Commit Graph

2352 Commits

Author SHA1 Message Date
NeilBrown 5cf00fcd3c md/raid10: collect some geometry fields into a dedicated structure.
We will shortly be adding reshape support for RAID10 which will
require it having 2 concurrent geometries (before and after).
To make that easier, collect most geometry fields into 'struct geom'
and access them from there.  Then we will more easily be able to add
a second set of fields.

Note that 'copies' is not in this struct and so cannot be changed.
There is little need to change this number and doing so is a lot
more difficult as it requires reallocating more things.
So leave it out for now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:28:20 +10:00
NeilBrown b5254dd5fd md/raid5: allow for change in data_offset while managing a reshape.
The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.

To this end we find the minimum offset difference across all devices
and add that where appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:01 +10:00
NeilBrown 05616be5e1 md/raid5: Use correct data_offset for all IO.
As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown c6563a8c38 md: add possibility to change data-offset for devices.
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
NeilBrown 2c810cddc4 md: allow a reshape operation to be reversed.
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape.  Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:27:00 +10:00
Shaohua Li b5e1b8cee7 md: using GFP_NOIO to allocate bio for flush request
A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.

This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-21 09:26:59 +10:00
Linus Torvalds b1dab2f040 A fix to the thin provisioning userspace interface.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPtuPVAAoJEK2W1qbAHj1n3q8QAJaM5OkZ3DqAoC8XznVBKO1B
 Fr5d0mui4irC9ce+End/MCoPN2EcwQ5bzyh6cQXqxfUOTR5RjYMCn97uID0I35Cc
 AEbDaUMSgKkQkKaDeVpM54SaHBhtLP95gIZmo774etmryzn3HyWiiYOsvswnAw8w
 /zIZwPkRPoskoGt1Hk/o490qYGwstJ6JErJndyuVEdHNWF6xiL9cFAi81oo1UF2J
 lysWmBBaisRcAD3kzbTBIQ3ntMUPSbvM7ZLUaJXEbXwg+XUBRkBfQLXJk3wA5q8R
 QacTz+kvAXG6B+DipFOYyDTkxR+s5kjRserlBwh2gHP9u3pHV9Nf7eWR5BWsYPbV
 tLH8Q/0Ctl31MsrxMl7cO05Kj2sMI/lOg/yGQT3Qx6k0HVtbU1gaR6J21RmpElGC
 TANXpBhV5MRpfZFvwnHOSACnE3hMm1i4XJstTejIfPiNtq8aeL0eSB3Q+ZJTEZzm
 ZCk23ufmaqC75RrYt1P3F5QpxsjD+moMmZT5hZSkGOd0YqvwAMfv9O/5FesQ748R
 tc5zPxS/dTONseGkBaAPeHpcX4WP0wm0fKJcPb0Y/4kx0AkiVQjbUrNNcnVxo5Pw
 pF1MbxY14xZxteZ/UHWmPvyCuJ8NICTtswxj9taquSDb87pliV2xajrK4Mnmed7u
 sLxA0ANWBdJghss/r/g6
 =03zw
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull a dm fix from Alasdair G Kergon:
 "A fix to the thin provisioning userspace interface."

* tag 'dm-3.4-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm thin: fix table output when pool target disables discard passdown internally
2012-05-18 18:22:45 -07:00
Mike Snitzer f402693d06 dm thin: fix table output when pool target disables discard passdown internally
When the thin pool target clears the discard_passdown parameter
internally, it incorrectly changes the table line reported to userspace.
This breaks dumb string comparisons on these table lines in generic
userspace device-mapper library code and leads to tables being reloaded
repeatedly when nothing is actually meant to be changing.

This patch corrects this by no longer changing the table line when
discard passdown was disabled.

We can still tell when discard passdown is overridden by looking for the
message "Discard unsupported by data device (sdX): Disabling discard passdown."

This automatic detection is also moved from the 'load' to the 'resume'
so that it is re-evaluated should the properties of underlying devices
change.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-19 01:01:01 +01:00
Linus Torvalds 2f05af8b59 Fix bug in recent fix to RAID10.
Without this patch, recovery will crash
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT7bVGznsnt1WYoG5AQL0ohAAvpF47NkSnuqw9gko25/Ndr5akrIC9kfz
 YNzxsd0FvNSEQipGS4KgBcRxnFxbcVkmHL+pN8McmiDxNew1LW2+zdtV47RJngnQ
 2slTUiSBLSHcvnOFblBrwIh+XpMdhv4FEMG13Si4tQaETtHp3JiNDr2JqvwbuKbo
 aO/nW38DFNjEcB59X+9npknZYaymvavxso4J7iH/ec0Iys2c2h+jX5afewCXnbYr
 Q6X2YlySUiUM3AXbgV0QI/w6xDM0TD+WWguHq411pgU1wHMYpHelYYwRGHSU73TJ
 061K5WB83Q53A8czeRXK33x0xGr/AhQesTel+mlV3eODHoGSnsYxH8Zja7sV220h
 HgTICrSvAFL2cvBz8dqJQ59KYH3yNo4Cie5xzmBxCuX6yyTxizgrQTp7L/HUy7RL
 eL1SqNFTRJcwSl3MwmgbL57saBLw+/ejV+og7PNBCXSS9gZg39uJyn1uTqJZTrcL
 PuBaizNG9HCYuR/pRTMsQGNWrksqk9r+hxwCJDIGb7xh9SBFI82YtQfK9F5fSB84
 vdmIec8yUvwS/Gxhz+I+YB3kqk/nwfFRbVGfBWvJdRBOR0DsZphsBapsJ83LZqFb
 VAa2nQL+NoHOweixv6RgiH4Y96SNHdWkvDJjj7R01BUDdpHeHMZuff07Effgqd64
 Knk64XvWKQc=
 =s7vy
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4-fixes' of git://neil.brown.name/md

Pull one more md bugfix from NeilBrown:
 "Fix bug in recent fix to RAID10.

  Without this patch, recovery will crash"

* tag 'md-3.4-fixes' of git://neil.brown.name/md:
  md/raid10: fix transcription error in calc_sectors conversion.
2012-05-18 16:19:59 -07:00
NeilBrown b0d634d568 md/raid10: fix transcription error in calc_sectors conversion.
The old code was
		sector_div(stride, fc);
the new code was
		sector_dir(size, conf->near_copies);

'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-19 09:01:13 +10:00
Linus Torvalds 36a1987cd8 md: 2 fixes for 3.4
one fixes a bug in the new raid10 resize code so is relevant
 to 3.4 only
 Other fixes a bug in the use of md by dm-raid, so is relevant
 to any kernel with dm-raid support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT7SBkznsnt1WYoG5AQK3wQ//Q2sPicPHb5MNGTTBpphYo1QWo+l9jFHs
 ZDBM+MaiNJg3kBN5ueUU+MENvLcaA5+zoxsGVBXBKyXr70ffqiQcLXyU7fHwrGu3
 5MD36p55ZPnq2pemCrp4qdTXEUabmDb+0/R7e5lywnzNdbmCAfh4uYih0VPiaClV
 ihq/Ci12TDnezmLjksc09OCquhm0s3zH2BnMCVdmSAkhnXCxTeZ45s/ob71Y2xvj
 cJ15SYlAG4t0QCikL5R8pZtkh0h2SuUhufDE09eD8yT4RGO4PHSQ4oHujajftzey
 9sB0NGH7Yla8gOXjA+EpzKPaiqtZxJB+1v/bhqA2FoOYAks8VoFfeqgwUbPYE7bk
 GIfGB4hFsUXaJo13uzofyJXBIp9mM/J5Sk1VJsiLE85P7wewg6N199B8lpC3lFDw
 tMLjfTMJzFOUqZBESjJoxyrc4fairZ9VCUWwpqjuioLO50e+lOi/jQHTspX78e+w
 GxgjHp8hh0RqQiTkl7vIz9KVcQIeOTG9uzz61IuDp15cRSrMs6E8gVKoX8gKW9g2
 Hec17fdG/H6ZeZa7MB9GzUD4HCj0PRbODQ3/fPhUdsbgtQjOvsVUH8LCRRU0U6cb
 YF+qsDFtUF7QT2kNbrs9R6adGj97c2HWUMyRWMQAXGuL5TkstvhrRv/rk1+bv2VG
 w7ptbiklj7o=
 =9zxe
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4-fixes' of git://neil.brown.name/md

Pull two md fixes from NeilBrown:
 "One fixes a bug in the new raid10 resize code so is relevant to 3.4
  only.

  The other fixes a bug in the use of md by dm-raid, so is relevant to
  any kernel with dm-raid support"

* tag 'md-3.4-fixes' of git://neil.brown.name/md:
  MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
  md/raid10: set dev_sectors properly when resizing devices in array.
2012-05-17 09:44:35 -07:00
Jonathan Brassow 0d9f4f135e MD: Add del_timer_sync to mddev_suspend (fix nasty panic)
Use del_timer_sync to remove timer before mddev_suspend finishes.

We don't want a timer going off after an mddev_suspend is called.  This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend.  This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:38:24 +10:00
NeilBrown 6508fdbf40 md/raid10: set dev_sectors properly when resizing devices in array.
raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.

However raid10_resize isn't updating it at all!

To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf.   So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-17 10:08:45 +10:00
Linus Torvalds 4a873f5399 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David S. Miller:

 1) Since we do RCU lookups on ipv4 FIB entries, we have to test if the
    entry is dead before returning it to our caller.

 2) openvswitch locking and packet validation fixes from Ansis Atteka,
    Jesse Gross, and Pravin B Shelar.

 3) Fix PM resume locking in IGB driver, from Benjamin Poirier.

 4) Fix VLAN header handling in vhost-net and macvtap, from Basil Gor.

 5) Revert a bogus network namespace isolation change that was causing
    regressions on S390 networking devices.

 6) If bonding decides to process and handle a LACPDU frame, we
    shouldn't bump the rx_dropped counter.  From Jiri Bohac.

 7) Fix mis-calculation of available TX space in r8169 driver when doing
    TSO, which can lead to crashes and/or hung device.  From Julien
    Ducourthial.

 8) SCTP does not validate cached routes properly in all cases, from
    Nicolas Dichtel.

 9) Link status interrupt needs to be handled in ks8851 driver, from
    Stephen Boyd.

10) Use capable(), not cap_raised(), in connector/userns netlink code.
    From Eric W. Biederman via Andrew Morton.

11) Fix pktgen OOPS on module unload, from Eric Dumazet.

12) iwlwifi under-estimates SKB truesizes, also from Eric Dumazet.

13) Cure division by zero in SFC driver, from Ben Hutchings.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  ks8851: Update link status during link change interrupt
  macvtap: restore vlan header on user read
  vhost-net: fix handle_rx buffer size
  bonding: don't increase rx_dropped after processing LACPDUs
  connector/userns: replace netlink uses of cap_raised() with capable()
  sctp: check cached dst before using it
  pktgen: fix crash at module unload
  Revert "net: maintain namespace isolation between vlan and real device"
  ehea: fix losing of NEQ events when one event occurred early
  igb: fix rtnl race in PM resume path
  ipv4: Do not use dead fib_info entries.
  r8169: fix unsigned int wraparound with TSO
  sfc: Fix division by zero when using one RX channel and no SR-IOV
  openvswitch: Validation of IPv6 set port action uses IPv4 header
  net: compare_ether_addr[_64bits]() has no ordering
  cdc_ether: Ignore bogus union descriptor for RNDIS devices
  bnx2x: bug fix when loading after SAN boot
  e1000: Silence sparse warnings by correcting type
  igb, ixgbe: netdev_tx_reset_queue incorrectly called from tx init path
  openvswitch: Release rtnl_lock if ovs_vport_cmd_build_info() failed.
  ...
2012-05-12 12:57:01 -07:00
Mike Snitzer 510193a2d3 dm mpath: check if scsi_dh module already loaded before trying to load
If the requested scsi_dh module is already loaded then skip
request_module().

Multipath table loads can hang in an unnecessary __request_module.

Reported-by: Ben Marzinski <bmarzins@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:21 +01:00
Alasdair G Kergon 7cab8bf160 dm thin: correct module description
Remove duplicate copy of string "device-mapper" (DM_NAME) from
MODULE_DESCRIPTION.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:19 +01:00
Mike Snitzer c3a0ce2eab dm thin: fix unprotected use of prepared_discards list
Fix two places in commit 104655fd4d ("dm thin: support discards") that
didn't use pool->lock to protect against concurrent changes to the
prepared_discards list.

Without this fix, thin_endio() can race with process_discard(), leading
to concurrent list_add()s that result in the processes locking up with
an error like the following:

WARNING: at lib/list_debug.c:32 __list_add+0x8f/0xa0()
...
list_add corruption. next->prev should be prev (ffff880323b96140), but was ffff8801d2c48440. (next=ffff8801d2c485c0).
...
Pid: 17205, comm: kworker/u:1 Tainted: G        W  O 3.4.0-rc3.snitm+ #1
Call Trace:
 [<ffffffff8103ca1f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8103cb16>] warn_slowpath_fmt+0x46/0x50
 [<ffffffffa04f6ce6>] ? bio_detain+0xc6/0x210 [dm_thin_pool]
 [<ffffffff8124ff3f>] __list_add+0x8f/0xa0
 [<ffffffffa04f70d2>] process_discard+0x2a2/0x2d0 [dm_thin_pool]
 [<ffffffffa04f6a78>] ? remap_and_issue+0x38/0x50 [dm_thin_pool]
 [<ffffffffa04f7c3b>] process_deferred_bios+0x7b/0x230 [dm_thin_pool]
 [<ffffffffa04f7df0>] ? process_deferred_bios+0x230/0x230 [dm_thin_pool]
 [<ffffffffa04f7e42>] do_worker+0x52/0x60 [dm_thin_pool]
 [<ffffffff81056fa9>] process_one_work+0x129/0x450
 [<ffffffff81059b9c>] worker_thread+0x17c/0x3c0
 [<ffffffff81059a20>] ? manage_workers+0x120/0x120
 [<ffffffff8105eabe>] kthread+0x9e/0xb0
 [<ffffffff814ceda4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105ea20>] ? kthread_freezable_should_stop+0x70/0x70
 [<ffffffff814ceda0>] ? gs_change+0x13/0x13
---[ end trace 7e0a523bc5e52692 ]---

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:16 +01:00
Mike Snitzer 03aaae7cdc dm thin: reinstate missing mempool_free in cell_release_singleton
Fix a significant memory leak inadvertently introduced during
simplification of cell_release_singleton() in commit
6f94a4c45a ("dm thin: fix stacked bi_next
usage").

A cell's hlist_del() must be accompanied by a mempool_free().
Use __cell_release() to do this, like before.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-05-12 01:43:12 +01:00
Eric W. Biederman 38bf195398 connector/userns: replace netlink uses of cap_raised() with capable()
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.

In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).

Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN).  The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.

In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.

Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.

The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).

To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:21:39 -04:00
NeilBrown b16b1b6cd0 md/bitmap: fix calculation of 'chunks' - missing shift.
commit 61a0d80c "md/bitmap: discard CHUNK_BLOCK_SHIFT macro"
replaced CHUNK_BLOCK_RATIO() by the same text that was
replacing CHUNK_BLOCK_SHIFT() - which is clearly wrong.

The result is that 'chunks' is often too small by 1,
which can sometimes result in a crash (not sure how).

So use the correct replacement, and get rid of CHUNK_BLOCK_RATIO
which is no longe used.

Reported-by: Karl Newman <siliconfiend@gmail.com>
Tested-by: Karl Newman <siliconfiend@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-05-04 17:03:18 +10:00
NeilBrown 30b8aa9172 md: fix possible corruption of array metadata on shutdown.
commit c744a65c1e
  md: don't set md arrays to readonly on shutdown.

removed the possibility of a 'BUG' when data is written to an array
that has just been switched to read-only, but also introduced the
possibility that the array metadata could be corrupted.

If, when md_notify_reboot gets the mddev lock, the array is
in a state where it is assembled but hasn't been started (as can
happen if the personality module is not available, or in other unusual
situations), then incorrect metadata will be written out making it
impossible to re-assemble the array.

So only call __md_stop_writes() if the array has actually been
activated.

This patch is needed for any stable kernel which has had the above
commit applied.

Cc: stable@vger.kernel.org
Reported-by: Christoph Nelles <evilazrael@evilazrael.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:16 +10:00
NeilBrown ed209584c3 md: don't call ->add_disk unless there is good reason.
Commit 7bfec5f35c

   md/raid5: If there is a spare and a want_replacement device, start replacement.

cause md_check_recovery to call ->add_disk much more often.
Instead of only when the array is degraded, it is now called whenever
md_check_recovery finds anything useful to do, which includes
updating the metadata for clean<->dirty transition.
This causes unnecessary work, and causes info messages from ->add_disk
to be reported much too often.

So refine md_check_recovery to only do any actual recovery checking
(including ->add_disk) if MD_RECOVERY_NEEDED is set.

This fix is suitable for 3.3.y:

Cc: stable@vger.kernel.org
Reported-by: Jan Ceuleers <jan.ceuleers@computer.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:14 +10:00
Jonathan Brassow a9ad8526bb DM RAID: Use safe version of rdev_for_each
Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe

Commit dafb20fa34 mistakenly replaced a safe
iterator with an unsafe one when making some macro changes.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-24 10:23:13 +10:00
NeilBrown afbaa90b80 md/bitmap: prevent bitmap_daemon_work running while initialising bitmap
If a bitmap is added while the array is active, it is possible
for bitmap_daemon_work to run while the bitmap is being
initialised.
This is particularly a problem if bitmap_daemon_work sees
bitmap->filemap as non-NULL before it has been filled in properly.
So hold bitmap_info.mutex while filling in ->filemap
to prevent problems.

This patch is suitable for any -stable kernel, though it might not
apply cleanly before about 3.1.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:05:06 +10:00
majianpeng f4380a9158 md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.
If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.

This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 16:04:47 +10:00
Andrei Warkentin 9e41dd35b3 MD: Bitmap version cleanup.
bitmap_new_disk_sb() would still create V3 bitmap superblock
with host-endian layout.

Perhaps I'm confused, but shouldn't bitmap_new_disk_sb() be
creating a V4 bitmap superblock instead, that is portable,
as per comment in bitmap.h?

Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-12 15:55:21 +10:00
NeilBrown 5020ad7d14 md/raid1,raid10: don't compare excess byte during consistency check.
When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.

In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.

This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:39:23 +10:00
majianpeng c6d2e084c7 md/raid5: Fix a bug about judging if the operation is syncing or replacing
When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
		if (do_recovery ||
		    sh->sector >= conf->mddev->recovery_cp)
			s->syncing = 1;
		else
			s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.

This bug was introduced by commit 9a3e1101b8
    md/raid5:  detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:38 +10:00
majianpeng a42f9d83b5 md/raid1:Remove unnecessary rcu_dereference(conf->mirrors[i].rdev).
Because rde->nr_pending > 0,so can not remove this disk.
And in any case, we aren't holding rcu_read_lock()

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:33 +10:00
Jes Sorensen 24b961f811 md: Avoid OOPS when reshaping raid1 to raid0
raid1 arrays do not have the notion of chunk size. Calculate the
largest chunk sector size we can use to avoid a divide by zero OOPS
when aligning the size of the new array to the chunk size.

Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:37:26 +10:00
NeilBrown 18b9837ea0 md/raid5: fix handling of bad blocks during recovery.
1/ We can only treat a known-bad-block like a read-error if we
   have the data that belongs in that block.  So fix that test.

2/ If we cannot recovery a stripe due to insufficient data,
   don't tell "md_done_sync" that the sync failed unless we really
   did fail something.  If we successfully record bad blocks,
   that is success.

Reported-by: "majianpeng" <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-03 15:36:17 +10:00
majianpeng 5220ea1e64 md/raid1: If md_integrity_register() failed,run() must free the mem
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:38 +10:00
majianpeng 0366ef8475 md/raid0: If md_integrity_register() fails, raid0_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
majianpeng 98d5561bfb md/linear: If md_integrity_register() fails, linear_run() must free the mem.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-04-02 09:48:37 +10:00
Mikulas Patocka a4ffc15219 dm: add verity target
This device-mapper target creates a read-only device that transparently
validates the data on one underlying device against a pre-generated tree
of cryptographic checksums stored on a second device.

Two checksum device formats are supported: version 0 which is already
shipping in Chromium OS and version 1 which incorporates some
improvements.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Cc: Milan Broz <mbroz@redhat.com>
Cc: Olof Johansson <olofj@chromium.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:43:38 +01:00
Mikulas Patocka a66cc28f53 dm bufio: prefetch
This patch introduces a new function dm_bufio_prefetch. It prefetches
the specified range of blocks into dm-bufio cache without waiting
for i/o completion.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 67e2e2b281 dm thin: add pool target flags to control discard
Add dm thin target arguments to control discard support.

ignore_discard: Disables discard support

no_discard_passdown: Don't pass discards down to the underlying data
device, but just remove the mapping within the thin provisioning target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:29 +01:00
Joe Thornber 104655fd4d dm thin: support discards
Support discards in the thin target.

On discard the corresponding mapping(s) are removed from the thin
device.  If the associated block(s) are no longer shared the discard
is passed to the underlying device.

All bios other than discards now have an associated deferred_entry
that is saved to the 'all_io_entry' in endio_hook.  When non-discard
IO completes and associated mappings are quiesced any discards that
were deferred, via ds_add_work() in process_discard(), will be queued
for processing by the worker thread.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>

drivers/md/dm-thin.c |  173 ++++++++++++++++++++++++++++++++++++++++++++++----
 drivers/md/dm-thin.c |  172 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 158 insertions(+), 14 deletions(-)
2012-03-28 18:41:28 +01:00
Joe Thornber eb2aa48d4e dm thin: prepare to support discard
This patch contains the ground work needed for dm-thin to support discard.

  - Adds endio function that replaces shared_read_endio.

  - Introduce an explicit 'quiesced' flag into the new_mapping structure.
    Before, this was implicitly indicated by m->list being empty.

  - The map_info->ptr remains constant for the duration of a bio's trip
    through the thin target.  Make it easier to reason about it.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Alasdair G Kergon 6efd6e8309 dm thin: use dm_target_offset
Use dm_target_offset wrapper instead of referencing the awkward ti->begin
explicitly.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 2dd9c257fb dm thin: support read only external snapshot origins
Support the use of an external _read only_ device as an origin for a thin
device.

Any read to an unprovisioned area of the thin device will be passed
through to the origin.  Writes trigger allocation of new blocks as
usual.

One possible use case for this would be VM hosts that want to run
guests on thinly-provisioned volumes but have the base image on another
device (possibly shared between many VMs).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Mike Snitzer c4a69ecdb4 dm thin: relax hard limit on the maximum size of a metadata device
The thin metadata format can only make use of a device that is <=
THIN_METADATA_MAX_SECTORS (currently 15.9375 GB).  Therefore, there is no
practical benefit to using a larger device.

However, it may be that other factors impose a certain granularity for
the space that is allocated to a device (E.g. lvm2 can impose a coarse
granularity through the use of large, >= 1 GB, physical extents).

Rather than reject a larger metadata device, during thin-pool device
construction, switch to allowing it but issue a warning if a device
larger than THIN_METADATA_MAX_SECTORS_WARNING (16 GB) is
provided.  Any space over 15.9375 GB will not be used.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:28 +01:00
Joe Thornber 71fd5ae25d dm persistent data: remove space map ref_count entries if redundant
Save space by removing entries from the space map ref_count tree if
they're no longer needed.

Ref counts are stored in two places: a bitmap if the ref_count is
below 3, or a btree of uint32_t if 3 or above.

When a ref_count that was above 3 drops below we can remove it from
the tree and save some metadata space.  This removal was commented out
before because I was unsure why this was causing under-populated btree
nodes.  Earlier patches have fixed this issue.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Joe Thornber 905e51b39a dm thin: commit outstanding data every second
Commit unwritten data every second to prevent too much building up.

Released blocks don't become available until after the next commit
(for crash resilience).  Prior to this patch commits were only
triggered by a message to the target or a REQ_{FLUSH,FUA} bio.  This
allowed far too big a position to build up.

The interval is hard-coded to 1 second.  This is a sensible setting.
I'm not making this user configurable, since there isn't much to be
gained by tweaking this - and a lot lost by setting it far too high.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:27 +01:00
Mikulas Patocka 31998ef193 dm: reject trailing characters in sccanf input
Device mapper uses sscanf to convert arguments to numbers. The problem is that
the way we use it ignores additional unmatched characters in the scanned string.

For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
but also it will match number with some garbage appended, like "123abc".

As a result, device mapper accepts garbage after some numbers. For example
the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
will pass without an error.

This patch fixes all sscanf uses in device mapper. It appends "%c" with
a pointer to a dummy character variable to every sscanf statement.

The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
only if string is a null-terminated number (optionally preceded by some
whitespace characters). If there is some character appended after the number,
sscanf matches "%c", writes the character to the dummy variable and returns 2.
We check the return value for 1 and consequently reject numbers with some
garbage appended.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Jonathan E Brassow 0447568fc5 dm raid: handle failed devices during start up
The dm-raid code currently fails to create a RAID array if any of the
superblocks cannot be read.  This was an oversight as there is already
code to handle this case if the values ('- -') were provided for the
failed array position.

With this patch, if a superblock cannot be read, the array position's
fields are initialized as though '- -' was set in the table.  That is,
the device is failed and the position should not be used, but if there
is sufficient redundancy, the array should still be activated.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:26 +01:00
Joe Thornber fef838cc1a dm thin metadata: pass correct space map to dm_sm_root_size
Fix a harmless typo.

The root is a chunk of data that gets written to the superblock.  This
data is used to recreate the space map when opening a metadata area.
We have two space maps; one tracking space on the metadata device and
one of the data device.  Both of these use the same format for their
root, so this typo was harmless.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Joe Thornber a3aefb395e dm persistent data: remove redundant value_size arg from value_ptr
Now that the value_size is held within every node of the btrees we can
remove this argument from value_ptr().

For the last few months a BUG_ON has been checking this argument is
the same as that held in the node.  No issues were reported.  So this
is a safe change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Jun'ichi Nomura 466891f995 dm mpath: detect invalid map_context
The map_context pointer should always be set. However, we have reports
that upon requeuing it is not set correctly.  So add set and clear
functions with a BUG_ON() to track the issue properly.

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Hannes Reinecke 4d7b38b7d9 dm: clear bi_end_io on remapping failure
As a precaution, set bi_end_io to NULL when failing to remap.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:25 +01:00
Hannes Reinecke 574ce07eb0 dm table: simplify call to free_devices
free_devices in dm_table.c already uses list_for_each(), so we don't
need to check if the list is empty.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Joe Thornber fe878f34df dm thin: correct comments
Remove documentation for unimplemented 'trim' message.

I'd planned a 'trim' target message for shrinking thin devices, but
this is better handled via the discard ioctl.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Alasdair G Kergon 035220b33d dm raid: no longer experimental
The dm raid module (using md) is becoming the preferred way of creating long-lived
mirrors through userspace LVM so remove the EXPERIMENTAL tag.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Alasdair G Kergon e0b215da8f dm uevent: no longer experimental
Drop EXPERIMENTAL tag from dm-uevent.

It's not changed for a while and some userspace tools are relying upon it.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:24 +01:00
Joe Thornber b0988900ba dm persistent data: fix btree rebalancing after remove
When we remove an entry from a node we sometimes rebalance with it's
two neighbours.  This wasn't being done correctly; in some cases
entries have to move all the way from the right neighbour to the left
neighbour, or vice versa.  This patch pretty much re-writes the
balancing code to fix it.

This code is barely used currently; only when you delete a thin
device, and then only if you have hundreds of them in the same pool.
Once we have discard support, which removes mappings, this will be used
much more heavily.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:23 +01:00
Joe Thornber 6f94a4c45a dm thin: fix stacked bi_next usage
Avoid using the bi_next field for the holder of a cell when deferring
bios because a stacked device below might change it.  Store the
holder in a new field in struct cell instead.

When a cell is created, the bio that triggered creation (the holder) was
added to the same bio list as subsequent bios.  In some cases we pass
this holder bio directly to devices underneath.  If those devices use
the bi_next field there will be trouble...

This also simplifies some code that had to work out which bio was the
holder.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:23 +01:00
Mikulas Patocka 72c6e7afc4 dm crypt: add missing error handling
Always set io->error to -EIO when an error is detected in dm-crypt.

There were cases where an error code would be set only if we finish
processing the last sector. If there were other encryption operations in
flight, the error would be ignored and bio would be returned with
success as if no error happened.

This bug is present in kcryptd_crypt_write_convert, kcryptd_crypt_read_convert
and kcryptd_async_done.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Reviewed-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Mikulas Patocka aeb2deae26 dm crypt: fix mempool deadlock
This patch fixes a possible deadlock in dm-crypt's mempool use.

Currently, dm-crypt reserves a mempool of MIN_BIO_PAGES reserved pages.
It allocates first MIN_BIO_PAGES with non-failing allocation (the allocation
cannot fail and waits until the mempool is refilled). Further pages are
allocated with different gfp flags that allow failing.

Because allocations may be done in parallel, this code can deadlock. Example:
There are two processes, each tries to allocate MIN_BIO_PAGES and the processes
run simultaneously.
It may end up in a situation where each process allocates (MIN_BIO_PAGES / 2)
pages. The mempool is exhausted. Each process waits for more pages to be freed
to the mempool, which never happens.

To avoid this deadlock scenario, this patch changes the code so that only
the first page is allocated with non-failing gfp mask. Allocation of further
pages may fail.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Andrei Warkentin aadbe266f2 dm exception store: fix init error path
Call the correct exit function on failure in dm_exception_store_init.

Signed-off-by: Andrei Warkentin <andrey.warkentin@gmail.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-28 18:41:22 +01:00
Linus Torvalds 267d7b23dd md updates for 3.4
Mostly tidying up code in preparation for some bigger changes
 next time.
 A few bug fixes tagged for -stable.
 
 Main functionality change is that some RAID10 arrays can now
 grow to use extra space that may have been made available on the
 individual devices.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT2bLBjnsnt1WYoG5AQKN3xAAv1UlR5Kem5WN7Ex4lmR9xj3lr9dbURYT
 TtvrUuCy3pYYWdTuijb+IBqkbODF0kPDHIhUiBx9fXUfMavkp/b9heXS/vJ3pcH4
 1j99NUbOGL/AylD1TPRV9TQxGTKhEjK3n26bY0t/amLc92bWJaytMO1B9cz38LN+
 qx6ufpIepz4DPXXtPYpnkBR4cZ6L4/ZXQvjf5BqG6WfKwc+0Nyncg8ipYEqhBWy7
 R7ztF5yPo0yl96Wopa2KG91OroWflmyZo1DNYcbUbKtbNGGtYC92GFadOH+wNupM
 FnmXv10ivfVGU5w4SpshAwOg+4OSUqmWNsBxUhpYbf8ChbN+lOl0VZdH6UBxo19D
 3SqZWT/yz4I4HYd5rtr35MXFdOeBNM++CHQs4F68BLA0B6OcHfWsA9bvly2tnBVx
 iEBFPd277qWztUr8m6yz7AFf/0dgyXuIhuB3d7IkVrG5yG3FX6hPi2T0FSA33qMx
 Lwi5w6O4DREg5tG09xEYEnXgXe+PnB8HsKb1U/m76XMQ0UScvX6dLA6934Vg+DCv
 xf+AYqob0Tc/Op5I7h2PbVXq7DciNXwlX1WvM0m+TEaV+3fl1FB0VsCcANAV6JVn
 uRLmvtePQRt0hxAog2p7OsumVnxMhbuEo5h8rJMKWM7IbhueKNoz+gBwpcFLzBmY
 ygWc4peLQpE=
 =MGuM
 -----END PGP SIGNATURE-----

Merge tag 'md-3.4' of git://neil.brown.name/md

Pull md updates for 3.4 from Neil Brown:
 "Mostly tidying up code in preparation for some bigger changes next
  time.

  A few bug fixes tagged for -stable.

  Main functionality change is that some RAID10 arrays can now grow to
  use extra space that may have been made available on the individual
  devices."

Fixed up trivial conflicts with the k[un]map_atomic() cleanups in
drivers/md/bitmap.c.

* tag 'md-3.4' of git://neil.brown.name/md: (22 commits)
  md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
  md: fix clearing of the 'changed' flags for the bad blocks list.
  md/bitmap: discard CHUNK_BLOCK_SHIFT macro
  md/bitmap: remove unnecessary indirection when allocating.
  md/bitmap: remove some pointless locking.
  md/bitmap: change a 'goto' to a normal 'if' construct.
  md/bitmap: move printing of bitmap status to bitmap.c
  md/bitmap: remove some unused noise from bitmap.h
  md/raid10 - support resizing some RAID10 arrays.
  md/raid1: handle merge_bvec_fn in member devices.
  md/raid10: handle merge_bvec_fn in member devices.
  md: add proper merge_bvec handling to RAID0 and Linear.
  md: tidy up rdev_for_each usage.
  md/raid1,raid10: avoid deadlock during resync/recovery.
  md/bitmap: ensure to load bitmap when creating via sysfs.
  md: don't set md arrays to readonly on shutdown.
  md: allow re-add to failed arrays.
  md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
  md: Use existed macros instead of numbers
  md/raid5: removed unused 'added_devices' variable.
  ...
2012-03-22 12:29:50 -07:00
Linus Torvalds 9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Linus Torvalds 69a7aebcf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "It's indeed trivial -- mostly documentation updates and a bunch of
  typo fixes from Masanari.

  There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
  kcore: fix spelling in read_kcore() comment
  constify struct pci_dev * in obvious cases
  Revert "char: Fix typo in viotape.c"
  init: fix wording error in mm_init comment
  usb: gadget: Kconfig: fix typo for 'different'
  Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
  writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
  writeback: fix typo in the writeback_control comment
  Documentation: Fix multiple typo in Documentation
  tpm_tis: fix tis_lock with respect to RCU
  Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
  Doc: Update numastat.txt
  qla4xxx: Add missing spaces to error messages
  compiler.h: Fix typo
  security: struct security_operations kerneldoc fix
  Documentation: broken URL in libata.tmpl
  Documentation: broken URL in filesystems.tmpl
  mtd: simplify return logic in do_map_probe()
  mm: fix comment typo of truncate_inode_pages_range
  power: bq27x00: Fix typos in comment
  ...
2012-03-20 21:12:50 -07:00
Cong Wang c2e022cb65 dm: remove the second argument of k[un]map_atomic()
Acked-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:28 +08:00
Cong Wang b2f46e6882 md: remove the second argument of k[un]map_atomic()
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:18 +08:00
majianpeng ecb178bb2b md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.


Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:42 +11:00
NeilBrown d0962936bf md: fix clearing of the 'changed' flags for the bad blocks list.
In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.

In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error.  However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.

This patch is suitable for -stable release 3.0 and later.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 61a0d80ce4 md/bitmap: discard CHUNK_BLOCK_SHIFT macro
Be redefining ->chunkshift as the shift from sectors to chunks rather
than bytes to chunks, we can just use "bitmap->chunkshift" which is
shorter than the macro call, and less indirect.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 792a1d4bbf md/bitmap: remove unnecessary indirection when allocating.
These funcitons don't add anything useful except possibly the trace
points, and I don't think they are worth the extra indirection.
So remove them.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:41 +11:00
NeilBrown 5a6c824ebb md/bitmap: remove some pointless locking.
There is nothing gained by holding a lock while we check if a pointer
is NULL or not.  If there could be a race, then it could become NULL
immediately after the unlock - but there is no race here.

So just remove the locking.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 278c1ca2f2 md/bitmap: change a 'goto' to a normal 'if' construct.
The use of a goto makes the control flow more obscure here.

So make it a normal:
  if (x) {
     Y;
  }

No functional change.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 57148964d9 md/bitmap: move printing of bitmap status to bitmap.c
The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c.  So move it there.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 4ba97dff71 md/bitmap: remove some unused noise from bitmap.h
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 006a09a0ae md/raid10 - support resizing some RAID10 arrays.
'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.

This is not supported for array with a 'far' layout.  However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:40 +11:00
NeilBrown 6b740b8d79 md/raid1: handle merge_bvec_fn in member devices.
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So create a raid1 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown 050b66152f md/raid10: handle merge_bvec_fn in member devices.
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown ba13da47ff md: add proper merge_bvec handling to RAID0 and Linear.
These personalities currently set a max request size of one page
when any member device has a merge_bvec_fn because they don't
bother to call that function.

This causes extra works in splitting and combining requests.

So make the extra effort to call the merge_bvec_fn when it exists
so that we end up with larger requests out the bottom.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown dafb20fa34 md: tidy up rdev_for_each usage.
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev.  However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
 - rename rdev_for_each to rdev_for_each_safe
 - create a new rdev_for_each which uses the plain
   list_for_each_entry,
 - use the 'safe' version only where needed, and convert all other
   list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:39 +11:00
NeilBrown d6b42dcb99 md/raid1,raid10: avoid deadlock during resync/recovery.
If RAID1 or RAID10 is used under LVM or some other stacking
block device, it is possible to enter a deadlock during
resync or recovery.
This can happen if the upper level block device creates
two requests to the RAID1 or RAID10.  The first request gets
processed, blocks recovery and queue requests for underlying
requests in current->bio_list.  A resync request then starts
which will wait for those requests and block new IO.

But then the second request to the RAID1/10 will be attempted
and it cannot progress until the resync request completes,
which cannot progress until the underlying device requests complete,
which are on a queue behind that second request.

So allow that second request to proceed even though there is
a resync request about to start.

This is suitable for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Ray Morris <support@bettercgi.com>
Tested-by: Ray Morris <support@bettercgi.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:38 +11:00
NeilBrown 4474ca42e2 md/bitmap: ensure to load bitmap when creating via sysfs.
When commit 69e51b449d (md/bitmap:  separate out loading a bitmap...)
created bitmap_load, it missed calling it after bitmap_create when a
bitmap is created through the sysfs interface.
So if a bitmap is added this way, we don't allocate memory properly
and can crash.

This is suitable for any -stable release since 2.6.35.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
NeilBrown c744a65c1e md: don't set md arrays to readonly on shutdown.
It seems that with recent kernel, writeback can still be happening
while shutdown is happening, and consequently data can be written
after the md reboot notifier switches all arrays to read-only.
This causes a BUG.

So don't switch them to read-only - just mark them clean and
set 'safemode' to '2' which mean that immediately after any
write the array will be switch back to 'clean'.

This could result in the shutdown happening when array is marked
dirty, thus forcing a resync on reboot.  However if you reboot
without performing a "sync" first, you get to keep both halves.

This is suitable for any stable kernel (though there might be some
conflicts with obvious fixes in earlier kernels).

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
NeilBrown dc10c643e8 md: allow re-add to failed arrays.
When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays  types that check for a failed array).

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-19 12:46:37 +11:00
majianpeng 41fe75f60b md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:25 +11:00
NeilBrown 9d4c7d8799 md/raid5: removed unused 'added_devices' variable.
commit 908f4fbd26 removed the last user of this variable,
so we should discard it completely.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:21 +11:00
NeilBrown 547414d19f md/raid10: remove unnecessary smp_mb() from end_sync_write
Recent commit 4ca40c2ce0 (md/raid10: Allow replacement device ...)
added an smp_mb in end_sync_write.
This was to close a possible race with raid10_remove_disk.
However there is no such race as it is never attempted to remove a
disk while resync (or recovery) is happening.
so the smp_mb is just noise.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:20 +11:00
NeilBrown 1e3fa9bd50 md/raid5: make sure reshape_position is cleared on error path.
Leaving a valid reshape_position value in place could be confusing.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-13 11:21:18 +11:00
Linus Torvalds 5d0edf2915 Device-mapper fixes for 3.3.
Eight small device-mapper bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPV8+yAAoJEK2W1qbAHj1nZVAQAI8TNKwnBpKSW3Y9XFHqWjEx
 71wbjDkKkdEUWy52CAkSoRnQdX+ABxxGr5R60n/vJvHi4yDse56LddPzKAo4zD3c
 DVh6RB8CTIY+2IXGzjkDtelmKogKyAMlhmRoj0oLb5/29n6lnn6A0vkq4OimuFJO
 IIdgJxpRLqmV8NcSVC7qCEoErxzTNz9w7HaBBs73VhF8AcN/6Qi/z55zDOzT/Iz8
 iMHGmOHJBb8OxMN8BWWFdDh2YUz3isbM1xbBerYxy3P3WCHpxGBt7yRiHm3Yd5il
 USnJN3Kz0w6Orhgu1eeAuJz1A9cdSP62AQDdM91+v3nHz3mtTdAljmJZgzgzqs5u
 SRO24J6FD201DNh/RitDC1UzNOBqeapfqprT/gH+qM4Pl6X+vuXiSe5cxx+lTOhJ
 GErI1XYpTfzymdpQfqj6VnDMevRf0Hz+mSjEiUh8qjUv9bXHkmTrzjxCvAIEM+4h
 fJSQ0Fp77eV7Du9HkkFbEXVTYOe8VO+6E9AaplBAjZxHS6w+5tMFkHTM28JPxS98
 rYAks9QKbaZaEYZiNv7htux8n2OS9IeGHdLQpsooLh6lD4GxvBJ7NC8wUkfUzn27
 zEr2vqAYuA3PiccSHnT7tlN0PN1JlOjDCf+cdQkKfJj5w0E/qS2Fiv2UFIRLRPEa
 blSbf7wU0mpvorQJn/bd
 =lLJB
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper fixes for 3.3 from Alasdair Kergon

Eight small device-mapper bug fixes.

* tag 'dm-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm raid: fix flush support
  dm raid: set MD_CHANGE_DEVS when rebuilding
  dm thin metadata: decrement counter after removing mapped block
  dm thin metadata: unlock superblock in init_pmd error path
  dm thin metadata: remove incorrect close_device on creation error paths
  dm flakey: fix crash on read when corrupt_bio_byte not set
  dm io: fix discard support
  dm ioctl: do not leak argv if target message only contains whitespace
2012-03-08 17:21:51 -08:00
Jonathan E Brassow 0ca93de9b7 dm raid: fix flush support
Fix dm-raid flush support.

Both md and dm have support for flush, but the dm-raid target
forgot to set the flag to indicate that flushes should be
passed on.  (Important for data integrity e.g. with writeback cache
enabled.)

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:48 +00:00
Jonathan E Brassow 3aa3b2b2b1 dm raid: set MD_CHANGE_DEVS when rebuilding
The 'rebuild' parameter is used to rebuild individual devices in an
array (e.g. resynchronize a RAID1 device or recalculate a parity device
in higher RAID).  The MD_CHANGE_DEVS flag must be set when this
parameter is given in order to write out the superblocks and make the
change take immediate effect.  The code that handles new devices in
super_load already sets MD_CHANGE_DEVS and 'FirstUse'.  (The 'FirstUse'
flag was being set as a special case for rebuilds in
super_init_validation.)

Add a condition for rebuilds in super_load to take care of both flags
without the special case in 'super_init_validation'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:47 +00:00
Joe Thornber af63bcb817 dm thin metadata: decrement counter after removing mapped block
Correct the number of mapped sectors shown on a thin device's
status line by decrementing td->mapped_blocks in __remove() each time
a block is removed.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:44 +00:00
Joe Thornber 4469a5f387 dm thin metadata: unlock superblock in init_pmd error path
If dm_sm_disk_create() fails the superblock must be unlocked.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:43 +00:00
Mike Snitzer 1f3db25d8b dm thin metadata: remove incorrect close_device on creation error paths
The __open_device() error paths in __create_thin() and __create_snap()
incorrectly call __close_device() even if td was not initialized by
__open_device().  Remove this.

Also document __open_device() return values, remove a redundant
td->changed = 1 in __create_thin(), and insert an additional
safeguard against creating an already-existing device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:41 +00:00
Mike Snitzer 1212268fd9 dm flakey: fix crash on read when corrupt_bio_byte not set
The following BUG is hit on the first read that is submitted to a dm
flakey test device while the device is "down" if the corrupt_bio_byte
feature wasn't requested when the device's table was loaded.

Example DM table that will hit this BUG:
0 2097152 flakey 8:0 2048 0 30

This bug was introduced by commit a3998799fb
(dm flakey: add corrupt_bio_byte feature) in v3.1-rc1.

BUG: unable to handle kernel paging request at ffff8801cfce3fff
IP: [<ffffffffa008c233>] corrupt_bio_data+0x6e/0xae [dm_flakey]
PGD 1606063 PUD 0
Oops: 0002 [#1] SMP
...
Call Trace:
 <IRQ>
 [<ffffffffa008c2b5>] flakey_end_io+0x42/0x48 [dm_flakey]
 [<ffffffffa00dca98>] clone_endio+0x54/0xb6 [dm_mod]
 [<ffffffff81130587>] bio_endio+0x2d/0x2f
 [<ffffffff811c819a>] req_bio_endio+0x96/0x9f
 [<ffffffff811c94b9>] blk_update_request+0x1dc/0x3a9
 [<ffffffff812f5ee2>] ? rcu_read_unlock+0x21/0x23
 [<ffffffff811c96a6>] blk_update_bidi_request+0x20/0x6e
 [<ffffffff811c9713>] blk_end_bidi_request+0x1f/0x5d
 [<ffffffff811c978d>] blk_end_request+0x10/0x12
 [<ffffffff8128f450>] scsi_io_completion+0x1e5/0x4b1
 [<ffffffff812882a9>] scsi_finish_command+0xec/0xf5
 [<ffffffff8128f830>] scsi_softirq_done+0xff/0x108
 [<ffffffff811ce284>] blk_done_softirq+0x84/0x98
 [<ffffffff81048d19>] __do_softirq+0xe3/0x1d5
 [<ffffffff8138f83f>] ? _raw_spin_lock+0x62/0x69
 [<ffffffff810997cf>] ? handle_irq_event+0x4c/0x61
 [<ffffffff8139833c>] call_softirq+0x1c/0x30
 [<ffffffff81003b37>] do_softirq+0x4b/0xa3
 [<ffffffff81048a39>] irq_exit+0x53/0xca
 [<ffffffff81398acd>] do_IRQ+0x9d/0xb4
 [<ffffffff81390333>] common_interrupt+0x73/0x73
...

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:39 +00:00
Milan Broz 0c535e0d6f dm io: fix discard support
This patch fixes a crash by recognising discards in dm_io.

Currently dm_mirror can send REQ_DISCARD bios if running over a
discard-enabled device and without support in dm_io the system
crashes badly.

BUG: unable to handle kernel paging request at 00800000
IP:  __bio_add_page.part.17+0xf5/0x1e0
...
 bio_add_page+0x56/0x70
 dispatch_io+0x1cf/0x240 [dm_mod]
 ? km_get_page+0x50/0x50 [dm_mod]
 ? vm_next_page+0x20/0x20 [dm_mod]
 ? mirror_flush+0x130/0x130 [dm_mirror]
 dm_io+0xdc/0x2b0 [dm_mod]
...

Introduced in 2.6.38-rc1 by commit 5fc2ffeabb
(dm raid1: support discard).

Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: stable@kernel.org
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:37 +00:00
Jesper Juhl 902c6a96a7 dm ioctl: do not leak argv if target message only contains whitespace
If 'argc' is zero we jump to the 'out:' label, but this leaks the
(unused) memory that 'dm_split_args()' allocated for 'argv' if the
string being split consisted entirely of whitespace.  Jump to the
'out_argv:' label instead to free up that memory.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-03-07 19:09:34 +00:00
Linus Torvalds a2e5f13ce8 3 fixes for md in 3.3-rc
2 relate to the recently added drive replacement.
 
 One causes read error in RAID10 to sometimes be retried indefinitely.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT1VI1znsnt1WYoG5AQK47Q//d51y5QCpABFNUcgIM626zJXlBWFUSmzU
 wFOGXh5emN6/TWguzkiZwrvcspDmXMzz1zmJtGWixYb2jBpn2MHEN4uNz3Vq68w+
 IYk/dJg/CG4+lzX+6IjiHOb3+TASRx94QZHJASx68vypqniAyikshqcbUeZBMTB0
 Fu+sKqsOGYmwQfe6/vtRPVXY7DYK2dFDBRMFpmOl+o4Y2XxmmWzMw4Dg1RIEdtFS
 Jo9GwLHTnlw2xoc0XooufeT0Q2KOpqi9T8L6Nj0ORwpgsFqgtZ/kIOoGU6qOpSri
 ofLTrobVKMpjFtmiYVOp9TaBlPnd/TNX3E4WPLGNsAwYuRUFjq8evmJKjG+pOdeB
 3ArxRKRJCaI2jnVhH+NpT7i/tpkEg/8a/BoOAihX+hM/8QkmsWluaRBOGMhpuuuc
 1baPVTusi/zijO9cM8RGIXaQj5UG4s3LUpCIOIYdDyxsfmAH5KN1F2EPrU4NMME2
 96THSshIZLkgAg5ICwtva0qoHlBlEclAlVAzEomT7R9KwHojEB1xUiyMmaIdMFoy
 JjGFAMp2E5+KBKZ1eYEHjthPWCb+nZ3eYHUh0DOnEt4kASCXnn45GJREQkpkNIR/
 HhDTS8vI743unKnbCtYFMxiw/9OXZbMkdoZhobg7lxcpoQlWJ+5ziOtACl0h0Kv8
 +ET+Kp3W8K4=
 =93ms
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Three fixes for md in 3.3-rc: Two relate to the recently added drive
  replacement.  One fixes the problem where a read error in RAID10 would
  sometimes be retried indefinitely."

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md/raid10: fix assembling of arrays with replacement devices.
  md/raid10: fix handling of error on last working device in array.
  md/raid1: fix buglet in md_raid1_contested.
2012-03-05 16:01:25 -08:00
NeilBrown 7a90484825 md/raid10: fix assembling of arrays with replacement devices.
commit 56a2559bb6 (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'.  So it was largely ineffective.

So remove that now.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-03-06 10:12:45 +11:00
NeilBrown fae8cc5ed0 md/raid10: fix handling of error on last working device in array.
If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.

This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
   was the last usable device.  This is efficient but a little racy
   and will sometimes retry when it should not.

2/ in handle_read_error we are careful to exclude any device from
   retry which we tried to mark as faulty (that might have failed if
   it was the last device).  This is race-free but less efficient.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-14 11:10:10 +11:00
NeilBrown f53e29fc87 md/raid1: fix buglet in md_raid1_contested.
Since we added 'replacement' capability, RAID1 can have twice
as many devices as ->raid_disks indicates.
So md_raid1_congested needs to check that many possible devices,
not just ->raid_disks many.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-13 14:24:05 +11:00
Linus Torvalds 4d39aa1b99 Some simple md-related fixes.
1/ two small fixes to ensure we handle an interrupted resync properly.
 2/ avoid loading the bitmap multiple times in dm-raid
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUATzMdiTnsnt1WYoG5AQKICw/9H3Xf/3crCCVRQ+yzSdZ1ZJH24Rps9O6W
 8dLFN4/Ng/qxymWUMrgHAMq5MEEz2M3i7W+j23lFv6Oce06y8GJ4PpoYY5xlXCgO
 SIU1BaO1JFHxQn89EQtP3iOn4AOiZvX0GUObR0P8KO1mMnLmN7cg8J1kBfmQiBKu
 aXcUqqNvcywoix6ve4O/xgnZjd4IExxqG3W8U7CaIwExUDwaLY4NckxJcIJbIYy9
 iapOGMUdcyr6xm819V/xE2DyAtfFCtvAk1hfW/dM4QQctran3MzQIRFn9RW+CwHU
 ComEnv5ti/7g//JPXQArUPk4xgRHrMhqFcmmD8rozJ6FJDi8vw2e0BXaRLVqa0mK
 1qSZkr0Ot3nwAdILzgSbNXQ0Y5OJgc9OLX5GGlVibTW2VTJYFgA7jAsnqq8PAJC5
 sU5h2K3jrSy2unGy6BxleL5D/wvREE5OBnW35TEB5TYbxjp1FLgn+BWp8FfFUYWT
 Eb2cIyAj6cBFJ3ma1K0RH0dmS9cbNjuG+CLiApJOnEEsXzrp/4KnqOwg4672ewW3
 m1Ue2Qv+0avaK3sVyT+qzuemc6b0ps/dix0gMXw2pYqXQWHquW5NdUJcgD2DKFSn
 BB734nUP6KlPg0IFh1eehRHyVRLIAot/uBlUJ3bMx9xeYCkKa+twX90u6EmjTopP
 JjLxNsf6c2I=
 =k0Xz
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3-fixes' of git://neil.brown.name/md

Some simple md-related fixes.

1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid

* tag 'md-3.3-fixes' of git://neil.brown.name/md:
  md: two small fixes to handling interrupt resync.
  Prevent DM RAID from loading bitmap twice.
2012-02-08 19:06:30 -08:00
NeilBrown db91ff55bd md: two small fixes to handling interrupt resync.
1/ If a resync is aborted we should record how far we got
 (recovery_cp) the last request that we know has completed
 (->curr_resync_completed) rather than the last request that was
 submitted (->curr_resync).

2/ When a resync aborts we still want to update the metadata with
 any changes, so set MD_CHANGE_DEVS even if we 'skip'.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-02-07 12:01:51 +11:00