linux

Commit Graph

Author	SHA1	Message	Date
Mike Snitzer	876fbba1db	dm table: avoid crash if integrity profile changes Commit `a63a5cf` (dm: improve block integrity support) introduced a two-phase initialization of a DM device's integrity profile. This patch avoids dereferencing a NULL 'template_disk' pointer in blk_integrity_register() if there is an integrity profile mismatch in dm_table_set_integrity(). This can occur if the integrity profiles for stacked devices in a DM table are changed between the call to dm_table_prealloc_integrity() and dm_table_set_integrity(). Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org # 2.6.39	2011-09-25 23:26:17 +01:00
Mike Snitzer	68e58a294f	dm: flakey fix corrupt_bio_byte error path If no arguments were provided to the corrupt_bio_byte feature an error should be returned immediately. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-09-25 23:26:15 +01:00
Daniel P. Berrange	2dba6a911c	md: don't delay reboot by 1 second if no MD devices exist The md_notify_reboot() method includes a call to mdelay(1000), to deal with "exotic SCSI devices" which are too volatile on reboot. The delay is unconditional. Even if the machine does not have any block devices, let alone MD devices, the kernel shutdown sequence is slowed down. 1 second does not matter much with physical hardware, but with certain virtualization use cases any wasted time in the bootup & shutdown sequence counts for alot. * drivers/md/md.c: md_notify_reboot() - only impose a delay if there was at least one MD device to be stopped during reboot Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-23 19:54:04 +10:00
Wang Sheng-Hui	7e84152626	trival: md_k.h should be md.h in the beginning comment of file md.h Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-21 15:37:46 +10:00
NeilBrown	2585f3ef8c	md/bitmap: improve handling of 'allclean'. The 'allclean' flag is used to cache the fact that there is nothing to do, so we can avoid waking up and scanning the bitmap regularly. The two sorts of pages that might need the attention of the bitmap daemon are BITMAP_PAGE_PENDING and BITMAP_PAGE_NEEDWRITE pages. So make sure allclean reflects exactly when there are none of those. So: set it before scanning all pages with either bit set. clear it whenever these bits are set clear it when we desire not to clear one of these bits. don't clear it any other time. Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-21 15:37:46 +10:00
NeilBrown	5a537df44d	md/bitmap: rename and tidy up BITMAP_PAGE_CLEAN The flag 'BITMAP_PAGE_CLEAN' has a confusing name as it doesn't mean that the page is clean, but rather that there are counters in the page which allow bits in the bitmap to be cleared - i.e. maybe cleaning can happen. So change it to BITMAP_PAGE_PENDING and fix some irregularities: - Don't set it in bitmap_init_from_disk as bitmap_set_memory_bits sets it when needed - in bitmap_daemon_work, if we find a counter that is '1', but need_sync is set, then set BITMAP_PAGE_PENDING again (it was recently cleared) to ensure we don't forget about this bit. Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-21 15:37:46 +10:00
NeilBrown	01f96c0a99	md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org	2011-09-21 15:30:20 +10:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
Jens Axboe	c20e8de27f	block: rename __make_request() to blk_queue_bio() Now that it's exported, lets put it in a more sane namespace. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:31 +02:00
Christoph Hellwig	166e1f901b	block: export __make_request Avoid the hacks need for request based device mappers currently by simply exporting the symbol instead of trying to get it through the back door. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:27 +02:00
NeilBrown	27a7b260f7	md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. 0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:28 +10:00
NeilBrown	079fa166a2	md/raid1,10: Remove use-after-free bug in make_request. A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call _write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the _write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:23 +10:00
NeilBrown	19d5f834d6	md/raid10: unify handling of write completion. A write can complete at two different places: 1/ when the last member-device write completes, through raid10_end_write_request 2/ in make_request() when we remove the initial bias from ->remaining. These two should do exactly the same thing and the comment says they do, but they don't. So factor the correct code out into a function and call it in both places. This makes the code much more similar to RAID1. The difference is only significant if there is an error, and they usually take a while, so it is unlikely that there will be an error already when make_request is completing, so this is unlikely to cause real problems. Signed-off-by: NeilBrown <neilb@suse.de>	2011-09-10 17:21:17 +10:00
NeilBrown	43220aa0f2	md/raid5: fix a hang on device failure. Waiting for a 'blocked' rdev to become unblocked in the raid5d thread cannot work with internal metadata as it is the raid5d thread which will clear the blocked flag. This wasn't a problem in 3.0 and earlier as we only set the blocked flag when external metadata was used then. However we now set it always, so we need to be more careful. Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-31 12:49:14 +10:00
NeilBrown	7da64a0abc	md: fix clearing of 'blocked' flag in the presence of bad blocks. When the 'blocked' flag on a device is cleared while there are unacknowledged bad blocks we must fail the device. This is needed for backwards compatability of the interface. The code currently uses the wrong test for "unacknowledged bad blocks exist". Change it to the right test. Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-30 16:20:17 +10:00
NeilBrown	1b6afa1758	md/linear: avoid corrupting structure while waiting for rcu_free to complete. I don't know what I was thinking putting 'rcu' after a dynamically sized array! The array could still be in use when we call rcu_free() (That is the point) so we mustn't corrupt it. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:43:53 +10:00
Namhyung Kim	a5bf4df0c8	md: use REQ_NOIDLE flag in md_super_write() Queue idling is used for the anticipation of immediate sequencial I/O's but md_super_write() is a kind of one- shot operation, coupled with md_super_wait(), so the idling in this case will be just a waste of time. Specifying REQ_NOIDLE prevents it. Instead of adding the flag to submit_bio() directly, use pre-defined macro WRITE_FLUSH_FUA. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:43:34 +10:00
NeilBrown	aeb9b21184	md: ensure changes to 'write-mostly' are reflected in metadata. The 'write-mostly' flag can be changed through sysfs. With 0.90 metadata, those changes are reflected in the metadata. For 1.x metadata, they aren't. So fix super_1_sync to record 'write-mostly' status. Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:43:08 +10:00
NeilBrown	5ef56c8fec	md: report failure if a 'set faulty' request doesn't. Sometimes a device will refuse to be set faulty. e.g. RAID1 will never let the last working device become faulty. So check if "md_error()" did manage to set the faulty flag and fail with EBUSY if it didn't. Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198 Reported-by: Mike Hommey <mh+reportbug@glandium.org> Signed-off-by: NeilBrown <neilb@suse.de>	2011-08-25 14:42:51 +10:00
Mike Snitzer	ed8b752bcc	dm table: set flush capability based on underlying devices DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities regardless of whether or not a given DM device's underlying devices also advertised a need for them. Block's flush-merge changes from 2.6.39 have proven to be more costly for DM devices. Performance regressions have been reported even when DM's underlying devices do not advertise that they have a write cache. Fix the performance regressions by configuring a DM device's flushing capabilities based on those of the underlying devices' capabilities. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:08 +01:00
Milan Broz	772ae5f54d	dm crypt: optionally support discard requests Add optional parameter field to dmcrypt table and support "allow_discards" option. Discard requests bypass crypt queue processing. Bio is simple remapped to underlying device. Note that discard will be never enabled by default because of security consequences. It is up to the administrator to enable it for encrypted devices. (Note that userspace cryptsetup does not understand new optional parameters yet. Support for this will come later. Until then, you should use 'dmsetup' to enable and disable this.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:08 +01:00
Jonathan Brassow	327372797c	dm raid: add md raid1 support Support the MD RAID1 personality through dm-raid. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:07 +01:00
Jonathan Brassow	b12d437b73	dm raid: support metadata devices Add the ability to parse and use metadata devices to dm-raid. Although not strictly required, without the metadata devices, many features of RAID are unavailable. They are used to store a superblock and bitmap. The role, or position in the array, of each device must be recorded in its superblock. This is to help with fault handling, array reshaping, and sanity checks. RAID 4/5/6 devices must be loaded in a specific order: in this way, the 'array_position' field helps validate the correctness of the mapping when it is loaded. It can be used during reshaping to identify which devices are added/removed. Fault handling is impossible without this field. For example, when a device fails it is recorded in the superblock. If this is a RAID1 device and the offending device is removed from the array, there must be a way during subsequent array assembly to determine that the failed device was the one removed. This is done by correlating the 'array_position' field and the bit-field variable 'failed_devices'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:07 +01:00
Jonathan Brassow	46bed2b5c1	dm raid: add write_mostly parameter Add the write_mostly parameter to RAID1 dm-raid tables. This allows the user to set the WriteMostly flag on a RAID1 device that should normally be avoided for read I/O. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:07 +01:00
Jonathan Brassow	c1084561bb	dm raid: add region_size parameter Allow the user to specify the region_size. Ensures that the supplied value meets md's constraints, viz. the number of regions does not exceed 2^21. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:07 +01:00
Mikulas Patocka	759dea204c	dm ioctl: forbid multiple device specifiers Exactly one of name, uuid or device must be specified when referencing an existing device. This removes the ambiguity (risking the wrong device being updated) if two conflicting parameters were specified. Previously one parameter got used and any others were ignored silently. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:06 +01:00
Mikulas Patocka	ba2e19b0f4	dm ioctl: introduce __get_dev_cell Move logic to find device based on major/minor number to a separate function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell). This makes the function __find_device_hash_cell more straightforward. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:06 +01:00
Mikulas Patocka	0ddf9644cc	dm ioctl: fill in device parameters in more ioctls Move parameter filling from find_device to __find_device_hash_cell. This patch causes ioctls using __find_device_hash_cell (DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD) to return device parameters, bringing them into line with the other ioctls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:06 +01:00
Mike Snitzer	a3998799fb	dm flakey: add corrupt_bio_byte feature Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a specified position with a specified value during intervals when the device is "down". Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:06 +01:00
Mike Snitzer	b26f5e3d71	dm flakey: add drop_writes Add 'drop_writes' option to drop writes silently while the device is 'down'. Reads are not touched. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:05 +01:00
Mike Snitzer	dfd068b01f	dm flakey: support feature args Add the ability to specify arbitrary feature flags when creating a flakey target. This code uses the same target argument helpers that the multipath target does. Also remove the superfluous 'dm-flakey' prefixes from the error messages, as they already contain the prefix 'flakey'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:05 +01:00
Mike Snitzer	30e4171bfe	dm flakey: use dm_target_offset and support discards Use dm_target_offset() and support discards. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:05 +01:00
Mike Snitzer	498f0103ea	dm table: share target argument parsing functions Move multipath target argument parsing code into dm-table so other targets can share it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:04 +01:00
Mikulas Patocka	a6e50b409d	dm snapshot: skip reading origin when overwriting complete chunk If we write a full chunk in the snapshot, skip reading the origin device because the whole chunk will be overwritten anyway. This patch changes the snapshot write logic when a full chunk is written. In this case: 1. allocate the exception 2. dispatch the bio (but don't report the bio completion to device mapper) 3. write the exception record 4. report bio completed Callbacks must be done through the kcopyd thread, because callbacks must not race with each other. So we create two new functions: dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback. (This function must not be called from interrupt context.) dm_kcopyd_do_callback: submit callback. (This function may be called from interrupt context.) Performance test (on snapshots with 4k chunk size): without the patch: non-direct-io sequential write (dd): 17.7MB/s direct-io sequential write (dd): 20.9MB/s non-direct-io random write (mkfs.ext2): 0.44s with the patch: non-direct-io sequential write (dd): 26.5MB/s direct-io sequential write (dd): 33.2MB/s non-direct-io random write (mkfs.ext2): 0.27s Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:04 +01:00
Mikulas Patocka	d5b9dd04bd	dm: ignore merge_bvec for snapshots when safe Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate whether the device can accept bios larger than the size its merge function returns. When set, use this to send large bios to snapshots which can split them if necessary. Snapshot I/O may be significantly fragmented and this approach seems to improve peformance. Before the patch, dm_set_device_limits restricted bio size to page size if the underlying device had a merge function and the target didn't provide a merge function. After the patch, dm_set_device_limits restricts bio size to page size if the underlying device has a merge function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't provide a merge function. The snapshot target can't provide a merge function because when the merge function is called, it is impossible to determine where the bio will be remapped. Previously this led us to impose a 4k limit, which we can now remove if the snapshot store is located on a device without a merge function. Together with another patch for optimizing full chunk writes, it improves performance from 29MB/s to 40MB/s when writing to the filesystem on snapshot store. If the snapshot store is placed on a non-dm device with a merge function (such as md-raid), device mapper still limits all bios to page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:04 +01:00
Mike Snitzer	0864901254	dm table: clean dm_get_device and move exports There is no need for __table_get_device to be factored out. Also move the exports to the end of their respective functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:04 +01:00
Alasdair G Kergon	3e8dbb7f39	dm raid: tidy includes A dm target only needs to use include/linux dm headers. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:03 +01:00
Alasdair G Kergon	2ca4c92f58	dm ioctl: prevent empty message Detect invalid empty messages in core dm instead of requiring every target to check this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:03 +01:00
Jonathan Brassow	13c87583ea	dm raid: cleanup parameter handling Re-order the parameters so they are handled consistently in the same order where defined, parsed and output. Only include rebuild parameters in the STATUSTYPE_TABLE output if they were supplied in the original table line. Correct the parameter count when outputting rebuild: there are two words, not one. Use case-independent checks for keywords (as in other device-mapper targets). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:03 +01:00
Jonathan Brassow	a2d2b0345a	dm snapshot: style cleanups Coding style cleanups. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-08-02 12:32:03 +01:00
Mikulas Patocka	aa3f0794d2	dm snapshot: remove unused definitions Remove a couple of unused #defines. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:03 +01:00
Mikulas Patocka	5bf45a3dcd	dm kcopyd: remove nr_pages field from job structure The nr_pages field in struct kcopyd_job is only used temporarily in run_pages_job() to count the number of required pages. We can use a local variable instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:02 +01:00
Mikulas Patocka	4622afb3f5	dm kcopyd: remove offset field from job structure The offset field in struct kcopyd_job is always zero so remove it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:02 +01:00
Joe Perches	e29e65aacb	dm: use vzalloc Use vzalloc() instead of vmalloc()+memset(). Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:02 +01:00
Kirill A. Shutemov	6c9b27ab08	dm log: userspace use list_move Replace list_del() followed by list_add() with list_move(). Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:02 +01:00
Akinobu Mita	c8f543e078	dm log: clean up bit little endian bitops Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). This also removes unnecessary casts. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Mike Snitzer	936688d7eb	dm table: fix discard support Remove 'discards_supported' from the dm_table structure. The same information can be easily discovered from the table's target(s) in dm_table_supports_discards(). Before this fix dm_table_supports_discards() would skip checking the individual targets' 'discards_supported' flag if any one target in the table didn't set num_discard_requests > 0. Now the per-target 'discards_supported' flag is effective at insuring the final DM device advertises discard support. But, to be clear, targets that don't support discards (!num_discard_requests) will not receive discard requests. Also DMWARN if a target sets 'discards_supported' override but forgets to set 'num_discard_requests'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Alasdair G Kergon	283a8328ca	dm: suppress endian warnings Suppress sparse warnings about cpu_to_le32() by using __le32 types for on-disk data etc. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Alasdair G Kergon	d15b774c29	dm: fix idr leak on module removal Destroy _minor_idr when unloading the core dm module. (Found by kmemleak.) Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Mikulas Patocka	bb91bc7bac	dm io: flush cpu cache with vmapped io For normal kernel pages, CPU cache is synchronized by the dma layer. However, this is not done for pages allocated with vmalloc. If we do I/O to/from vmallocated pages, we must synchronize CPU cache explicitly. Prior to doing I/O on vmallocated page we must call flush_kernel_vmap_range to flush dirty cache on the virtual address. After finished read we must call invalidate_kernel_vmap_range to invalidate cache on the virtual address, so that accesses to the virtual address return newly read data and not stale data from CPU cache. This patch fixes metadata corruption on dm-snapshots on PA-RISC and possibly other architectures with caches indexed by virtual address. Cc: stable <stable@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Mike Snitzer	286f367dad	dm mpath: fix potential NULL pointer in feature arg processing Avoid dereferencing a NULL pointer if the number of feature arguments supplied is fewer than indicated. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2011-08-02 12:32:00 +01:00
Mikulas Patocka	762a80d9fc	dm snapshot: flush disk cache when merging This patch makes dm-snapshot flush disk cache when writing metadata for merging snapshot. Without cache flushing the disk may reorder metadata write and other data writes and there is a possibility of data corruption in case of power fault. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:00 +01:00
Linus Torvalds	6140333d36	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (75 commits) md/raid10: handle further errors during fix_read_error better. md/raid10: Handle read errors during recovery better. md/raid10: simplify read error handling during recovery. md/raid10: record bad blocks due to write errors during resync/recovery. md/raid10: attempt to fix read errors during resync/check md/raid10: Handle write errors by updating badblock log. md/raid10: clear bad-block record when write succeeds. md/raid10: avoid writing to known bad blocks on known bad drives. md/raid10 record bad blocks as needed during recovery. md/raid10: avoid reading known bad blocks during resync/recovery. md/raid10 - avoid reading from known bad blocks - part 3 md/raid10: avoid reading from known bad blocks - part 2 md/raid10: avoid reading from known bad blocks - part 1 md/raid10: Split handle_read_error out from raid10d. md/raid10: simplify/reindent some loops. md/raid5: Clear bad blocks on successful write. md/raid5. Don't write to known bad block on doubtful devices. md/raid5: write errors should be recorded as bad blocks if possible. md/raid5: use bad-block log to improve handling of uncorrectable read errors. md/raid5: avoid reading from known bad blocks. ...	2011-07-28 05:50:27 -07:00
NeilBrown	58c54fcca3	md/raid10: handle further errors during fix_read_error better. If we find more read/write errors we should record a bad block before failing the device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	5e5702898e	md/raid10: Handle read errors during recovery better. Currently when we get a read error during recovery, we simply abort the recovery. Instead, repeat the read in page-sized blocks. On successful reads, write to the target. On read errors, record a bad block on the destination, and only if that fails do we abort the recovery. As we now retry reads we need to know where we read from. This was in bi_sector but that can be changed during a read attempt. So store the correct from_addr and to_addr in the r10_bio for later access. Signed-off-by: NeilBrown<neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	e684e41db3	md/raid10: simplify read error handling during recovery. If a read error is detected during recovery the code currently fails the read device. This isn't really necessary. recovery_request_write will signal a write error to end_sync_write and it will record a write error on the destination device which will record a bad block there or kick it from the array. So just remove this call to do md_error. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	1a0b7cd826	md/raid10: record bad blocks due to write errors during resync/recovery. If we get a write error during resync/recovery don't fail the device but instead record a bad block. If that fails we can then fail the device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	f84ee364dd	md/raid10: attempt to fix read errors during resync/check We already attempt to fix read errors found during normal IO and a 'repair' process. It is best to try to repair them at any time they are found, so move a test so that during sync and check a read error will be corrected by over-writing with good data. If both (all) devices have known bad blocks in the sync section we won't try to fix even though the bad blocks might not overlap. That should be considered later. Also if we hit a read error during recovery we don't try to fix it. It would only be possible to fix if there were at least three copies of data, which is not very common with RAID10. But it should still be considered later. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:25 +10:00
NeilBrown	bd870a16c5	md/raid10: Handle write errors by updating badblock log. When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	749c55e942	md/raid10: clear bad-block record when write succeeds. If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	d4432c23be	md/raid10: avoid writing to known bad blocks on known bad drives. Writing to known bad blocks on drives that have seen a write error is asking for trouble. So try to avoid these blocks. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	e875ecea26	md/raid10 record bad blocks as needed during recovery. When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	40c356ce5a	md/raid10: avoid reading known bad blocks during resync/recovery. During resync/recovery limit the size of the request to avoid reading into a bad block that does not start at-or-before the current read address. Similarly if there is a bad block at this address, don't allow the current request to extend beyond the end of that bad block. Now that we don't ever read from known bad blocks, it is safe to allow devices with those blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	8dbed5cebd	md/raid10 - avoid reading from known bad blocks - part 3 When attempting to repair a read error, don't read from devices with a known bad block. As we are only reading PAGE_SIZE blocks, we don't try to narrow down to smaller regions in the hope that only part of this page is bad - it isn't worth the effort. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:24 +10:00
NeilBrown	7399c31bc9	md/raid10: avoid reading from known bad blocks - part 2 When redirecting a read error to a different device, we must again avoid bad blocks and possibly split the request. Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	856e08e237	md/raid10: avoid reading from known bad blocks - part 1 This patch just covers the basic read path: 1/ read_balance needs to check for badblocks, and return not only the chosen slot, but also how many good blocks are available there. 2/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' On read error we currently just fail the request if another target cannot handle the whole request. Next patch refines that a bit. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	560f8e5532	md/raid10: Split handle_read_error out from raid10d. raid10d() is too big and is about to get bigger, so split handle_read_error() out as a separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	1294b9c973	md/raid10: simplify/reindent some loops. When a loop ends with a large if, it can be neater to change the if to invert the condition and just 'continue'. Then the body of the if can be indented to a lower level. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	b84db560ea	md/raid5: Clear bad blocks on successful write. On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:23 +10:00
NeilBrown	73e92e51b7	md/raid5. Don't write to known bad block on doubtful devices. If a device has seen write errors, don't write to any known bad blocks on that device. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:22 +10:00
NeilBrown	bc2607f393	md/raid5: write errors should be recorded as bad blocks if possible. When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:22 +10:00
NeilBrown	7f0da59bdc	md/raid5: use bad-block log to improve handling of uncorrectable read errors. If we get an uncorrectable read error - record a bad block rather than failing the device. And if these errors (which may be due to known bad blocks) cause recovery to be impossible, record a bad block on the recovering devices, or abort the recovery. As we might abort a recovery without failing a device we need to teach RAID5 about recovery_disabled handling. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:22 +10:00
NeilBrown	31c176ecdf	md/raid5: avoid reading from known bad blocks. There are two times that we might read in raid5: 1/ when a read request fits within a chunk on a single working device. In this case, if there is any bad block in the range of the read, we simply fail the cache-bypass read and perform the read though the stripe cache. 2/ when reading into the stripe cache. In this case we mark as failed any device which has a bad block in that strip (1 page wide). Note that we will both avoid reading and avoid writing. This is correct (as we will never read from the block, there is no point writing), but not optimal (as writing could 'fix' the error) - that will be addressed later. If we have not seen any write errors on the device yet, we treat a bad block like a recent read error. This will encourage an attempt to fix the read error which will either generate a write error, or will ensure good data is stored there. We don't yet forget the bad block in that case. That comes later. Now that we honour bad blocks when reading we can allow devices with bad blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:39:22 +10:00
NeilBrown	62096bce23	md/raid1: factor several functions out or raid1d() raid1d is too big with several deep branches. So separate them out into their own functions. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:38:13 +10:00
NeilBrown	3a9f28a511	md/raid1: improve handling of read failure during recovery. If we cannot read a block from anywhere during recovery, there is now a better approach than just giving up. We can record a bad block on each device and keep going - being careful not to clear the bad block when a write succeeds as it might - it will be a write of incorrect data. We have now reached the state where - for raid1 - we only call md_error if md_set_badblocks has failed. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:33:42 +10:00
NeilBrown	d8f05d2995	md/raid1: record badblocks found during resync etc. If we find a bad block while writing as part of resync/recovery we need to report that back to raid1d which must record the bad block, or fail the device. Similarly when fixing a read error, a further error should just record a bad block if possible rather than failing the device. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:33:00 +10:00
NeilBrown	cd5ff9a16f	md/raid1: Handle write errors by updating badblock log. When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:32:41 +10:00
NeilBrown	2ca68f5ed7	md/raid1: store behind-write pages in bi_vecs. When performing write-behind we allocate pages to store the data during write. Previously we just keep a list of pages. Now we keep a list of bi_vec which includes offset and size. This means that the r1bio has complete information to create a new bio which will be needed for retrying after write errors. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:32:10 +10:00
NeilBrown	4367af5561	md/raid1: clear bad-block record when write succeeds. If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:49 +10:00
NeilBrown	1f68f0c4b6	md/raid1: avoid writing to known-bad blocks on known-bad drives. If we have seen any write error on a drive, then don't write to any known-bad blocks on that drive. If necessary, we divide the write request up into pieces just like we do for reads, so each piece is either all written or all not written to any given drive. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:48 +10:00
NeilBrown	de393cdea6	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	d7a9d443bc	md: add 'write_error' flag to component devices. If a device has ever seen a write error, we will want to handle known-bad-blocks differently. So create an appropriate state flag and export it via sysfs. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:48 +10:00
NeilBrown	06f603851f	md/raid1: avoid reading known bad blocks during resync When performing resync/etc, keep the size of the request small enough that it doesn't overlap any known bad blocks. Devices with badblocks at the start of the request are completely excluded. If there is nowhere to read from due to bad blocks, record a bad block on each target device. Now that we never read from known-bad-blocks we can allow devices with known-bad-blocks into a RAID1. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	d2eb35acfd	md/raid1: avoid reading from known bad blocks. Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:48 +10:00
NeilBrown	9f2f383078	md: Disable bad blocks and v0.90 metadata. v0.90 metadata cannot record bad blocks, so when loading metadata for such a device, set shift to -1. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:47 +10:00
NeilBrown	2699b67223	md: load/store badblock list from v1.x metadata Space must have been allocated when array was created. A feature flag is set when the badblock list is non-empty, to ensure old kernels don't load and trust the whole device. We only update the on-disk badblocklist when it has changed. If the badblocklist (or other metadata) is stored on a bad block, we don't cope very well. If metadata has no room for bad block, flag bad-blocks as disabled, and do the same for 0.90 metadata. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 11:31:47 +10:00
NeilBrown	34b343cff4	md: don't allow arrays to contain devices with bad blocks. As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:47 +10:00
NeilBrown	16c791a5af	md/bad-block-log: add sysfs interface for accessing bad-block-log. This can show the log (providing it fits in one page) and allows bad blocks to be 'acknowledged' meaning that they have safely been recorded in metadata. Clearing bad blocks is not allowed via sysfs (except for code testing). A bad block can only be cleared when a write to the block succeeds. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:47 +10:00
NeilBrown	2230dfe4cc	md: beginnings of bad block management. This the first step in allowing md to track bad-blocks per-device so that we can fail individual blocks rather than the whole device. This patch just adds a data structure for recording bad blocks, with routines to add, remove, search the list. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-28 11:31:46 +10:00
NeilBrown	a519b26dbe	md: remove suspicious size_of() When calling bioset_create we pass the size of the front_pad as sizeof(mddev) which looks suspicious as mddev is a pointer and so it looks like a common mistake where sizeof(mddev) was intended. The size is actually correct as we want to store a pointer in the front padding of the bios created by the bioset, so make the intent more explicit by using sizeof(mddev_t ) Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-28 07:56:24 +10:00
Jonathan Brassow	768e587e18	MD: generate an event when array sync is complete This patch causes MD to generate an event (for device-mapper) when the synchronization thread is reaped. This is expected behavior for device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:37 +10:00
Jonathan Brassow	3520fa4db7	MD bitmap: Revert DM dirty log hooks Revert most of commit `e384e58549` md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. MD should not need to use DM's dirty log - we decided to use md's bitmaps instead. Keeping the DIV_ROUND_UP clean-ups that were part of commit `e384e58549`, however. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:37 +10:00
Jonathan Brassow	654e8b5abc	MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe If device-mapper creates a RAID1 array that includes devices to be rebuilt, it will deref a NULL pointer when finished because sysfs is not used by device-mapper instantiated RAID devices. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	8cfa7b0f67	md/raid5: Avoid BUG caused by multiple failures. While preparing to write a stripe we keep the parity block or blocks locked (R5_LOCKED) - towards the end of schedule_reconstruction. If the array is discovered to have failed before this write completes we can leave those blocks LOCKED, and init_stripe will notice that a free stripe still has a locked block and will complain. So clear the R5_LOCKED flag in handle_failed_stripe, and demote the 'BUG' to a 'WARN_ON'. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	cbea21703b	md/raid10: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	ddd5115fe5	md/raid5: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	9d3d80113d	md/raid1: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Also included a couple of whitespace fixes on sync_page_io(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	65a06f0674	md: get rid of unnecessary casts on page_address() page_address() returns void pointer, so the casts can be removed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	700c721389	md/raid10: Improve decision on whether to fail a device with a read error. Normally we would fail a device with a READ error. However if doing so causes the array to fail, it is better to leave the device in place and just return the read error to the caller. The current test for decide if the array will fail is overly simplistic. We have a function 'enough' which can tell if the array is failed or not, so use it to guide the decision. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	2bb77736ae	md/raid10: Make use of new recovery_disabled handling When we get a read error during recovery, RAID10 previously arranged for the recovering device to appear to fail so that the recovery stops and doesn't restart. This is misleading and wrong. Instead, make use of the new recovery_disabled handling and mark the target device and having recovery disabled. Add appropriate checks in add_disk and remove_disk so that devices are removed and not re-added when recovery is disabled. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	5389042ffa	md: change managed of recovery_disabled. If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	a478a069b6	md: remove ro check in md_check_recovery() Commit `c89a8eee61` ("Allow faulty devices to be removed from a readonly array.") added some work on ro array in the function, but it couldn't be done since we didn't allow the ro array to be handled from the beginning. Fix it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Namhyung Kim	36fad858a7	md: introduce link/unlink_rdev() helpers There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Christian Dietrich	8bda470e8e	md/raid: use printk_ratelimited instead of printk_ratelimit As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
Akinobu Mita	a0a02a7ad6	md: use proper little-endian bitops Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: NeilBrown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-27 11:00:36 +10:00
NeilBrown	acfe726bdd	md/raid5: finalise new merged handle_stripe. handle_stripe5() and handle_stripe6() are now virtually identical. So discard one and rename the other to 'analyse_stripe()'. It always returns 0, so change it to 'void' and remove the 'done' variable in handle_stripe(). Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	474af965fe	md/raid5: move some more common code into handle_stripe The RAID6 version of this code is usable for RAID5 providing: - we test "conf->max_degraded" rather than "2" as appropriate - we make sure s->failed_num[1] is meaningful (and not '-1') when s->failed > 1 The 'return 1' must become 'goto finish' in the new location. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	84789554e9	md/raid5: move more common code into handle_stripe Apart from 'prexor' which can only be set for RAID5, and 'qd_idx' which can only be meaningful for RAID6, these two chunks of code are nearly the same. So combine them into one adding a test to call either handle_parity_checks5 or handle_parity_checks6 as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	c8ac1803ff	md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is also allow 'read-modify-write' Apart from this difference, handle_stripe_dirtying[56] are nearly identical. So resolve these differences and create just one function. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	93b3dbce64	md/raid5: unite fetch_block5 and fetch_block6 Provided that ->failed_num[1] is not a valid device number (which is easily achieved) fetch_block6 provides all the functionality of fetch_block5. So remove the latter and rename the former to simply "fetch_block". Then handle_stripe_fill5 and handle_stripe_fill6 become the same and can similarly be united. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	5d35e09cae	md/raid5: rearrange a test in fetch_block6. Next patch will unite fetch_block5 and fetch_block6. First I want to make the differences a little more clear. For RAID6 if we are writing at all and there is a failed device, then we need to load or compute every block so we can do a reconstruct-write. This case isn't needed for RAID5 - we will do a read-modify-write in that case. So make that test a separate test in fetch_block6 rather than merged with two other tests. Make a similar change in fetch_block5 so the one bit that is not needed for RAID6 is clearly separate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	c5a3100062	md/raid5: move more code into common handle_stripe The difference between the RAID5 and RAID6 code here is easily resolved using conf->max_degraded. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	3687c06188	md/raid5: Move code for finishing a reconstruction into handle_stripe. Prior to commit `ab69ae12ce` the code in handle_stripe5 and handle_stripe6 to "Finish reconstruct operations initiated by the expansion process" was identical. That commit added an identical stanza of code to each function, but in different places. That was careless. The raid5 code was correct, so move that out into handle_stripe and remove raid6 version. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
NeilBrown	86c374ba9f	md/raid5: Remove stripe_head_state arg from handle_stripe_expansion. This arg is only used to differentiate between RAID5 and RAID6 but that is not needed. For RAID5, raid5_compute_sector will set qd_idx to "~0" so j with certainly not equals qd_idx, so there is no need for a guard on that condition. So remove the guard and remove the arg from the declaration and callers of handle_stripe_expansion. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-27 11:00:36 +10:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
NeilBrown	cc94015a9e	md/raid5: move stripe_head_state and more code into handle_stripe. By defining the 'stripe_head_state' in 'handle_stripe', we can move some common code out of handle_stripe[56]() and into handle_stripe. The means that all accesses for stripe_head_state in handle_stripe[56] need to be 's->' instead of 's.', but the compiler should inline those functions and just use a direct stack reference, and future patches while hoist most of this code up into handle_stripe() so we will revert to "s.". Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:35:35 +10:00
NeilBrown	c5709ef6a0	md/raid5: add some more fields to stripe_head_state Adding these three fields will allow more common code to be moved to handle_stripe() struct field rearrangement by Namhyung Kim. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:35:20 +10:00
NeilBrown	f2b3b44dee	md/raid5: unify stripe_head_state and r6_state 'struct stripe_head_state' stores state about the 'current' stripe that is passed around while handling the stripe. For RAID6 there is an extension structure: r6_state, which is also passed around. There is no value in keeping these separate, so move the fields from the latter into the former. This means that all code now needs to treat s->failed_num as an small array, but this is a small cost. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:35:19 +10:00
NeilBrown	82e5a1718b	md/raid5: move common code into handle_stripe There is common code at the start of handle_stripe5 and handle_stripe6. Move it into handle_stripe. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:35:15 +10:00
NeilBrown	c4c1663be4	md/raid5: replace sh->lock with an 'active' flag. sh->lock is now mainly used to ensure that two threads aren't running in the locked part of handle_stripe[56] at the same time. That can more neatly be achieved with an 'active' flag which we set while running handle_stripe. If we find the flag is set, we simply requeue the stripe for later by setting STRIPE_HANDLE. For safety we take ->device_lock while examining the state of the stripe and creating a summary in 'stripe_head_state / r6_state'. This possibly isn't needed but as shared fields like ->toread, ->towrite are checked it is safer for now at least. We leave the label after the old 'unlock' called "unlock" because it will disappear in a few patches, so renaming seems pointless. This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE later, but that is not a problem. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:34:20 +10:00
NeilBrown	cbe47ec559	md/raid5: Protect some more code with ->device_lock. Other places that change or follow dev->towrite and dev->written take the device_lock as well as the sh->lock. So it should really be held in these places too. Also, doing so will allow sh->lock to be discarded. with merged fixes by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:20:35 +10:00
NeilBrown	83206d66b6	md/raid5: Remove use of sh->lock in sync_request This is the start of a series of patches to remove sh->lock. sync_request takes sh->lock before setting STRIPE_SYNCING to ensure there is no race with testing it in handle_stripe[56]. Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early in handle_stripe[56] (after getting the same lock) and perform the same set/clear operations if it was set. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>	2011-07-26 11:19:49 +10:00
Linus Torvalds	bbd9d6f7fb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.	2011-07-22 19:02:39 -07:00
Kay Sievers	f15146380d	fs: seq_file - add event counter to simplify poll() support Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:50 -04:00
Lai Jiangshan	b119cbab3a	md,rcu: Convert call_rcu(free_conf) to kfree_rcu() The rcu callback free_conf() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_conf). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: NeilBrown <neilb@suse.de> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-07-20 11:05:29 -07:00
Namhyung Kim	ffd96e35c1	md/raid5: get rid of duplicated call to bio_data_dir() In raid5::make_request(), once bio_data_dir(@bi) is detected it never (and couldn't) be changed. Use the result always. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:51 +10:00
Namhyung Kim	6ce328462c	md/raid5: use kmem_cache_zalloc() Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc. I think it's not harmful since @conf->slab_cache already knows actual size of struct stripe_head. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:50 +10:00
Namhyung Kim	c65060ad42	md/raid10: share pages between read and write bio's during recovery When performing a recovery, only first 2 slots in r10_bio are in use, for read and write respectively. However all of pages in the write bio are never used and just replaced to read bio's when the read completes. Get rid of those unused pages and share read pages properly. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:49 +10:00
Namhyung Kim	778ca01852	md/raid10: factor out common bio handling code When normal-write and sync-read/write bio completes, we should find out the disk number the bio belongs to. Factor those common code out to a separate function. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:47 +10:00
Namhyung Kim	2c4193df37	md/raid10: get rid of duplicated conditional expression Variable 'first' is initialized to zero and updated to @rdev->raid_disk only if it is greater than 0. Thus condition '>= first' always implies '>= 0' so the latter is not needed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-07-18 17:38:43 +10:00
NeilBrown	4274215d24	md: avoid endless recovery loop when waiting for fail device to complete. If a device fails in a way that causes pending request to take a while to complete, md will not be able to immediately remove it from the array in remove_and_add_spares. It will then incorrectly look like a spare device and md will try to recover it even though it is failed. This leads to a recovery process starting and instantly aborting over and over again. We should check if the device is faulty before considering it to be a spare. This will avoid trying to start a recovery that cannot proceed. This bug was introduced in 2.6.26 so that patch is suitable for any kernel since then. Cc: stable@kernel.org Reported-by: Jim Paradis <james.paradis@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-28 16:59:42 +10:00
Namhyung Kim	fcde90759a	md/raid5: remove unusual use of bio_iovec_idx() In the bio_for_each_segment loop, bvl always points current bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of it. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-14 14:23:57 +10:00
Namhyung Kim	b062962edb	md/raid5: fix FUA request handling in ops_run_io() Commit `e9c7469bb4` ("md: implment REQ_FLUSH/FUA support") introduced R5_WantFUA flag and set rw to WRITE_FUA in that case. However remaining code still checks whether rw is exactly same as WRITE or not, so FUAed-write ends up with being treated as READ. Fix it. This bug has been present since 2.6.37 and the fix is suitable for any -stable kernel since then. It is not clear why this has not caused more problems. Cc: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-14 14:20:19 +10:00
Namhyung Kim	9b2dc8b665	md/raid5: fix raid5_set_bi_hw_segments The @bio->bi_phys_segments consists of active stripes count in the lower 16 bits and processed stripes count in the upper 16 bits. So logical-OR operator should be bitwise one. This bug has been present since 2.6.27 and the fix is suitable for any -stable kernel since then. Fortunately the bad code is only used on error paths and is relatively unlikely to be hit. Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-14 14:09:41 +10:00
Namhyung Kim	97b3d4aacf	md/bitmap: remove unused fields from struct bitmap Get rid of ->syncchunk and ->counter_bits since they're never used. Also discard COUNTER_BYTE_RATIO which is unused. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:43:01 +10:00
Namhyung Kim	27d5ea04d0	md/bitmap: use proper accessor macro Use COUNTER()/NEEDED() macro instead of open-coding them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:57 +10:00
Namhyung Kim	01393f3d58	md: check ->hot_remove_disk when removing disk Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store() during disk removal. The linear personality only has ->hot_add_disk and no ->hot_remove_disk, so that removing disk in the array resulted to following kernel bug: $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3] $ echo none \| sudo tee /sys/block/md0/md/dev-loop2/slot BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) PGD c9f5d067 PUD 8575a067 PMD 0 Oops: 0010 [#1] SMP CPU 2 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO RIP: 0010:[<0000000000000000>] [< (null)>] (null) RSP: 0018:ffff880085757df0 EFLAGS: 00010282 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000 FS: 00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000) Stack: ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000 ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90 ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98 Call Trace: [<ffffffff8138496a>] ? slot_store+0xaa/0x265 [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8 [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144 [<ffffffff81106b87>] vfs_write+0xb1/0x10d [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135 [<ffffffff81106cac>] sys_write+0x4d/0x77 [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff880085757df0> CR2: 0000000000000000 ---[ end trace ba5fc64319a826fb ]--- Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:54 +10:00
马建朋	9864c0053d	md: Using poll /proc/mdstat can monitor the events of adding a spare disks Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:48 +10:00
Jonathan Brassow	d744540cd3	MD: use is_power_of_2 macro Make use of is_power_of_2 macro. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:36 +10:00
Jonathan Brassow	d6b212f4b1	MD: raid5 do not set fullsync Add check to determine if a device needs full resync or if partial resync will do RAID 5 was assuming that if a device was not In_sync, it must undergo a full resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'. If it is, we can safely skip the full resync and rely on the bitmap for partial recovery instead. This is the legitimate purpose of 'saved_raid_disk', from md.h: int saved_raid_disk; /* role that device used to have in the * array and could again if we did a partial * resync from the bitmap */ Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:42:29 +10:00
Jonathan Brassow	9c81075f43	MD: support initial bitmap creation in-kernel Add bitmap support to the device-mapper specific metadata area. This patch allows the creation of the bitmap metadata area upon initial array creation via device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-09 11:41:36 +10:00
Jonathan Brassow	076f968b37	MD: add sync_super to mddev_t struct Add the 'sync_super' function pointer to MD array structure (struct mddev_s) If device-mapper (dm-raid.c) is to define its own on-disk superblock and be able to load it, there must still be a way for MD to initiate superblock updates. The simplest way to make this happen is to provide a pointer in the MD array structure that can be set by device-mapper (or other module) with a function to do this. If the function has been set, it will be used; otherwise, the method with be looked up via 'super_types' as usual. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	1ed7242e59	MD: raid1 changes to allow use by device mapper MD RAID1: Changes to allow RAID1 to be used by device-mapper (dm-raid.c) Added the necessary congestion function and conditionalize calls requiring an array 'queue' or 'gendisk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	0fd018af37	MD: move thread wakeups into resume Move personality and sync/recovery thread starting outside md_run. Moving the wakeup's of the personality and sync/recovery threads out of md_run and into do_md_run and mddev_resume solves two issues: 1) It allows bitmap_load to be called before the sync_thread is run and 2) when MD personalities are used by device-mapper (dm-raid.c), the start-up of the array is better alligned with device-mapper primatives (CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should not happen until after a resume has taken place. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	ac42450c7c	MD: possible typo Make message a bit clearer by s/blocks/k/ I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the message. 'k' may be a bit ambigous, but I think it's better than "blocks" which normally means 512, but means 1024 in MD. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:11:31 +10:00
Jonathan Brassow	68866e425b	MD: no sync IO while suspended Disallow resync I/O while the RAID array is suspended. Recovery, resync, and metadata I/O should not be allowed while a device is suspended. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:10:08 +10:00
Jonathan Brassow	629acb6aba	MD: no integrity register if no gendisk Don't attempt md_integrity_register if there is no gendisk struct available. When MD arrays are built via device-mapper, the gendisk structure is not available via mddev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-06-08 15:10:08 +10:00
Mikulas Patocka	fa34ce7307	dm kcopyd: return client directly and not through a pointer Return client directly from dm_kcopyd_client_create, not through a parameter, making it consistent with dm_io_client_create. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:13 +01:00
Mikulas Patocka	5f43ba2950	dm kcopyd: reserve fewer pages Reserve just the minimum of pages needed to process one job. Because we allocate pages from page allocator, we don't need to reserve a large number of pages. The maximum job size is SUB_JOB_SIZE and we calculate the number of reserved pages based on this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:11 +01:00
Mikulas Patocka	bda8efec5c	dm io: use fixed initial mempool size Replace the arbitrary calculation of an initial io struct mempool size with a constant. The code calculated the number of reserved structures based on the request size and used a "magic" multiplication constant of 4. This patch changes it to reserve a fixed number - itself still chosen quite arbitrarily. Further testing might show if there is a better number to choose. Note that if there is no memory pressure, we can still allocate an arbitrary number of "struct io" structures. One structure is enough to process the whole request. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:09 +01:00
Mikulas Patocka	d04714580f	dm kcopyd: alloc pages from the main page allocator This patch changes dm-kcopyd so that it allocates pages from the main page allocator with __GFP_NOWARN \| __GFP_NORETRY flags (so that it can fail in case of memory pressure). If the allocation fails, dm-kcopyd allocates pages from its own reserve. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:07 +01:00
Mikulas Patocka	f99b55eec7	dm kcopyd: add gfp parm to alloc_pl Introduce a parameter for gfp flags to alloc_pl() for use in following patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:04 +01:00
Mikulas Patocka	4cc1b4cffd	dm kcopyd: remove superfluous page allocation spinlock Remove the spinlock protecting the pages allocation. The spinlock is only taken on initialization or from single-threaded workqueue. Therefore, the spinlock is useless. The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages. kcopyd_get_pages is only called from run_pages_job, which is only called from process_jobs called from do_work. kcopyd_put_pages is called from client_alloc_pages (which is initialization function) or from run_complete_job. run_complete_job is only called from process_jobs called from do_work. Another spinlock, kc->job_lock is taken each time someone pushes or pops some work for the worker thread. Once we take kc->job_lock, we guarantee that any written memory is visible to the other CPUs. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:02 +01:00
Mikulas Patocka	c6ea41fbbe	dm kcopyd: preallocate sub jobs to avoid deadlock There's a possible theoretical deadlock in dm-kcopyd because multiple allocations from the same mempool are required to finish a request. Avoid this by preallocating sub jobs. There is a mempool of 512 entries. Each request requires up to 9 entries from the mempool. If we have at least 57 concurrent requests running, the mempool may overflow and mempool allocations may start blocking until another entry is freed to the mempool. Because the same thread is used to free entries to the mempool and allocate entries from the mempool, this may result in a deadlock. This patch changes it so that one mempool entry contains all 9 "struct kcopyd_job" required to fulfill the whole request. The allocation is done only once in dm_kcopyd_copy and no further mempool allocations are done during request processing. If dm_kcopyd_copy is not run in the completion thread, this implementation is deadlock-free. MIN_JOBS needs reducing accordingly and we've chosen to reduce it further to 8. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:03:00 +01:00
Mikulas Patocka	a705a34a56	dm kcopyd: avoid pointless job splitting Don't split SUB_JOB_SIZE jobs If the job size equals SUB_JOB_SIZE, there is no point in splitting it. Splitting it just unnecessarily wastes time, because the split job size is SUB_JOB_SIZE too. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:02:58 +01:00
Martin K. Petersen	6f13f6fba7	dm mpath: do not fail paths after integrity errors Integrity errors need to be passed to the owner of the integrity metadata for processing. Consequently EILSEQ should be passed up the stack. Cc: stable@kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:02:55 +01:00
Milan Broz	f4808ca99a	dm table: reject devices without request fns This patch adds a check that a block device has a request function defined before it is used. Otherwise, misconfiguration can cause an oops. Because we are allowing devices with zero size e.g. an offline multipath device as in commit `2cd54d9bed` ("dm: allow offline devices") there needs to be an additional check to ensure devices are initialised. Some block devices, like a loop device without a backing file, exist but have no request function. Reproducer is trivial: dm-mirror on unbound loop device (no backing file on loop devices) dmsetup create x --table "0 8 mirror core 2 8 sync 2 /dev/loop0 0 /dev/loop1 0" and mirror resync will immediatelly cause OOps. BUG: unable to handle kernel NULL pointer dereference at (null) ? generic_make_request+0x2bd/0x590 ? kmem_cache_alloc+0xad/0x190 submit_bio+0x53/0xe0 ? bio_add_page+0x3b/0x50 dispatch_io+0x1ca/0x210 [dm_mod] ? read_callback+0x0/0xd0 [dm_mirror] dm_io+0xbb/0x290 [dm_mod] do_mirror+0x1e0/0x748 [dm_mirror] Signed-off-by: Milan Broz <mbroz@redhat.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 13:02:52 +01:00
Mike Snitzer	4c25932701	dm table: allow targets to support discards internally Permit a target to support discards regardless of whether or not all its underlying devices do. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-05-29 12:52:55 +01:00
Linus Torvalds	57d19e80f4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) b43: fix comment typo reqest -> request Haavard Skinnemoen has left Atmel cris: typo in mach-fs Makefile Kconfig: fix copy/paste-ism for dell-wmi-aio driver doc: timers-howto: fix a typo ("unsgined") perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). treewide: fix a few typos in comments regulator: change debug statement be consistent with the style of the rest Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations" audit: acquire creds selectively to reduce atomic op overhead rtlwifi: don't touch with treewide double semicolon removal treewide: cleanup continuations and remove logging message whitespace ath9k_hw: don't touch with treewide double semicolon removal include/linux/leds-regulator.h: fix syntax in example code tty: fix typo in descripton of tty_termios_encode_baud_rate xtensa: remove obsolete BKL kernel option from defconfig m68k: fix comment typo 'occcured' arch:Kconfig.locks Remove unused config option. treewide: remove extra semicolons ...	2011-05-23 09:12:26 -07:00
NeilBrown	b098636cf0	md: allow resync_start to be set while an array is active. The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 15:52:21 +10:00
NeilBrown	ab9d47e990	md/raid10: reformat some loops with less indenting. When a loop ends with an 'if' with a large body, it is neater to make the if 'continue' on the inverse condition, and then the body is indented less. Apply this pattern 3 times, and wrap some other long lines. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:41 +10:00
NeilBrown	f17ed07c85	md/raid10: remove unused variable. This variable 'disk' is never used - how odd. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:32 +10:00
NeilBrown	a8830bcaf3	md/raid10: make more use of 'slot' in raid10d. Now that we have a 'slot' variable, make better use of it to simplify some code a little. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:54:19 +10:00
NeilBrown	7c4e06ff2b	md/raid10: some tidying up in fix_read_error Currently the rdev on which a read error happened could be removed before we perform the fix_error handling. This requires extra tests for NULL. So delay the rdev_dec_pending call until after the call to fix_read_error so that we can be sure that the rdev still exists. This allows an 'if' clause to be removed so the body gets re-indented back one level. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:53:17 +10:00
NeilBrown	af6d7b760c	md/raid1: improve handling of pages allocated for write-behind. The current handling and freeing of these pages is a bit fragile. We only keep the list of allocated pages in each bio, so we need to still have a valid bio when freeing the pages, which is a bit clumsy. So simply store the allocated page list in the r1_bio so it can easily be found and freed when we are finished with the r1_bio. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:51:19 +10:00
NeilBrown	7ca78d57d1	md/raid1: try fix_sync_read_error before process_checks. If we get a read error during resync/recovery we current repeat with single-page reads to find out just where the error is, and possibly read each page from a different device. With check/repair we don't currently do that, we just fail. However it is possible that while all devices fail on the large 64K read, we might be able to satisfy each 4K from one device or another. So call fix_sync_read_error before process_checks to maximise the chance of finding good data and writing it out to the devices with read errors. For this to work, we need to set the 'uptodate' flags properly after fix_sync_read_error has succeeded. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:50:37 +10:00
NeilBrown	78d7f5f726	md/raid1: tidy up new functions: process_checks and fix_sync_read_error. These changes are mostly cosmetic: 1/ change mddev->raid_disks to conf->raid_disks because the later is technically safer, though in current practice it doesn't matter in this particular context. 2/ Rearrange two for / if loops to have an early 'continue' so the body of the 'if' doesn't need to be indented so much. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:48:56 +10:00
NeilBrown	a68e587035	md/raid1: split out two sub-functions from sync_request_write sync_request_write is too big and too deep. So split out two self-contains bits of functionality into separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:40:44 +10:00
NeilBrown	6f8d0c77ce	md: make error_handler functions more uniform and correct. - there is no need to test_bit Faulty, as that was already done in md_error which is the only caller of these functions. - MD_CHANGE_DEVS should be set after faulty is set to ensure metadata is updated correctly. - spinlock should be held while updating ->degraded. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:38:44 +10:00
NeilBrown	92f861a72a	md/multipath: discard ->working_disks in favour of ->degraded conf->working_disks duplicates information already available in mddev->degraded. So remove working_disks. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:38:02 +10:00
NeilBrown	76073054c9	md/raid1: clean up read_balance. read_balance has two loops which both look for a 'best' device based on slightly different criteria. This is clumsy and makes is hard to add extra criteria. So replace it all with a single loop that combines everything. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:34:56 +10:00
NeilBrown	56d9912106	md: simplify raid10 read_balance raid10 read balance has two different loop for looking through possible devices to chose the best. Collapse those into one loop and generally make the code more readable. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:27:03 +10:00
NeilBrown	8258c53208	md/bitmap: fix saving of events_cleared and other state. If a bitmap is found to be 'stale' the events_cleared value is set to match 'events'. However if the array is degraded this does not get stored on disk. This can subsequently lead to incorrect behaviour. So change bitmap_update_sb to always update events_cleared in the superblock from the known events_cleared. For neatness also set ->state from ->flags. This requires updating ->state whenever we update ->flags, which makes sense anyway. This is suitable for any active -stable release. cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:26:30 +10:00
NeilBrown	bedd86b777	md: reject a re-add request that cannot be honoured. The 'add_new_disk' ioctl can be used to add a device either as a spare, or as an active disk that just needs to be resynced based on write-intent-bitmap information (re-add) Currently if a re-add is requested but fails we add as a spare instead. This makes it impossible for user-space to check for failure. So change to require that a re-add attempt will either succeed or completely fail. User-space can then decide what to do next. Signed-off-by: NeilBrown <neilb@suse.de>	2011-05-11 14:26:20 +10:00
NeilBrown	b0140891a8	md: Fix race when creating a new md device. There is a race when creating an md device by opening /dev/mdXX. If two processes do this at much the same time they will follow the call path __blkdev_get -> get_gendisk -> kobj_lookup The first will call -> md_probe -> md_alloc -> add_disk -> blk_register_region and the race happens when the second gets to kobj_lookup after add_disk has called blk_register_region but before it returns to md_alloc. In the case the second will not call md_probe (as the probe is already done) but will get a handle on the gendisk, return to __blkdev_get which will then call md_open (via the ->open) pointer. As mddev->gendisk hasn't been set yet, md_open will think something is wrong an return with ERESTARTSYS. This can loop endlessly while the first thread makes no progress through add_disk. Nothing is blocking it, but due to scheduler behaviour it doesn't get a turn. So this is essentially a live-lock. We fix this by simply moving the assignment to mddev->gendisk before the call the add_disk() so md_open doesn't get confused. Also move blk_queue_flush earlier because add_disk should be as late as possible. To make sure that md_open doesn't complete until md_alloc has done all that is needed, we take mddev->open_mutex during the last part of md_alloc. md_open will wait for this. This can cause a lock-up on boot so Cc:ing for stable. For 2.6.36 and earlier a different patch will be needed as the 'blk_queue_flush' call isn't there. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Cc: stable@kernel.org	2011-05-11 14:26:17 +10:00
Jesper Juhl	aeb878b096	md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). There's a small typo in a comment in drivers/md/raid5.c - 'Of course' is misspelled as 'Ofcourse'. This patch fixes the spelling error. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-05-10 10:18:36 +02:00
Randy Dunlap	d76c8420c3	raid5: fix build error, sector_t usage Change <sectors> from unsigned long long to sector_t. This matches its source field. ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined! Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-04-21 10:00:00 -07:00
Krzysztof Wojcik	fee68723cf	md: Cleanup after raid45->raid0 takeover Problem: After raid4->raid0 takeover operation, another takeover operation (e.g raid0->raid10) results "kernel oops". Root cause: Variables 'degraded' in mddev structure is not cleared on raid45->raid0 takeover. This patch reset this variable. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-20 15:39:53 +10:00
NeilBrown	3b71bd9337	md: Fix dev_sectors on takeover from raid0 to raid4/5 A raid0 array doesn't set 'dev_sectors' as each device might contribute a different number of sectors. So when converting to a RAID4 or RAID5 we need to set dev_sectors as they need the number. We have already verified that in fact all devices do contribute the same number of sectors, so use that number. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-20 15:38:18 +10:00
NeilBrown	2b7da309ff	md/raid5: remove setting of ->queue_lock We previously needed to set ->queue_lock to match the raid5 device_lock so we could safely use queue_flag_* operations (e.g. for plugging). which test the ->queue_lock is in fact locked. However that need has completely gone away and is unlikely to come back to remove this now-pointless setting. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-20 15:38:07 +10:00
NeilBrown	c3b328ac84	md: fix up raid1/raid10 unplugging. We just need to make sure that an unplug event wakes up the md thread, which is exactly what mddev_check_plugged does. Also remove some plug-related code that is no longer needed. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:43 +10:00
NeilBrown	7c13edc875	md: incorporate new plugging into raid5. In raid5 plugging is used for 2 things: 1/ collecting writes that require a bitmap update 2/ collecting writes in the hope that we can create full stripes - or at least more-full. We now release these different sets of stripes when plug_cnt is zero. Also in make_request, we call mddev_check_plug to hopefully increase plug_cnt, and wake up the thread at the end if plugging wasn't achieved for some reason. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:43 +10:00
NeilBrown	97658cdd3a	md: provide generic support for handling unplug callbacks. When an md device adds a request to a queue, it can call mddev_check_plugged. If this succeeds then we know that the md thread will be woken up shortly, and ->plug_cnt will be non-zero until then, so some processing can be delayed. If it fails, then no unplug callback is expected and the make_request function needs to do whatever is required to make the request happen. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:42 +10:00
NeilBrown	482c083492	md - remove old plugging code. md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:42 +10:00
NeilBrown	af1db72d8b	md/dm - remove remains of plug_fn callback. Now that unplugging is done differently, the unplug_fn callback is never called, so it can be completely discarded. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:41 +10:00
NeilBrown	e1dfa0a297	md: use new plugging interface for RAID IO. md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>	2011-04-18 18:25:41 +10:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Mike Snitzer	a63a5cf84d	dm: improve block integrity support The current block integrity (DIF/DIX) support in DM is verifying that all devices' integrity profiles match during DM device resume (which is past the point of no return). To some degree that is unavoidable (stacked DM devices force this late checking). But for most DM devices (which aren't stacking on other DM devices) the ideal time to verify all integrity profiles match is during table load. Introduce the notion of an "initialized" integrity profile: a profile that was blk_integrity_register()'d with a non-NULL 'blk_integrity' template. Add blk_integrity_is_initialized() to allow checking if a profile was initialized. Update DM integrity support to: - check all devices with _initialized_ integrity profiles match during table load; uninitialized profiles (e.g. for underlying DM device(s) of a stacked DM device) are ignored. - disallow a table load that would result in an integrity profile that conflicts with a DM device's existing (in-use) integrity profile - avoid clearing an existing integrity profile - validate all integrity profiles match during resume; but if they don't all we can do is report the mismatch (during resume we're past the point of no return) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-05 23:52:43 +02:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Martin K. Petersen	89078d572e	md: Fix integrity registration error when no devices are capable We incorrectly returned -EINVAL when none of the devices in the array had an integrity profile. This in turn prevented mdadm from starting the metadevice. Fix this so we only return errors on mismatched profiles and memory allocation failures. Reported-by: Giacomo Catenazzi <cate@cateee.net> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-28 17:53:29 -07:00
Linus Torvalds	44bbd7ac26	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm stripe: implement merge method dm mpath: allow table load with no priority groups dm mpath: fail message ioctl if specified path is not valid dm ioctl: add flag to wipe buffers for secure data dm ioctl: prepare for crypt key wiping dm crypt: wipe keys string immediately after key is set dm: add flakey target dm: fix opening log and cow devices for read only tables	2011-03-25 20:51:44 -07:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Mustafa Mesanovic	2991520200	dm stripe: implement merge method Implement a merge function in the striped target. When the striped target's underlying devices provide a merge_bvec_fn (like all DM devices do via dm_merge_bvec) it is important to call down to them when building a biovec that doesn't span a stripe boundary. Without the merge method, a striped DM device stacked on DM devices causes bios with a single page to be submitted which results in unnecessary overhead that hurts performance. This change really helps filesystems (e.g. XFS and now ext4) which take care to assemble larger bios. By implementing stripe_merge(), DM and the stripe target no longer undermine the filesystem's work by only allowing a single page per bio. Buffered IO sees the biggest improvement (particularly uncached reads, buffered writes to a lesser degree). This is especially so for more capable "enterprise" storage LUNs. The performance improvement has been measured to be ~12-35% -- when a reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe count that is a power of 2. In contrast, the performance penalty is ~5-7% for the pathological worst case stripe configuration (small chunk_size with a stripe count that is not a power of 2). The reason for this is that stripe_map_sector() is now called once for every call to dm_merge_bvec(). stripe_map_sector() will use slower division if stripe count isn't a power of 2. Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:35 +00:00
Mike Snitzer	a490a07a67	dm mpath: allow table load with no priority groups This patch adjusts the multipath target to allow a table with both 0 priority groups and 0 for the initial priority group number. If any mpath device is held open when all paths in the last priority group have failed, userspace multipathd will attempt to reload the associated DM table to reflect the fact that the device no longer has any priority groups. But the reload attempt always failed because the multipath target did not allow 0 priority groups. All multipath target messages related to priority group (enable_group, disable_group, switch_group) will handle a priority group of 0 (will cause error). When reloading a multipath table with 0 priority groups, userspace multipathd must be updated to specify an initial priority group number of 0 (rather than 1). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Babu Moger <babu.moger@lsi.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:33 +00:00
Mike Snitzer	19040c0bc8	dm mpath: fail message ioctl if specified path is not valid Fail the reinstate_path and fail_path message ioctl if the specified path is not valid. The message ioctl would succeed for the 'reinistate_path' and 'fail_path' messages even if action was not taken because the specified device was not a valid path of the multipath device. Before, when /dev/vdb is not a path of mpathb: $ dmsetup message mpathb 0 reinstate_path /dev/vdb $ echo $? 0 After: $ dmsetup message mpathb 0 reinstate_path /dev/vdb device-mapper: message ioctl failed: Invalid argument Command failed $ echo $? 1 Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:31 +00:00
Milan Broz	f868120549	dm ioctl: add flag to wipe buffers for secure data Add DM_SECURE_DATA_FLAG which userspace can use to ensure that all buffers allocated for dm-ioctl are wiped immediately after use. The user buffer is wiped as well (we do not want to keep and return sensitive data back to userspace if the flag is set). Wiping is useful for cryptsetup to ensure that the key is present in memory only in defined places and only for the time needed. (For crypt, key can be present in table during load or table status, wait and message commands). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:30 +00:00
Milan Broz	6bb43b5d1f	dm ioctl: prepare for crypt key wiping Prepare code for implementing buffer wipe flag. No functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:28 +00:00
Milan Broz	de8be5ac70	dm crypt: wipe keys string immediately after key is set Always wipe the original copy of the key after processing it in crypt_set_key(). Signed-off-by: Milan Broz <mbroz@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:27 +00:00
Josef Bacik	3407ef5262	dm: add flakey target This target is the same as the linear target except that it returns I/O errors periodically. It's been found useful in simulating failing devices for testing purposes. I needed a dm target to do some failure testing on btrfs's raid code, and Mike pointed me at this. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:54:24 +00:00
Milan Broz	024d37e95e	dm: fix opening log and cow devices for read only tables If a table is read-only, also open any log and cow devices it uses read-only. Previously, even read-only devices were opened read-write internally. After patch `75f1dc0d07` block: check bdev_read_only() from blkdev_get() was applied, loading such tables began to fail. The patch was reverted by `e51900f7d3` block: revert block_dev read-only check but this patch fixes this part of the code to work with the original patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-03-24 13:52:14 +00:00
Akinobu Mita	bb5cda3d70	dm: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:20 -07:00
Akinobu Mita	6b33aff368	md: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:20 -07:00
Shaohua Li	1e9bb8808a	block: fix non-atomic access to genhd inflight structures After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 08:35:35 +01:00
Linus Torvalds	c55d267de2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits) [SCSI] scsi_dh_rdac: Add MD36xxf into device list [SCSI] scsi_debug: add consecutive medium errors [SCSI] libsas: fix ata list corruption issue [SCSI] hpsa: export resettable host attribute [SCSI] hpsa: move device attributes to avoid forward declarations [SCSI] scsi_debug: Logical Block Provisioning (SBC3r26) [SCSI] sd: Logical Block Provisioning update [SCSI] Include protection operation in SCSI command trace [SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try) [SCSI] target: Fix volume size misreporting for volumes > 2TB [SCSI] bnx2fc: Broadcom FCoE offload driver [SCSI] fcoe: fix broken fcoe interface reset [SCSI] fcoe: precedence bug in fcoe_filter_frames() [SCSI] libfcoe: Remove stale fcoe-netdev entries [SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h [SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument [SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs [SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out" [SCSI] libfc: Fixing a memory leak when destroying an interface [SCSI] megaraid_sas: Version and Changelog update ... Fix up trivial conflicts due to whitespace differences in drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}	2011-03-17 17:54:40 -07:00
Martin K. Petersen	a91a2785b2	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:11:05 +01:00
Linus Torvalds	7a6362800c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits) bonding: enable netpoll without checking link status xfrm: Refcount destination entry on xfrm_lookup net: introduce rx_handler results and logic around that bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag bonding: wrap slave state work net: get rid of multiple bond-related netdevice->priv_flags bonding: register slave pointer for rx_handler be2net: Bump up the version number be2net: Copyright notice change. Update to Emulex instead of ServerEngines e1000e: fix kconfig for crc32 dependency netfilter ebtables: fix xt_AUDIT to work with ebtables xen network backend driver bonding: Improve syslog message at device creation time bonding: Call netif_carrier_off after register_netdevice bonding: Incorrect TX queue offset net_sched: fix ip_tos2prio xfrm: fix __xfrm_route_forward() be2net: Fix UDP packet detected status in RX compl Phonet: fix aligned-mode pipe socket buffer header reserve netxen: support for GbE port settings ... Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c with the staging updates.	2011-03-16 16:29:25 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
Jens Axboe	4c63f5646e	Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/core Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:58:35 +01:00
Jens Axboe	721a9602e6	block: kill off REQ_UNPLUG With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:27 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
David S. Miller	0a0e9ae1bd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h	2011-03-03 21:27:42 -08:00
Patrick McHardy	01a16b21d6	netlink: kill eff_cap from struct netlink_skb_parms Netlink message processing in the kernel is synchronous these days, capabilities can be checked directly in security_netlink_recv() from the current process. Signed-off-by: Patrick McHardy <kaber@trash.net> Reviewed-by: James Morris <jmorris@namei.org> [chrisw: update to include pohmelfs and uvesafb] Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-03 13:32:07 -08:00
NeilBrown	f0b4f7e2f2	md: Fix - again - partition detection when array becomes active Revert `b821eaa572` and `f3b99be19d` When I wrote the first of these I had a wrong idea about the lifetime of 'struct block_device'. It can disappear at any time that the block device is not open if it falls out of the inode cache. So relying on the 'size' recorded with it to detect when the device size has changed and so we need to revalidate, is wrong. Rather, we really do need the 'changed' attribute stored directly in the mddev and set/tested as appropriate. Without this patch, a sequence of: mknod / open / close / unlink (which can cause a block_device to be created and then destroyed) will result in a rescan of the partition table and consequence removal and addition of partitions. Several of these in a row can get udev racing to create and unlink and other code can get confused. With the patch, the rescan is only performed when needed and so there are no races. This is suitable for any stable kernel from 2.6.35. Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2011-02-24 17:26:41 +11:00
Tejun Heo	43d133c18b	Merge branch 'master' into for-2.6.39	2011-02-21 09:43:56 +01:00
NeilBrown	da9cf5050a	md: avoid spinlock problem in blk_throtl_exit blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-21 18:25:57 +11:00
NeilBrown	8f5f02c460	md: correctly handle probe of an 'mdp' device. 'mdp' devices are md devices with preallocated device numbers for partitions. As such it is possible to mknod and open a partition before opening the whole device. this causes md_probe() to be called with a device number of a partition, which in-turn calls mddev_find with such a number. However mddev_find expects the number of a 'whole device' and does the wrong thing with partition numbers. So add code to mddev_find to remove the 'partition' part of a device number and just work with the 'whole device'. This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652 Reported-by: hkmaly@bigfoot.com Signed-off-by: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org>	2011-02-16 13:58:51 +11:00
NeilBrown	cbe6ef1d26	md: don't set_capacity before array is active. If the desired size of an array is set (via sysfs) before the array is active (which is the normal sequence), we currrently call set_capacity immediately. This means that a subsequent 'open' (as can be caused by some udev-triggers program) will notice the new size and try to probe for partitions. However as the array isn't quite ready yet the read will fail. Then when the array is read, as the size doesn't change again we don't try to re-probe. So when setting array size via sysfs, only call set_capacity if the array is already active. Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-16 13:58:38 +11:00
Krzysztof Wojcik	f7bee80945	md: Fix raid1->raid0 takeover Takeover raid1->raid0 not succeded. Kernel message is shown: "md/raid0:md126: too few disks (1 of 2) - aborting!" Problem was that we weren't updating ->raid_disks for that takeover, unlike all the others. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-14 10:01:41 +11:00
Hannes Reinecke	751b2a7d62	[SCSI] dm mpath: propagate target errors immediately DM now has more information about the nature of the underlying storage failure. Path failure is avoided if a request failed due to a target error. Instead the target error is immediately passed up the stack. Discard requests that fail due to non-target errors may now be retried. Errors restricted to the path will be retried or returned if no paths are available, irregarding the no_path_retry setting. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2011-02-12 10:33:29 -06:00
Krzysztof Wojcik	02214dc546	FIX: md: process hangs at wait_barrier after 0->10 takeover Following symptoms were observed: 1. After raid0->raid10 takeover operation we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state. 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D" In scenarios above processes hang at the same function- wait_barrier in raid10.c. Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens. Reason was that at the end of level_store, after calling pers->run, we call mddev_resume. This calls pers->quiesce(mddev, 0) with RAID10, that calls lower_barrier. However raise_barrier hadn't been called on that 'conf' yet, so conf->barrier becomes negative, which is bad. This patch introduces setting conf->barrier=1 after takeover operation. It prevents to become barrier negative after call lower_barrier(). Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-08 11:49:02 +11:00
Chris Mason	e91ece5590	md_make_request: don't touch the bio after calling make_request md_make_request was calling bio_sectors() for part_stat_add after it was calling the make_request function. This is bad because the make_request function can free the bio and because the bi_size field can change around. The fix here was suggested by Jens Axboe. It saves the sector count before the make_request call. I hit this with CONFIG_DEBUG_PAGEALLOC turned on while trying to break his pretty fusionio card. Cc: <stable@kernel.org> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-08 09:53:28 +11:00
NeilBrown	c6751b2bde	md: Don't allow slot_store while resync/recovery is happening. Activating a spare in an array while resync/recovery is already happening can lead the that spare being marked in-sync when it isn't really. So don't allow the 'slot' to be set (this activating the device) while resync/recovery is happening. Signed-off-by: NeilBrown <neilb@suse.de>	2011-02-02 11:57:13 +11:00
NeilBrown	7281f8129c	md: don't clear curr_resync_completed at end of resync. There is no need to set this to zero at this point. It will be set to zero by remove_and_add_spares or at the start of md_do_sync at the latest. And setting it to zero before MD_RECOVERY_RUNNING is cleared can make a 'zero' appear briefly in the 'sync_completed' sysfs attribute just as resync is finishing. So simply remove this setting to zero. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 14:30:27 +11:00
NeilBrown	a8c42c7f47	md: Don't use remove_and_add_spares to remove failed devices from a read-only array remove_and_add_spares is called in two places where the needs really are very different. remove_and_add_spares should not be called on an array which is about to be reshaped as some extra devices might have been manually added and that would remove them. However if the array is 'read-auto', that will currently happen, which is bad. So in the 'ro != 0' case don't call remove_and_add_spares but simply remove the failed devices as the comment suggests is needed. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 13:47:13 +11:00
Krzysztof Wojcik	fc3a08b85b	Add raid1->raid0 takeover support This patch introduces raid 1 to raid0 takeover operation in kernel space. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: Neil Brown <neilb@nbeee.brown>	2011-01-31 13:47:13 +11:00
NeilBrown	f21e9ff7f7	md: Remove the AllReserved flag for component devices. This flag is not needed and is used badly. Devices that are included in a native-metadata array are reserved exclusively for that array - and currently have AllReserved set. They all are bd_claimed for the rdev and so cannot be shared. Devices that are included in external-metadata arrays can be shared among multiple arrays - providing there is no overlap. These are bd_claimed for md in general - not for a particular rdev. When changing the amount of a device that is used in an array we need to check for overlap. This currently includes a check on AllReserved So even without overlap, sharing with an AllReserved device is not allowed. However the bd_claim usage already precludes sharing with these devices, so the test on AllReserved is not needed. And in fact it is wrong. As this is the only use of AllReserved, simply remove all usage and definition of AllReserved. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 12:10:09 +11:00
NeilBrown	50da084096	md: don't abort checking spares as soon as one cannot be added. As spares can be added manually before a reshape starts, we need to find them all to mark some of them as in_sync. Previously we would abort looking for spares when we found an unallocated spare what could not be added to the array (implying there was no room for new spares). However already-added spares could be later in the list, so we need to keep searching. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 11:57:43 +11:00
NeilBrown	469518a345	md: fix the test for finding spares in raid5_start_reshape. As spares can be added to the array before the reshape is started, we need to find and count them when checking there are enough. The array could have been degraded, so we need to check all devices, no just those out side of the range of devices in the array before the reshape. So instead of checking the index, check the In_sync flag as that reliably tells if the device is a spare or this purpose. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 11:57:43 +11:00
NeilBrown	87a8dec91e	md: simplify some 'if' conditionals in raid5_start_reshape. There are two consecutive 'if' statements. if (mddev->delta_disks >= 0) .... if (mddev->delta_disks > 0) The code in the second is equally valid if delta_disks == 0, and these two statements are the only place that 'added_devices' is used. So make them a single if statement, make added_devices a local variable, and re-indent it all. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 11:57:43 +11:00
NeilBrown	de171cb9a5	md: revert change to raid_disks on failure. If we try to update_raid_disks and it fails, we should put 'delta_disks' back to zero. This is important because some code, such as slot_store, assumes that delta_disks has been validated. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-31 11:57:42 +11:00
Tejun Heo	ada609ee2a	workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER WQ_RESCUER is now an internal flag and should only be used in the workqueue implementation proper. Use WQ_MEM_RECLAIM instead. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de>	2011-01-25 14:35:54 +01:00
Tejun Heo	49731baa41	block: restore multiple bd_link_disk_holder() support Commit `e09b457b` (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit `e09b457b` sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-01-14 18:44:22 +01:00
Linus Torvalds	f6bcfd94c0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits) dm: raid456 basic support dm: per target unplug callback support dm: introduce target callbacks and congestion callback dm mpath: delay activate_path retry on SCSI_DH_RETRY dm: remove superfluous irq disablement in dm_request_fn dm log: use PTR_ERR value instead of ENOMEM dm snapshot: avoid storing private suspended state dm snapshot: persistent make metadata_wq multithreaded dm: use non reentrant workqueues if equivalent dm: convert workqueues to alloc_ordered dm stripe: switch from local workqueue to system_wq dm: dont use flush_scheduled_work dm snapshot: remove unused dm_snapshot queued_bios_work dm ioctl: suppress needless warning messages dm crypt: add loop aes iv generator dm crypt: add multi key capability dm crypt: add post iv call to iv generator dm crypt: use io thread for reads only if mempool exhausted dm crypt: scale to multiple cpus dm crypt: simplify compatible table output ...	2011-01-13 17:30:47 -08:00
Linus Torvalds	509e4aef44	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: Fix removal of extra drives when converting RAID6 to RAID5 md: range check slot number when manually adding a spare. md/raid5: handle manually-added spares in start_reshape. md: fix sync_completed reporting for very large drives (>2TB) md: allow suspend_lo and suspend_hi to decrease as well as increase. md: Don't let implementation detail of curr_resync leak out through sysfs. md: separate meta and data devs md-new-param-to_sync_page_io md-new-param-to-calc_dev_sboffset md: Be more careful about clearing flags bit in ->recovery md: md_stop_writes requires mddev_lock. md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer md: Ensure no IO request to get md device before it is properly initialised. md: Fix single printks with multiple KERN_<level>s md: fix regression resulting in delays in clearing bits in a bitmap md: fix regression with re-adding devices to arrays with no metadata	2011-01-13 17:30:20 -08:00
NeilBrown	bf2cb0dab8	md: Fix removal of extra drives when converting RAID6 to RAID5 When a RAID6 is converted to a RAID5, the extra drive should be discarded. However it isn't due to a typo in a comparison. This bug was introduced in commit `e93f68a1fc` in 2.6.35-rc4 and is suitable for any -stable since than. As the extra drive is not removed, the 'degraded' counter is wrong and so the RAID5 will not respond correctly to a subsequent failure. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
NeilBrown	ba1b41b6b4	md: range check slot number when manually adding a spare. When adding a spare to an active array, we should check the slot number, but allow it to be larger than raid_disks if a reshape is being prepared. Apply the same test when adding a device to an array-under-construction. It already had most of the test in place, but not quite all. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
NeilBrown	1a940fcee3	md/raid5: handle manually-added spares in start_reshape. It is possible to manually add spares to specific slots before starting a reshape. raid5_start_reshape should recognised this possibility and include it in the accounting. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
Rémi Rérolle	13ae864bc8	md: fix sync_completed reporting for very large drives (>2TB) The values exported in the sync_completed file are unsigned long, which overflows with very large drives, resulting in wrong values reported. Since sync_completed uses sectors as unit, we'll start getting wrong values with components larger than 2TB. This patch simply replaces the use of unsigned long by unsigned long long. Signed-off-by: Rémi Rérolle <rrerolle@lacie.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
NeilBrown	23ddff3792	md: allow suspend_lo and suspend_hi to decrease as well as increase. The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region to which read/writes are suspended so that the under lying data can be manipulated without user-space noticing. Currently the window they describe can only move forwards along the device. However this is an unnecessary restriction which will cause problems with planned developments. So relax this restriction and allow these endpoints to move arbitrarily. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
NeilBrown	75d3da43cb	md: Don't let implementation detail of curr_resync leak out through sysfs. mddev->curr_resync has artificial values of '1' and '2' which are used by the code which ensures only one resync is happening at a time on any given device. These values are internal and should never be exposed to user-space (except when translated appropriately as in the 'pending' status in /proc/mdstat). Unfortunately they are as ->curr_resync is assigned to ->curr_resync_completed and that value is directly visible through sysfs. So change the assignments to ->curr_resync_completed to get the same valued from elsewhere in a form that doesn't have the magic '1' or '2' values. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
Jonathan Brassow	a6ff7e089c	md: separate meta and data devs Allow the metadata to be on a separate device from the data. This doesn't mean the data and metadata will by on separate physical devices - it simply gives device-mapper and userspace tools more flexibility. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:34 +11:00
Jonathan Brassow	ccebd4c415	md-new-param-to_sync_page_io Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
Jonathan Brassow	57b2caa394	md-new-param-to-calc_dev_sboffset When we allow for separate devices for data and metadata in a later patch, we will need to be able to calculate the superblock offset based on more than the bdev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
NeilBrown	7ebc0be7ff	md: Be more careful about clearing flags bit in ->recovery Setting ->recovery to 0 is generally not a good idea as it could clear bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN should only be cleared on explicit request from user-space. So when we need to clear things, just clear the bits that need clearing. As there are a few different places which reap a resync process - and some do an incomplte job - factor out the code for doing the from md_check_recovery and call that function instead of open coding part of it. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jonathan Brassow <jbrassow@redhat.com>	2011-01-14 09:14:33 +11:00
NeilBrown	defad61a5b	md: md_stop_writes requires mddev_lock. As md_stop_writes manipulates the sync_thread and calls md_update_sb, it need to be called with mddev_lock held. In all internal cases it is, but the symbol is exported for dm-raid to call and in that case the lock won't be help. Do make an exported version which takes the lock, and an internal version which does not. Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:33 +11:00
Jonathan Brassow	43c73ca43b	md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer With the module parameter 'start_dirty_degraded' set, raid5_spare_active() previously called sysfs_notify_dirent() with a NULL argument (rdev->sysfs_state) when a rebuild finished. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2011-01-14 09:14:33 +11:00
NeilBrown	0ca69886a8	md: Ensure no IO request to get md device before it is properly initialised. When an md device is in the process of coming on line it is possible for an IO request (typically a partition table probe) to get through before the array is fully initialised, which can cause unexpected behaviour (e.g. a crash). So explicitly record when the array is ready for IO and don't allow IO through until then. There is no possibility for a similar problem when the array is going off-line as there must only be one 'open' at that time, and it is busy off-lining the array and so cannot send IO requests. So no memory barrier is needed in md_stop() This has been a bug since commit `409c57f380` in 2.6.30 which introduced md_make_request. Before then, each personality would register its own make_request_fn when it was ready. This is suitable for any stable kernel from 2.6.30.y onwards. Cc: <stable@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>	2011-01-14 09:14:33 +11:00
Joe Perches	067032bc62	md: Fix single printks with multiple KERN_<level>s Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:14:33 +11:00
NeilBrown	6c98791014	md: fix regression resulting in delays in clearing bits in a bitmap commit `589a594be1` (2.6.37-rc4) fixed a problem were md_thread would sometimes call the ->run function at a bad time. If an error is detected during array start up after the md_thread has been started, the md_thread is killed. This resulted in the ->run function being called once. However the array may not be in a state that it is safe to call ->run. However the fix imposed meant that ->run was not called on a timeout. This means that when an array goes idle, bitmap bits do not get cleared promptly. While the array is busy the bits will still be cleared when appropriate so this is not very serious. There is no risk to data. Change the test so that we only avoid calling ->run when the thread is being stopped. This more explicitly addresses the problem situation. This is suitable for 2.6.37-stable and any -stable kernel to which `589a594be1` was applied. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2011-01-14 09:13:53 +11:00
NeilBrown	9d09e663d5	dm: raid456 basic support This patch is the skeleton for the DM target that will be the bridge from DM to MD (initially RAID456 and later RAID1). It provides a way to use device-mapper interfaces to the MD RAID456 drivers. As with all device-mapper targets, the nominal public interfaces are the constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO and STATUSTYPE_TABLE). The CTR table looks like the following: 1: <s> <l> raid \ 2: <raid_type> <#raid_params> <raid_params> \ 3: <#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN> Line 1 contains the standard first three arguments to any device-mapper target - the start, length, and target type fields. The target type in this case is "raid". Line 2 contains the arguments that define the particular raid type/personality/level, the required arguments for that raid type, and any optional arguments. Possible raid types include: raid4, raid5_la, raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is planned for the future.) The list of required and optional parameters is the same for all the current raid types. The required parameters are positional, while the optional parameters are given as key/value pairs. The possible parameters are as follows: <chunk_size> Chunk size in sectors. [[no]sync] Force/Prevent RAID initialization [rebuild <idx>] Rebuild the drive indicated by the index [daemon_sleep <ms>] Time between bitmap daemon work to clear bits [min_recovery_rate <kB/sec/disk>] Throttle RAID initialization [max_recovery_rate <kB/sec/disk>] Throttle RAID initialization [max_write_behind <value>] See '-write-behind=' (man mdadm) [stripe_cache <sectors>] Stripe cache size for higher RAIDs Line 3 contains the list of devices that compose the array in metadata/data device pairs. If the metadata is stored separately, a '-' is given for the metadata device position. If a drive has failed or is missing at creation time, a '-' can be given for both the metadata and data drives for a given position. Examples: # RAID4 - 4 data drives, 1 parity # No metadata devices specified to hold superblock/bitmap info # Chunk size of 1MiB # (Lines separated for easy reading) 0 1960893648 raid \ raid4 1 2048 \ 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 # RAID4 - 4 data drives, 1 parity (no metadata devices) # Chunk size of 1MiB, force RAID initialization, # min recovery rate at 20 kiB/sec/disk 0 1960893648 raid \ raid4 4 2048 min_recovery_rate 20 sync\ 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 Performing a 'dmsetup table' should display the CTR table used to construct the mapping (with possible reordering of optional parameters). Performing a 'dmsetup status' will yield information on the state and health of the array. The output is as follows: 1: <s> <l> raid \ 2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio> Line 1 is standard DM output. Line 2 is best shown by example: 0 1960893648 raid raid4 5 AAAAA 2/490221568 Here we can see the RAID type is raid4, there are 5 devices - all of which are 'A'live, and the array is 2/490221568 complete with recovery. Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-01-13 20:00:02 +00:00

... 3 4 5 6 7 ...

2283 Commits