linux

Commit Graph

Author	SHA1	Message	Date
NeilBrown	91adb56473	md/raid5: refactor raid5 "run" .. so that the code to create the private data structures is separate. This will help with future code to change the level of an active array. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:39 +11:00
NeilBrown	34817e8c39	md: make sure new_level, new_chunksize, new_layout always have sensible values. When an md array is undergoing a change, we have new_* fields that show the new values. When no change is happening, it is least confusing if these have the same value as the normal fields. This is true in most cases, but not when the values are set via sysfs. So fix this up. A subsequent patch will BUG_ON if these things aren't consistent. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	67cc2b8165	md/raid5: finish support for DDF/raid6 DDF requires RAID6 calculations over different devices in a different order. For md/raid6, we calculate over just the data devices, starting immediately after the 'Q' block. For ddf/raid6 we calculate over all devices, using zeros in place of the P and Q blocks. This requires unfortunately complex loops... Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	99c0fb5f92	md/raid5: Add support for new layouts for raid5 and raid6. DDF uses different layouts for P and Q blocks than current md/raid6 so add those that are missing. Also add support for RAID6 layouts that are identical to various raid5 layouts with the simple addition of one device to hold all of the 'Q' blocks. Finally add 'raid5' layouts to match raid4. These last to will allow online level conversion. Note that this does not provide correct support for DDF/raid6 yet as the order in which data blocks are summed to produce the Q block is significant and different between current md code and DDF requirements. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	911d4ee853	md/raid5: simplify raid5_compute_sector interface Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass a 'struct stripe_head *' and fill in the relevant fields. This is more extensible. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	d0dabf7e57	md/raid6: remove expectation that Q device is immediately after P device. Code currently assumes that the devices in a raid6 stripe are 0 1 ... N-1 P Q in some rotated order. We will shortly add new layouts in which this strict pattern is broken. So remove this expectation. We still assume that the data disks are roughly in-order. However P and Q can be inserted anywhere within that order. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	112bf8970d	md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument This similar to the recent change to get_active_stripe. There is no functional change, just come rearrangement to make future patches cleaner. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
NeilBrown	b5663ba405	md/raid5: simplify interface for init_stripe and get_active_stripe Rather than passing 'pd_idx' and 'disks' to these functions, just pass 'previous' which tells whether to use the 'previous' or 'current' geometry during a reshape, and let init_stripe calculate disks and pd_idx and anything else it might need. This is not a substantial simplification and even adds a division. However we will shortly be adding more complexity to init_stripe to handle more interesting 'reshape' activities, and without this change, the interface to these functions would get very complex. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:39:38 +11:00
Andre Noll	dd8ac336c1	md: Represent raid device size in sectors. This patch renames the "size" field of struct mdk_rdev_s to "sectors" and changes this field to store sectors instead of blocks. All users of this field, linear.c, raid0.c and md.c, are fixed up accordingly which gets rid of many multiplications and divisions. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
Andre Noll	58c0fed400	md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	575a80fa4f	md: be more consistent about setting WriteMostly flag when adding a drive to an array When a drive is added to an array using ADD_NEW_DISK, there are two places we can get certain flags from: the metadata on the disk or the flags passed through the IOCTL. For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value from either of those sources depending on if it is set (i.e. we effectively 'or' the two sources together). This makes it awkward to clear, and is at best inconsistent. As documented code (in mdadm) requires that setting MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the inconsistency by always using the value for this flag from the ioctl, and ignoring the value on disk. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	97e4f42d62	md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash Version 1.x metadata has the ability to record the status of a partially completed drive recovery. However we only update that record on a clean shutdown. It would be nice to update it on unclean shutdowns too, particularly when using a bitmap that removes much to the 'sync' effort after an unclean shutdown. One complication with checkpointing recovery is that we only know where we are up to in terms of IO requests started, not which ones have completed. And we need to know what has completed to record how much is recovered. So occasionally pause the recovery until all submitted requests are completed, then update the record of where we are up to. When we have a bitmap, we already do that pause occasionally to keep the bitmap up-to-date. So enhance that code to record the recovery offset and schedule a superblock update. And when there is no bitmap, just pause 16 times during the resync to do a checkpoint. '16' is a fairly arbitrary number. But we don't really have any good way to judge how often is acceptable, and it seems like a reasonable number for now. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	43b2e5d86d	md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	bff61975b3	md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:33:13 +11:00
NeilBrown	8b2b5c217c	md: move LEVEL_* definition from md_k.h to md_u.h .. as they are part of the user-space interface. Also move MdpMinorShift into there so we can remove duplication. Lastly move mdp_major in. It is less obviously part of the user-space interface, but do_mounts_md.c uses it, and it is acting a bit like user-space. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:03 +11:00
Christoph Hellwig	ef740c372d	md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:03 +11:00
Christoph Hellwig	2a40a8aed0	cleanup drivers/md/Makefile Use the -y variables instead of the old -objs so we can easily add conditional objects to the modules. Also always use += to add subobjects to avoid problems when placing additional objects in some place in the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Christoph Hellwig	3dbd8c2e3f	md: stop defining MAJOR_NR MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier kernels, so no need to keep it around. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Martin K. Petersen	3f9d99c12a	MD data integrity support md: Add support for data integrity to MD If all subdevices support the same protection format the MD device is flagged as integrity capable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	355a43e641	md: write bitmap information to devices that are undergoing recovery. When we add some spares to an array and start recovery, and we have a bitmap which is stored 'internally' on all devices, we call bitmap_write_all to make sure the bitmap is correct on the new device(s). However that doesn't work as write_sb_page only writes to 'In_sync' devices, and devices undergoing recovery are not 'In_sync' until recovery finishes. So extend write_sb_page (actually next_active_rdev) to include devices that are under recovery. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	d0a4bb4927	md: never clear bit from the write-intent bitmap when the array is degraded. It is safe to clear a bit from the write-intent bitmap for a raid1 if we know the data has been written to all devices, which is what the current test does. But it is not always safe to update the 'events_cleared' counter in that case. This is because one request could complete successfully after some other request has partially failed. So simply disable the clearing and updating of events_cleared whenever the array is degraded. This might end up not clearing some bits that could safely be cleared, but it is safest approach. Note that the bug fixed here did not risk corrupting data by letting the array get out-of-sync. Rather it meant that when a device is removed and re-added to the array, it might incorrectly require a full recovery rather than just recovering based on the bitmap. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	1187cf0a3c	md: Allow write-intent bitmaps to have chunksize < PAGE_SIZE md currently insists that the chunk size used for write-intent bitmaps (the amount of data that corresponds to one chunk) be at least one page. The reason for this restriction is lost in the mists of time, but a review of the code (and a vague memory) suggests that the only problem would be related to resync. Resync tries very hard to work in multiples of a page, but also needs to sync with units of a bitmap_chunk too. This connection comes out in the bitmap_start_sync call. So change bitmap_start_sync to always work in multiples of a page. If the bitmap chunk size is less that one page, we flag multiple chunks as 'syncing' and generally make them all appear to the resync routines like one chunk. All other code either already works with data ranges that could span multiple chunks, or explicitly only cares about a single chunk. Signed-off-by: Neil Brown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
NeilBrown	eea1bf384e	md: Fix is_mddev_idle test (again). There are two problems with is_mddev_idle. 1/ sync_io is 'atomic_t' and hence 'int'. curr_events and all the rest are 'long'. So if sync_io were to wrap on a 64bit host, the value of curr_events would go very negative suddenly, and take a very long time to return to positive. So do all calculations as 'int'. That gives us plenty of precision for what we need. 2/ To initialise rdev->last_events we simply call is_mddev_idle, on the assumption that it will make sure that last_events is in a suitable range. It used to do this, but now it does not. So now we need to be more explicit about initialisation. Signed-off-by: NeilBrown <neilb@suse.de>	2009-03-31 14:27:02 +11:00
Milan Broz	b35f8caa08	dm crypt: wait for endio to complete before destruction The following oops has been reported when dm-crypt runs over a loop device. ... [ 70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000) ... [ 70.381058] Call Trace: [ 70.381058] [<d0d76601>] ? crypt_dec_pending+0x5e/0x62 [dm_crypt] [ 70.381058] [<d0d767b8>] ? crypt_endio+0xa2/0xaa [dm_crypt] [ 70.381058] [<d0d76716>] ? crypt_endio+0x0/0xaa [dm_crypt] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<d0806530>] ? dec_pending+0x224/0x23b [dm_mod] [ 70.381058] [<d08066e4>] ? clone_endio+0x79/0xa4 [dm_mod] [ 70.381058] [<d080666b>] ? clone_endio+0x0/0xa4 [dm_mod] [ 70.381058] [<c01a2f24>] ? bio_endio+0x2b/0x2e [ 70.381058] [<c02bad86>] ? loop_thread+0x380/0x3b7 [ 70.381058] [<c02ba8a1>] ? do_lo_send_aops+0x0/0x165 [ 70.381058] [<c013754f>] ? autoremove_wake_function+0x0/0x33 [ 70.381058] [<c02baa06>] ? loop_thread+0x0/0x3b7 When a table is being replaced, it waits for I/O to complete before destroying the mempool, but the endio function doesn't call mempool_free() until after completing the bio. Fix it by swapping the order of those two operations. The same problem occurs in dm.c with md referenced after dec_pending. Again, we swap the order. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:36 +00:00
Huang Ying	b2174eebd1	dm crypt: fix kcryptd_async_done parameter In the async encryption-complete function (kcryptd_async_done), the crypto_async_request passed in may be different from the one passed to crypto_ablkcipher_encrypt/decrypt. Only crypto_async_request->data is guaranteed to be same as the one passed in. The current kcryptd_async_done uses the passed-in crypto_async_request directly which may cause the AES-NI-based AES algorithm implementation to panic. This patch fixes this bug by only using crypto_async_request->data, which points to dm_crypt_request, the crypto_async_request passed in. The original data (convert_context) is gotten from dm_crypt_request. [mbroz@redhat.com: reworked] Cc: stable@kernel.org Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:33 +00:00
Mikulas Patocka	d659e6cc98	dm io: respect BIO_MAX_PAGES limit dm-io calls bio_get_nr_vecs to get the maximum number of pages to use for a given device. It allocates one additional bio_vec to use internally but failed to respect BIO_MAX_PAGES, so fix this. This was the likely cause of: https://bugzilla.redhat.com/show_bug.cgi?id=173153 Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:30 +00:00
Mikulas Patocka	f80a557008	dm table: rework reference counting fix Fix an error introduced in dm-table-rework-reference-counting.patch. When there is failure after table initialization, we need to use dm_table_destroy, not dm_table_put, to free the table. dm_table_put may be used only after dm_table_get. Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 17:44:26 +00:00
Milan Broz	bc0fd67feb	dm ioctl: validate name length when renaming When renaming a mapped device validate the length of the new name. The rename ioctl accepted any correctly-terminated string enclosed within the data passed from userspace. The other ioctls enforce a size limit of DM_NAME_LEN. If the name is changed and becomes longer than that, the device can no longer be addressed by name. Fix it by properly checking for device name length (including terminating zero). Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-03-16 16:56:01 +00:00
Dan Williams	5fd3a17ed4	md: fix deadlock when stopping arrays Resolve a deadlock when stopping redundant arrays, i.e. ones that require a call to sysfs_remove_group when shutdown. The deadlock is summarized below: Thread1 Thread2 ------- ------- read sysfs attribute stop array take mddev lock sysfs_remove_group sysfs_get_active wait for mddev lock wait for active Sysrq-w: -------- mdmon S 00000017 2212 4163 1 f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc 00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd Call Trace: [<c044d9dd>] ? debug_mutex_add_waiter+0x1d/0x58 [<c06ef451>] __mutex_lock_common+0x1d9/0x338 [<c06ef451>] ? __mutex_lock_common+0x1d9/0x338 [<c06ef5e3>] mutex_lock_interruptible_nested+0x33/0x3a [<c0634553>] ? mddev_lock+0x14/0x16 [<c0634553>] mddev_lock+0x14/0x16 [<c0634eda>] md_attr_show+0x2a/0x49 [<c04e9997>] sysfs_read_file+0x93/0xf9 mdadm D 00000017 2812 4177 1 f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70 00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000 Call Trace: [<c06eed1b>] schedule_timeout+0x1b/0x95 [<c06eed1b>] ? schedule_timeout+0x1b/0x95 [<c06eeb97>] ? wait_for_common+0x34/0xdc [<c044fa8a>] ? trace_hardirqs_on_caller+0x18/0x145 [<c044fbc2>] ? trace_hardirqs_on+0xb/0xd [<c06eec03>] wait_for_common+0xa0/0xdc [<c0428c7c>] ? default_wake_function+0x0/0x12 [<c06eeccc>] wait_for_completion+0x17/0x19 [<c04ea620>] sysfs_addrm_finish+0x19f/0x1d1 [<c04e920e>] sysfs_hash_and_remove+0x42/0x55 [<c04eb4db>] sysfs_remove_group+0x57/0x86 [<c0638086>] do_md_stop+0x13a/0x499 This has been there for a while, but is easier to trigger now that mdmon is closely watching sysfs. Cc: <stable@kernel.org> Reported-by: Jacek Danecki <jacek.danecki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2009-03-04 00:57:25 -07:00
NeilBrown	73d5c38a95	md: avoid races when stopping resync. There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
NeilBrown	78200d45cd	md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. For raid1/4/5/6, resync (fixing inconsistencies between devices) is very similar to recovery (rebuilding a failed device onto a spare). The both walk through the device addresses in order. For raid10 it can be quite different. resync follows the 'array' address, and makes sure all copies are the same. Recover walks through 'device' addresses and recreates each missing block. The 'bitmap_cond_end_sync' function allows the write-intent-bitmap (When present) to be updated to reflect a partially completed resync. It makes assumptions which mean that it does not work correctly for raid10 recovery at all. In particularly, it can cause bitmap-directed recovery of a raid10 to not recovery some of the blocks that need to be recovered. So move the call to bitmap_cond_end_sync into the resync path, rather than being in the common "resync or recovery" path. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
NeilBrown	09b4068a7f	md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. When doing recovery on a raid10 with a write-intent bitmap, we only need to recovery chunks that are flagged in the bitmap. However if we choose to skip a chunk as it isn't flag, the code currently skips the whole raid10-chunk, thus it might not recovery some blocks that need recovering. This patch fixes it. In case that is confusing, it might help to understand that there is a 'raid10 chunk size' which guides how data is distributed across the devices, and a 'bitmap chunk size' which says how much data corresponds to a single bit in the bitmap. This bug only affects cases where the bitmap chunk size is smaller than the raid10 chunk size. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-25 13:18:47 +11:00
Jens Axboe	93dbb39350	block: fix bad definition of BIO_RW_SYNC We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before `213d9417fe`. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-02-18 10:32:00 +01:00
NeilBrown	de01dfadf2	md: Ensure an md array never has too many devices. Each different metadata format supported by md supports a different maximum number of devices. We really should be enforcing this maximum in the kernel, but we aren't quite doing that properly. We currently only enforce it at the 'hot_add' point, which is an older interface which is not used by current userspace. We need to also enforce it at 'add_new_disk' time for active arrays and at 'do_md_run' time when starting a new array. So move the test from 'hot_add' into 'bind_rdev_to_array' which is called from both 'hot_add' and 'add_new_disk, and add a new test in 'analyse_sbs' which is called from 'do_md_run'. This bug (or missing feature) has been around "forever" and so the patch is suitable for any -stable that is currently maintained. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 18:02:46 +11:00
Andre Noll	852c8bf484	md: Fix a bug in linear.c causing which_dev() to return the wrong device. `ab5bd5cbc8` introduced the following bug in linear software raid for large arrays on 32 bit machines: which_dev() computes the device holding a given sector by shifting down the sector number to a 32 bit range, dividing by the array spacing and looking up the resulting index in the hash table of the array. Because the computed index might be slightly too small, a loop at the end of which_dev() increases the index until the given sector actually falls into the range of the device associated with that index. The changes of the above mentioned commit caused this loop to check whether the _index_ rather than the sector number is small enough, effectively bypassing the loop and thus possibly returning the wrong device. As reported by Simon Kirby, this leads to errors such as linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464 Fix this bug by introducing a local variable for the index so that the variable containing the passed sector is left unchanged. Cc: stable@kernel.org Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 15:10:52 +11:00
NeilBrown	4706b349f4	md: Allow read error in a single drive raid1 to be passed up. If a raid1 only has a single working device and gets a read error, we choose to simply return that error up to the filesystem (or whatever) rather than failing the whole array. However the codes doesn't quite do that. We attempt a readbalance which allocates the same drive, so we retry the read - indefinitely. Instead: If read_balance in the error case chooses the same drive that just failed, treat it as a failure and don't retry. Signed-off-by: NeilBrown <neilb@suse.de>	2009-02-06 15:06:47 +11:00
NeilBrown	4044ba58dd	md: don't retry recovery of raid1 that fails due to error on source drive. If a raid1 has only one working drive and it has a sector which gives an error on read, then an attempt to recover onto a spare will fail, but as the single remaining drive is not removed from the array, the recovery will be immediately re-attempted, resulting in an infinite recovery loop. So detect this situation and don't retry recovery once an error on the lone remaining drive is detected. Allow recovery to be retried once every time a spare is added in case the problem wasn't actually a media error. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:11 +11:00
NeilBrown	efeb53c0e5	md: Allow md devices to be created by name. Using sequential numbers to identify md devices is somewhat artificial. Using names can be a lot more user-friendly. Also, creating md devices by opening the device special file is a bit awkward. So this patch provides a new option for creating and naming devices. Writing a name such as "md_home" to /sys/modules/md_mod/parameters/new_array will cause an array with that name to be created. It will appear in /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'. It will have an arbitrary minor number allocated. md devices that a created by an open are destroyed on the last close when the device is inactive. For named md devices, they will not be destroyed until the array is explicitly stopped, either with the STOP_ARRAY ioctl or by writing 'clear' to /sys/block/md_XXXX/md/array_state. The name of the array must start 'md_' to avoid conflict with other devices. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
NeilBrown	d3374825ce	md: make devices disappear when they are no longer needed. Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:10 +11:00
NeilBrown	a21d15042d	md: centralise all freeing of an 'mddev' in 'md_free' md_free is the .release handler for the md kobj_type. So it makes sense to release all the objects referenced by the mddev in there, rather than just prior to calling kobject_put for what we think is the last time. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:09 +11:00
NeilBrown	8b76539823	md: move allocation of ->queue from mddev_find to md_probe It is more balanced to just do simple initialisation in mddev_find, which allocates and links a new md device, and leave all the more sophisticated allocation to md_probe (which calls mddev_find). md_probe already allocated the gendisk. It should allocate the queue too. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Cheng Renquan	cd2ac9321c	md: need another print_sb for mdp_superblock_1 md_print_devices is called in two code path: MD_BUG(...), and md_ioctl with PRINT_RAID_DEBUG. it will dump out all in use md devices information; However, it wrongly processed two types of superblock in one: The header file <linux/raid/md_p.h> has defined two types of superblock, struct mdp_superblock_s (typedefed with mdp_super_t) according to md with metadata 0.90, and struct mdp_superblock_1 according to md with metadata 1.0 and later, These two types of superblock are very different, The md_print_devices code processed them both in mdp_super_t, that would lead to wrong informaton dump like: [ 6742.345877] [ 6742.345887] md: ********************************** [ 6742.345890] md: * <COMPLETE RAID STATE PRINTOUT> * [ 6742.345892] md: ******************************** [ 6742.345896] md1: <ram7><ram6><ram5><ram4> [ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 6742.345909] md: rdev superblock: [ 6742.345914] md: SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d [ 6742.345918] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 6742.345922] md: UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001 [ 6742.345924] D 0: DISK<N:0,(1,8),R:0,S:6> [ 6742.345930] D 1: DISK<N:1,(1,10),R:1,S:6> [ 6742.345933] D 2: DISK<N:2,(1,12),R:2,S:6> [ 6742.345937] D 3: DISK<N:3,(1,14),R:3,S:6> [ 6742.345942] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 6742.346058] md0: <ram3><ram2><ram1><ram0> [ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 6742.346070] md: rdev superblock: [ 6742.346073] md: SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c [ 6742.346077] md: L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610 [ 6742.346081] md: UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000 [ 6742.346084] D 0: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346089] D 1: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346092] D 2: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346096] D 3: DISK<N:-1,(-1,-1),R:-1,S:-1> [ 6742.346102] md: THIS: DISK<N:0,(0,0),R:0,S:0> ... [ 6742.346219] md: ****************************** [ 6742.346221] Here md1 is metadata 0.90.0, and md0 is metadata 1.2 After some more code to distinguish these two types of superblock, in this patch, it will generate dump information like: [ 7906.755790] [ 7906.755799] md: ******************************** [ 7906.755802] md: * <COMPLETE RAID STATE PRINTOUT> * [ 7906.755804] md: ******************************** [ 7906.755808] md1: <ram7><ram6><ram5><ram4> [ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3 [ 7906.755821] md: rdev superblock (MJ:0): [ 7906.755826] md: SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3 [ 7906.755830] md: L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536 [ 7906.755834] md: UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001 [ 7906.755836] D 0: DISK<N:0,(1,8),R:0,S:6> [ 7906.755842] D 1: DISK<N:1,(1,10),R:1,S:6> [ 7906.755845] D 2: DISK<N:2,(1,12),R:2,S:6> [ 7906.755849] D 3: DISK<N:3,(1,14),R:3,S:6> [ 7906.755855] md: THIS: DISK<N:3,(1,14),R:3,S:6> ... [ 7906.755972] md0: <ram3><ram2><ram1><ram0> [ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3 [ 7906.755984] md: rdev superblock (MJ:1): [ 7906.755989] md: SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd> [ 7906.755990] md: Name: "DG5:0" CT:1226410480 [ 7906.755998] md: L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0 [ 7906.755999] md: Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a [ 7906.756001] md: (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829 [ 7906.756003] md: (MaxDev:384) ... [ 7906.756113] md: ******************************** [ 7906.756116] this md0 (metadata 1.2) information dumping is exactly according to struct mdp_superblock_1. Signed-off-by: Cheng Renquan <crquan@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Cheng Renquan	159ec1fc06	md: use list_for_each_entry macro directly The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Andre Noll	ccacc7d2cf	md: raid0: make hash_spacing and preshift sector-based. This patch renames the hash_spacing and preshift members of struct raid0_private_data to spacing and sector_shift respectively and changes the semantics as follows: We always have spacing = 2 * hash_spacing. In case sizeof(sector_t) > sizeof(u32) we also have sector_shift = preshift + 1 while sector_shift = preshift = 0 otherwise. Note that the values of nb_zone and zone are unaffected by these changes because in the sector_div() preceeding the assignement of these two variables both arguments double. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:08 +11:00
Andre Noll	83838ed878	md: raid0: Represent the size of strip zones in sectors. This completes the block -> sector conversion of struct strip_zone. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	0825b87a7d	md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. This patch consists only of these trivial changes. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	6b8796cc3d	md: raid0 create_strip_zones(): Make two local variables sector-based. current_offset and curr_zone_offset stored the corresponding offsets as 1K quantities. Rename them to current_start and curr_zone_start to match the naming of struct strip_zone and store the offsets as sector counts. Also, add KERN_INFO to the printk() affected by this change to make checkpatch happy. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	6199d3db0f	md: raid0: Represent zone->zone_offset in sectors. For the same reason as in the previous patch, rename it from zone_offset to zone_start. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:07 +11:00
Andre Noll	019c4e2f3e	md: raid0: Represent device offset in sectors. Rename zone->dev_offset to zone->dev_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	e0f0686834	md: raid0_make_request(): Replace local variable block by sector. This change already simplifies the code a bit. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	a471200595	md: raid0_make_request(): Remove local variable chunk_size. We might as well use chunk_sects instead. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
Andre Noll	1b7fdf8ff7	md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. As ffz(~(2 * x)) = ffz(~x) + 1, we have chunksect_bits = chunksize_bits + 1. Fixup all users accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:06 +11:00
NeilBrown	0c3573f19d	md: use sysfs_notify_dirent to notify changes to md/sync_action. There is no compelling need for this, but sysfs_notify_dirent is a nicer interface and the change is good for consistency. Signed-off-by: NeilBrown <neilb@suse.de>	2009-01-09 08:31:05 +11:00
NeilBrown	538452700d	md: fix bitmap-on-external-file bug. commit `a2ed9615e3` fixed a bug with 'internal' bitmaps, but in the process broke 'in a file' bitmaps. So they are broken in 2.6.28 This fixes it, and needs to go in 2.6.28-stable. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2009-01-09 08:31:05 +11:00
Jonathan Brassow	a159c1ac5f	dm snapshot: extend exception store functions Supply dm_add_exception as a callback to the read_metadata function. Add a status function ready for a later patch and name the functions consistently. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:19 +00:00
Alasdair G Kergon	4db6bfe02b	dm snapshot: split out exception store implementations Move the existing snapshot exception store implementations out into separate files. Later patches will place these behind a new interface in preparation for alternative implementations. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:17 +00:00
Jonathan Brassow	1ae25f9c93	dm snapshot: rename struct exception_store Rename struct exception_store to dm_exception_store. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:16 +00:00
Jonathan Brassow	aea53d92f7	dm snapshot: separate out exception store interface Pull structures that bridge the gap between snapshot and exception store out of dm-snap.h and put them in a new .h file - dm-exception-store.h. This file will define the API for new exception stores. Ultimately, dm-snap.h is unnecessary, since only dm-snap.c should be using it. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:15 +00:00
Alasdair G Kergon	fe9cf30eb8	dm mpath: move trigger_event to system workqueue The same workqueue is used both for sending uevents and processing queued I/O. Deadlock has been reported in RHEL5 when sending a uevent was blocked waiting for the queued I/O to be processed. Use scheduled_work() for the asynchronous uevents instead. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:13 +00:00
Milan Broz	784aae735d	dm: add name and uuid to sysfs Implement simple read-only sysfs entry for device-mapper block device. This patch adds a simple sysfs directory named "dm" under block device properties and implements - name attribute (string containing mapped device name) - uuid attribute (string containing UUID, or empty string if not set) The kobject is embedded in mapped_device struct, so no additional memory allocation is needed for initializing sysfs entry. During the processing of sysfs attribute we need to lock mapped device which is done by a new function dm_get_from_kobj, which returns the md associated with kobject and increases the usage count. Each 'show attribute' function is responsible for its own locking. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:12 +00:00
Mikulas Patocka	d58168763f	dm table: rework reference counting Rework table reference counting. The existing code uses a reference counter. When the last reference is dropped and the counter reaches zero, the table destructor is called. Table reference counters are acquired/released from upcalls from other kernel code (dm_any_congested, dm_merge_bvec, dm_unplug_all). If the reference counter reaches zero in one of the upcalls, the table destructor is called from almost random kernel code. This leads to various problems: * dm_any_congested being called under a spinlock, which calls the destructor, which calls some sleeping function. * the destructor attempting to take a lock that is already taken by the same process. * stale reference from some other kernel code keeps the table constructed, which keeps some devices open, even after successful return from "dmsetup remove". This can confuse lvm and prevent closing of underlying devices or reusing device minor numbers. The patch changes reference counting so that the table destructor can be called only at predetermined places. The table has always exactly one reference from either mapped_device->map or hash_cell->new_map. After this patch, this reference is not counted in table->holders. A pair of dm_create_table/dm_destroy_table functions is used for table creation/destruction. Temporary references from the other code increase table->holders. A pair of dm_table_get/dm_table_put functions is used to manipulate it. When the table is about to be destroyed, we wait for table->holders to reach 0. Then, we call the table destructor. We use active waiting with msleep(1), because the situation happens rarely (to one user in 5 years) and removing the device isn't performance-critical task: the user doesn't care if it takes one tick more or not. This way, the destructor is called only at specific points (dm_table_destroy function) and the above problems associated with lazy destruction can't happen. Finally remove the temporary protection added to dm_any_congested(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:10 +00:00
Andi Kleen	ab4c142488	dm: support barriers on simple devices Implement barrier support for single device DM devices This patch implements barrier support in DM for the common case of dm linear just remapping a single underlying device. In this case we can safely pass the barrier through because there can be no reordering between devices. NB. Any DM device might cease to support barriers if it gets reconfigured so code must continue to allow for a possible -EOPNOTSUPP on every barrier bio submitted. - agk Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:09 +00:00
Kiyoshi Ueda	8fbf26ad5b	dm request: add caches This patch prepares some kmem_caches for request-based dm. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:06 +00:00
Milan Broz	23d39f63aa	dm ioctl: allow dm_copy_name_and_uuid to return only one field Allow NULL buffer in dm_copy_name_and_uuid if you only want to return one of the fields. (Required by a following patch that adds these fields to sysfs.) Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:04 +00:00
Milan Broz	ac1f0ac22c	dm log: ensure log bitmap fits on log device Check that the log bitmap will fit within the log device. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:02 +00:00
Milan Broz	2045e88edb	dm log: move region_size validation Move log size validation from mirror target to log constructor. Removed PAGE_SIZE restriction we no longer think necessary. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:05:01 +00:00
Takahiro Yasui	6f3af01cb0	dm log: avoid reinitialising io_req on every operation rw_header function updates three members of io_req data every time when I/O is processed. bi_rw and notify.fn are never modified once they get initialized, and so they can be set in advance. header_to_disk() can also be pulled out of write_header() since only one caller needs it and write_header() can be replaced by rw_header() directly. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:59 +00:00
Mikulas Patocka	10d3bd09a3	dm: consolidate target deregistration error handling Change dm_unregister_target to return void and use BUG() for error reporting. dm_unregister_target can only fail because of programming bug in the target driver. It can't fail because of user's behavior or disk errors. This patch changes unregister_target to return void and use BUG if someone tries to unregister non-registered target or unregister target that is in use. This patch removes code duplication (testing of error codes in all dm targets) and reports bugs in just one place, in dm_unregister_target. In some target drivers, these return codes were ignored, which could lead to a situation where bugs could be missed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:58 +00:00
Jonathan Brassow	d460c65a6a	dm raid1: fix error count Always increase the error count when I/O on a leg of a mirror fails. The error count is used to decide whether to select an alternative mirror leg. If the target doesn't use the "handle_errors" feature, the error count is not updated and the bio can get requeued forever by the read callback. Fix it by increasing error_count before the handle_errors feature checking. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:57 +00:00
Takahiro Yasui	c7a2bd19b7	dm log: fix dm_io_client leak on error paths In create_log_context function, dm_io_client_destroy function needs to be called, when memory allocation of disk_header, sync_bits and recovering_bits failed, but dm_io_client_destroy is not called. Cc: stable@kernel.org Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:56 +00:00
Mikulas Patocka	90fa1527bd	dm snapshot: change yield to msleep Change yield() to msleep(1). If the thread had realtime priority, yield() doesn't really yield, so the yielding process would loop indefinitely and cause machine lockup. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:54 +00:00
Mikulas Patocka	a1b51e9867	dm table: drop reference at unbind Move one dm_table_put() so that the last reference in the thread gets dropped in __unbind(). This is required for a following patch, dm-table-rework-reference-counting.patch, which will change the logic in such a way that table destructor is called only at specific points in the code. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2009-01-06 03:04:53 +00:00
Jens Axboe	bb799ca020	bio: allow individual slabs in the bio_set Instead of having a global bio slab cache, add a reference to one in each bio_set that is created. This allows for personalized slabs in each bio_set, so that they can have bios of different sizes. This means we can personalize the bios we return. File systems may want to embed the bio inside another structure, to avoid allocation more items (and stuffing them in ->bi_private) after the get a bio. Or we may want to embed a number of bio_vecs directly at the end of a bio, to avoid doing two allocations to return a bio. This is now possible. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-12-29 08:29:23 +01:00
Ingo Molnar	db8862eafe	Merge branch 'linus' into tracing/hw-branch-tracing	2008-12-24 21:08:26 +01:00
NeilBrown	a2ed9615e3	md: Don't read past end of bitmap when reading bitmap. When we read the write-intent-bitmap off the device, we currently read a whole number of pages. When PAGE_SIZE is 4K, this works due to the alignment we enforce on the superblock and bitmap. When PAGE_SIZE is 64K, this case read past the end-of-device which causes an error. When we write the superblock, we ensure to clip the last page to just be the required size. Copy that code into the read path to just read the required number of sectors. Signed-off-by: Neil Brown <neilb@suse.de> Cc: stable@kernel.org	2008-12-19 16:25:01 +11:00
Ingo Molnar	970987beb9	Merge branches 'tracing/ftrace', 'tracing/function-graph-tracer' and 'tracing/urgent' into tracing/core	2008-12-05 14:45:22 +01:00
Milan Broz	0e435ac26e	block: fix setting of max_segment_size and seg_boundary mask Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=471639 http://bugzilla.kernel.org/show_bug.cgi?id=8672 Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Cc: Mike Miller <mike.miller@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-12-03 12:55:55 +01:00
Ingo Molnar	0bfc24559d	blktrace: port to tracepoints, update Port to the new tracepoints API: split DEFINE_TRACE() and DECLARE_TRACE() sites. Spread them out to the usage sites, as suggested by Mathieu Desnoyers. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>	2008-11-26 13:04:35 +01:00
Arnaldo Carvalho de Melo	5f3ea37c77	blktrace: port to tracepoints This was a forward port of work done by Mathieu Desnoyers, I changed it to encode the 'what' parameter on the tracepoint name, so that one can register interest in specific events and not on classes of events to then check the 'what' parameter. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-11-26 12:13:34 +01:00
Chandra Seetharaman	8a57dfc6f9	dm: avoid destroying table in dm_any_congested dm_any_congested() just checks for the DMF_BLOCK_IO and has no code to make sure that suspend waits for dm_any_congested() to complete. This patch adds such a check. Without it, a race can occur with dm_table_put() attempting to destroying the table in the wrong thread, the one running dm_any_congested() which is meant to be quick and return immediately. Two examples of problems: 1. Sleeping functions called from congested code, the caller of which holds a spin lock. 2. An ABBA deadlock between pdflush and multipathd. The two locks in contention are inode lock and kernel lock. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:14 +00:00
Mikulas Patocka	d221d2e776	dm: move pending queue wake_up end_io_acct This doesn't fix any bug, just moves wake_up immediately after decrementing md->pending, for better code readability. It must be clear to anyone manipulating md->pending to wake up the queue if md->pending reaches zero, so move the wakeup as close to the decrementing as possible. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:10 +00:00
Chandra Seetharaman	14e98c5ca8	dm mpath: warn if args ignored Currently dm ignores the parameters provided to hardware handlers without providing any notifications to the user. This patch just prints a warning message so that the user knows that the arguments are ignored. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:06 +00:00
Chandra Seetharaman	b81aa1c792	dm mpath: avoid attempting to activate null path Path activation code is called even when the pgpath is NULL. This could lead to a panic in activate_path(). Such a panic is seen in -rt kernel. This problem has been there before the pg_init() was moved to a workqueue. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:39:00 +00:00
Heinz Mauelshagen	6edebdee48	dm stripe: fix init failure Don't proceed if dm_stripe_init() fails to register itself as a dm target. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-11-13 23:38:56 +00:00
Mikulas Patocka	18776c7316	dm raid1: flush workqueue before destruction We queue work on keventd queue --- so this queue must be flushed in the destructor. Otherwise, keventd could access mirror_set after it was freed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-11-13 23:38:52 +00:00
Andre Noll	f1cd14ae52	md: linear: Fix a division by zero bug for very small arrays. We currently oops with a divide error on starting a linear software raid array consisting of at least two very small (< 500K) devices. The bug is caused by the calculation of the hash table size which tries to compute sector_div(sz, base) with "base" being zero due to the small size of the component devices of the array. Fix this by requiring the hash spacing to be at least one which implies that also "base" is non-zero. This bug has existed since about 2.6.14. Cc: stable@kernel.org Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 19:41:24 +11:00
NeilBrown	a53a6c8575	md: fix bug in raid10 recovery. Adding a spare to a raid10 doesn't cause recovery to start. This is due to an silly type in commit `6c2fce2ef6` and so is a bug in 2.6.27 and .28-rc. Thanks to Thomas Backlund for bisecting to find this. Cc: Thomas Backlund <tmb@mandriva.org> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 17:28:20 +11:00
NeilBrown	cb3ac42b8a	md: revert the recent addition of a call to the BLKRRPART ioctl. It turns out that it is only safe to call blkdev_ioctl when the device is actually open (as ->bd_disk is set to NULL on last close). And it is quite possible for do_md_stop to be called when the device is not open. So discard the call to blkdev_ioctl(BLKRRPART) which was added in commit `934d9c23b4` It is just as easy to call this ioctl from userspace when needed (on mdadm -S) so leave it out of the kernel Signed-off-by: NeilBrown <neilb@suse.de>	2008-11-06 17:28:01 +11:00
Linus Torvalds	721d5dfe7e	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: destroy partitions and notify udev when md array is stopped.	2008-10-30 18:36:16 -07:00
Mikulas Patocka	879129d208	dm snapshot: wait for chunks in destructor If there are several snapshots sharing an origin and one is removed while the origin is being written to, the snapshot's mempool may get deleted while elements are still referenced. Prior to dm-snapshot-use-per-device-mempools.patch the pending exceptions may still have been referenced after the snapshot was destroyed, but this was not a problem because the shared mempool was still there. This patch fixes the problem by tracking the number of mempool elements in use. The scenario: - You have an origin and two snapshots 1 and 2. - Someone writes to the origin. - It creates two exceptions in the snapshots, snapshot 1 will be primary exception, snapshot 2's pending_exception->primary_pe will point to the exception in snapshot 1. - The exceptions are being relocated, relocation of exception 1 finishes (but it's pending_exception is still allocated, because it is referenced by an exception from snapshot 2) - The user lvremoves snapshot 1 --- it calls just suspend (does nothing) and destructor. md->pending is zero (there is no I/O submitted to the snapshot by md layer), so it won't help us. - The destructor waits for kcopyd jobs to finish on snapshot 1 --- but there are none. - The destructor on snapshot 1 cleans up everything. - The relocation of exception on snapshot 2 finishes, it drops reference on primary_pe. This frees its primary_pe pointer. Primary_pe points to pending exception created for snapshot 1. So it frees memory into non-existing mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-30 13:33:16 +00:00
Mikulas Patocka	60c856c8e2	dm snapshot: fix register_snapshot deadlock register_snapshot() performs a GFP_KERNEL allocation while holding _origins_lock for write, but that could write out dirty pages onto a device that attempts to acquire _origins_lock for read, resulting in deadlock. So move the allocation up before taking the lock. This path is not performance-critical, so it doesn't matter that we allocate memory and free it if we find that we won't need it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-30 13:33:12 +00:00
Ilpo Jarvinen	b34578a484	dm raid1: fix do_failures Missing braces. Commit `1f965b1943` (dm raid1: separate region_hash interface part1) broke it. Signed-off-by: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Heinz Mauelshagen <hjm@redhat.com>	2008-10-30 13:33:07 +00:00
NeilBrown	934d9c23b4	md: destroy partitions and notify udev when md array is stopped. md arrays are not currently destroyed when they are stopped - they remain in /sys/block. Last time I tried this I tripped over locking too much. A consequence of this is that udev doesn't remove anything from /dev. This is rather ugly. As an interim measure until proper device removal can be achieved, make sure all partitions are removed using the BLKRRPART ioctl, and send a KOBJ_CHANGE when an md array is stopped. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-28 17:01:23 +11:00
Linus Torvalds	f8d56f1771	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: allow extended partitions on md devices. md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state md: use sysfs_notify_dirent to notify changes to md/array_state	2008-10-26 16:42:18 -07:00
Linus Torvalds	2248485640	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits) [PATCH] kill the rest of struct file propagation in block ioctls [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET [PATCH] get rid of blkdev_locked_ioctl() [PATCH] get rid of blkdev_driver_ioctl() [PATCH] sanitize blkdev_get() and friends [PATCH] remember mode of reiserfs journal [PATCH] propagate mode through swsusp_close() [PATCH] propagate mode through open_bdev_excl/close_bdev_excl [PATCH] pass fmode_t to blkdev_put() [PATCH] kill the unused bsize on the send side of /dev/loop [PATCH] trim file propagation in block/compat_ioctl.c [PATCH] end of methods switch: remove the old ones [PATCH] switch sr [PATCH] switch sd [PATCH] switch ide-scsi [PATCH] switch tape_block [PATCH] switch dcssblk [PATCH] switch dasd [PATCH] switch mtd_blkdevs [PATCH] switch mmc ...	2008-10-23 10:23:07 -07:00
Linus Torvalds	5ed487bc2c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (46 commits) [PATCH] fs: add a sanity check in d_free [PATCH] i_version: remount support [patch] vfs: make security_inode_setattr() calling consistent [patch 1/3] FS_MBCACHE: don't needlessly make it built-in [PATCH] move executable checking into ->permission() [PATCH] fs/dcache.c: update comment of d_validate() [RFC PATCH] touch_mnt_namespace when the mount flags change [PATCH] reiserfs: add missing llseek method [PATCH] fix ->llseek for more directories [PATCH vfs-2.6 6/6] vfs: add LOOKUP_RENAME_TARGET intent [PATCH vfs-2.6 5/6] vfs: remove LOOKUP_PARENT from non LOOKUP_PARENT lookup [PATCH vfs-2.6 4/6] vfs: remove unnecessary fsnotify_d_instantiate() [PATCH vfs-2.6 3/6] vfs: add __d_instantiate() helper [PATCH vfs-2.6 2/6] vfs: add d_ancestor() [PATCH vfs-2.6 1/6] vfs: replace parent == dentry->d_parent by IS_ROOT() [PATCH] get rid of on-stack dentry in udf [PATCH 2/2] anondev: switch to IDA [PATCH 1/2] anondev: init IDR statically [JFFS2] Use d_splice_alias() not d_add() in jffs2_lookup() [PATCH] Optimise NFS readdir hack slightly. ...	2008-10-23 10:22:40 -07:00
Christoph Hellwig	72e8264eda	[PATCH] dm: kill lookup_device wrapper Now that lookup_bdev is exported and used by dm just use it directly instead of through a trivial wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-23 05:12:57 -04:00
Kiyoshi Ueda	51157b4ab4	dm: tidy local_init This patch tidies local_init() in preparation for request-based dm. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:08 +01:00
Kiyoshi Ueda	f431d9666f	dm: remove unused flush_all This patch removes the DM_WQ_FLUSH_ALL state that is unnecessary. The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend() is never invoked because: - 'goto flush_and_out' is the same as 'goto out' because the 'goto flush_and_out' is called only when '!noflush' - If r is non-zero, then the code above will invoke 'goto out' and skip this code. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:07 +01:00
Heinz Mauelshagen	1f965b1943	dm raid1: separate region_hash interface part1 Separate the region hash code from raid1 so it can be shared by forthcoming targets. Use BUG_ON() for failed async dm_io() calls. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:06 +01:00
Martin K. Petersen	f3e1d26ede	dm: mark split bio as cloned When a bio gets split, mark its fragments with the BIO_CLONED flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:04 +01:00
Milan Broz	0a4a1047a4	dm crypt: remove waitqueue Remove waitqueue no longer needed with the async crypto interface. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:03 +01:00
Milan Broz	393b47ef23	dm crypt: fix async split When writing io, dm-crypt has to allocate a new cloned bio and encrypt the data into newly-allocated pages attached to this bio. In rare cases, because of hw restrictions (e.g. physical segment limit) or memory pressure, sometimes more than one cloned bio has to be used, each processing a different fragment of the original. Currently there is one waitqueue which waits for one fragment to finish and continues processing the next fragment. But when using asynchronous crypto this doesn't work, because several fragments may be processed asynchronously or in parallel and there is only one crypt context that cannot be shared between the bio fragments. The result may be corruption of the data contained in the encrypted bio. The patch fixes this by allocating new dm_crypt_io structs (with new crypto contexts) and running them independently. The fragments contains a pointer to the base dm_crypt_io struct to handle reference counting, so the base one is properly deallocated after all the fragments are finished. In a low memory situation, this only uses one additional object from the mempool. If the mempool is empty, the next allocation simple waits for previous fragments to complete. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:02 +01:00
Milan Broz	b635b00e0e	dm crypt: tidy sector Prepare local sector variable (offset) for later patch. Do not update io->sector for still-running I/O. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:45:00 +01:00
Mikulas Patocka	586e80e6ee	dm: remove dm header from targets Change #include "dm.h" to #include <linux/device-mapper.h> in all targets. Targets should not need direct access to internal DM structures. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:59 +01:00
Mikulas Patocka	d63a5ce3c0	dm: publish array_too_big Move array_too_big to include/linux/device-mapper.h because it is used by targets. Remove the test from dm-raid1 as the number of mirror legs is limited such that it can never fail. (Even for stripes it seems rather unlikely.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:57 +01:00
Mikulas Patocka	7acedc5b98	dm exception store: fix misordered writes We must zero the next chunk on disk before writing out the current chunk, not after. Otherwise if the machine crashes at the wrong time, the "end of metadata" marker may be missing. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:56 +01:00
Alasdair G Kergon	7c9e6c1732	dm exception store: refactor zero_area Use a separate buffer for writing zeroes to the on-disk snapshot exception store, make the updating of ps->current_area explicit and refactor the code in preparation for the fix in the next patch. No functional change. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:55 +01:00
Mikulas Patocka	f68d4f3d39	dm snapshot: drop unused last_percent The last_percent field is unused - remove it. (It dates from when events were triggered as each X% filled up.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-21 17:44:53 +01:00
Mikulas Patocka	7c5f78b9d7	dm snapshot: fix primary_pe race Fix a race condition with primary_pe ref_count handling. put_pending_exception runs under dm_snapshot->lock, it does atomic_dec_and_test on primary_pe->ref_count, and later does atomic_read primary_pe->ref_count. __origin_write does atomic_dec_and_test on primary_pe->ref_count without holding dm_snapshot->lock. This opens the following race condition: Assume two CPUs, CPU1 is executing put_pending_exception (and holding dm_snapshot->lock). CPU2 is executing __origin_write in parallel. primary_pe->ref_count == 2. CPU1: if (primary_pe && atomic_dec_and_test(&primary_pe->ref_count)) origin_bios = bio_list_get(&primary_pe->origin_bios); ... decrements primary_pe->ref_count to 1. Doesn't load origin_bios CPU2: if (first && atomic_dec_and_test(&primary_pe->ref_count)) { flush_bios(bio_list_get(&primary_pe->origin_bios)); free_pending_exception(primary_pe); /* If we got here, pe_queue is necessarily empty. */ return r; } ... decrements primary_pe->ref_count to 0, submits pending bios, frees primary_pe. CPU1: if (!primary_pe \|\| primary_pe != pe) free_pending_exception(pe); ... this has no effect. if (primary_pe && !atomic_read(&primary_pe->ref_count)) free_pending_exception(primary_pe); ... sees ref_count == 0 (written by CPU 2), does double free !! This bug can happen only if someone is simultaneously writing to both the origin and the snapshot. If someone is writing only to the origin, __origin_write will submit kcopyd request after it decrements primary_pe->ref_count (so it can't happen that the finished copy races with primary_pe->ref_count decrementation). If someone is writing only to the snapshot, __origin_write isn't invoked at all and the race can't happen. The race happens when someone writes to the snapshot --- this creates pending_exception with primary_pe == NULL and starts copying. Then, someone writes to the same chunk in the snapshot, and __origin_write races with termination of already submitted request in pending_complete (that calls put_pending_exception). This race may be reason for bugs: http://bugzilla.kernel.org/show_bug.cgi?id=11636 https://bugzilla.redhat.com/show_bug.cgi?id=465825 The patch fixes the code to make sure that: 1. If atomic_dec_and_test(&primary_pe->ref_count) returns false, the process must no longer dereference primary_pe (because someone else may free it under us). 2. If atomic_dec_and_test(&primary_pe->ref_count) returns true, the process is responsible for freeing primary_pe. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:51 +01:00
Kazuo Ito	b673c3a819	dm kcopyd: avoid queue shuffle Write throughput to LVM snapshot origin volume is an order of magnitude slower than those to LV without snapshots or snapshot target volumes, especially in the case of sequential writes with O_SYNC on. The following patch originally written by Kevin Jamieson and Jan Blunck and slightly modified for the current RCs by myself tries to improve the performance by modifying the behaviour of kcopyd, so that it pushes back an I/O job to the head of the job queue instead of the tail as process_jobs() currently does when it has to wait for free pages. This way, write requests aren't shuffled to cause extra seeks. I tested the patch against 2.6.27-rc5 and got the following results. The test is a dd command writing to snapshot origin followed by fsync to the file just created/updated. A couple of filesystem benchmarks gave me similar results in case of sequential writes, while random writes didn't suffer much. dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=... [conv=notrunc when updating] 1) linux 2.6.27-rc5 without the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 511.46 610.72 11.81 create,dd+fsync 7.10 6.77 8.13 update,dd 431.63 917.41 12.75 update,dd+fsync 7.79 7.43 8.12 compared with write throughput to LV without any snapshots, all dd+fsync and 1000 MiB writes perform very poorly. 10M 100M 1000M create,dd 555.03 608.98 123.29 create,dd+fsync 114.27 72.78 76.65 update,dd 152.34 1267.27 124.04 update,dd+fsync 130.56 77.81 77.84 2) linux 2.6.27-rc5 with the patch, write to snapshot origin, average throughput (MB/s) 10M 100M 1000M create,dd 537.06 589.44 46.21 create,dd+fsync 31.63 29.19 29.23 update,dd 487.59 897.65 37.76 update,dd+fsync 34.12 30.07 26.85 Although still not on par with plain LV performance - cannot be avoided because it's copy on write anyway - this simple patch successfully improves throughtput of dd+fsync while not affecting the rest. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-10-21 17:44:50 +01:00
Al Viro	9a1c354276	[PATCH] pass fmode_t to blkdev_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:58 -04:00
Al Viro	a39907fa2f	[PATCH] switch md Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:31 -04:00
Al Viro	fe5f9f2cd5	[PATCH] switch dm ioctl() doesn't need BKL here Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:29 -04:00
Al Viro	d4430d62fa	[PATCH] beginning of methods conversion To keep the size of changesets sane we split the switch by drivers; to keep the damn thing bisectable we do the following: 1) rename the affected methods, add ones with correct prototypes, make (few) callers handle both. That's this changeset. 2) for each driver convert to new methods. ALL drivers are converted in this series. 3) kill the old (renamed) methods. Note that it _is_ a flagday; all in-tree drivers are converted and by the end of this series no trace of old methods remain. The only reason why we do that this way is to keep the damn thing bisectable and allow per-driver debugging if anything goes wrong. New methods: open(bdev, mode) release(disk, mode) ioctl(bdev, mode, cmd, arg) /* Called without BKL / compat_ioctl(bdev, mode, cmd, arg) locked_ioctl(bdev, mode, cmd, arg) / Called with BKL, legacy */ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:32 -04:00
Al Viro	633a08b812	[PATCH] introduce __blkdev_driver_ioctl() Analog of blkdev_driver_ioctl() with sane arguments. For now uses fake struct file, by the end of the series it won't and blkdev_driver_ioctl() will become a wrapper around it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:26 -04:00
Al Viro	647b3d0084	[PATCH] lose unused arguments in dm ioctl callbacks Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:18 -04:00
Al Viro	aeb5d72706	[PATCH] introduce fmode_t, do annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:47:06 -04:00
NeilBrown	92850bbd71	md: allow extended partitions on md devices. The new extended partition support provides a much nicer was to have partitions on md devices that the 'mdp' alternate major. We cannot really get rid of 'mdp' at this time, but we can enable extended partitions as that will probably make life easier for sysadmins. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:32 +11:00
NeilBrown	3c0ee63a64	md: use sysfs_notify_dirent to notify changes to md/dev-xxx/state The 'state' file for a device reports, for example, when the device has failed. Changes should be reported to userspace ASAP without the possibility of blocking on low-memory. sysfs_notify does have that possibility (as it takes a mutex which can be held across a kmalloc) so use sysfs_notify_dirent instead. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:28 +11:00
NeilBrown	b62b75905d	md: use sysfs_notify_dirent to notify changes to md/array_state Now that we have sysfs_notify_dirent, use it to notify changes to md/array_state. As sysfs_notify_dirent can be called in atomic context, we can remove the delayed notify and the MD_NOTIFY_ARRAY_STATE flag. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-21 13:25:21 +11:00
Linus Torvalds	ed09441dac	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (39 commits) [SCSI] sd: fix compile failure with CONFIG_BLK_DEV_INTEGRITY=n libiscsi: fix locking in iscsi_eh_device_reset libiscsi: check reason why we are stopping iscsi session to determine error value [SCSI] iscsi_tcp: return a descriptive error value during connection errors [SCSI] libiscsi: rename host reset to target reset [SCSI] iscsi class: fix endpoint id handling [SCSI] libiscsi: Support drivers initiating session removal [SCSI] libiscsi: fix data corruption when target has to resend data-in packets [SCSI] sd: Switch kernel printing level for DIF messages [SCSI] sd: Correctly handle all combinations of DIF and DIX [SCSI] sd: Always print actual protection_type [SCSI] sd: Issue correct protection operation [SCSI] scsi_error: fix target reset handling [SCSI] lpfc 8.2.8 v2 : Add statistical reporting control and additional fc vendor events [SCSI] lpfc 8.2.8 v2 : Add sysfs control of target queue depth handling [SCSI] lpfc 8.2.8 v2 : Revert target busy in favor of transport disrupted [SCSI] scsi_dh_alua: remove REQ_NOMERGE [SCSI] lpfc 8.2.8 : update driver version to 8.2.8 [SCSI] lpfc 8.2.8 : Add MSI-X support [SCSI] lpfc 8.2.8 : Update driver to use new Host byte error code DID_TRANSPORT_DISRUPTED ...	2008-10-17 09:00:23 -07:00
Linus Torvalds	c472273f86	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: fix input truncation in safe_delay_store() md: check for memory allocation failure in faulty personality md: build failure due to missing delay.h md: Relax minimum size restrictions on chunk_size. md: remove space after function name in declaration and call. md: Remove unnecessary #includes, #defines, and function declarations. md: Convert remaining 1k representations in linear.c to sectors. md: linear.c: Make two local variables sector-based. md: linear: Represent dev_info->size and dev_info->offset in sectors. md: linear.c: Remove broken debug code. md: linear.c: Remove pointless initialization of curr_offset. md: linear.c: Fix typo in comment. md: Don't try to set an array to 'read-auto' if it is already in that state. md: Allow metadata_version to be updated for externally managed metadata. md: Fix rdev_size_store with size == 0	2008-10-16 11:55:11 -07:00
Dan Williams	97ce0a7f9c	md: fix input truncation in safe_delay_store() safe_delay_store() currently truncates the last character of input since it tells strlcpy that the buffer can only hold 'len' characters, off by one. sysfs already null terminates the buffer, so just increase the last argument to strlcpy. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-16 17:03:08 +11:00
Sven Wegener	08ff39f1c8	md: check for memory allocation failure in faulty personality It's a fault injection module, but I don't think we should oops here. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-10-16 14:16:53 +11:00
Stephen Rothwell	255707274e	md: build failure due to missing delay.h Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: * [drivers/md/raid1.o] Error 1 make[3]: * Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-15 21:57:05 +11:00
Mike Christie	6000a368cd	[SCSI] block: separate failfast into multiple bits. Multipath is best at handling transport errors. If it gets a device error then there is not much the multipath layer can do. It will just access the same device but from a different path. This patch breaks up failfast into device, transport and driver errors. The multipath layers (md and dm mutlipath) only ask the lower levels to fast fail transport errors. The user of failfast, read ahead, will ask to fast fail on all errors. Note that blk_noretry_request will return true if any failfast bit is set. This allows drivers that do not support the multipath failfast bits to continue to fail on any failfast error like before. Drivers like scsi that are able to fail fast specific errors can check for the specific fail fast type. In the next patch I will convert scsi. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-10-13 09:28:52 -04:00
NeilBrown	4bbf3771ca	md: Relax minimum size restrictions on chunk_size. Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE. This makes moving an array to a machine with a larger PAGE_SIZE, or changing the kernel to use a larger PAGE_SIZE, can stop an array from working. For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync process works on whole pages at a time, and assumes them to be wholly within a stripe. For other raid personalities, this restriction is not needed at all and can be dropped. So remove the test on chunk_size from common can, and add it in just the places where it is needed: raid10 and raid4/5/6. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	d710e13812	md: remove space after function name in declaration and call. Having function (args) instead of function(args) make is harder to search for calls of particular functions. So remove all those spaces. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	fb4d8c76e5	md: Remove unnecessary #includes, #defines, and function declarations. A lot of cruft has gathered over the years. Time to remove it. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	ab5bd5cbc8	md: Convert remaining 1k representations in linear.c to sectors. This patch renames hash_spacing and preshift to spacing and sector_shift respectively with the following change of semantics: Case 1: (sizeof(sector_t) <= sizeof(u32)). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this case, we have sector_shift = preshift = 0 and spacing = 2 * hash_spacing. Hence, the index for the hash table which is computed by the new code in which_dev() as sector / spacing equals the old value which was (sector/2) / hash_spacing. Note also that the value of nb_zone stays the same because both sz and base double. Case 2: (sizeof(sector_t) > sizeof(u32)). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (aka the shifting dance case). Here we have sector_shift = preshift + 1 and spacing = 2 * hash_spacing during the computation of nb_zone and curr_sector, but spacing = hash_spacing in which_dev() because in the last hunk of the patch for linear.c we shift down conf->spacing (= 2 * hash_spacing) by one more bit than in the old code. Hence in the computation of nb_zone, sz and base have the same value as before, so nb_zone is not affected. Also curr_sector in the next hunk stays the same. In which_dev() the hash table index is computed as (sector >> sector_shift) / spacing In view of sector_shift = preshift + 1 and spacing = hash_spacing, this equals ((sector/2) >> preshift) / hash_spacing which is the value computed by the old code. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	23242fbb47	md: linear.c: Make two local variables sector-based. This is a preparation for representing also the remaining fields of struct linear_private_data as sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	6283815d18	md: linear: Represent dev_info->size and dev_info->offset in sectors. Rename them to num_sectors and start_sector which is more descriptive. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	451708d2a4	md: linear.c: Remove broken debug code. conf->smallest_size is undefined since day one of the git repo.. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	481d86c7eb	md: linear.c: Remove pointless initialization of curr_offset. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
Andre Noll	e61130228e	md: linear.c: Fix typo in comment. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	80268ee927	md: Don't try to set an array to 'read-auto' if it is already in that state. 'read-auto' is a variant of 'readonly' which will switch to writable on the first write attempt. Calling do_md_stop to set the array readonly when it is already readonly returns an error. So make sure not to do that. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:12 +11:00
NeilBrown	ea43ddd849	md: Allow metadata_version to be updated for externally managed metadata. For externally managed metadata, the 'metadata_version' sysfs attribute is really just a channel for user-space programs to communicate about how the array is being managed. It can be useful for this to be changed while the array is active. Normally changes to metadata_version are not permitted while the array is active. Change that so that if the metadata is externally managed, the metadata_version can be changed to a different flavour of external management. Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:11 +11:00
Chris Webb	7d3c6f8717	md: Fix rdev_size_store with size == 0 Fix rdev_size_store with size == 0. size == 0 means to use the largest size allowed by the underlying device and is used when modifying an active array. This fixes a regression introduced by commit `d7027458d6` Cc: <stable@kernel.org> Signed-off-by: Chris Webb <chris@arachsys.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-10-13 11:55:11 +11:00
Alan Jenkins	ce52aebd02	raid, fastboot: hide RAID autodetect option if MD is compiled as a module Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-12 08:25:14 -07:00
Arjan van de Ven	a364092a41	raid: make RAID autodetect default a KConfig option RAID autodetect has the side effect of requiring synchronisation of all device drivers, which can make the boot several seconds longer (I've measured 7 on one of my laptops).... even for systems that don't have RAID setup for the root filesystem (the only FS where this matters). This patch makes the default for autodetect a config option; either way the user can always override via the kernel command line. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: NeilBrown <neilb@suse.de>	2008-10-12 08:25:02 -07:00
Linus Torvalds	b0af205afb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm: detect lost queue dm: publish dm_vcalloc dm: publish dm_table_unplug_all dm: publish dm_get_mapinfo dm: export struct dm_dev dm crypt: avoid unnecessary wait when splitting bio dm crypt: tidy ctx pending dm crypt: fix async inc_pending dm crypt: move dec_pending on error into write_io_submit dm crypt: remove inc_pending from write_io_submit dm crypt: tidy write loop pending dm crypt: tidy crypt alloc dm crypt: tidy inc pending dm exception store: use chunk_t for_areas dm exception store: introduce area_location function dm raid1: kcopyd should stop on error if errors handled dm mpath: remove is_active from struct dm_path dm mpath: use more error codes Fixed up trivial conflict in drivers/md/dm-mpath.c manually.	2008-10-10 11:11:47 -07:00
Alasdair G Kergon	0c2322e4ce	dm: detect lost queue Detect and report buggy drivers that destroy their request_queue. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Stefan Raspl <raspl@linux.vnet.ibm.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2008-10-10 13:37:13 +01:00
Mikulas Patocka	5416090426	dm: publish dm_vcalloc Publish dm_vcalloc in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:12 +01:00
Mikulas Patocka	ea0ec64094	dm: publish dm_table_unplug_all Publish dm_table_unplug_all in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:11 +01:00
Mikulas Patocka	89343da077	dm: publish dm_get_mapinfo Publish dm_get_mapinfo in include/linux/device-mapper.h because this function is used by targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:10 +01:00
Mikulas Patocka	82b1519b34	dm: export struct dm_dev Split struct dm_dev in two and publish the part that other targets need in include/linux/device-mapper.h. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:09 +01:00
Milan Broz	933f01d433	dm crypt: avoid unnecessary wait when splitting bio Don't wait between submitting crypt requests for a bio unless we are short of memory. There are two situations when we must split an encrypted bio: 1) there are no free pages; 2) the new bio would violate underlying device restrictions (e.g. max hw segments). In case (2) we do not need to wait. Add output variable to crypt_alloc_buffer() to distinguish between these cases. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:08 +01:00
Milan Broz	c8081618a9	dm crypt: tidy ctx pending Move the initialisation of ctx->pending into one place, at the start of crypt_convert(). Introduce crypt_finished to indicate whether or not the encryption is finished, for use in a later patch. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:08 +01:00
Milan Broz	4e59409891	dm crypt: fix async inc_pending The pending reference count must be incremented before the async work is queued to another thread, not after. Otherwise there's a race if the work completes and decrements the reference count before it gets incremented. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:07 +01:00
Milan Broz	6c031f41db	dm crypt: move dec_pending on error into write_io_submit Make kcryptd_crypt_write_io_submit() responsible for decrementing the pending count after an error. Also fixes a bug in the async path that forgot to decrement it. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:06 +01:00
Alasdair G Kergon	1e37bb8e55	dm crypt: remove inc_pending from write_io_submit Make the caller reponsible for incrementing the pending count before calling kcryptd_crypt_write_io_submit() in the non-async case to bring it into line with the async case. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:05 +01:00
Milan Broz	fc5a5e9aa8	dm crypt: tidy write loop pending Move kcryptd_crypt_write_convert_loop inside kcryptd_crypt_write_convert. This change is needed for a later patch. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:04 +01:00
Milan Broz	dc440d1e56	dm crypt: tidy crypt alloc Factor out crypt io allocation code. Later patches will call it from another place. No functional change. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:03 +01:00
Milan Broz	3e1a8bdd05	dm crypt: tidy inc pending Move io pending to one place. No functional change, usefull to simplify debugging. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:02 +01:00
Mikulas Patocka	fd14acf6fc	dm exception store: use chunk_t for_areas Change uint32_t into chunk_t to remove 32-bit limitation on the number of chunks on systems with 64-bit sector numbers. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:01 +01:00
Mikulas Patocka	a481db7846	dm exception store: introduce area_location function Move this logic to a function, because it will be reused later. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:37:00 +01:00
Jonathan Brassow	f7c83e2e47	dm raid1: kcopyd should stop on error if errors handled dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally when assigning kcopyd work. kcopyd is responsible for copying an assigned section of disk to one or more other disks. The 'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way: When not set: kcopyd will immediately stop the copy operation when an error is encountered. When set: kcopyd will try to proceed regardless of errors and try to continue copying any remaining amount. Since dm-raid1 tracks regions of the address space that are (or are not) in sync and it now has the ability to handle these errors, we can safely enable this optimization. This optimization is conditional on whether mirror error handling has been enabled. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:59 +01:00
Kiyoshi Ueda	6680073d3e	dm mpath: remove is_active from struct dm_path This patch moves 'is_active' from struct dm_path to struct pgpath as it does not need exporting. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:58 +01:00
Benjamin Marzinski	01460f3520	dm mpath: use more error codes This patch allows path errors from the multipath ctr function to propagate up to userspace as errno values from the ioctl() call. This is in response to https://www.redhat.com/archives/dm-devel/2008-May/msg00000.html and https://bugzilla.redhat.com/show_bug.cgi?id=444421 The patch only lets through the errors that it needs to in order to get the path errors from parse_path(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-10 13:36:57 +01:00
Denis ChengRq	6feef531f5	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:57:05 +02:00
Mike Anderson	224cb3e981	dm: Call blk_abort_queue on failed paths Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:14 +02:00
Tejun Heo	074a7aca7a	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	0762b8bde9	block: always set bdev->bd_part Till now, bdev->bd_part is set only if the bdev was for parts other than part0. This patch makes bdev->bd_part always set so that code paths don't have to differenciate common handling. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:08 +02:00
Tejun Heo	b7db9956e5	block: move policy from disk to part0 Move disk->policy to part0->policy. Implement and use get_disk_ro(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	ed9e198234	block: implement and use {disk\|part}_to_dev() Implement {disk\|part}_to_dev() and use them to access generic device instead of directly dereferencing {disk\|part}->dev. To make sure no user is left behind, rename generic devices fields to __dev. This is in preparation of unifying partition 0 handling with other partitions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:07 +02:00
Tejun Heo	c995905916	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:06 +02:00
Tejun Heo	f331c0296f	block: don't depend on consecutive minor space * Implement disk_devt() and part_devt() and use them to directly access devt instead of computing it from ->major and ->first_minor. Note that all references to ->major and ->first_minor outside of block layer is used to determine devt of the disk (the part0) and as ->major and ->first_minor will continue to represent devt for the disk, converting these users aren't strictly necessary. However, convert them for consistency. * Implement disk_max_parts() to avoid directly deferencing genhd->minors. * Update bdget_disk() such that it doesn't assume consecutive minor space. * Move devt computation from register_disk() to add_disk() and make it the only one (all other usages use the initially determined value). These changes clean up the code and will help disk->part dereference fix and extended block device numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:05 +02:00
Jens Axboe	5b99c2ffa9	block: make bi_phys_segments an unsigned int instead of short raid5 can overflow with more than 255 stripes, and we can increase it to an int for free on both 32 and 64-bit archs due to the padding. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Jens Axboe	960e739d9e	block: raid fixups for removal of bi_hw_segments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Mikulas Patocka	5df97b91b5	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-09 08:56:03 +02:00
Chandra Seetharaman	7253a33434	dm mpath: add missing path switching locking Moving the path activation to workqueue along with scsi_dh patches introduced a race. It is due to the fact that the current_pgpath (in the multipath data structure) can be modified if changes happen in any of the paths leading to the lun. If the changes lead to current_pgpath being set to NULL, then it leads to the invalid access which results in the panic below. This patch fixes that by storing the pgpath to activate in the multipath data structure and properly protecting it. Note that if activate_path is called twice in succession with different pgpath, with the second one being called before the first one is done, then activate path will be called twice for the second pgpath, which is fine. Unable to handle kernel paging request for data at address 0x00000020 Faulting instruction address: 0xd000000000aa1844 cpu 0x1: Vector: 300 (Data Access) at [c00000006b987a80] pc: d000000000aa1844: .activate_path+0x30/0x218 [dm_multipath] lr: c000000000087a2c: .run_workqueue+0x114/0x204 sp: c00000006b987d00 msr: 8000000000009032 dar: 20 dsisr: 40000000 current = 0xc0000000676bb3f0 paca = 0xc0000000006f3680 pid = 2528, comm = kmpath_handlerd enter ? for help [c00000006b987da0] c000000000087a2c .run_workqueue+0x114/0x204 [c00000006b987e40] c000000000088b58 .worker_thread+0x120/0x144 [c00000006b987f00] c00000000008ca70 .kthread+0x78/0xc4 [c00000006b987f90] c000000000027cc8 .kernel_thread+0x4c/0x68 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:27 +01:00
Mikulas Patocka	b01cd5ac43	dm: cope with access beyond end of device in dm_merge_bvec If for any reason dm_merge_bvec() is given an offset beyond the end of the device, avoid an oops and always allow one page to be added to an empty bio. We'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:24 +01:00
Mikulas Patocka	5037108acd	dm: always allow one page in dm_merge_bvec Some callers assume they can always add at least one page to an empty bio, so dm_merge_bvec should not return 0 in this case: we'll reject the I/O later after the bio is submitted. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-10-01 14:39:17 +01:00
NeilBrown	9744197c3d	md: Don't wait UNINTERRUPTIBLE for other resync to finish When two md arrays share some block device (e.g each uses different partitions on the one device), a resync of one array will wait for the resync on the other to finish. This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE, the softlockup code notices and complains. So use TASK_INTERRUPTIBLE instead and make sure to flush signals before calling schedule. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-19 11:49:54 +10:00
NeilBrown	b2d2c4cead	Fix problem with waiting while holding rcu read lock in md/bitmap.c A recent patch to protect the rdev list with rcu locking leaves us with a problem because we can sleep on memalloc while holding the rcu lock. The rcu lock is only needed while walking the linked list as uninteresting devices (failed or spares) can be removed at any time. So only take the rcu lock while actually walking the linked list. Take a refcount on the rdev during the time when we drop the lock and do the memalloc to start IO. When we return to the locked code, all the interesting devices on the list will not have moved, so we can simply use list_for_each_continue_rcu to pick up where we left off. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-01 12:48:13 +10:00
NeilBrown	271f5a9b8f	Remove invalidate_partition call from do_md_stop. When stopping an md array, or just switching to read-only, we currently call invalidate_partition while holding the mddev lock. The main reason for this is probably to ensure all dirty buffers are flushed (invalidate_partition calls fsync_bdev). However if any dirty buffers are found, it will almost certainly cause a deadlock as starting writeout will require an update to the superblock, and performing that updates requires taking the mddev lock - which is already held. This deadlock can be demonstrated by running "reboot -f -n" with a root filesystem on md/raid, and some dirty buffers in memory. All other calls to stop an array should already happen after a flush. The normal sequence is to stop using the array (e.g. umount) which will cause __blkdev_put to call sync_blockdev. Then open the array and issue the STOP_ARRAY ioctl while the buffers are all still clean. So this invalidate_partition is normally a no-op, except for one case where it will cause a deadlock. So remove it. This patch possibly addresses the regression recored in http://bugzilla.kernel.org/show_bug.cgi?id=11460 and http://bugzilla.kernel.org/show_bug.cgi?id=11452 though it isn't yet clear how it ever worked. Signed-off-by: NeilBrown <neilb@suse.de>	2008-09-01 12:32:52 +10:00
Dan Williams	56ac36d722	md: cancel check/repair requests when recovery is needed If a 'repair' is requested when an array is in a position to 'recover' raid1 will perform the repair while md believes a recovery is happening. Address this at both ends, i.e. cancel check/repair requests upon detecting a recover condition and do not call ->spare_active after completing a check/repair. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-08-07 10:02:47 -07:00
NeilBrown	0310fa216d	Allow raid10 resync to happening in larger chunks. The raid10 resync/recovery code currently limits the amount of in-flight resync IO to 2Meg. This was copied from raid1 where it seems quite adequate. However for raid10, some layouts require a bit of seeking to perform a resync, and allowing a larger buffer size means that the seeking can be significantly reduced. There is probably no real need to limit the amount of in-flight IO at all. Any shortage of memory will naturally reduce the amount of buffer space available down to a set minimum, and any concurrent normal IO will quickly cause resync IO to back off. The only problem would be that normal IO has to wait for all resync IO to finish, so a very large amount of resync IO could cause unpleasant latency when normal IO starts up. So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is available) which seems to be a good amount. Also reduce the amount of memory reserved as there is no need to keep 2Meg just for resync if memory is tight. Thanks to Keld for the suggestion. Cc: Keld Jørn Simonsen <keld@dkuug.dk> Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	c89a8eee61	Allow faulty devices to be removed from a readonly array. Removing faulty devices from an array is a two stage process. First the device is moved from being a part of the active array to being similar to a spare device. Then it can be removed by a request from user space. The first step is currently not performed for read-only arrays, so the second step can never succeed. So allow readonly arrays to remove failed devices (which aren't blocked). Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	ac4090d24c	Don't let a blocked_rdev interfere with read request in raid5/6 When we have externally managed metadata, we need to mark a failed device as 'Blocked' and not allow any writes until that device have been marked as faulty in the metadata and the Blocked flag has been removed. However it is perfectly OK to allow read requests when there is a Blocked device, and with a readonly array, there may not be any metadata-handler watching for blocked devices. So in raid5/raid6 only allow a Blocked device to interfere with Write request or resync. Read requests go through untouched. raid1 and raid10 already differentiate between read and write properly. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	dba034eef2	Fail safely when trying to grow an array with a write-intent bitmap. We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:32 +10:00
NeilBrown	2b25000bf5	Restore force switch of md array to readonly at reboot time. A recent patch allowed do_md_stop to know whether it was being called via an ioctl or not, and thus where to allow for an extra open file descriptor when checking if it is in use. This broke then switch to readonly performed by the shutdown notifier, which needs to work even when the array is still (apparently) active (as md doesn't get told when the filesystem becomes readonly). So restore this feature by pretending that there can be lots of file descriptors open, but we still want do_md_stop to switch to readonly. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:31 +10:00
NeilBrown	19052c0e85	Make writes to md/safe_mode_delay immediately effective. If we reduce the 'safe_mode_delay', it could still wait for the old delay to completely expire before doing anything about safe_mode. Thus the effect if the change is delayed. To make the effect more immediate, run the timeout function immediately if the delay was reduced. This may cause it to run slightly earlier that required, but that is the safer option. Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-05 15:56:31 +10:00
Linus Torvalds	1e24b15b26	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: raid10: wake up frozen array md: do not count blocked devices as spares md: do not progress the resync process if the stripe was blocked md: delay notification of 'active_idle' to the recovery thread md: fix merge error md: move async_tx_issue_pending_all outside spin_lock_irq	2008-08-01 11:56:07 -07:00
Linus Torvalds	b17b3d479c	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: md: the bitmap code needs to use blk_plug_device_unlocked() block: add a blk_plug_device_unlocked() that grabs the queue lock	2008-08-01 11:46:00 -07:00
Jens Axboe	93769f5807	md: the bitmap code needs to use blk_plug_device_unlocked() It doesn't hold the queue lock, so it's both racey on the queue flags and thus spews a warning. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-08-01 20:32:31 +02:00
Al Viro	d5686b444f	[PATCH] switch mtd and dm-table to lookup_bdev() No need to open-code it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 11:25:31 -04:00
Arthur Jones	388667bed5	md: raid10: wake up frozen array When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2008-08-01 12:55:14 +10:00
Dan Williams	e542713529	md: do not count blocked devices as spares remove_and_add_spares() assumes that failed devices have been hot-removed from the array. Removal is skipped in the 'blocked' case so do not count a device in this state as 'spare'. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-28 17:52:44 -07:00
Dan Williams	df10cfbc4d	md: do not progress the resync process if the stripe was blocked handle_stripe will take no action on a stripe when waiting for userspace to unblock the array, so do not report completed sectors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-28 17:52:37 -07:00
Hannes Reinecke	ae11b1b36d	[SCSI] scsi_dh: attach to hardware handler from dm-mpath multipath keeps a separate device table which may be more current than the built-in one. So we should make sure to always call ->attach whenever a multipath map with hardware handler is instantiated. And we should call ->detach on removal, too. [sekharan: update as per comments from agk] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-26 15:14:53 -04:00
Dan Williams	d8e64406a0	md: delay notification of 'active_idle' to the recovery thread sysfs_notify might sleep, so do not call it from md_safemode_timeout. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-23 13:09:48 -07:00
Dan Williams	2339788376	md: fix merge error The original STRIPE_OP_IO removal patch had the following hunk: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } However it appears the hunk became broken after merging: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); set_bit(R5_LOCKED, &dev->flags); s.locked++; - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-23 13:09:45 -07:00
Dan Williams	c9f21aaff1	md: move async_tx_issue_pending_all outside spin_lock_irq Some dma drivers need to call spin_lock_bh in their device_issue_pending routines. This change avoids: WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85() Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-23 12:05:51 -07:00
Linus Torvalds	b7e6f62fe2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm crypt: add merge dm table: remove merge_bvec sector restriction dm: linear add merge dm: introduce merge_bvec_fn dm snapshot: use per device mempools dm snapshot: fix race during exception creation dm snapshot: track snapshot reads dm mpath: fix test for reinstate_path dm mpath: return parameter error dm io: remove struct padding dm log: make dm_dirty_log init and exit static dm mpath: free path selector on invalid args	2008-07-21 10:30:10 -07:00
Linus Torvalds	8a392625b6	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (52 commits) md: Protect access to mddev->disks list using RCU md: only count actual openers as access which prevent a 'stop' md: linear: Make array_size sector-based and rename it to array_sectors. md: Make mddev->array_size sector-based. md: Make super_type->rdev_size_change() take sector-based sizes. md: Fix check for overlapping devices. md: Tidy up rdev_size_store a bit: md: Remove some unused macros. md: Turn rdev->sb_offset into a sector-based quantity. md: Make calc_dev_sboffset() return a sector count. md: Replace calc_dev_size() by calc_num_sectors(). md: Make update_size() take the number of sectors. md: Better control of when do_md_stop is allowed to stop the array. md: get_disk_info(): Don't convert between signed and unsigned and back. md: Simplify restart_array(). md: alloc_disk_sb(): Return proper error value. md: Simplify sb_equal(). md: Simplify uuid_equal(). md: sb_equal(): Fix misleading printk. md: Fix a typo in the comment to cmd_match(). ...	2008-07-21 10:29:12 -07:00
Milan Broz	d41e26b901	dm crypt: add merge This patch implements biovec merge function for crypt target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:40 +01:00
Milan Broz	9980c638a6	dm table: remove merge_bvec sector restriction Remove max_sector restriction - merge function replaced it. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:39 +01:00
Milan Broz	7bc3447b69	dm: linear add merge This patch implements biovec merge function for linear target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:38 +01:00
Milan Broz	f6fccb1213	dm: introduce merge_bvec_fn Introduce a bvec merge function for device mapper devices for dynamic size restrictions. This code ensures the requested biovec lies within a single target and then calls a target-specific function to check against any constraints imposed by underlying devices. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:37 +01:00
Mikulas Patocka	92e868122e	dm snapshot: use per device mempools Change snapshot per-module mempool to per-device mempool. Per-module mempools could cause a deadlock if multiple snapshot devices are stacked above each other. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:35 +01:00
Mikulas Patocka	a8d41b59f3	dm snapshot: fix race during exception creation Fix a race condition that returns incorrect data when a write causes an exception to be allocated whilst a read is still in flight. The race condition happens as follows: * A read to non-reallocated sector in the snapshot is submitted so that the read is routed to the original device. * A write to the original device is submitted. The write causes an exception that reallocates the block. The write proceeds. * The original read is dequeued and reads the wrong data. This race can be triggered with CFQ scheduler and one thread writing and multiple threads reading simultaneously. (This patch relies upon the earlier dm-kcopyd-per-device.patch to avoid a deadlock.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:34 +01:00
Mikulas Patocka	cd45daffd1	dm snapshot: track snapshot reads Whenever a snapshot read gets mapped through to the origin, track it in a per-snapshot hash table indexed by chunk number, using memory allocated from a new per-snapshot mempool. We need to track these reads to avoid race conditions which will be fixed by patches that follow. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:32 +01:00
Alasdair G Kergon	def052d21c	dm mpath: fix test for reinstate_path Fix test for reinstate_path method before attempting to use it. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Julia Lawall <julia@diku.dk>	2008-07-21 12:00:31 +01:00
Mikulas Patocka	148acff615	dm mpath: return parameter error Return a specific error message if there are an invalid number of multipath arguments. This invalid command returns an "Unknown error" because the ti->error field is not set dmsetup create --table '0 2 multipath 0 0 1 1 round-robin 0 1 1 /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:30 +01:00
Richard Kennedy	6ae2fa6718	dm io: remove struct padding Rearrange struct dm_io. Shrinks size from 40 -> 32 allowing more objects/slab. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:28 +01:00
Adrian Bunk	c8da2f8dd8	dm log: make dm_dirty_log init and exit static dm_dirty_log_{init,exit}() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:27 +01:00
Mikulas Patocka	371b2e348b	dm mpath: free path selector on invalid args Free path selector if the arguments are invalid. This command (note that it is invalid) causes reference leak on module "dm_round_robin" and prevents the module from being removed. dmsetup create --table '0 2 multipath 0 0 1 1 round-robin /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-21 12:00:24 +01:00
NeilBrown	4b80991c6c	md: Protect access to mddev->disks list using RCU All modifications and most access to the mddev->disks list are made under the reconfig_mutex lock. However there are three places where the list is walked without any locking. If a reconfig happens at this time, havoc (and oops) can ensue. So use RCU to protect these accesses: - wrap them in rcu_read_{,un}lock() - use list_for_each_entry_rcu - add to the list with list_add_rcu - delete from the list with list_del_rcu - delay the 'free' with call_rcu rather than schedule_work Note that export_rdev did a list_del_init on this list. In almost all cases the entry was not in the list anymore so it was a no-op and so safe. It is no longer safe as after list_del_rcu we may not touch the list_head. An audit shows that export_rdev is called: - after unbind_rdev_from_array, in which case the delete has already been done, - after bind_rdev_to_array fails, in which case the delete isn't needed. - before the device has been put on a list at all (e.g. in add_new_disk where reading the superblock fails). - and in autorun devices after a failure when the device is on a different list. So remove the list_del_init call from export_rdev, and add it back immediately before the called to export_rdev for that last case. Note also that ->same_set is sometimes used for lists other than mddev->list (e.g. candidates). In these cases rcu is not needed. Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 17:05:25 +10:00
NeilBrown	f2ea68cf42	md: only count actual openers as access which prevent a 'stop' Open isn't the only thing that increments ->active. e.g. reading /proc/mdstat will increment it briefly. So to avoid false positives in testing for concurrent access, introduce a new counter that counts just the number of times the md device it open. Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 17:05:25 +10:00
Andre Noll	d6e2215052	md: linear: Make array_size sector-based and rename it to array_sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 17:05:25 +10:00
Andre Noll	f233ea5c9e	md: Make mddev->array_size sector-based. This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 17:05:22 +10:00
Andre Noll	15f4a5fdf3	md: Make super_type->rdev_size_change() take sector-based sizes. Also, change the type of the size parameter from unsigned long long to sector_t and rename it to num_sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 14:42:12 +10:00
Andre Noll	d07bd3bcc4	md: Fix check for overlapping devices. The checks in overlaps() expect all parameters either in block-based or sector-based quantities. However, its single caller passes two rdev->data_offset arguments as well as two rdev->size arguments, the former being sector counts while the latter are measured in 1K blocks. This could cause rdev_size_store() to accept an invalid size from user space. Fix it by passing only sector-based quantities to overlaps(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 14:42:07 +10:00
Neil Brown	d7027458d6	md: Tidy up rdev_size_store a bit: - used strict_strtoull in place of simple_strtoull - use my_mddev in place of rdev->mddev (they have the same value) and more significantly, - don't adjust mddev->size to fit, rather reject changes which make rdev->size smaller than mddev->size Adjusting mddev->size is a hangover from bind_rdev_to_array which does a similar thing. But it really is a better design to insist that mddev->size is set as required, then the rdev->sizes are set to allow for that. The previous way invites confusion. Signed-off-by: NeilBrown <neilb@suse.de>	2008-07-21 14:22:18 +10:00
Linus Torvalds	89a93f2f48	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (102 commits) [SCSI] scsi_dh: fix kconfig related build errors [SCSI] sym53c8xx: Fix bogus sym_que_entry re-implementation of container_of [SCSI] scsi_cmnd.h: remove double inclusion of linux/blkdev.h [SCSI] make struct scsi_{host,target}_type static [SCSI] fix locking in host use of blk_plug_device() [SCSI] zfcp: Cleanup external header file [SCSI] zfcp: Cleanup code in zfcp_erp.c [SCSI] zfcp: zfcp_fsf cleanup. [SCSI] zfcp: consolidate sysfs things into one file. [SCSI] zfcp: Cleanup of code in zfcp_aux.c [SCSI] zfcp: Cleanup of code in zfcp_scsi.c [SCSI] zfcp: Move status accessors from zfcp to SCSI include file. [SCSI] zfcp: Small QDIO cleanups [SCSI] zfcp: Adapter reopen for large number of unsolicited status [SCSI] zfcp: Fix error checking for ELS ADISC requests [SCSI] zfcp: wait until adapter is finished with ERP during auto-port [SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver [SCSI] sg: Add target reset support [SCSI] lib: Add support for the T10 (SCSI) Data Integrity Field CRC [SCSI] sd: Move scsi_disk() accessor function to sd.h ...	2008-07-15 18:58:04 -07:00
Chandra Seetharaman	fe9233fb69	[SCSI] scsi_dh: fix kconfig related build errors Do not automatically "select" SCSI_DH for dm-multipath. If SCSI_DH doesn't exist,just do not allow hardware handlers to be used. Handle SCSI_DH being a module also. Make sure it doesn't allow DM_MULTIPATH to be compiled in when SCSI_DH is a module. [jejb: added comment for Kconfig syntax] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-07-15 09:16:43 -05:00
Linus Torvalds	dddec01eb8	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits) splice: fix generic_file_splice_read() race with page invalidation ramfs: enable splice write drivers/block/pktcdvd.c: avoid useless memset cdrom: revert commit `22a9189` (cdrom: use kmalloced buffers instead of buffers on stack) scsi: sr avoids useless buffer allocation block: blk_rq_map_kern uses the bounce buffers for stack buffers block: add blk_queue_update_dma_pad DAC960: push down BKL pktcdvd: push BKL down into driver paride: push ioctl down into driver block: use get_unaligned_* helpers block: extend queue_flag bitops block: request_module(): use format string Add bvec_merge_data to handle stacked devices and ->merge_bvec() block: integrity flags can't use bit ops on unsigned short cmdfilter: extend default read filter sg: fix odd style (extra parenthesis) introduced by cmd filter patch block: add bounce support to blk_rq_map_user_iov cfq-iosched: get rid of enable_idle being unused warning allow userspace to modify scsi command filter on per device basis ...	2008-07-14 13:15:14 -07:00
Andre Noll	0f420358e3	md: Turn rdev->sb_offset into a sector-based quantity. Rename it to sb_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:23 +10:00
Andre Noll	b73df2d3d6	md: Make calc_dev_sboffset() return a sector count. As BLOCK_SIZE_BITS is 10 and MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x), the return value of calc_dev_sboffset() doubles. Fix up all three callers accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:23 +10:00
Andre Noll	e7debaa495	md: Replace calc_dev_size() by calc_num_sectors(). Number of sectors is the preferred unit for sizes of raid devices, so change calc_dev_size() so that it returns this unit instead of the number of 1K blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:23 +10:00
Andre Noll	d71f9f88d7	md: Make update_size() take the number of sectors. Changing the internal representations of sizes of raid devices from 1K blocks to sector counts (512B units) is desirable because it allows to get rid of many divisions/multiplications and unnecessary casts that are present in the current code. This patch is a first step in this direction. It replaces the old 1K-based "size" argument of update_size() by "num_sectors" and fixes up its two callers. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:22 +10:00
Neil Brown	df5b20cf68	md: Better control of when do_md_stop is allowed to stop the array. do_md_stop check the number of active users before allowing the array to be stopped. Two problems: 1/ it assumes the request is coming through an open file descriptor (via ioctl) so it allows for that. This is not always the case. 2/ it doesn't do the check it the array hasn't been activated. This is not good for cases when we use an inactive array to hold some devices in a container. Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:22 +10:00
Andre Noll	26ef379f53	md: get_disk_info(): Don't convert between signed and unsigned and back. The current code copies a signed int from user space, converts it to unsigned and passes the unsigned value to find_rdev_nr() which expects a signed value. Simply pass the signed value from user space directly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:21 +10:00
Andre Noll	80fab1d77b	md: Simplify restart_array(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:21 +10:00
Andre Noll	ebc2433728	md: alloc_disk_sb(): Return proper error value. If alloc_page() fails, ENOMEM is a more suitable error value than EINVAL. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:20 +10:00
Andre Noll	ce0c8e05f8	md: Simplify sb_equal(). The only caller of sb_equal() tests the return value against zero, so it's OK to return the negated return value of memcmp(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:20 +10:00
Andre Noll	05710466c9	md: Simplify uuid_equal(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-11 22:02:20 +10:00
Linus Torvalds	2283af5b0b	Merge branch 'for-2.6.26' of git://neil.brown.name/md * 'for-2.6.26' of git://neil.brown.name/md: md: ensure all blocks are uptodate or locked when syncing	2008-07-10 09:49:46 -07:00
Dan Williams	7a1fc53c5a	md: ensure all blocks are uptodate or locked when syncing Remove the dubious attempt to prefer 'compute' over 'read'. Not only is it wrong given commit `c337869d` (md: do not compute parity unless it is on a failed drive), but it can trigger a BUG_ON in handle_parity_checks5(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-10 15:25:18 +10:00
Andre Noll	35020f1a06	md: sb_equal(): Fix misleading printk. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:53:20 +10:00
Andre Noll	7f6ce76928	md: Fix a typo in the comment to cmd_match(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:53:00 +10:00
Andre Noll	910d8cb3f4	md: Fix typo in array_state comment. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:52:45 +10:00
Andre Noll	9687a60c78	md: sync_speed_show(): Trivial cleanups. - Remove superfluous parentheses. - Make format string match the type of the variable that is printed. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:52:26 +10:00
Andre Noll	13e53df354	md: do_md_run(): Fix misleading error message. In case pers->run() succeeds but creating the bitmap fails, we print an error message stating that pers->run() has failed. Print this message only if pers->run() really failed. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:52:15 +10:00
Andre Noll	2f9618ce63	md: md_getgeo(): Move comment to proper position. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:52:00 +10:00
Andre Noll	bb57fc64b2	md: md_ioctl(): Fix misleading indentation. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-07-08 10:51:29 +10:00
Neil Brown	0529613a19	Merge branch 'for-neil' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/md into for-next	2008-07-08 10:13:28 +10:00
Neil Brown	5b1a4bf220	Merge branch 'master' into for-next	2008-07-08 10:11:50 +10:00
Alasdair G Kergon	cc371e66e3	Add bvec_merge_data to handle stacked devices and ->merge_bvec() When devices are stacked, one device's merge_bvec_fn may need to perform the mapping and then call one or more functions for its underlying devices. The following bio fields are used: bio->bi_sector bio->bi_bdev bio->bi_size bio->bi_rw using bio_data_dir() This patch creates a new struct bvec_merge_data holding a copy of those fields to avoid having to change them directly in the struct bio when going down the stack only to have to change them back again on the way back up. (And then when the bio gets mapped for real, the whole exercise gets repeated, but that's a problem for another day...) Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-07-03 13:21:15 +02:00
Linus Torvalds	cefcade9e7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm crypt: use cond_resched	2008-07-02 18:55:17 -07:00
Milan Broz	c7f1b20441	dm crypt: use cond_resched Add cond_resched() to prevent monopolising CPU when processing large bios. dm-crypt processes encryption of bios in sector units. If the bio request is big it can spend a long time in the encryption call. Signed-off-by: Milan Broz <mbroz@redhat.com> Tested-by: Yan Li <elliot.li.tech@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-07-02 09:34:28 +01:00
Dan Williams	b5470dc5fc	md: resolve external metadata handling deadlock in md_allow_write md_allow_write() marks the metadata dirty while holding mddev->lock and then waits for the write to complete. For externally managed metadata this causes a deadlock as userspace needs to take the lock to communicate that the metadata update has completed. Change md_allow_write() in the 'external' case to start the 'mark active' operation and then return -EAGAIN. The expected side effects while waiting for userspace to write 'active' to 'array_state' are holding off reshape (code currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes these failures can be mitigated by coordinating with mdmon. md_write_start() still prevents writes from occurring until the metadata handler has had a chance to take action as it unconditionally waits for MD_CHANGE_CLEAN to be cleared. [neilb@suse.de: return -EAGAIN, try GFP_NOIO] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-06-30 17:18:19 -07:00
Dan Williams	1fe797e67f	md: rationalize raid5 function names From: Dan Williams <dan.j.williams@intel.com> Commit `a4456856` refactored some of the deep code paths in raid5.c into separate functions. The names chosen at the time do not consistently indicate what is going to happen to the stripe. So, update the names, and since a stripe is a cache element use cache semantics like fill, dirty, and clean. (also, fix up the indentation in fetch_block5) Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 09:16:30 +10:00
Dan Williams	7b3a871ed9	md: handle operation chaining in raid5_run_ops From: Dan Williams <dan.j.williams@intel.com> Neil said: > At the end of ops_run_compute5 you have: > /* ack now if postxor is not set to be run */ > if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run)) > async_tx_ack(tx); > > It looks odd having that test there. Would it fit in raid5_run_ops > better? The intended global interpretation is that raid5_run_ops can build a chain of xor and memcpy operations. When MD registers the compute-xor it tells async_tx to keep the operation handle around so that another item in the dependency chain can be submitted. If we are just computing a block to satisfy a read then we can terminate the chain immediately. raid5_run_ops gives a better context for this test since it cares about the entire chain. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:32:09 +10:00
Dan Williams	d8ee0728b5	md: replace R5_WantPrexor with R5_WantDrain, add 'prexor' reconstruct_states From: Dan Williams <dan.j.williams@intel.com> Currently ops_run_biodrain and other locations have extra logic to determine which blocks are processed in the prexor and non-prexor cases. This can be eliminated if handle_write_operations5 flags the blocks to be processed in all cases via R5_Wantdrain. The presence of the prexor operation is tracked in sh->reconstruct_state. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:32:06 +10:00
Dan Williams	600aa10993	md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states' From: Dan Williams <dan.j.williams@intel.com> Track the state of reconstruct operations (recalculating the parity block usually due to incoming writes, or as part of array expansion) Reduces the scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether a reconstruct operation has been requested via the ops_request field of struct stripe_head_state. This is the final step in the removal of ops.{pending,ack,complete,count}, i.e. the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:32:05 +10:00
Dan Williams	976ea8d475	md: replace STRIPE_OP_COMPUTE_BLK with STRIPE_COMPUTE_RUN From: Dan Williams <dan.j.williams@intel.com> Track the state of compute operations (recalculating a block from all the other blocks in a stripe) with a state flag. Reduces the scope of the STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has been requested via the ops_request field of struct stripe_head_state. Note, the compute operation that is performed in the course of doing a 'repair' operation (check the parity block, recalculate it and write it back if the check result is not zero) is tracked separately with the 'check_state' variable. Compute operations are held off while a 'check' is in progress, and moving this check out to handle_issuing_new_read_requests5 the helper routine __handle_issuing_new_read_requests5 can be simplified. This is another step towards the removal of ops.{pending,ack,complete,count}, i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:32:03 +10:00
Dan Williams	83de75cc92	md: replace STRIPE_OP_BIOFILL with STRIPE_BIOFILL_RUN From: Dan Williams <dan.j.williams@intel.com> Track the state of read operations (copying data from the stripe cache to bio buffers outside the lock) with a state flag. Reduce the scope of the STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been requested via the ops_request field of struct stripe_head_state. This is another step towards the removal of ops.{pending,ack,complete,count}, i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:58 +10:00
Dan Williams	ecc65c9b3f	md: replace STRIPE_OP_CHECK with 'check_states' From: Dan Williams <dan.j.williams@intel.com> The STRIPE_OP_* flags record the state of stripe operations which are performed outside the stripe lock. Their use in indicating which operations need to be run is straightforward; however, interpolating what the next state of the stripe should be based on a given combination of these flags is not straightforward, and has led to bugs. An easier to read implementation with minimal degrees of freedom is needed. Towards this goal, this patch introduces explicit states to replace what was previously interpolated from the STRIPE_OP_* flags. For now this only converts the handle_parity_checks5 path, removing a user of the ops.{pending,ack,complete,count} fields of struct stripe_operations. This conversion also found a remaining issue with the current code. There is a small window for a drive to fail between when we schedule a repair and when the parity calculation for that repair completes. When this happens we will writeback to 'failed_num' when we really want to write back to 'pd_idx'. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:57 +10:00
Dan Williams	f0e43bcdeb	md: unify raid5/6 i/o submission From: Dan Williams <dan.j.williams@intel.com> Let the raid6 path call ops_run_io to get pending i/o submitted. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:55 +10:00
Dan Williams	c4e5ac0a22	md: use stripe_head_state in ops_run_io() From: Dan Williams <dan.j.williams@intel.com> In handle_stripe after taking sh->lock we sample some bits into 's' (struct stripe_head_state): s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); Use these values from 's' in ops_run_io() rather than re-sampling the bits. This ensures a consistent snapshot (as seen under sh->lock) is used. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:53 +10:00
Dan Williams	2b7497f0e0	md: kill STRIPE_OP_IO flag From: Dan Williams <dan.j.williams@intel.com> The R5_Want{Read,Write} flags already gate i/o. So, this flag is superfluous and we can unconditionally call ops_run_io(). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:52 +10:00
Dan Williams	b203886edb	md: kill STRIPE_OP_MOD_DMA in raid5 offload From: Dan Williams <dan.j.williams@intel.com> This micro-optimization allowed the raid code to skip a re-read of the parity block after checking parity. It took advantage of the fact that xor-offload-engines have their own internal result buffer and can check parity without writing to memory. Remove it for the following reasons: 1/ It is a layering violation for MD to need to manage the DMA and non-DMA paths within async_xor_zero_sum 2/ Bad precedent to toggle the 'ops' flags outside the lock 3/ Hard to realize a performance gain as reads will not need an updated parity block and writes will dirty it anyways. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:50 +10:00
Chris Webb	0cd17fec98	Support changing rdev size on running arrays. From: Chris Webb <chris@arachsys.com> Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the superblock if necessary for this metadata version. We prevent the available space from shrinking to less than the used size, and allow it to be set to zero to fill all the available space on the underlying device. Signed-off-by: Chris Webb <chris@arachsys.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:46 +10:00
Neil Brown	526647320e	Make sure all changes to md/dev-XX/state are notified The important state change happens during an interrupt in md_error. So just set a flag there and call sysfs_notify later in process context. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:44 +10:00
Neil Brown	a99ac97113	Make sure all changes to md/degraded are notified. When a device fails, when a spare is activated, when an array is reshaped, or when an array is started, the extent to which the array is degraded can change. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:43 +10:00
Neil Brown	72a23c211e	Make sure all changes to md/sync_action are notified. When the 'resync' thread starts or stops, when we explicitly set sync_action, or when we determine that there is definitely nothing to do, we notify sync_action. To stop "sync_action" from occasionally showing the wrong value, we introduce a new flags - MD_RECOVERY_RECOVER - to say that a recovery is probably needed or happening, and we make sure that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:41 +10:00
Neil Brown	0fd62b861e	Make sure all changes to md/array_state are notified. Changes in md/array_state could be of interest to a monitoring program. So make sure all changes trigger a notification. Exceptions: changing active_idle to active is not reported because it is frequent and not interesting. changing active to active_idle is only reported on arrays with externally managed metadata, as it is not interesting otherwise. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:36 +10:00
Neil Brown	c7d0c941ae	Don't reject HOT_REMOVE_DISK request for an array that is not yet started. There is really no need for this test here, and there are valid cases for selectively removing devices from an array that it not actually active. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:34 +10:00
Neil Brown	199050ea1f	rationalise return value for ->hot_add_disk method. For all array types but linear, ->hot_add_disk returns 1 on success, 0 on failure. For linear, it returns 0 on success and -errno on failure. This doesn't cause a functional problem because the ->hot_add_disk function of linear is used quite differently to the others. However it is confusing. So convert all to return 0 for success or -errno on failure and fix call sites to match. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:33 +10:00
Neil Brown	6c2fce2ef6	Support adding a spare to a live md array with external metadata. i.e. extend the 'md/dev-XXX/slot' attribute so that you can tell a device to fill an vacant slot in an and md array. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:31 +10:00
Neil Brown	8ed0a5216a	Enable setting of 'offset' and 'size' of a hot-added spare. offset_store and rdev_size_store allow control of the region of a device which is to be using in an md/raid array. They only allow these values to be set when an array is being assembled, as changing them on an active array could be dangerous. However when adding a spare device to an array, we might need to set the offset and size before starting recovery. So allow these values to be set also if "->raid_disk < 0" which indicates that the device is still a spare. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:29 +10:00
Neil Brown	1a0fd49773	Don't try to make md arrays dirty if that is not meaningful. Arrays personalities such as 'raid0' and 'linear' have no redundancy, and so marking them as 'clean' or 'dirty' is not meaningful. So always allow write requests without requiring a superblock update. Such arrays types are detected by ->sync_request being NULL. If it is not possible to send a sync request we don't need a 'dirty' flag because all a dirty flag does is trigger some sync_requests. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:27 +10:00
Neil Brown	f48ed53838	Close race in md_probe There is a possible race in md_probe. If two threads call md_probe for the same device, then one could exit (having checked that ->gendisk exists) before the other has called kobject_init_and_add, thus returning an incomplete kobj which will cause problems when we try to add children to it. So extend the range of protection of disks_mutex slightly to avoid this possibility. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:26 +10:00
Neil Brown	5e96ee65c8	Allow setting start point for requested check/repair This makes it possible to just resync a small part of an array. e.g. if a drive reports that it has questionable sectors, a 'repair' of just the region covering those sectors will cause them to be read and, if there is an error, re-written with correct data. Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:24 +10:00
Neil Brown	a0da84f35b	Improve setting of "events_cleared" for write-intent bitmaps. When an array is degraded, bits in the write-intent bitmap are not cleared, so that if the missing device is re-added, it can be synced by only updated those parts of the device that have changed since it was removed. The enable this a 'events_cleared' value is stored. It is the event counter for the array the last time that any bits were cleared. Sometimes - if a device disappears from an array while it is 'clean' - the events_cleared value gets updated incorrectly (there are subtle ordering issues between updateing events in the main metadata and the bitmap metadata) resulting in the missing device appearing to require a full resync when it is re-added. With this patch, we update events_cleared precisely when we are about to clear a bit in the bitmap. We record events_cleared when we clear the bit internally, and copy that to the superblock which is written out before the bit on storage. This makes it more "obviously correct". We also need to update events_cleared when the event_count is going backwards (as happens on a dirty->clean transition of a non-degraded array). Thanks to Mike Snitzer for identifying this problem and testing early "fixes". Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:22 +10:00
Neil Brown	0e13fe23a0	use bio_endio instead of a call to bi_end_io Turn calls to bi->bi_end_io() into bio_endio(). Apparently bio_endio does exactly the same error processing as is hardcoded at these places. bio_endio() avoids recursion (or will soon), so it should be used. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:20 +10:00
Nikanth Karthikesan	13864515f7	linear: correct disk numbering error check From: "Nikanth Karthikesan" <knikanth@novell.com> Correct disk numbering problem check. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:19 +10:00
Neil Brown	9bbbca3a0e	Fix error paths if md_probe fails. md_probe can fail (e.g. alloc_disk could fail) without returning an error (as it alway returns NULL). So when we call mddev_find immediately afterwards, we need to check that md_probe actually succeeded. This means checking that mdev->gendisk is non-NULL. cc: <stable@kernel.org> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:17 +10:00
Neil Brown	efe3114318	Don't acknowlege that stripe-expand is complete until it really is. We shouldn't acknowledge that a stripe has been expanded (When reshaping a raid5 by adding a device) until the moved data has actually been written out. However we are currently acknowledging (by calling md_done_sync) when the POST_XOR is complete and before the write. So track in s.locked whether there are pending writes, and don't call md_done_sync yet if there are. Note: we all set R5_LOCKED on devices which are are about to read from. This probably isn't technically necessary, but is usually done when writing a block, and justifies the use of s.locked here. This bug can lead to a crash if an array is stopped while an reshape is in progress. Cc: <stable@kernel.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:31:14 +10:00
Neil Brown	8c2e870a62	Ensure interrupted recovery completed properly (v1 metadata plus bitmap) If, while assembling an array, we find a device which is not fully in-sync with the array, it is important to set the "fullsync" flags. This is an exact analog to the setting of this flag in hot_add_disk methods. Currently, only v1.x metadata supports having devices in an array which are not fully in-sync (it keep track of how in sync they are). The 'fullsync' flag only makes a difference when a write-intent bitmap is being used. In this case it tells recovery to ignore the bitmap and recovery all blocks. This fix is already in place for raid1, but not raid5/6 or raid10. So without this fix, a raid1 ir raid4/5/6 array with version 1.x metadata and a write intent bitmaps, that is stopped in the middle of a recovery, will appear to complete the recovery instantly after it is reassembled, but the recovery will not be correct. If you might have an array like that, issueing echo repair > /sys/block/mdXX/md/sync_action will make sure recovery completes properly. Cc: <stable@kernel.org> Signed-off-by: Neil Brown <neilb@suse.de>	2008-06-28 08:30:52 +10:00
Dan Williams	c337869d95	md: do not compute parity unless it is on a failed drive If a block is computed (rather than read) then a check/repair operation may be lead to believe that the data on disk is correct, when infact it isn't. So only compute blocks for failed devices. This issue has been around since at least 2.6.12, but has become harder to hit in recent kernels since most reads bypass the cache. echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the correct state. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-06-06 11:29:08 -07:00
Dan Williams	a6d8113a98	md: fix uninitialized use of mddev->recovery_wait If an array was created with --assume-clean we will oops when trying to set ->resync_max. Fix this by initializing ->recovery_wait in mddev_find. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-06-06 11:29:08 -07:00
Dan Williams	e0a115e5aa	md: fix prexor vs sync_request race During the initial array synchronization process there is a window between when a prexor operation is scheduled to a specific stripe and when it completes for a sync_request to be scheduled to the same stripe. When this happens the prexor completes and the stripe is unconditionally marked "insync", effectively canceling the sync_request for the stripe. Prior to 2.6.23 this was not a problem because the prexor operation was done under sh->lock. The effect in older kernels being that the prexor would still erroneously mark the stripe "insync", but sync_request would be held off and re-mark the stripe as "!in_sync". Change the write completion logic to not mark the stripe "in_sync" if a prexor was performed. The effect of the change is to sometimes not set STRIPE_INSYNC. The worst this can do is cause the resync to stall waiting for STRIPE_INSYNC to be set. If this were happening, then STRIPE_SYNCING would be set and handle_issuing_new_read_requests would cause all available blocks to eventually be read, at which point prexor would never be used on that stripe any more and STRIPE_INSYNC would eventually be set. echo repair > /sys/block/mdN/md/sync_action will correct arrays that may have lost this race. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-06-06 11:29:08 -07:00
Chandra Seetharaman	688864e298	[SCSI] scsi_dh: Remove hardware handler infrastructure from dm This patch just removes infrastructure that provided support for hardware handlers in the dm layer as it is not needed anymore. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:42 -05:00
Chandra Seetharaman	cb520223d7	[SCSI] scsi_dh: Remove hardware handlers from dm This patch removes the 3 hardware handlers that currently exist under dm as the functionality is moved to SCSI layer in the earlier patches. [jejb: removed more makefile hunks and rejection fixes] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
Chandra Seetharaman	2651f5d7d3	[SCSI] scsi_dh: Remove dm_pg_init_complete This patch just removes the dm layer's path initialization completion routine. This is separated from the other patch(scsi_dh: Use SCSI device handler in dm-multipath) Just to make that patch more readable. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
Chandra Seetharaman	bab7cfc733	[SCSI] scsi_dh: Add a single threaded workqueue for initializing paths Before this patch set (SCSI hardware handlers), initialization of a path was done asynchronously. Doing that requires a workqueue in each device/hardware handler module and leads to unneccessary complication in the device handler code, making it difficult to read the code and follow the state diagram. Moving that workqueue to this level makes the device handler code simpler. Hence, the workqueue is moved to dm level. A new workqueue is added instead of adding it to the existing workqueue (kmpathd) for the following reasons: 1. Device activation has to happen faster, stacking them along with the other workqueue might lead to unnecessary delay in the activation of the path. 2. The effect could be felt the other way too. i.e the current events that are handled by the existing workqueue might get a delayed response. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
Chandra Seetharaman	cfae5c9bb6	[SCSI] scsi_dh: Use SCSI device handler in dm-multipath This patch converts dm-mpath to use scsi device handlers instead of dm's hardware handlers. This patch does not add any new functionality. Old behaviors remain and userspace tools work as is except that arguments supplied with hardware handler are ignored. One behavioral exception is: Activation of a path is synchronous in this patch, opposed to the older behavior of being asynchronous (changed in patch 07: scsi_dh: Add a single threaded workqueue for initializing a path) Note: There is no need to get a reference for the device handler module (as it was done in the dm hardware handler case) here as the reference is held when the device was first found. Instead we check and make sure that support for the specified device is present at table load time. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2008-06-05 09:23:41 -05:00
NeilBrown	dfc7064500	md: restart recovery cleanly after device failure. When we get any IO error during a recovery (rebuilding a spare), we abort the recovery and restart it. For RAID6 (and multi-drive RAID1) it may not be best to restart at the beginning: when multiple failures can be tolerated, the recovery may be able to continue and re-doing all that has already been done doesn't make sense. We already have the infrastructure to record where a recovery is up to and restart from there, but it is not being used properly. This is because: - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR, which causes the recovery not be be checkpointed. - We remove spares and then re-added them which loses important state information. The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't needed. If there is an error, the relevant drive will be marked as Faulty, and that is enough to ensure correct handling of the error. So we first remove MD_RECOVERY_ERR, changing some of the uses of it to MD_RECOVERY_INTR. Then we cause the attempt to remove a non-faulty device from an array to fail (unless recovery is impossible as the array is too degraded). Then when remove_and_add_spares attempts to remove the devices on which recovery can continue, it will fail, they will remain in place, and recovery will continue on them as desired. Issue: If we are halfway through rebuilding a spare and another drive fails, and a new spare is immediately available, do we want to: 1/ complete the current rebuild, then go back and rebuild the new spare or 2/ restart the rebuild from the start and rebuild both devices in parallel. Both options can be argued for. The code currently takes option 2 as a/ this requires least code change b/ this results in a minimally-degraded array in minimal time. Cc: "Eivind Sarto" <ivan@kasenna.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Bernd Schubert	90b08710e4	md: allow parallel resync of md-devices. In some configurations, a raid6 resync can be limited by CPU speed (Calculating P and Q and moving data) rather than by device speed. In these cases there is nothing to be gained byt serialising resync of arrays that share a device, and doing the resync in parallel can provide benefit. So add a sysfs tunable to flag an array as being allowed to resync in parallel with other arrays that use (a different part of) the same device. Signed-off-by: Bernd Schubert <bs@q-leap.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Dan Williams	4f54b0e948	md: notify userspace on 'stop' events This additional notification to 'array_state' is needed to allow the monitor application to learn about stop events via sysfs. The sysfs_notify("sync_action") call that comes at the end of do_md_stop() (via md_new_event) is insufficient since the 'sync_action' attribute has been removed by this point. (Seems like a sysfs-notify-on-removal patch is a better fix. Currently removal updates the event count but does not wake up waiters) Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
NeilBrown	09a44cc150	md: notify userspace on 'write-pending' changes to array_state When an array enters write pending, 'array_state' changes, so we must be sure to sysfs_notify. Also, when waiting for user-space to acknowledge 'write-pending' by marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to be cleared as that might not happen. So explicity test for the bits that we are really interested in. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
NeilBrown	698b18c1e8	md: raid1: Fix restoration of bio between failed read and write. When performing a "recovery" or "check" pass on a RAID1 array, we read from each device and possible, if there is a difference or a read error, write back to some devices. We use the same 'bio' for both read and write, resetting various fields between the two operations. We forgot to reset bv_offset and bv_len however. These are often left unchanged, but in the case where there is an IO error one or two sectors into a page, they are changed. This results in correctable errors not being corrected properly. It does not result in any data corruption. Cc: "Fairbanks, David" <David.Fairbanks@stratus.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Bernd Schubert	6be9d49401	md: md: raid5 rate limit error printk Last night we had scsi problems and a hardware raid unit was offlined during heavy i/o. While this happened we got for about 3 minutes a huge number messages like these Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2). I guess the high error rate is responsible for not scheduling other events - during this time the system was not pingable and in the end also other devices run into scsi command timeouts causing problems on these unrelated devices as well. Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Christoph Hellwig	6bcfd60186	md: kill file_path wrapper Kill the trivial and rather pointless file_path wrapper around d_path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:09 -07:00
NeilBrown	84255d1018	md: fix possible oops when removing a bitmap from an active array It is possible to add a write-intent bitmap to an active array, or remove the bitmap that is there. When we do with the 'quiesce' the array, which causes make_request to block in "wait_barrier()". However we are sampling the value of "mddev->bitmap" before the wait_barrier call, and using it afterwards. This can result in using a bitmap structure that has been freed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:09 -07:00
Neil Brown	e7e72bf641	Remove blkdev warning triggered by using md As setting and clearing queue flags now requires that we hold a spinlock on the queue, and as blk_queue_stack_limits is called without that lock, get the lock inside blk_queue_stack_limits. For blk_queue_stack_limits to be able to find the right lock, each md personality needs to set q->queue_lock to point to the appropriate lock. Those personalities which didn't previously use a spin_lock, us q->__queue_lock. So always initialise that lock when allocated. With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no longer cause warnings as it will be clear that the proper lock is held. Thanks to Dan Williams for review and fixing the silly bugs. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alistair John Strachan <alistair@devzero.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Jacek Luczak <difrost.kernel@gmail.com> Cc: Prakash Punnoor <prakash@punnoor.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-14 19:11:15 -07:00
Dan Williams	c8894419ac	md: fix raid5 'repair' operations commit `bd2ab67030` "md: close a livelock window in handle_parity_checks5" introduced a bug in handling 'repair' operations. After a repair operation completes we clear the state bits tracking this operation. However, they are cleared too early and this results in the code deciding to re-run the parity check operation. Since we have done the repair in memory the second check does not find a mismatch and thus does not do a writeback. Test results: $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 51072 $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 0 (also fix incorrect indentation) Cc: <stable@kernel.org> Tested-by: George Spelvin <linux@horizon.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-13 08:02:24 -07:00
Harvey Harrison	cb6969e8cd	misc: fix integer as NULL pointer warnings drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-08 10:46:55 -07:00
Dan Williams	6bfe0b4990	md: support blocking writes to an array on device failure Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Dan Williams	11e2ede022	md: prevent duplicates in bind_rdev_to_array Found when trying to reassemble an active externally managed array. Without this check we hit the more noisy "sysfs duplicate" warning in the later call to kobject_add. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Dan Williams	242b363e22	md: remove a stray command from a copy and paste error in resync_start_store Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
NeilBrown	648b629ed4	md: fix up switching md arrays between read-only and read-write When setting an array to 'readonly' or to 'active' via sysfs, we must make the appropriate set_disk_ro call too. Also when switching to "read_auto" (which is like readonly, but blocks on the first write so that metadata can be marked 'dirty') we need to be more careful about what state we are changing from. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	31a59e3425	md: fix 'safemode' handling for external metadata. 'safemode' relates to marking an array as 'clean' if there has been no write traffic for a while (a couple of seconds), to reduce the chance of the array being found dirty on reboot. ->safemode is set to '1' when there have been no write for a while, and it gets set to '0' when the superblock is updates with the 'clean' flag set. This requires a few fixes for 'external' metadata: - When an array is set to 'clean' via sysfs, 'safemode' must be cleared. - when we write to an array that has 'safemode' set (there must have been some delay in updating the metadata), we need to clear safemode. - Don't try to update external metadata in md_check_recovery for safemode transitions - it won't work. Also, don't try to support "immediate safe mode" (safemode==2) for external metadata, it cannot really work (the safemode timeout can be set very low if this is really needed). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	d897dbf914	md: reinitialise more mddev fields in do_md_stop. I keep finding problems where an mddev gets reused and some fields has a value from a previous usage that confuses the new usage. So clear all fields that could possible need clearing when calling do_md_stop. Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
NeilBrown	8377bc8080	md: skip all metadata update processing when using external metadata. All the metadata update processing for external metadata is on in user-space or through the sysfs interfaces, so make "md_update_sb" a no-op in that case. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
Dan Williams	6a51830e14	md: fix use after free when removing rdev via sysfs rdev->mddev is no longer valid upon return from entry->store() when the 'remove' command is given. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:32 -07:00
Jens Axboe	c9a3f6d6f5	dm: use unlocked variants of queue flag check/set dm.c already provides mutual exclusion through ->map_lock. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 10:21:12 -07:00
Linus Torvalds	bd5d435a96	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Skip I/O merges when disabled block: add large command support block: replace sizeof(rq->cmd) with BLK_MAX_CDB ide: use blk_rq_init() to initialize the request block: use blk_rq_init() to initialize the request block: rename and export rq_init() block: no need to initialize rq->cmd with blk_get_request block: no need to initialize rq->cmd in prepare_flush_fn hook block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline block/elevator.c:elv_rq_merge_ok() mustn't be inline block: make queue flags non-atomic block: add dma alignment and padding support to blk_rq_map_kern unexport blk_max_pfn ps3disk: Remove superfluous cast block: make rq_init() do a full memset() relay: fix splice problem	2008-04-29 08:18:03 -07:00
Denis V. Lunev	c7705f3449	drivers: use non-racy method for proc entries creation (2) Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:22 -07:00
FUJITA Tomonori	992b5bceee	block: no need to initialize rq->cmd with blk_get_request blk_get_request initializes rq->cmd (rq_init does) so the users don't need to do that. The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd, as a preparation for large command support, which changes rq->cmd from the static array to a pointer. sizeof(rq->cmd) will not make sense and &rq->cmd won't work. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:55 +02:00
Nick Piggin	75ad23bc0f	block: make queue flags non-atomic We can save some atomic ops in the IO path, if we clearly define the rules of how to modify the queue flags. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 14:48:33 +02:00
Julia Lawall	62b0559aad	drivers/md: use time_before, time_before_eq, etc The functions time_before, time_before_eq, time_after, and time_after_eq are more robust for comparing jiffies against other values. A simplified version of the semantic patch making this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @ change_compare_np @ expression E; @@ ( - jiffies <= E + time_before_eq(jiffies,E) \| - jiffies >= E + time_after_eq(jiffies,E) \| - jiffies < E + time_before(jiffies,E) \| - jiffies > E + time_after(jiffies,E) ) @ include depends on change_compare_np @ @@ #include <linux/jiffies.h> @ no_include depends on !include && change_compare_np @ @@ #include <linux/...> + #include <linux/jiffies.h> // </smpl> [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Nick Andrew	d7a420c947	raid: remove leading TAB on printk messages MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Dan Williams	4ef197d87a	md: raid5.c convert simple_strtoul to strict_strtoul strict_strtoul handles the open-coded sanity checks in raid5_store_stripe_cache_size and raid5_store_preread_threshold Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Dan Williams	8b3e6cdc53	md: introduce get_priority_stripe() to improve raid456 write performance Improve write performance by preventing the delayed_list from dumping all its stripes onto the handle_list in one shot. Delayed stripes are now further delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when: * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list' * 'handle_list' is empty and i/o is being done to satisfy full stripe-width write requests * 'bypass_count' is less than 'bypass_threshold'. By default the threshold is 1, i.e. every other stripe handled is a preread stripe provided the top two conditions are false. Benchmark data: System: 2x Xeon 5150, 4x SATA, mem=1GB Baseline: 2.6.24-rc7 Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512) Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS) * patched: +13% * patched + preread_bypass_threshold = 0: +37% Changes since v1: * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and make it configurable. This defaults to fairness and modest performance gains out of the box. Changes since v2: * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not necessary, the important change was clearing STRIPE_DELAYED in add_stripe_bio and this has been moved out to make_request for the hang fix. * [neilb@suse.de]: simplify get_priority_stripe * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is sampled empty (+11%) * [dan.j.williams@intel.com]: decrement the bypass_count at the detection of stripes being naturally promoted off of hold_list +2%. Note, resetting bypass_count instead of decrementing on these events yields +4% but that is probably too aggressive. Changes since v3: * cosmetic fixups Tested-by: James W. Laferriere <babydr@baby-dragons.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Harvey Harrison	e46b272b66	md: replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Harvey Harrison	9a7b2b0f36	md: fix integer as NULL pointer warnings in md.c drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer Add some braces to match the else-block as well. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:42 -07:00
Frederik Deweerdt	cf13ab8e02	dm: remove md argument from specific_minor The small patch below: - Removes the unused md argument from both specific_minor() and next_free_minor() - Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev() This has been compile tested on x86. Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:27:02 +01:00
Adrian Bunk	4fdfe401e9	dm table: remove unused dm_create_error_table dm_create_error_table() was added in kernel 2.6.18 and never used... Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:27:00 +01:00
Adrian Bunk	e8488d0858	dm table: drop void suspend_targets return void returning functions returned the return value of another void returning function... Spotted by sparse. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:59 +01:00
Mikulas Patocka	7ff14a3615	dm: unplug queues in threads Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O. It is specified that any submitted bio without BIO_RW_SYNC flag may plug the queue (i.e. block the requests from being dispatched to the physical device). The queue is unplugged when the caller calls blk_unplug() function. Usually, the sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs the queue and waits (to be possibly joined with other adjacent bios). Then, when the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to the disk. This was happenning: When doing O_SYNC writes, function fsync_buffers_list() submits a list of bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is woken up. fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but there are no bios on the device queue as they are still in the dm_raid1 queue. wait_on_buffer() starts waiting until the IO is finished. kmirrord is scheduled, kmirrord takes bios and submits them to the devices. The submitted bio plugs the harddisk queue but there is no one to unplug it. (The process that called wait_on_buffer() is already sleeping.) So there is a 3ms timeout, after which the queues on the harddisks are unplugged and requests are processed. This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes), dm-raid1 is 10 times slower than md raid1. Every time we submit something asynchronously via dm_io, we must unplug the queue actually to send the request to the device. This patch adds an unplug call to kmirrord - while processing requests, it keeps the queue plugged (so that adjacent bios can be merged); when it finishes processing all the bios, it unplugs the queue to submit the bios. It also fixes kcopyd which has the same potential problem. All kcopyd requests are submitted with BIO_RW_SYNC. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-25 13:26:57 +01:00
Mikulas Patocka	a2aebe03be	dm raid1: use timer This patch replaces the schedule() in the main kmirrord thread with a timer. The schedule() could introduce an unwanted delay when work is ready to be processed. The code instead calls wake() when there's work to be done immediately, and delayed_wake() after a failure to give a short delay before retrying. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:56 +01:00
Alasdair G Kergon	a765e20eeb	dm: move include files Publish the dm-io, dm-log and dm-kcopyd headers in include/linux. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:55 +01:00
Alasdair G Kergon	2d1e580afe	dm kcopyd: rename Rename kcopyd.[ch] to dm-kcopyd.[ch]. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:54 +01:00
Alasdair G Kergon	0da336e5fa	dm: expose macros Make dm.h macros and inlines available in include/linux/device-mapper.h Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:53 +01:00
Mikulas Patocka	945fa4d283	dm kcopyd: remove redundant client counting Remove client counting code that is no longer needed. Initialization and destruction is made globally from dm_init and dm_exit and is not based on client counts. Initialization allocates only one empty slab cache, so there is no negative impact from performing the initialization always, regardless of whether some client uses kcopyd or not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:52 +01:00
Mikulas Patocka	08d8757a4d	dm kcopyd: private mempool Change the global mempool in kcopyd into a per-device mempool to avoid deadlock possibilities. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:50 +01:00
Mikulas Patocka	8c0cbc2f79	dm kcopyd: per device Make one kcopyd thread per device. The original shared kcopyd could deadlock. Configuration:	2008-04-25 13:26:49 +01:00
Jonathan Brassow	2a23aa1ddb	dm log: make module use tracking internal Remove internal module reference fields from the interface. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:48 +01:00
Alasdair G Kergon	b8206bc3de	dm log: move register functions Reorder a couple of functions in the file so the next patch is readable. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:47 +01:00
Heinz Mauelshagen	416cd17b19	dm log: clean interface Clean up the dm-log interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:46 +01:00
Heinz Mauelshagen	eb69aca5d3	dm kcopyd: clean interface Clean up the kcopyd interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:44 +01:00
Heinz Mauelshagen	22a1ceb1e6	dm io: clean interface Clean up the dm-io interface to prepare for publishing it in include/linux. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:43 +01:00
Alasdair G Kergon	e01fd7eeb0	dm io: rename error to error_bits Rename 'error' to 'error_bits' for clarity. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:41 +01:00
Mikulas Patocka	72727bad54	dm snapshot: store pointer to target instance Save pointer to dm_target in dm_snapshot structure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:40 +01:00
Heinz Mauelshagen	769aef30f0	dm log: move dirty region log code into separate module Move the dirty region log code into a separate module so other targets can share the code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:39 +01:00
Heinz Mauelshagen	b7fd54a70f	dm log: generalise name in messages Change dm-log.c messages from "mirror log" to "dirty region log" as a new dm target wants to share this code. Signed-off-by: Heinz Mauelshagen <hjm@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:38 +01:00
Robert P. J. Day	c12bfc923e	dm raid1: use list_split_init Use shorter list_splice_init() for brevity. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:36 +01:00
Milan Broz	8ee2767a59	dm snapshot: reduce default memory allocation Limit the amount of memory allocated per snapshot on systems with a large page size. (The larger default chunk size on these systems compensates for the smaller number of pages reserved.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2008-04-25 13:26:35 +01:00
Mikulas Patocka	924362629b	dm snapshot: fix chunksize sector conversion If a snapshot has a smaller chunksize than the page size the conversion to pages currently returns 0 instead of 1, causing: kernel BUG in mempool_resize. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org	2008-04-25 13:26:34 +01:00
Nick Andrew	fdefa4d87e	RAID: remove trailing space from printk line drivers/md/*.[ch] contains only one more printk line with a trailing space. Remove it. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>	2008-04-21 22:42:58 +00:00
Dan Williams	bd2ab67030	md: close a livelock window in handle_parity_checks5 If a failure is detected after a parity check operation has been initiated, but before it completes handle_parity_checks5 will never quiesce operations on the stripe. Explicitly handle this case by "canceling" the parity check, i.e. clear the STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh any non-uptodate blocks. Kernel versions >= 2.6.23 are susceptible. Cc: <stable@kernel.org> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-11 08:06:44 -07:00
Alasdair G Kergon	4cdc1d1fa5	dm io: write error bits form long not int write_err is an unsigned long used with set_bit() so should not be passed around as unsigned int. http://bugzilla.kernel.org/show_bug.cgi?id=10271 Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:23 -07:00
Milan Broz	3f1e9070f6	dm crypt: fix ctx pending Fix regression in dm-crypt introduced in commit `3a7f6c990a` ("dm crypt: use async crypto"). If write requests need to be split into pieces, the code must not process them in parallel because the crypto context cannot be shared. So there can be parallel crypto operations on one part of the write, but only one write bio can be processed at a time. This is not optimal and the workqueue code needs to be optimized for parallel processing, but for now it solves the problem without affecting the performance of synchronous crypto operation (most of current dm-crypt users). http://bugzilla.kernel.org/show_bug.cgi?id=10242 http://bugzilla.kernel.org/show_bug.cgi?id=10207 Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:22 -07:00
Andrew Morton	9ea85ebae1	drivers/md/raid5.c: fix printk warnings gcc-3.4.5 on sparc64: drivers/md/raid5.c: In function `raid5_end_read_request': drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4) drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3) drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3) sector_t is u64, and we don't know what type the architecture uses to implement u64 (on some it is unsigned long). Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-19 18:53:37 -07:00
NeilBrown	0e82989d95	md: remove the 'super' sysfs attribute from devices in an 'md' array Exposing the binary blob which is the md 'super-block' via sysfs doesn't really fit with the whole sysfs model, and ever since commit `8118a859dc` ("sysfs: fix off-by-one error in fill_read_buffer()") it doesn't actually work at all (as the size of the blob is often one page). (akpm: as in, fs/sysfs/file.c:fill_read_buffer() goes BUG) So just remove it altogether. It isn't really useful. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-19 18:53:35 -07:00
NeilBrown	7be3dfec47	md: reduce CPU wastage on idle md array with a write-intent bitmap Recent patch titled Reduce CPU wastage on idle md array with a write-intent bitmap. would sometimes leave the array with dirty bitmap bits that stay dirty. A subsequent write would sort things out so it isn't a big problem, but should be fixed nonetheless. We need to make sure that when the bitmap becomes not "allclean", the daemon_sleep really does get set to a sensible value. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:19 -07:00
NeilBrown	52720ae77d	md: fix formatting error in /proc/mdstat If an md array is "auto-read-only", then this appears in /proc/mdstat as /dev/md0: active(auto-read-only) whereas if it is truely readonly, it appears as /dev/md0: active (read-only) The difference being a space. One program known to parse this file expects the space and gets badly confused. It will be fixed, but it would be best if what the kernel generates is more consistent too. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:19 -07:00
K.Tanaka	a07e6ab41b	md: the md RAID10 resync thread could cause a md RAID10 array deadlock This message describes another issue about md RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi error results in disabling a disk during RAID10 recovery, the resync threads of md RAID10 could stall. This case, the raid array has already been broken and it may not matter. But I think stall is not preferable. If it occurs, even shutdown or reboot will fail because of resource busy. The deadlock mechanism: The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be handled when recovering. The "remaining" counter is incremented when building a BIO in sync_request() and is decremented when finish a BIO in end_sync_write(). If building a BIO fails for some reasons in sync_request(), the "remaining" should be decremented if it has already been incremented. I found a case where this decrement is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for md_done_sync() called by end_sync_write(), but end_sync_write() never calls md_done_sync() because of the "remaining" counter mismatch. For example, this problem would be reproduced in the following case: Personalities : [raid10] md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F) 3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_] [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec This case, sdf1 is recovering, sdb1 and sde1 are disabled. An additional error with detaching sdd will cause a deadlock. md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F) 3919616 blocks 64K chunks 2 near-copies [4/1] [_U__] [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec 2739 ? S< 0:17 [md0_raid10] 28608 ? D< 0:00 [md0_resync] 28629 pts/1 Ss 0:00 bash 28830 pts/1 R+ 0:00 ps ax 31819 ? D< 0:00 [kjournald] The resync thread keeps working, but actually it is deadlocked. Patch: By this patch, the remaining counter will be decremented if needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	1c830532f6	md: fix possible raid1/raid10 deadlock on read error during resync Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for another possible deadlock in raid1/raid10 error handing. If a read request returns an error while a resync is happening and a resync request is pending, the attempt to fix the error will block until the resync progresses, and the resync will block until the read request completes. Thus a deadlock. This patch fixes the problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
Keld Simonsen	8ed3a19563	md: don't attempt read-balancing for raid10 'far' layouts This patch changes the disk to be read for layout "far > 1" to always be the disk with the lowest block address. Thus the chunks to be read will always be (for a fully functioning array) from the first band of stripes, and the raid will then work as a raid0 consisting of the first band of stripes. Some advantages: The fastest part which is the outer sectors of the disks involved will be used. The outer blocks of a disk may be as much as 100 % faster than the inner blocks. Average seek time will be smaller, as seeks will always be confined to the first part of the disks. Mixed disks with different performance characteristics will work better, as they will work as raid0, the sequential read rate will be number of disks involved times the IO rate of the slowest disk. If a disk is malfunctioning, the first disk which is working, and has the lowest block address for the logical block will be used. Signed-off-by: Keld Simonsen <keld@dkuug.dk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	27c529bb8e	md: lock access to rdev attributes properly When we access attributes of an rdev (component device on an md array) through sysfs, we really need to lock the array against concurrent changes. We currently do that when we change an attribute, but not when we read an attribute. We need to lock when reading as well else rdev->mddev could become NULL while we are accessing it. So add appropriate locking (mddev_lock) to rdev_attr_show. rdev_size_store requires some extra care as well as it needs to unlock the mddev while scanning other mddevs for overlapping regions. We currently assume that rdev->mddev will still be unchanged after the scan, but that cannot be certain. So take a copy of rdev->mddev for use at the end of the function. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	2515619823	md: make sure a reshape is started when device switches to read-write A resync/reshape/recovery thread will refuse to progress when the array is marked read-only. So whenever it mark it not read-only, it is important to wake up thread resync thread. There is one place we didn't do this. The problem manifests if the start_ro module parameters is set, and a raid5 array that is in the middle of a reshape (restripe) is started. The array will initially be semi-read-only (meaning it acts like it is readonly until the first write). So the reshape will not proceed. On the first write, the array will become read-write, but the reshape will not be started, and there is no event which will ever restart that thread. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	d0fae18f1b	md: clean up irregularity with raid autodetect When a raid1 array is stopped, all components currently get added to the list for auto-detection. However we should really only add components that were found by autodetection in the first place. So add a flag to record that information, and use it. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:18 -08:00
NeilBrown	a1801f858e	md: guard against possible bad array geometry in v1 metadata Make sure the data doesn't start before the end of the superblock when the superblock is at the start of the device. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:17 -08:00
NeilBrown	8311c29d40	md: reduce CPU wastage on idle md array with a write-intent bitmap On an md array with a write-intent bitmap, a thread wakes up every few seconds and scans the bitmap looking for work to do. If the array is idle, there will be no work to do, but a lot of scanning is done to discover this. So cache the fact that the bitmap is completely clean, and avoid scanning the whole bitmap when the cache is known to be clean. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:17 -08:00

... 5 6 7 8 9 ...

1424 Commits