linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	292088ee03	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "A couple of fixes + getting rid of __blkdev_put() return value" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: proc: Use PDE attribute setting accessor functions make blkdev_put() return void block_device_operations->release() should return void mtd_blktrans_ops->release() should return void hfs: SMP race on directory close()	2013-05-07 15:14:53 -07:00
Roger Pau Monne	bb642e8315	xen-blkback: allocate list of pending reqs in small chunks Allocate pending requests in smaller chunks instead of allocating them all at the same time. This change also removes the global array of pending_reqs, it is no longer necessay. Variables related to the grant mapping have been grouped into a struct called "grant_page", this allows to allocate them in smaller chunks, and also improves memory locality. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-05-07 09:42:17 -04:00
Al Viro	db2a144bed	block_device_operations->release() should return void The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) and it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-07 02:16:21 -04:00
Linus Torvalds	91f8575685	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Alex Elder: "This is a big pull. Most of it is culmination of Alex's work to implement RBD image layering, which is now complete (yay!). There is also some work from Yan to fix i_mutex behavior surrounding writes in cephfs, a sync write fix, a fix for RBD images that get resized while they are mapped, and a few patches from me that resolve annoying auth warnings and fix several bugs in the ceph auth code." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (254 commits) rbd: fix image request leak on parent read libceph: use slab cache for osd client requests libceph: allocate ceph message data with a slab allocator libceph: allocate ceph messages with a slab allocator rbd: allocate image object names with a slab allocator rbd: allocate object requests with a slab allocator rbd: allocate name separate from obj_request rbd: allocate image requests with a slab allocator rbd: use binary search for snapshot lookup rbd: clear EXISTS flag if mapped snapshot disappears rbd: kill off the snapshot list rbd: define rbd_snap_size() and rbd_snap_features() rbd: use snap_id not index to look up snap info rbd: look up snapshot name in names buffer rbd: drop obj_request->version rbd: drop rbd_obj_method_sync() version parameter rbd: more version parameter removal rbd: get rid of some version parameters rbd: stop tracking header object version rbd: snap names are pointer to constant data ...	2013-05-06 13:11:19 -07:00
Linus Torvalds	736a2dd257	Lots of virtio work which wasn't quite ready for last merge window. Plus I dived into lguest again, reworking the pagetable code so we can move the switcher page: our fixmaps sometimes take more than 2MB now... Cheers, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJRga7lAAoJENkgDmzRrbjx/yIQAKpqIBtxOJeYH3SY+Uoe7Cfp toNYcpJEldvb0UcWN8M2cSZpHoxl1SUoq9djwcM29tcKa7EZAjHaGtb/Q1qMTDgv +B3WAfiGU2pmXFxLAkbrlLNGnysy24JspqJQ5hcYV84EiBxQdZp+nCYgOphd+GMK ww16vo9ya8jFjzt3GeRp/Heb3vEzV4Cp6BC3i0m8A3WNpEpbRb66pqXNk5o8ggJO SxQOKSXmUM+0m+jKSul5xn3e2Ls2LOrZZ8/DIHA+gW66N4Zab7n2/j1Q9VRxb4lh FqnR7KwgBX8OCh9IsBDqQYS7MohvMYge6eUdLtFrq84jvMleMEhrC8q9v2tucFUb 5t18CLwvyK7Gdg6UCKiZ7YSPcuURAILO16al9bh5IseeBDsuX+43VsvQoBmFn9k6 cLOVTZ6BlOmahK5PyRYFSvLa9Rxzr/05Mr7oYq9UgshD9io78dnqczFYIORF53rW zD7C4HuTZfYJFfNd0wAJ0RfVXnf8QvDlMdo7zPC26DSXNWqj8OexCY0qqSWUB+2F vcfJP6NkV4fZB8aawWIFUVwc64yqtt2uPVLa7ATZWqk16PgKrchGewmw3tiEwOgu 1l7xgffTRRUIJsqaCZoXdgw3yezcKRjuUBcOxL09lDAAhc+NxWNvzZBsKp66DwDk yZQKn0OdXnuf0CeEOfFf =1tYL -----END PGP SIGNATURE----- Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull virtio & lguest updates from Rusty Russell: "Lots of virtio work which wasn't quite ready for last merge window. Plus I dived into lguest again, reworking the pagetable code so we can move the switcher page: our fixmaps sometimes take more than 2MB now..." Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename. Hopefully correctly resolved. * tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits) caif_virtio: Remove bouncing email addresses lguest: improve code readability in lg_cpu_start. virtio-net: fill only rx queues which are being used lguest: map Switcher below fixmap. lguest: cache last cpu we ran on. lguest: map Switcher text whenever we allocate a new pagetable. lguest: don't share Switcher PTE pages between guests. lguest: expost switcher_pages array (as lg_switcher_pages). lguest: extract shadow PTE walking / allocating. lguest: make check_gpte et. al return bool. lguest: assume Switcher text is a single page. lguest: rename switcher_page to switcher_pages. lguest: remove RESERVE_MEM constant. lguest: check vaddr not pgd for Switcher protection. lguest: prepare to make SWITCHER_ADDR a variable. virtio: console: replace EMFILE with EBUSY for already-open port virtio-scsi: reset virtqueue affinity when doing cpu hotplug virtio-scsi: introduce multiqueue support virtio-scsi: push vq lock/unlock into virtscsi_vq_done virtio-scsi: pass struct virtio_scsi to virtqueue completion function ...	2013-05-02 14:14:04 -07:00
Keith Busch	78f8d2577b	NVMe: Schedule timeout for sync commands Schedule a timeout on sync commands in case the command times out and the device is not being polled for timeouts. This prevents device removal from hanging forever if the device has stopped responding. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 15:36:02 -04:00
Keith Busch	f410c680b5	NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO This adds support for namespaces with separate meta-data formats in the submit io ioctl. The meta-data buffer has to be a contiguous, so such a buffer is allocated and the mapped user pages are copied to/from this buffer for write/read commands. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 15:35:09 -04:00
Keith Busch	159b67d7ae	NVMe: Device specific stripe size handling We have an nvme device that has a concept of a stripe size. IO requests that do not transfer data crossing a stripe boundary has greater performance compared to IO that does cross it. This patch sets the stripe size for the device if the device and vendor ids match one with this feature and splits IO requests that cross the stripe boundary. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:41:05 -04:00
Keith Busch	427e970801	NVMe: Split non-mergeable bio requests It is possible a bio request can not be submitted as a single NVMe IO command if the bio_vec is not mergeable with the NVMe PRP alignement constraints. This condition was handled by submitting an IO for the mergeable portion then submitting a follow on IO for the remaining data after the previous IO completes. The remainder to be sent was tracked by manipulating the bio->bi_idx and bio->bi_sector. This patch splits the request as many times as necessary and submits the bios together. Since submitting the bio may cause it to be requeued on split, nvme_resubmit_bios had to be modified to remove the wait queue when the bio list is empty prior to submitting the bio since a split would have added the wait queue a second time, corrupting the wait queue head task list. There are a few other benefits from doing this: it fixes a potential issue with the previous handling of a non-mergeable bio as the requeuing method could would use an unlocked nvme_queue if the callback isn't invoked on the queue's associated cpu; it will be possible to retry a failed bio if desired at some later time since it does not manipulate the original bio; the bio integrity extensions require the bio to be in its original condition for the checks to work correctly if we implement the end-to-end data protection in the future. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:38:59 -04:00
Keith Busch	cbb6218fd4	NVMe: Remove dead code in nvme_dev_add There is no situation that could occur where we could error out of this function and require cleaning up allocated namespaces. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:36:45 -04:00
Keith Busch	a9ef4343af	NVMe: Check for NULL memory in nvme_dev_add Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:35:44 -04:00
Keith Busch	68b8eca5f8	NVMe: Fix error clean-up on nvme_alloc_queue The nvme_queue's depth is not set if we fail to allocate submission queue entries, which was being used to determine how much coherent memory to free on error. Use the depth variable instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:34:35 -04:00
Keith Busch	025c557a71	NVMe: Free admin queue on request_irq error Fixes a potential memory leak if requesting the admin queue irq fails. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:33:53 -04:00
Keith Busch	ec50373350	NVMe: Add scsi unmap to SG_IO Translates a scsi unmap request from SG_IO ioctl to NVMe data-set-management deallocate. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:32:08 -04:00
Keith Busch	14385de117	NVMe: queue usage fixes in nvme-scsi Fixes nvme queue usages in scsi-to-nvme translation code to not get a queue more often than it is being put, and not use the queue in an unsafe way without it being locked. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-02 14:30:53 -04:00
Alex Elder	b5b09be30c	rbd: fix image request leak on parent read When a read for a layered image object finds the target object doesn't exist, a read image request for the parent image is created and submitted. When that completes, the callback routine was not releasing that parent image request. Fix that. The slab allocation stuff just added has greatly simplified the search for the source of this memory leak. This resolves: http://tracker.ceph.com/issues/4803 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 12:15:28 -05:00
Alex Elder	78c2a44aae	rbd: allocate image object names with a slab allocator The names of objects used for image object requests are always fixed size. So create a slab cache to manage them. Define a new function rbd_segment_name_free() to match rbd_segment_name() (which is what supplies the dynamically-allocated name buffer). This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:30 -05:00
Alex Elder	868311b1eb	rbd: allocate object requests with a slab allocator Create a slab cache to manage rbd_obj_request allocation. We aren't using a constructor, and we'll zero-fill object request structures when they're allocated. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:30 -05:00
Alex Elder	f907ad5596	rbd: allocate name separate from obj_request The next patch will define a slab allocator for a object requests. To use that we'll need to allocate the name of an object separate from the request structure itself. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:29 -05:00
Alex Elder	1c2a9dfe21	rbd: allocate image requests with a slab allocator Create a slab cache to manage rbd_img_request allocation. Nothing too fancy at this point--we'll still initialize everything at allocation time (no constructor) This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:29 -05:00
Alex Elder	30d1cff817	rbd: use binary search for snapshot lookup Use bsearch(3) to make snapshot lookup by id more efficient. (There could be thousands of snapshots, and conceivably many more.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:17 -05:00
Alex Elder	15228ede7d	rbd: clear EXISTS flag if mapped snapshot disappears This functionality inadvertently disappeared in the last patch. Image snapshots can get removed at just about any time. In particular it can disappear even if it is in use by an rbd client as a mapped image. The rbd client deals with such a disappearance by responding to new requests with ENXIO. This is implemented by each rbd device maintaining an EXISTS flag, which is normally set but cleared if a snapshot disappears. This patch (re-)implements the clearing of that flag. Whenever mapped image header information is refreshed, if the mapping is for a snapshot, verify the mapped snapshot is still present in the updated snapshot context. If it is not, clear the flag. It is not necessary to check this in the initial probe, because the probe will not succeed if the snapshot doesn't exist. This resolves: http://tracker.ceph.com/issues/4880 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:57:03 -05:00
Alex Elder	33dca39f5c	rbd: kill off the snapshot list We no longer use the snapshot list for anything. When we need to look up a snapshot name, id, size, or feature mask, we just do it directly rather than relying on this list being updated with every refresh. The main reason it existed was for the benefit of the device/sysfs entries that previously were associated with snapshots. So get rid of the snapshot list, and struct rbd_snap, and the hundreds of lines of code that supported them. This resolves: http://tracker.ceph.com/issues/4868 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:22 -07:00
Alex Elder	2ad3d7167e	rbd: define rbd_snap_size() and rbd_snap_features() This patch defines a handful of new functions that will allow us to get rid of the rbd device structure's list of snapshots. Define rbd_snap_id_by_name() to look up a snapshot id given its name. This is efficient for format 1 images but not for format 2. Fortunately it only gets called at mapping time so it's not that critical. Use rbd_snap_id_by_name() to find out the id for a snapshot getting mapped, and pass that id to new functions rbd_snap_size() and rbd_snap_features() to look up information about a given snapshot's size and feature mask given its snapshot id. All this gets done in rbd_dev_mapping_set(). As a result, snap_by_name() is no longer needed, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:20 -07:00
Alex Elder	54cac61fb6	rbd: use snap_id not index to look up snap info In order to align with what was needed for format 1 rbd images, rbd_dev_v2_snap_info() was set up to take as argument an index into the array of snapshot ids in a rbd device's snapshot context. This switches that around, so we pass the snapshot id instead. In doing this, rbd_snap_name() now returns a dynamically-allocated string rather than a fixed one, so there's no need to make a duplicate in its caller, rbd_dev_spec_update(). This means the following functions take a snapshot id where they previously used an index value: rbd_dev_snap_info() rbd_dev_v1_snap_info() rbd_dev_v2_snap_info() A new function, rbd_dev_snap_index(), determines the snap index for format 1 images and uses it to look up the name. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:19 -07:00
Alex Elder	9682fc6d3a	rbd: look up snapshot name in names buffer Rather than scanning the list of snapshot structures for it, scan the snapshot context buffer containing snapshot names in order to determine for a format 1 image the name associated with a given snapshot id. Pull out the part of rbd_dev_v1_snap_info() that does this scan into a new function, _rbd_dev_v1_snap_name(). Have that function return a dynamically-allocated copy of the name, and don't duplicate it in rbd_dev_v1_snap_info(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:18 -07:00
Alex Elder	dedc81ea84	rbd: drop obj_request->version Nothing ever uses the version field maintained in the object request structure any more, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:17 -07:00
Alex Elder	e2a58ee55b	rbd: drop rbd_obj_method_sync() version parameter Only NULL is passed as the version argument to rbd_obj_method_sync(), so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:16 -07:00
Alex Elder	cc4a38bdd5	rbd: more version parameter removal Continued from the last patch, more parameters that can go away because we no longer have a need to track object versions. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:15 -07:00
Alex Elder	7097f8df6e	rbd: get rid of some version parameters Several functions in rbd have parameters meant to allow the version of an object to be passed in or out. The purpose of those was to allow the version of a header object to be maintained, but we no longer do that. As a result, these parameters are never actually needed or used, so get rid of them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:14 -07:00
Alex Elder	b21ebdddeb	rbd: stop tracking header object version The rbd code takes care to maintain the version of the header object. This was done in hopes of using it to detect a change in the object between reading it and setting up a watch request to be notified of changes. The mechanism was never fully implemented, however. And we now avoid the original problem by setting up the watch request before ever reading the content of the header. The osd doesn't interpret the object version supplied with a WATCH osd op, nor does it use the version supplied with a NOTIFY_ACK op (we can just supply 0 for both). There is therefore no need to maintain the header's object version any more, so stop doing so. We'll be able to simplify some more rbd code in the next few patches as a result of this. This resolves: http://tracker.ceph.com/issues/3952 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:13 -07:00
Alex Elder	cb75223d2b	rbd: snap names are pointer to constant data Make explicit that snapshot names don't change by making functions return and take parameters that that point to const qualified data. This resolves: http://tracker.ceph.com/issues/4867 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:12 -07:00
Alex Elder	a3fbe5d447	rbd: don't revalidate so much Whenever a header object event causes a mapped rbd image to refresh its header information, revalidate_disk() is being called. This was done in rbd_dev_refresh() outside the control mutex in order to avoid a lock inversion. Although a an event like this might indicate the image has changed size, most of the time it does not. Record the image size before and after the refresh, and only call revalidate_disk() if it changes. This resolves: http://tracker.ceph.com/issues/4867 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:11 -07:00
Alex Elder	96882f55c4	rbd: fix up the layering warning message A warning gets spewed for any image being probed, including parent images. Set up a condition such that the warning message only gets printed for the image being mapped, not any of its parents. Also, I didn't like the way the warning ended up being so long. Make it a terse warning instead. People experimenting with layering will know what the message means. This is part of: http://tracker.ceph.com/issues/4867 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:10 -07:00
Alex Elder	812164f8c3	ceph: use ceph_create_snap_context() Now that we have a library routine to create snap contexts, use it. This is part of: http://tracker.ceph.com/issues/4857 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:09 -07:00
Alex Elder	b536f69a3a	rbd: set up devices only for mapped images Stop setting up Linux devices during the image probe operation. Instead, set up the devices as a separate step after the image probe, in rbd_add(). A consequence of this is that only mapped images get devices assigned to them, which is pretty sweet. This resolves: http://tracker.ceph.com/issues/4774 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:07 -07:00
Alex Elder	8ad42cd0c0	rbd: don't have device release destroy rbd_dev Currently an rbd_device structure gets destroyed from the release routine for the device embedded within it. Stop doing that, instead calling rbd_dev_image_release() right after rbd_bus_del_dev() wherever the latter is called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:05 -07:00
Alex Elder	6fd48b3be9	rbd: define rbd_dev_unprobe() Define a new function rbd_dev_unprobe() which undoes state changes that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe(). Note that this is a superset of rbd_header_free(), which is now getting removed (it seems to have been used improperly anyway). Flesh out rbd_dev_image_release() so it undoes exactly what rbd_dev_image_probe() does. This means that: - rbd_dev_device_release() gets called when the last device reference gets dropped; - that undoes everything done by the rbd_dev_device_setup() call at the end of rbd_dev_image_probe() (and nothing more), ending by calling rbd_dev_image_release(); and - rbd_dev_image_release() undoes everything else done by rbd_dev_image_probe() (and this includes a call to rbd_dev_unprobe(). This means the image and device portions of an rbd device are fairly cleanly separated now, so error paths should be a little easier to verify than they used to be. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:04 -07:00
Alex Elder	200a6a8be5	rbd: don't destroy rbd_dev in device release function Rename rbd_dev_probe_finish() to be rbd_dev_device_setup(). Its purpose is to set up the Linux side of an rbd device mapping. Rename rbd_dev_release() to be rbd_dev_device_release(), making it more obvious it serves as the inverse of the setup function (or it will). Encapsulate some of what was done in rbd_dev_release() into a new function rbd_dev_image_release(), which serves as the inverse of setting up the ceph side of the mapped rbd image. Define a new helper rbd_dev_clear_mapping() to simply zero out the fields of a mapping structure--the inverse of rbd_dev_set_mapping(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:03 -07:00
Alex Elder	79ab7558aa	rbd: drop module later Drop the module reference at the end of rbd_remove() for symmetry with adding a reference at the top of rbd_add(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:02 -07:00
Alex Elder	b644de2ba0	rbd: set up watch in rbd_dev_image_probe() Move setting up the watch request for an image so it's done in rbd_dev_image_probe() rather than rbd_dev_probe_finish(). Move it all the way up to before doing the initial probe. This avoids a potential race condition, in which we get (and use) the initial snapshot context for an image, and it gets changed between that time and the time we get the watch set up. This resolves: http://tracker.ceph.com/issues/3871 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:01 -07:00
Alex Elder	96f03e08f9	rbd: don't bother checking whether order changes When a format 2 image is refreshed, code is in place to verify that the object order never changes from what it was originally. This relies on the fact that the refresh will occur after an initial load of information about the image. An upcoming patch makes it possible for the refresh to occur first, so we can no longer make this order check. The order really can't ever change anyway--this was just a sanity check. So get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:00 -07:00
Alex Elder	0d8189e175	rbd: don't clean up watch in device release function Currently, a watch on an rbd device header object gets torn down when its final Linux device reference gets dropped. Instead, tear it down when removing the device. If an error occurs cleaning up the watch event when unmapping, abort the unmap request. All images (including parents) still get watch requests set up, so tear these down also, in rbd_dev_remove_parent(). For now, ignore any errors that occur in this case. Get rid of local variable "rc" in rbd_remove(); use "ret" instead (they both somehow ended up defined in the function and only one is needed). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:59 -07:00
Alex Elder	332bb12db9	rbd: define rbd_header_name() Define a new function rbd_header_name(), which allocates and formats the name of the header object for the rbd device. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:58 -07:00
Alex Elder	9bb81c9be9	rbd: move more initialization into rbd_dev_image_probe() Move a block of initialization related to the "ceph-side" of an rbd image out of rbd_dev_probe_finish() and into rbd_dev_image_probe(). Add appropriate error handling to clean things up in the event any of these new functions return an error. We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and rbd_dev_probe_parent() all clean up after themselves before they return an error, so no special cleanup is required except when an earlier call succeeds. Since rbd_dev_spec_update() only updates the spec field (whose cleanup will be handled by dropping the last reference to the spec) there is no cleanup action associatied with that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:57 -07:00
Alex Elder	5de10f3b0c	rbd: probe for the parent earlier Probe for a parent device earlier in rbd_dev_probe_finish(), before starting to set up the Linux side of the rbd device. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:56 -07:00
Alex Elder	2e93bf9e46	rbd: remove parent devices on probe error When an error occurs while finishing probing a device it is assumed that parent devices get cleaned up when deleting a device. They don't. Add a call to clean them up. Note that this means the parent spec will already be cleaned up so it doesn't have to be in one of the rbd_add() error paths. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:55 -07:00
Alex Elder	ad945fc1da	rbd: fix rbd_dev_remove_parent() In certain error paths, it is possible for an rbd device to have a parent spec but no parent rbd_dev. In rbd_dev_remove_parent() use the parent field rather than parent_spec in determining whether to try to remove any parent devices. Use assertions to indicate that any non-null parent pointer has parent_spec associated with it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:54 -07:00
Alex Elder	b480815a17	rbd: kill __rbd_remove() The function __rbd_remove() is used in two spots, and it's fairly simple. It combines cleanup of part of the ceph-side state as well as cleaning up the Linux-side state. Just open code it in the two callers and eliminate the function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:53 -07:00
Alex Elder	d1cf578845	rbd: set mapping info earlier Set the mapping size and features earlier in rbd_dev_probe_finish(). Define rbd_dev_mapping_clear() as an inverse for setting those fields, and use it both in error handling in rbd_dev_image_probe() and in the final cleanup in rbd_dev_release(). Change the name of rbd_dev_set_mapping() to of rbd_dev_mapping_set(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:51 -07:00
Alex Elder	05a46afdc7	rbd: encapsulate removing parent devices Encapsulate the code that removes an rbd device's parent images into a new function, rbd_dev_remove_parent(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:50 -07:00
Alex Elder	124afba25d	rbd: encapsulate probing for parent devices Encapsulate the code that probes for an rbd device's parent images into a new function, rbd_dev_probe_parent(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:49 -07:00
Alex Elder	b5156e76da	rbd: defer setting disk capacity Don't set the disk capacity until right before we announce the device as available for use. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:48 -07:00
Alex Elder	129b79d449	rbd: only set device exists flag when ready Hold off setting the EXISTS rbd device flag until just before we announce the disk as available for use. There's no point in doing so any earlier than that, and at that point the device truly is fully set up and ready to use. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:47 -07:00
Alex Elder	fc71d8330e	rbd: fix up some sysfs stuff This just tweaks a few things in the routines that implement rbd sysfs files. All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/ will represent information whose valid values are known by the time they are accessible. Right now we get the size of the mapped image by a call to get_capacity(). There's no need to do this, because that will return what we last set the capacity to, which is just the size recorded for the mapping. So just show that value instead. We also get this under protection of the header semaphore, in order to provide a precisely correct value. This isn't really necessary; these files are really informational only and it's not necessary to be so careful. Finally, print a special value in case the major device number is not recorded. Right now that won't matter much but soon the parent images won't have devices associated with them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:46 -07:00
Alex Elder	e28626a08b	rbd: fix a bug in resizing a mapping When a snapshot context update occurs, rbd_update_mapping_size() is called to set the capacity of the disk to record the updated size of the image in case it has changed. There's a bug though. The mapping size is in units of bytes. The code that updates the mapping size field is assigning a value that has been scaled down to sectors. Fix that. Also, check to see if the size has actually changed, and don't bother updating things (specifically, calling set_capacity()) if it has not. This resolves: http://tracker.ceph.com/issues/4833 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:45 -07:00
Alex Elder	2e9f7f1c0d	rbd: refactor rbd_dev_probe_update_spec() Fairly straightforward refactoring of rbd_dev_probe_update_spec(). The name is changed to rbd_dev_spec_update(). Rearrange it so nothing gets assigned to the spec until all of the names have been successfully acquired. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:44 -07:00
Alex Elder	71f293e26e	rbd: rename rbd_dev_probe() Rename rbd_dev_probe() to be rbd_dev_image_probe(). Its purpose will eventually be to probe for the existence of a valid rbd image for the rbd device--focusing only on the ceph side and not the Linux device side of initialization. For now the two "sides" are not fully separated, and this function is still the entry point for initializing the full rbd device. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:43 -07:00
Alex Elder	9f5dffdc8f	rbd: make rbd_dev_destroy() match rbd_dev_create() Currently, rbd_dev_destroy() does more than just the inverse of what rbd_dev_create() does. Stop doing that, and move the two extra things it does into the three call sites. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:42 -07:00
Alex Elder	468521c1b1	rbd: define rbd snap context routines Encapsulate the creation of a snapshot context for rbd in a new function rbd_snap_context_create(). Define rbd wrappers for getting and dropping references to them once they're created. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:41 -07:00
Alex Elder	c0cd10db46	rbd: use rbd_warn(), not WARN_ON() Change some calls to WARN_ON() so they use rbd_warn() instead, so we get consistent messaging. A few remain but they can probably just go away eventually. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:40 -07:00
Alex Elder	500d0c0fbb	rbd: move stripe_unit and stripe_count into header This commit added fetching if fancy striping parameters: 09186ddb rbd: get and check striping parameters They are almost unused, but the two fields storing the information really belonged in the rbd_image_header structure. This patch moves them there. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:39 -07:00
Alex Elder	ecb4dc2256	rbd: make rbd spec names pointer to const Make the names and image id in an rbd_spec be pointers to constant data. This required the use of a local variable to hold the snapshot name in rbd_add_parse_args() to avoid a warning. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:37 -07:00
Alex Elder	e1d4213f09	rbd: set snapshot id in rbd_dev_probe_update_spec() Set the rbd spec's snapshot id for an image getting mapped in rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping(). This is the more logical place for that to happen (even though it means we might look up the snapshot by name twice). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:36 -07:00
Alex Elder	8b0241f85a	rbd: have snap_by_name() return a snapshot A function called snap_by_name() ought to just look up a snapshot by name. It does that, but then it assigns some stuff to the rbd device structure as well. Change the function to do just the lookup, and have the caller do the assignments that follow. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:35 -07:00
Alex Elder	5655c4d940	rbd: fix image id leak in initial probe If a format 2 image id is found for an image being mapped, but the subsequent probe of the image fails, rbd_dev_probe() quits without freeing the image id. Fix that. Also drop a redundant hunk of code in rbd_dev_image_id(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:34 -07:00
Alex Elder	c0fba36880	rbd: have rbd_dev_image_id() set format 1 image id Currently, rbd_dev_probe() assumes that any error returned by rbd_dev_image_id() is most likely -ENOENT, and responds by calling the format 1 probe routine, rbd_dev_v1_probe(). Then, at the top of rbd_dev_v1_probe(), an empty string is allocated for the image id. This is sort of unbalanced. Fix this by having rbd_dev_image_id() look for -ENOENT from its "get_id" method call. If that is seen, have it allocate the empty string there rather than depending on rbd_dev_v1_probe() to do it. Given that this is effectively defining the format of the image, set rbd_dev->image_format inside rbd_dev_image_id() rather than in the format-specific probe routines. Also drop a redundant hunk of code in rbd_dev_image_id(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:33 -07:00
Alex Elder	a0cab92432	rbd: avoid dropping extra reference in rbd_free_disk() I found during some failure injection testing that the call to rbd_free_disk() in the error path of rbd_dev_probe_finish() was dropping an extra reference to the disk queue. The problem occurred when put_disk tried to drop a reference to the disk's queue. A call to blk_cleanup_queue() just prior to that will have also dropped a reference to the queue. The problem is that the reference dropped by put_disk() is assumed to have been taken by add_disk(). Our code has error paths that can occur after the disk and its queue are initialized, but before the call to add_disk(), and in those paths we won't have that extra reference. The fix is easy though. In rbd_free_disk() we're already checking the disk's GENHD_FL_UP flag. That flag is an indication that add_disk() has been called, so just call blk_cleanup_queue() conditional on that flag being set. This resolves: http://tracker.ceph.com/issues/4800 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:32 -07:00
Alex Elder	f40eb349e0	rbd: use rbd_obj_method_sync() return value Now that rbd_obj_method_sync() returns the number of bytes returned by the method call, that value should be used by callers to ensure we don't overrun the valid portion of the buffer. Fix the two spots that remained that weren't doing that, rbd_dev_image_name() and rbd_dev_v2_snap_name(). Rearrange the error path slightly in rbd_dev_v2_snap_name(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:31 -07:00
Alex Elder	6e584f5244	rbd: fix leak of format 2 snapshot names When the snapshot context for an rbd device gets updated (or the initial one is recorded) a a list of snapshot structures is created to represent them, one entry per snapshot. Each entry includes a dynamically-allocated copy of the snapshot name. Currently the name is allocated in rbd_snap_create(), as a duplicate of the passed-in name. For format 1 images, the snapshot name provided is just a pointer to an existing name. But for format 2 images, the passed-in name is already dynamically allocated, and in the the process of duplicating it here we are leaking the passed-in name. Fix this by dynamically allocating the name for format 1 snapshots also, and then stop allocating a duplicate in rbd_snap_create(). Change rbd_dev_v1_snap_info() so none of its parameters is side-effected unless it's going to return success. This is part of: http://tracker.ceph.com/issues/4803 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:30 -07:00
Alex Elder	6087b51b9e	rbd: rename __rbd_add_snap_dev() Rename __rbd_add_snap_dev() to be rbd_snap_create(). We no longer have devices for non-mapped snapshots, and we're not actually "adding" it to the list in this function, just creating it. Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons similar to the above. Stop having this function delete the snapshot from its list (to be symmetrical with its create counterpart) and do that in the caller instead. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:29 -07:00
Alex Elder	acb1b6caf1	rbd: only update values on snap_info success Change rbd_dev_v2_snap_info() so it only ever sets values of the size and features parameters if looking up the snapshot name was successful. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:28 -07:00
Alex Elder	c86f86e9e7	rbd: make snap_size order parameter optional Only one of the two callers of _rbd_dev_v2_snap_size() needs the order value returned. So make that an optional argument--a null pointer if the caller doesn't need it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:27 -07:00
Alex Elder	522a0cc0f0	rbd: fix leak of snapshots during initial probe When an rbd image is initially mapped, its snapshot context is collected, and then a list of snapshot entries representing the snapshots in that context is created. The list is created using rbd_dev_snaps_update(). (This function also supports updating an existing snapshot list based on a new snapshot context.) If an error occurs, updating the list is aborted, and the list is currently left as-is, in an inconsistent state. At that point, there may be a partially-constructed list, but the calling functions (rbd_dev_probe_finish() from rbd_dev_probe() from rbd_add()) never clean them up. So this constitutes a leak. A snapshot list that is inconsistent with the current snapshot context is of no use, and might even be actively bad. So rather than just having the caller clean it up, have rbd_dev_snaps_update() just clear out the entire snapshot list in the event an error occurs. The other place rbd_dev_snaps_update() is used is when a refresh is triggered, either because of a watch callback or via a write to the /sys/bus/rbd/devices/<id>/refresh interface. An error while updating the snapshots has no substantive effect in either of those cases, but one of them issues a warning. Move that warning to the common rbd_dev_refresh() function so it gets issued regardless of how it got initiated. This is part of: http://tracker.ceph.com/issues/4803 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:26 -07:00
Alex Elder	3e83b65bb9	rbd: don't create sysfs entries for non-mapped snapshots When an rbd image gets mapped a device entry gets created for it under /sys/bus/rbd/devices/<id>/. Inside that directory there are sysfs files that contain information about the image: its size, feature bits, major device number, and so on. Additionally, if that image has any snapshots, a device entry gets created for each of those as a "child" of the mapped device. Each of these is a subdirectory of the mapped device, and each directory contains a few files with information about the snapshot (its snapshot id, size, and feature mask). There is no clear benefit to having those device entries for the snapshots. The information provided via sysfs of of little real value--and all of it is available via rbd CLI commands. If we still wanted to see the kernel's view of this information it could be done much more simply by including it in a single sysfs file for the mapped image. But there is a clear cost to supporting them. Every time a snapshot context changes, these entries need to be updated (deleted snapshots removed, new snapshots created). The rbd driver is notified of changes to the snapshot context via callbacks from an osd, and care must be taken to coordinate removal of snapshot data structures with the possibility of one these notifications occurring. Things would be considerably simpler if we just didn't have to maintain device entries for the snapshots. So get rid of them. The ability to map a snapshot of an rbd image will remain; the only thing lost will be the ability to query these sysfs directories for information about snapshots of mapped images. This resolves: http://tracker.ceph.com/issues/4796 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:25 -07:00
Alex Elder	770eba6e29	rbd: activate support for layered images Now that we have most everything in place to support layered rbd images, enable support for them in the kernel client. Issue a warning to the log that the support is considered experimental whenever a format 2 layered image is mapped. Note that we also have to claim to support the STRIPINGV2 feature, due to a mistake in the way the rbd CLI set up those flags. This feature can work if it has the right parameters, and safeguards have been put in place to reject those images that do not have compatible parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:23 -07:00
Alex Elder	cc070d59bc	rbd: get and check striping parameters If an rbd format 2 image indicates it supports the STRIPINGV2 feature we need to find out its stripe unit and stripe count in order to know whether we can use it. We don't yet support fancy striping fully, but if the default parameters are used the behavior is indistinguishible from non-fancy striping. This is necessary because some images require the STRIPINGV2 feature even if they use the default parameters. (Which is to say the feature bit was erroneously set even if the feature was not used.) This resolves: http://tracker.ceph.com/issues/4709 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:21 -07:00
Alex Elder	57385b51c3	rbd: have rbd_obj_method_sync() return transfer count Callers of rbd_obj_method_sync() don't know how many bytes of data got returned by the class method call. As a result, they have been assuming enough got returned to decode whatever was expected. This isn't safe. We know how many bytes got transferred, so have rbd_obj_method_sync() return that amount (rather than just 0) if the call is successful. Change all callers to use this return value to ensure decoding of the results is done safely. On the other hand, most callers of rbd_obj_method_sync() only indicate success or failure, so all of their callers can simply test for non-zero result. This resolves: http://tracker.ceph.com/issues/4773 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:20 -07:00
Alex Elder	4157976b27	rbd: void data pointers for rbd_obj_method_sync() Make the inbound and outbound data parameters have void rather than character type for rbd_obj_method_sync(). This makes it more clear they don't expect typed data, and eliminates the need for some silly type casts. One more unrelated change: define the features buffer used in _rbd_dev_v2_snap_features() to be a packed data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:19 -07:00
Alex Elder	80ef15bf71	rbd: give rbd_obj_read_sync() buffer void type Make the buf parameter into which the data is to be read have type void pointer. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:18 -07:00
Alex Elder	a9e8ba2cb3	rbd: enforce parent overlap A clone image has a defined overlap point with its parent image. That is the byte offset beyond which the parent image has no defined data to back the clone, and anything thereafter can be viewed as being zero-filled by the clone image. This is needed because a clone image can be resized. If it gets resized larger than the snapshot it is based on, the overlap defines the original size. If the clone gets resized downward below the original size the new clone size defines the overlap. If the clone is subsequently resized to be larger, the overlap won't be increased because the previous resize invalidated any parent data beyond that point. This resolves: http://tracker.ceph.com/issues/4724 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:15 -07:00
Alex Elder	0eefd470f0	rbd: issue a copyup for layered writes This implements the main copyup functionality for layered writes. Here we add a copyup_pages field to the object request, which is used only for copyup requests to keep track of the page array containing data read from the parent image. A copyup request is currently the only request rbd has that requires two osd operations. Because of this we handle copyup specially. All image object requests get an osd request allocated when they are created. For a write request, if a copyup is required, the osd request originally allocated is released, and a new one (with room for two osd ops) is allocated to replace it. A new function rbd_osd_req_create_copyup() allocates an osd request suitable for a copyup request. The first op is then filled with a copyup object class method call, supplying the array of pages containing data read from the parent. The second op is filled in with the original write request. The original request otherwise remains intact, and it describes the original write request (found in the second osd op). The presence of the copyup op is sort of implicit; a non-null copyup_pages field could be used to distinguish between a "normal" write request and a request containing both a copyup call and a write. This resolves: http://tracker.ceph.com/issues/3419 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:14 -07:00
Alex Elder	3d7efd18d9	rbd: implement full object parent reads As a step toward implementing layered writes, implement reading the data for a target object from the parent image for a write request whose target object is known to not exist. Add a copyup_pages field to an image request to track the page array used (only) for such a request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:13 -07:00
Laurent Barbe	d98df63ea7	rbd: revalidate_disk upon rbd resize If rbd disk is open and rbd resize is done, new size is not visible by filesystem. Like is done in virtio-blk and dm driver, revalidate_disk() permits to update the bd_inode size. Signed-off-by: Laurent Barbe <laurent@ksperis.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-05-01 21:19:12 -07:00
Alex Elder	f1a4739f33	rbd: support page array image requests This patch adds the ability to build an image request whose data will be written from or read into memory described by a page array. (Previously only bio lists were supported.) Originally this was going to define a new function for this purpose but it was largely identical to the rbd_img_request_fill_bio(). So instead, rbd_img_request_fill_bio() has been generalized to handle both types of image request. For the moment we still only fill image requests with bio data. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:11 -07:00
Alex Elder	b9434c5b43	rbd: define zero_pages() Define a new function zero_pages() that zeroes a range of memory defined by a page array, along the lines of zero_bio_chain(). It saves and the irq flags like bvec_kmap_irq() does, though I'm not sure at this point that it's necessary. Update rbd_img_obj_request_read_callback() to use the new function if the object request contains page rather than bio data. For the moment, only bio data is used for osd READ ops. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:10 -07:00
Alex Elder	b454e36d26	rbd: encapsulate submission of image object requests Object requests that are part of an image request are subject to some additional handling. Define rbd_img_obj_request_submit() to encapsulate that, and use it when initially submitting an image object request, and when re-submitting it during callback of an object existence check. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:08 -07:00
Alex Elder	9d4df01f08	rbd: define separate read and write format funcs Separate rbd_osd_req_format() into two functions, one for read requests and the other for write requests. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:07 -07:00
Alex Elder	c5b5ef6c51	rbd: issue stat request before layered write This is a step toward fully implementing layered writes. Add checks before request submission for the object(s) associated with an image request. For write requests, if we don't know that the target object exists, issue a STAT request to find out. When that request completes, mark the known and exists flags for the original object request accordingly and re-submit the object request. (Note that this still does the existence check only; the copyup operation is not yet done.) A new object request is created to perform the existence check. A pointer to the original request is added to that object request to allow the stat request to re-issue the original request after updating its flags. If there is a failure with the stat request the error code is stored with the original request, which is then completed. This resolves: http://tracker.ceph.com/issues/3418 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:04 -07:00
Alex Elder	5679c59f60	rbd: add target object existence flags This creates two new flags for object requests to indicate what is known about the existence of the object to which a request is to be sent. The KNOWN flag will be true if the the EXISTS flag is meaningful. That is: KNOWN EXISTS ----- ------ 0 0 don't know whether the object exists 0 1 (not used/invalid) 1 0 object is known to not exist 1 0 object is known to exist This will be used in determining how to handle write requests for data objects for layered rbd images. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:03 -07:00
Alex Elder	57acbaa7fb	rbd: always check IMG_DATA flag In a few spots, whether the an object request's img_request pointer is null is used to determine whether an object request is being done as part of an image data request. Stop doing that, and instead always use the object request IMG_DATA flag for that purpose. Swap the order of the definition of the IMG_DATA and DONE flag helpers, because obj_request_done_set() now refers to obj_request_img_data_set() to get its rbd_dev value. This will become important because the img_request pointer is about to become part of a union. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:02 -07:00
Alex Elder	b155e86cf6	rbd: adjust image object request ref counting An extra reference is taken when an object request is added as one of the requests making up an image object. A reference is dropped again when the image's object requests get submitted. The original reference for the object request will remain throughout this period, so we don't need to add and then take away an extra one. This can be interpreted as the image request inheriting the original object request's reference. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:19:01 -07:00
Alex Elder	406e2c9f92	libceph: kill off osd data write_request parameters In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:58 -07:00
Alex Elder	8b3e1a5698	rbd: implement layered reads Implement layered read requests for format 2 rbd images. If an rbd image is a clone of a snapshot, the snapshot will be the clone's "parent" image. When an object read request on a clone comes back with ENOENT it indicates that the clone is not yet populated with that portion of the image's data, and the parent image should be consulted to satisfy the read. When this occurs, a new image request is created, directed to the parent image. The offset and length of the image are the same as the image-relative offset and length of the object request that produced ENOENT. Data from the parent image therefore satisfies the object read request for the original image request. While this code works, it will not be active until we enable the layering feature (by adding RBD_FEATURE_LAYERING to the value of RBD_FEATURES_SUPPORTED). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:48 -07:00
Alex Elder	2f82ee54d9	rbd: probe the parent of an image if present Call the probe function for the parent device if one is present. Since we don't formally support the layering feature we won't be using this functionality just yet. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:47 -07:00
Alex Elder	6365d33a27	rbd: add an object request flag for image data objects Add a flag to distinguish between object requests being done on standalone objects and requests being sent for objects representing rbd image data (i.e., object requests that are the result of image request). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:46 -07:00
Alex Elder	926f9b3f08	rbd: define an rbd object request flags field We're going to need some more Boolean values for object requests, so create a flags bit field and use it to record whether the request is done. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:44 -07:00
Alex Elder	1217857fbf	rbd: encapsulate image object end request handling Encapsulate the code that completes processing of an object request that's part of an image request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:43 -07:00
Alex Elder	d0b2e94455	rbd: define image request layered flag Define a flag indicating whether an image request is for a layered image (one with a parent image to which requests will be redirected if the target object of a request does not exist). The code that checks this flag will be added shortly. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:42 -07:00
Alex Elder	9849e98636	rbd: define image request originator flag Define a flag indicating whether an image request originated from the Linux block layer (from blk_fetch_request()) or whether it was initiated in order to satisfy an object request for a child image of a layered rbd device. For image requests initiated by objects of child images we'll save a pointer to the object request rather than the Linux block request. For now, only block requests are used. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:41 -07:00
Alex Elder	0c425248e0	rbd: define image request flags There are several Boolean values we'll be maintaining for image requests. Switch from the single write_request field to a general-purpose flags field, and use one if its bits to represent the direction of I/O for the image request. Define helper functions for setting and testing that flag. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:40 -07:00
Alex Elder	7da22d296d	rbd: record image-relative offset in object requests For an image object request we will need to know what offset within the rbd image the request covers. Record that when the object request gets created. Update the I/O error warnings so they use this so what's reported is more informative. Rename a local variable to fit the convention used everywhere else. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:39 -07:00
Alex Elder	55f27e0931	rbd: record aggregate image transfer count Compute the total number of bytes transferred for an image request--the sum across each of the request's object requests. To avoid contention do it only when all object requests are complete, in rbd_img_request_complete(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:38 -07:00
Alex Elder	a5a337d438	rbd: record overall image request result If any image object request produces a non-zero result, preserve that as the result of the overall image request. If multiple objects have non-zero results, save only the first one. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:37 -07:00
Alex Elder	5cbf6f12c4	rbd: update feature bits There is a new rbd feature bit defined for "fancy striping." Add it to the ones defined in the kernel client. Change RBD_FEATURES_ALL so it represents the set of all feature bits (rather than just the ones we support). Define a new symbol RBD_FEATURES_SUPPORTED to indicate the supported ones. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:36 -07:00
Alex Elder	04017e29bb	libceph: make method call data be a separate data item Right now the data for a method call is specified via a pointer and length, and it's copied--along with the class and method name--into a pagelist data item to be sent to the osd. Instead, encode the data in a data item separate from the class and method names. This will allow large amounts of data to be supplied to methods without copying. Only rbd uses the class functionality right now, and when it really needs this it will probably need to use a page array rather than a page list. But this simple implementation demonstrates the functionality on the osd client, and that's enough for now. This resolves: http://tracker.ceph.com/issues/4104 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:35 -07:00
Alex Elder	a4ce40a9a7	libceph: combine initializing and setting osd data This ends up being a rather large patch but what it's doing is somewhat straightforward. Basically, this is replacing two calls with one. The first of the two calls is initializing a struct ceph_osd_data with data (either a page array, a page list, or a bio list); the second is setting an osd request op so it associates that data with one of the op's parameters. In place of those two will be a single function that initializes the op directly. That means we sort of fan out a set of the needed functions: - extent ops with pages data - extent ops with pagelist data - extent ops with bio list data and - class ops with page data for receiving a response We also have define another one, but it's only used internally: - class ops with pagelist data for request parameters Note that we still haven't gotten rid of the osd request's r_data_in and r_data_out fields. All the osd ops refer to them for their data. For now, these data fields are pointers assigned to the appropriate r_data_* field when these new functions are called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:23 -07:00
Alex Elder	2169238dd3	rbd: rearrange some code for consistency This patch just trivially moves around some code for consistency. In preparation for initializing osd request data fields in ceph_osdc_build_request(), I wanted to verify that rbd did in fact call that immediately before it called ceph_osdc_start_request(). It was true (although image requests are built in a group and then started as a group). But I made the changes here just to make it more obvious, by making all of the calls follow a common sequence: osd_req_op_<optype>_init(); ceph_osd_data_<type>_init() osd_req_op_<optype>_<datafield>() rbd_osd_req_format() ... ret = rbd_obj_request_submit() I moved the initialization of the callback for image object requests into rbd_img_request_fill_bio(), again, for consistency. To avoid a forward reference, I moved the definition of rbd_img_obj_callback() up in the file. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:18 -07:00
Alex Elder	44cd188d48	rbd: separate initialization of osd data The osd data for a request is currently initialized inside rbd_osd_req_create(), but that assumes an object request's data belongs in the osd request's data in or data out field. There are only three places where requests with data are set up, and it turns out it's easier to call just the osd data init routines directly there rather than handling it in rbd_osd_req_create(). (The real motivation here is moving toward getting rid of the osd request in and out data fields.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:17 -07:00
Alex Elder	2fa123201a	rbd: don't set data in rbd_osd_req_format_op() Currently an object request has its osd request's data field set in rbd_osd_req_format_op(). That assumes a single osd op per object request, and that won't be the case for long. Move the code that sets this out and into the caller. Rename rbd_osd_req_format_op() to be just rbd_osd_req_format(), removing the notion that it's doing anything op-specific. This and the next patch resolve: http://tracker.ceph.com/issues/4658 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:16 -07:00
Alex Elder	c99d2d4abb	libceph: specify osd op by index in request An osd request now holds all of its source op structures, and every place that initializes one of these is in fact initializing one of the entries in the the osd request's array. So rather than supplying the address of the op to initialize, have caller specify the osd request and an indication of which op it would like to initialize. This better hides the details the op structure (and faciltates moving the data pointers they use). Since osd_req_op_init() is a common routine, and it's not used outside the osd client code, give it static scope. Also make it return the address of the specified op (so all the other init routines don't have to repeat that code). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:15 -07:00
Alex Elder	8c042b0df9	libceph: add data pointers in osd op structures An extent type osd operation currently implies that there will be corresponding data supplied in the data portion of the request (for write) or response (for read) message. Similarly, an osd class method operation implies a data item will be supplied to receive the response data from the operation. Add a ceph_osd_data pointer to each of those structures, and assign it to point to eithre the incoming or the outgoing data structure in the osd message. The data is not always available when an op is initially set up, so add two new functions to allow setting them after the op has been initialized. Begin to make use of the data item pointer available in the osd operation rather than the request data in or out structure in places where it's convenient. Add some assertions to verify pointers are always set the way they're expected to be. This is a sort of stepping stone toward really moving the data into the osd request ops, to allow for some validation before making that jump. This is the first in a series of patches that resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:14 -07:00
Alex Elder	79528734f3	libceph: keep source rather than message osd op array An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:12 -07:00
Alex Elder	430c28c3cb	rbd: define rbd_osd_req_format_op() Define rbd_osd_req_format_op(), which encapsulates formatting an osd op into an object request's osd request message. Only one op is supported right now. Stop calling ceph_osdc_build_request() in rbd_osd_req_create(). Instead, call rbd_osd_req_format_op() in each of the callers of rbd_osd_req_create(). This is to prepare for the next patch, in which the source ops for an osd request will be held in the osd request itself. Because of that, we won't have the source op to work with until after the request is created, so we can't format the op until then. This an the next patch resolve: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:11 -07:00
Alex Elder	43bfe5de9f	libceph: define osd data initialization helpers Define and use functions that encapsulate the initializion of a ceph_osd_data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:06 -07:00
Alex Elder	6010a451c3	rbd: define inbound data size for method ops When rbd creates an object request containing an object method call operation it is passing 0 for the size. I originally thought this was because the length was not needed for method calls, but I think it really should be supplied, to describe how much space is available to receive response data. So provide the supplied length. This resolves: http://tracker.ceph.com/issues/4659 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:18:04 -07:00
Alex Elder	fdce58ccb5	libceph: record length of bio list with bio When assigning a bio pointer to an osd request, we don't have an efficient way of knowing the total length bytes in the bio list. That information is available at the point it's set up by the rbd code, so record it with the osd data when it's set. This and the next patch are related to maintaining the length of a message's data independent of the message header, as described here: http://tracker.ceph.com/issues/4589 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:56 -07:00
Alex Elder	33803f3300	libceph: define source request op functions The rbd code has a function that allocates and populates a ceph_osd_req_op structure (the in-core version of an osd request operation). When reviewed, Josh suggested two things: that the big varargs function might be better split into type-specific functions; and that this functionality really belongs in the osd client rather than rbd. This patch implements both of Josh's suggestions. It breaks up the rbd function into separate functions and defines them in the osd client module as exported interfaces. Unlike the rbd version, however, the functions don't allocate an osd_req_op structure; they are provided the address of one and that is initialized instead. The rbd function has been eliminated and calls to it have been replaced by calls to the new routines. The rbd code now now use a stack (struct) variable to hold the op rather than allocating and freeing it each time. For now only the capabilities used by rbd are implemented. Implementing all the other osd op types, and making the rest of the code use it will be done separately, in the next few patches. Note that only the extent, cls, and watch portions of the ceph_osd_req_op structure are currently used. Delete the others (xattr, pgls, and snap) from its definition so nobody thinks it's actually implemented or needed. We can add it back again later if needed, when we know it's been tested. This (and a few follow-on patches) resolves: http://tracker.ceph.com/issues/3861 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:45 -07:00
Alex Elder	adfe695a25	ceph: move max constant definitions Move some definitions for max integer values out of the rbd code and into the more central "decode.h" header file. These really belong in a Linux (or libc) header somewhere, but I haven't gotten around to proposing that yet. This is in preparation for moving some code out of rbd.c and into the osd client. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:42 -07:00
Alex Elder	175face2ba	libceph: let osd ops determine request data length The length of outgoing data in an osd request is dependent on the osd ops that are embedded in that request. Each op is encoded into a request message using osd_req_encode_op(), so that should be used to determine the amount of outgoing data implied by the op as it is encoded. Have osd_req_encode_op() return the number of bytes of outgoing data implied by the op being encoded, and accumulate and use that in ceph_osdc_build_request(). As a result, ceph_osdc_build_request() no longer requires its "len" parameter, so get rid of it. Using the sum of the op lengths rather than the length provided is a valid change because: - The only callers of osd ceph_osdc_build_request() are rbd and the osd client (in ceph_osdc_new_request() on behalf of the file system). - When rbd calls it, the length provided is only non-zero for write requests, and in that case the single op has the same length value as what was passed here. - When called from ceph_osdc_new_request(), (it's not all that easy to see, but) the length passed is also always the same as the extent length encoded in its (single) write op if present. This resolves: http://tracker.ceph.com/issues/4406 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:17:02 -07:00
Alex Elder	e0c594878e	libceph: record byte count not page count Record the byte count for an osd request rather than the page count. The number of pages can always be derived from the byte count (and alignment/offset) but the reverse is not true. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:36 -07:00
Alex Elder	0fff87ec79	libceph: separate read and write data An osd request defines information about where data to be read should be placed as well as where data to write comes from. Currently these are represented by common fields. Keep information about data for writing separate from data to be read by splitting these into data_in and data_out fields. This is the key patch in this whole series, in that it actually identifies which osd requests generate outgoing data and which generate incoming data. It's less obvious (currently) that an osd CALL op generates both outgoing and incoming data; that's the focus of some upcoming work. This resolves: http://tracker.ceph.com/issues/4127 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:27 -07:00
Alex Elder	2ac2b7a6d4	libceph: distinguish page and bio requests An osd request uses either pages or a bio list for its data. Use a union to record information about the two, and add a data type tag to select between them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:25 -07:00
Alex Elder	2794a82a11	libceph: separate osd request data info Pull the fields in an osd request structure that define the data for the request out into a separate structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:16:24 -07:00
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
Arjan van de Ven	564a232c05	NVMe: Set TASK_INTERRUPTIBLE before processing queues The kthread has two tasks; handling timeouts (for which it runs once per second), and submitting queued BIOs. If a BIO happens to be queued after the thread has processed the queue but before it calls schedule_timeout(), the thread will sleep for a second before submitting it, which can cause performance problems in some rare cases (that will become more common in a subsequent patch). Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-05-01 16:40:27 -04:00
Mihnea Dobrescu-Balaur	60abc786dd	aoe: replace kmalloc and then memcpy with kmemdup Signed-off-by: Mihnea Dobrescu-Balaur <mihneadb@gmail.com> Cc: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-30 17:04:08 -07:00
Michal Belczyk	078be02b80	nbd: increase default and max request sizes Raise the default max request size for nbd to 128KB (from 127KB) to get it 4KB aligned. This patch also allows the max request size to be increased (via /sys/block/nbd<x>/queue/max_sectors_kb) to 32MB. The patch makes nbd network traffic more efficient by: - reducing request fragmentation (4KB alignment) - reducing the number of requests (fewer round trips, less network overhead) Especially in high latency networks, larger request size can make a dramatic Signed-off-by: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Michal Belczyk <belczyk@bsd.krakow.pl> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-30 17:04:07 -07:00
Akinobu Mita	38b682b261	drbd: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 18:28:42 -07:00
Mike Miller	0821e90405	cciss: bug fix to prevent cciss from loading in kdump crash kernel By default the cciss driver supports all "older" HP Smart Array controllers and hpsa supports all controllers starting with the G6 family. There are module parameters that allow a user to override those defaults and use hpsa for any HP Smart Array controller. If the user does override the default behavior and uses hpsa for older controllers it is possible that cciss may try to load in a kdump crash kernel. This may happen if cciss is loaded first from the kdump initrd image. If cciss does load rather than hpsa and reset_devices is true we immediately call cciss_hard_reset_controller. This will result in a kernel panic and the core file cannot be created. This patch prevents cciss from trying to load in this scenario. Signed-off-by: Mike Miller <mike.miller@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-29 21:24:02 +02:00
Mike Miller	e4292e05d4	cciss: add cciss_allow_hpsa module parameter Add the cciss_allow_hpsa modules parameter. This allows users to use the hpsa driver instead of cciss for older controllers. Signed-off-by: Mike Miller <mike.miller@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-29 21:24:02 +02:00
Jingoo Han	aed3d67e57	drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following build warning when CONFIG_PM_SLEEP is not selected. This is because sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when the CONFIG_PM_SLEEP is enabled. drivers/block/mg_disk.c:783:12: warning: 'mg_suspend' defined but not used [-Wunused-function] drivers/block/mg_disk.c:807:12: warning: 'mg_resume' defined but not used [-Wunused-function] Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-29 21:24:02 +02:00
Asai Thambi S P	2077d94726	mtip32xx: Workaround for unaligned writes Workaround for handling unaligned writes: limit number of outstanding unaligned writes Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-29 21:19:49 +02:00
Roger Pau Monne	402b27f9f2	xen-block: implement indirect descriptors Indirect descriptors introduce a new block operation (BLKIF_OP_INDIRECT) that passes grant references instead of segments in the request. This grant references are filled with arrays of blkif_request_segment_aligned, this way we can send more segments in a request. The proposed implementation sets the maximum number of indirect grefs (frames filled with blkif_request_segment_aligned) to 256 in the backend and 32 in the frontend. The value in the frontend has been chosen experimentally, and the backend value has been set to a sane value that allows expanding the maximum number of indirect descriptors in the frontend if needed. The migration code has changed from the previous implementation, in which we simply remapped the segments on the shared ring. Now the maximum number of segments allowed in a request can change depending on the backend, so we have to requeue all the requests in the ring and in the queue and split the bios in them if they are bigger than the new maximum number of segments. [v2: Fixed minor comments by Konrad. [v1: Added padding to make the indirect request 64bit aligned. Added some BUGs, comments; fixed number of indirect pages in blkif_get_x86_{32/64}_req. Added description about the indirect operation in blkif.h] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> [v3: Fixed spaces and tabs mix ups] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 14:16:00 -04:00
Roger Pau Monne	31552ee32d	xen-blkback: expand map/unmap functions Preparatory change for implementing indirect descriptors. Change xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount of grants (previously it was limited to BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req in the map/unmap functions, so we can map/unmap grants without needing to pass a pending_req. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:25 -04:00
Roger Pau Monne	bf0720c48c	xen-blkback: make the queue of free requests per backend Remove the last dependency from blkbk by moving the list of free requests to blkif. This change reduces the contention on the list of available requests. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:25 -04:00
Roger Pau Monne	bb6acb289f	xen-blkback: move pending handles list from blkbk to pending_req Moving grant ref handles from blkbk to pending_req will allow us to get rid of the shared blkbk structure. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:24 -04:00
Roger Pau Monne	3f3aad5e66	xen-blkback: implement LRU mechanism for persistent grants This mechanism allows blkback to change the number of grants persistently mapped at run time. The algorithm uses a simple LRU mechanism that removes (if needed) the persistent grants that have not been used since the last LRU run, or if all grants have been used it removes the first grants in the list (that are not in use). The algorithm allows the user to change the maximum number of persistent grants, by changing max_persistent_grants in sysfs. Since we are storing the persistent grants used inside the request struct (to be able to mark them as "unused" when unmapping), we no longer need the bitmap (unmap_seg). Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:23 -04:00
Roger Pau Monne	c6cc142dac	xen-blkback: use balloon pages for all mappings Using balloon pages for all granted pages allows us to simplify the logic in blkback, especially in the xen_blkbk_map function, since now we can decide if we want to map a grant persistently or not after we have actually mapped it. This could not be done before because persistent grants used ballooned pages, whereas non-persistent grants used pages from the kernel. This patch also introduces several changes, the first one is that the list of free pages is no longer global, now each blkback instance has it's own list of free pages that can be used to map grants. Also, a run time parameter (max_buffer_pages) has been added in order to tune the maximum number of free pages each blkback instance will keep in it's buffer. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:22 -04:00
Roger Pau Monne	c1a15d08f4	xen-blkback: print stats about persistent grants Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-18 09:29:21 -04:00
Linus Torvalds	96d8683483	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fix from Sage Weil: "It's a simple fix for a hard to hit race, but low-risk and clearly correct" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: do a safe list traversal in rbd_img_request_submit()	2013-04-17 12:52:02 -07:00
Alex Elder	46faeed4a6	rbd: do a safe list traversal in rbd_img_request_submit() It's possible that the reference to the object request dropped inside the loop in rbd_img_request_submit() will be the last one, in which case the content of the object pointer can't be trusted. Use a safe form of the object request list traversal to avoid problems. This resolves: http://tracker.ceph.com/issues/4705 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-04-17 11:39:09 -07:00
Keith Busch	5e82e952f0	NVMe: Add a character device for each nvme device Registers a miscellaneous device for each nvme controller probed. This creates character device files as /dev/nvmeN, where N is the device instance, and supports nvme admin ioctl commands so devices without namespaces can be managed. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-04-16 15:43:55 -04:00
Matthew Wilcox	1c9b52651d	NVMe: Fix endian-related problems in user I/O submission path When constructing the command, dsmgmt needs to be treated as a 32-bit value, not a 16-bit value. reftag, apptag and appmask all need to be converted from native-endian to little-endian. Again, sparse's bitwise warnings caught this problem. Thanks to Keith for pointing out the correct way to fix the reftag. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Keith Busch <keith.busch@intel.com>	2013-04-16 15:21:06 -04:00
Matthew Wilcox	af2d9ca744	NVMe: Fix I/O cancellation status on big-endian machines The sparse bitwise checks pointed out that I needed to shift the status before changing its endianness, not after. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-04-16 15:18:30 -04:00
Vishal Verma	8741ee4cb6	NVMe: Fix sparse warnings in scsi emulation Sparse produced warnings for some instances of mismatched types and direct userspace dereferences. This patch fixes those for the scsi emulation layer. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-04-16 15:14:09 -04:00
Matthew Wilcox	422ef0c7c8	NVMe: Don't fail initialisation unnecessarily The nvme_dev_add() function currently returns the last error code that it saw, which (if everything else succeeds) happens to be the result of an optional command, so it can legitimately fail. Looking at the error path more closely reveals that we should return success from this function, even if no device namespaces are added. So once the queues are created and the device has responded to Identify, make sure that this function succeeds. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Keith Busch <keith.busch@intel.com>	2013-04-16 15:13:41 -04:00
Matthew Wilcox	063cc6d559	NVMe: Abstract out sector to block number conversion Introduce nvme_block_nr() to help convert sectors to block numbers. This fixes an integer overflow in the SCSI conversion layer, and it's slightly less typing than opencoding it. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Keith Busch <keith.busch@intel.com>	2013-04-16 15:05:22 -04:00
Arjan van de Ven	acb7aa0db0	NVMe: Use round_jiffies_relative() for the periodic, once-per-second timer The nvme driver has a "once per second" event where the management kthread wakes up the system and then reschedules itself for 1 second later. For power efficiency reasons, I'd like this timer to happen together with other wakeups in the system. This patch makes the schedule_timeout() call in the kthread use round_jiffies_relative(), causing the wakeup to at least align with other "once per X seconds" events in the kernel. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-04-16 15:05:16 -04:00
Asai Thambi S P	68466cbf92	mtip32xx: mtip32xx: Disable TRIM support Temporarily disabling TRIM support until TRIM related issues are addressed in the firmware. Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-14 10:24:24 +02:00
Asai Thambi S P	5a79e1ac21	mtip32xx: fix a smatch warning Reported smatch warning: drivers/block/mtip32xx/mtip32xx.c:4163 mtip_block_shutdown() warn: variable dereferenced before check 'dd->disk' (see line 4159) dd->disk->disk_name accessed before the check if dd->disk is NULL. Fixed this and access of dd->queue/dd->disk->queue. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-14 10:24:24 +02:00
Linus Torvalds	27387dd8c6	for-linus-20130409 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABCAAGBQJRZCnRAAoJEPfTWPspceCmQD0QAMS6a7kYln9n7ByzMa1rfm7x Mhu3NkT3I34a5+hgdiIwWTjqJ2Uu78S5pD522eAP4m9yGcG7m1c3aqdTEewDck7K xcIjgzS1j1J/XRlrtgPxJhsiA23Kzhliwztrr1YhczWLMijvIrNu2KYZ4EK4/YPB KkdGJJitunrOw4nhmnB1AYCRJoZJ5KXOpigqVr7vxTh0Ye7ue2k9mDD3x6tOp/Q0 TxKxXyEZF9yHgUkpARMcy4OEcr63APfAeiAnZOPpdK5rbCc8pVXpoadLE8UipnlI 0FcGBzuEezQE3FZrT9PL58FFNEPUmx3NWz2SwYV7Te5Gw5Zw/ffMxk8geE12ALmb YuULHGOxz97rxjFImYYGqIOQl7F6V59bYgbXegmKnWjorKEn6LxHyNI1Q76oH5V3 UgvkmTK91yNa6DvB24Vrt1yKb3GvKQApHAwEn82y+LqcZFkWq1iKqlPVNNGGjCbR j47Qu4G/6Bif0vTFC34ciI79ga/di4S4FUOCthsjskm7waDUNvrlbP4Mvwq1VPAc VZaYGrQcWCDxuAY/S3XTR+2ci/B0n4Fhyn/iFg/6S+nQawJDeiM+C3yVDbH8caq4 wV2PXEQ/r+SII4PVGFAmATewUKF3dkUpT2TKhc9KRHZh9+mFBXFj13MZ3+WpBPHl NTvEyi5nUL+MdyZGuJUi =CB+W -----END PGP SIGNATURE----- Merge tag 'for-linus-20130409' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "I've got a few smaller fixes queued up for 3.9 that should go in. The major one is the loop regression, the others are nice fixes on their own though. It contains: - Fix for unitialized var in the block sysfs code, courtesy of Arnd and gcc-4.8. - Two fixes for mtip32xx, fixing probe and command timeout. Also a debug measure that could have waited for 3.10, but it's driver only, so I let it slip in. - Revert the loop partition cleanup fix, it could cause a deadlock on auto-teardown as part of umount. The fix is clear, but at this point we just want to revert it and get a real fix in for 3.10." * tag 'for-linus-20130409' of git://git.kernel.dk/linux-block: Revert "loop: cleanup partitions when detaching loop device" mtip32xx: fix two smatch warnings mtip32xx: Add debugfs entry device_status mtip32xx: return 0 from pci probe in case of rebuild mtip32xx: recovery from command timeout block: avoid using uninitialized value in from queue_var_store	2013-04-09 12:05:41 -07:00
Al Viro	d9dda78bad	procfs: new helper - PDE_DATA(inode) The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:13:32 -04:00
Al Viro	e88b7bb002	cciss: switch to ->show_info() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:13:19 -04:00
Al Viro	03d95eb2f2	lift sb_start_write() out of ->write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:12:56 -04:00
Jens Axboe	c2fccc1c9f	Revert "loop: cleanup partitions when detaching loop device" This reverts commit `8761a3dc1f`. There are situations where the destruction path is called with the bdev->bd_mutex already held, which then deadlocks in loop_clr_fd(). The normal partition cleanup does a trylock() on the mutex, but it'd be nice to have a more bullet proof method in loop. So punt this more involved fix to the next merge window, and just back out this buggy fix for now. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-08 10:12:11 +02:00
Jens Axboe	c66bb3f075	mtip32xx: fix two smatch warnings Dan reports: New smatch warnings: drivers/block/mtip32xx/mtip32xx.c:2728 show_device_status() warn: variable dereferenced before check 'dd' (see line 2727) drivers/block/mtip32xx/mtip32xx.c:2758 show_device_status() warn: variable dereferenced before check 'dd' (see line 2757) which are checking if dd == NULL, in a list_for_each_entry() type loop. Get rid of the check, dd can never be NULL here. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-04 09:03:41 +02:00
Asai Thambi S P	0caff00390	mtip32xx: Add debugfs entry device_status This patch adds a new debugfs entry 'device_status' in /sys/kernel/debug/rssd. The value of this entry shows devices online and those in the process of removing. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-03 21:54:02 +02:00
Asai Thambi S P	6b06d35f3f	mtip32xx: return 0 from pci probe in case of rebuild Fix to return 0 from pci probe in case of rebuild. If not, pci consider probe has failed, and crash during rmmod. Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-03 21:54:02 +02:00
Asai Thambi S P	d0d096b1d8	mtip32xx: recovery from command timeout To recover from command timeouts, reset the device. In addition to that improved timeout handling of PIO commands. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-03 21:54:01 +02:00
Jens Axboe	64f8de4da7	Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. `e3620a3ad5` ("MD RAID5: Avoid accessing gendisk or queue structs when not available") `2f6db2a707` ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-02 10:04:39 +02:00
Anatol Pomozov	c1681bf8a7	loop: prevent bdev freeing while device in use struct block_device lifecycle is defined by its inode (see fs/block_dev.c) - block_device allocated first time we access /dev/loopXX and deallocated on bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile" we want that block_device stay alive until we destroy the loop device with "losetup -d". But because we do not hold /dev/loopXX inode its counter goes 0, and inode/bdev can be destroyed at any moment. Usually it happens at memory pressure or when user drops inode cache (like in the test below). When later in loop_clr_fd() we want to use bdev we have use-after-free error with following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000280 bd_set_size+0x10/0xa0 loop_clr_fd+0x1f8/0x420 [loop] lo_ioctl+0x200/0x7e0 [loop] lo_compat_ioctl+0x47/0xe0 [loop] compat_blkdev_ioctl+0x341/0x1290 do_filp_open+0x42/0xa0 compat_sys_ioctl+0xc1/0xf20 do_sys_open+0x16e/0x1d0 sysenter_dispatch+0x7/0x1a To prevent use-after-free we need to grab the device in loop_set_fd() and put it later in loop_clr_fd(). The issue is reprodusible on current Linus head and v3.3. Here is the test: dd if=/dev/zero of=loop.file bs=1M count=1 while [ true ]; do losetup /dev/loop0 loop.file echo 2 > /proc/sys/vm/drop_caches losetup -d /dev/loop0 done [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every time we call loop_set_fd() we check that loop_device->lo_state is Lo_unbound and set it to Lo_bound If somebody will try to set_fd again it will get EBUSY. And if we try to loop_clr_fd() on unbound loop device we'll get ENXIO. loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under loop_device->lo_ctl_mutex. ] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-01 15:48:47 -07:00
Linus Torvalds	ff3421dee6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) sadb_msg prepared for IPSEC userspace forgets to initialize the satype field, fix from Nicolas Dichtel. 2) Fix mac80211 synchronization during station removal, from Johannes Berg. 3) Fix IPSEC sequence number notifications when they wrap, from Steffen Klassert. 4) Fix cfg80211 wdev tracing crashes when add_virtual_intf() returns an error pointer, from Johannes Berg. 5) In mac80211, don't call into the channel context code with the interface list mutex held. From Johannes Berg. 6) In mac80211, if we don't actually associate, do not restart the STA timer, otherwise we can crash. From Ben Greear. 7) Missing dma_mapping_error() check in e1000, ixgb, and e1000e. From Christoph Paasch. 8) Fix sja1000 driver defines to not conflict with SH port, from Marc Kleine-Budde. 9) Don't call il4965_rs_use_green with a NULL station, from Colin Ian King. 10) Suspend/Resume in the FEC driver fail because the buffer descriptors are not initialized at all the moments in which they should. Fix from Frank Li. 11) cpsw and davinci_emac drivers both use the wrong interface to restart a stopped TX queue. Use netif_wake_queue not netif_start_queue, the latter is for initialization/bringup not active management of the queue. From Mugunthan V N. 12) Fix regression in rate calculations done by psched_ratecfg_precompute(), missing u64 type promotion. From Sergey Popovich. 13) Fix length overflow in tg3 VPD parsing, from Kees Cook. 14) AOE driver fails to allocate enough headroom, resulting in crashes. Fix from Eric Dumazet. 15) RX overflow happens too quickly in sky2 driver because pause packet thresholds are not programmed correctly. From Mirko Lindner. 16) Bonding driver manages arp_interval and miimon settings incorrectly, disabling one unintentionally disables both. Fix from Nikolay Aleksandrov. 17) smsc75xx drivers don't program the RX mac properly for jumbo frames. Fix from Steve Glendinning. 18) Fix off-by-one in Codel packet scheduler. From Vijay Subramanian. 19) Fix packet corruption in atl1c by disabling MSI support, from Hannes Frederic Sowa. 20) netdev_rx_handler_unregister() needs a synchronize_net() to fix crashes in bonding driver unload stress tests. From Eric Dumazet. 21) rxlen field of ks8851 RX packet descriptors not interpreted correctly (it is 12 bits not 16 bits, so needs to be masked after shifting the 32-bit value down 16 bits). Fix from Max Nekludov. 22) Fix missed RX/TX enable in sh_eth driver due to mishandling of link change indications. From Sergei Shtylyov. 23) Fix crashes during spurious ECI interrupts in sh_eth driver, also from Sergei Shtylyov. 24) dm9000 driver initialization is done wrong for revision B devices with DSP PHY, from Joseph CHANG. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits) DM9000B: driver initialization upgrade sh_eth: make 'link' field of 'struct sh_eth_private' int sh_eth: workaround for spurious ECI interrupt sh_eth: fix handling of no LINK signal ks8851: Fix interpretation of rxlen field. net: add a synchronize_net() in netdev_rx_handler_unregister() MAINTAINERS: Update netxen_nic maintainers list atl1e: drop pci-msi support because of packet corruption net: fq_codel: Fix off-by-one error net: calxedaxgmac: Wake-on-LAN fixes net: calxedaxgmac: fix rx ring handling when OOM net: core: Remove redundant call to 'nf_reset' in 'dev_forward_skb' smsc75xx: fix jumbo frame support net: fix the use of this_cpu_ptr bonding: fix disabling of arp_interval and miimon ipv6: don't accept node local multicast traffic from the wire sky2: Threshold for Pause Packet is set wrong sky2: Receive Overflows not counted aoe: reserve enough headroom on skbs line up comment for ndo_bridge_getlink ...	2013-04-01 08:06:30 -07:00
Linus Torvalds	d299c29039	for-linus-20130331 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABCAAGBQJRWHWXAAoJEPfTWPspceCmyGIQANHJlvexkzqkPsxzfA+hKi36 90ramlHmOIGLqxKk8pJLEhAJAEAEmR1sN5FfPBeiI3I7E8RT+vuPHCOCqXhAXgku 5saB294H0OGeaGsw4cxIl4KQFxBwa2PDskFq5irV4AYJd1IMolwUdyELr2wv37g1 d4vJJUeJIUBON47pZjVfV96nQ4utISMjtHLeBmvpeREcmfqn2I1qKyYcEXxDkNeX DWRIyeJ/UApCxEWbZcxFgaVNVWE/9nGg861HgnuazCu+OiwUVhfMpS+azj/dtl8G wdZLhokjXZBi9yd70h8mZ9XReIqMbTUP6k4texNrUQXgHaN87OVUiCgbzL5JBfUB Iq2bmlCkSIUOwxV9qOsv1MfNo9TJTB2ZcOZJH381BAqf/ua1ouGzZu9KLTxmalZi yIO3oTpifELxgfCV7O/HGEP1jkRTROwpRFjErqPOFx+Jr9vhT+xj/LGZYgAzaVhX 1HCXMtp8xjRBZa7TrHq/FZY2iO4fS3JZNGg0XaIVim8yHiFWfMnGxOg4TSs5rqEy AyPg3rFVufb7n9zSdRpYfgAg6gYK/pgHZ7OcyFTt44wRrGSWpMlR8TMxJREytbJx JjKlO2qRuIbBJXnoBS1J3W22Yt8NN/TaaMIoVL4GHD3fUYMbL88NugsjIZ5VKe/N /sw12PuUld2rTR+FghHV =u2RH -----END PGP SIGNATURE----- Merge tag 'for-linus-20130331' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Alright, this time from 10K up in the air. Collection of fixes that have been queued up since the merge window opened, hence postponed until later in the cycle. The pull request contains: - A bunch of fixes for the xen blk front/back driver. - A round of fixes for the new IBM RamSan driver, fixing various nasty issues. - Fixes for multiple drives from Wei Yongjun, bad handling of return values and wrong pointer math. - A fix for loop properly killing partitions when being detached." * tag 'for-linus-20130331' of git://git.kernel.dk/linux-block: (25 commits) mg_disk: fix error return code in mg_probe() rsxx: remove unused variable rsxx: enable error return of rsxx_eeh_save_issued_dmas() block: removes dynamic allocation on stack Block: blk-flush: Fixed indent code style cciss: fix invalid use of sizeof in cciss_find_cfgtables() loop: cleanup partitions when detaching loop device loop: fix error return code in loop_add() mtip32xx: fix error return code in mtip_pci_probe() xen-blkfront: remove frame list from blk_shadow xen-blkfront: pre-allocate pages for requests xen-blkback: don't store dev_bus_addr xen-blkfront: switch from llist to list xen-blkback: fix foreach_grant_safe to handle empty lists xen-blkfront: replace kmalloc and then memcpy with kmemdup xen-blkback: fix dispatch_rw_block_io() error path rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas() Adding in EEH support to the IBM FlashSystem 70/80 device driver block: IBM RamSan 70/80 error message bug fix. block: IBM RamSan 70/80 branding changes. ...	2013-03-31 11:38:59 -07:00
Linus Torvalds	b92eded4b7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fix from Sage Weil: "This fixes a regression introduced during the last merge window when mapping non-existent images." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: don't zero-fill non-image object requests	2013-03-29 11:47:43 -07:00
Alex Elder	6e2a4505db	rbd: don't zero-fill non-image object requests A result of ENOENT from a read request for an object that's part of an rbd image indicates that there is a hole in that portion of the image. Similarly, a short read for such an object indicates that the remainder of the read should be interpreted a full read with zeros filling out the end of the request. This behavior is not correct for objects that are not backing rbd image data. Currently rbd_img_obj_request_callback() assumes it should be done for all objects. Change rbd_img_obj_request_callback() so it only does this zeroing for image objects. Encapsulate that special handling in its own function. Add an assertion that the image object request is a bio request, since we assume that (and we currently don't support any other types). This resolves a problem identified here: http://tracker.ceph.com/issues/4559 The regression was introduced by `bf0d5f503d`. Reported-by: Dan van der Ster <dan@vanderster.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-off-by: Sage Weil <sage@inktank.com>	2013-03-29 11:32:07 -07:00
Vishal Verma	5d0f6131a7	NVMe: Add nvme-scsi.c Translates SCSI commands in SG_IO ioctl to NVMe commands. Uses the scsi-nvme translation spec from nvmexpress.org as reference. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-28 14:50:49 -04:00
Eric Dumazet	91c5746425	aoe: reserve enough headroom on skbs Some network drivers use a non default hard_header_len Transmitted skb should take into account dev->hard_header_len, or risk crashes or expensive reallocations. In the case of aoe, lets reserve MAX_HEADER bytes. David reported a crash in defxx driver, solved by this patch. Reported-by: David Oostdyk <daveo@ll.mit.edu> Tested-by: David Oostdyk <daveo@ll.mit.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ed Cashin <ecashin@coraid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-28 14:29:47 -04:00
Lars Ellenberg	0b6ef4164f	drbd: fix if(); found by kbuild test robot Recently introduced al_begin_io_nonblock() was returning -EBUSY, even when it should return -EWOULDBLOCK. Impact: A few spurious wake_up() calls in prepare_al_transaction_nonblock(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:26 -06:00
Philipp Reisner	3990e04df0	drbd: use sched_setscheduler() It was unnoticed for some time that assigning to current->policy is no longer sufficient to set a real time priority for a kernel thread. Reported-by: Charlie Suffin <Charlie.Suffin@stratus.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Philipp Reisner	7c689e63a8	drbd: fix for deadlock when using automatic split-brain-recovery With an automatic after split-brain recovery policy of "after-sb-1pri call-pri-lost-after-sb", when trying to drbd_set_role() to R_SECONDARY, we run into a deadlock. This was first recognized and supposedly fixed by 2009-06-10 "Fixed a deadlock when using automatic split brain recovery when both nodes are" replacing drbd_set_role() with drbd_change_state() in that code-path, but the first hunk of that patch forgets to remove the drbd_set_role(). We apparently only ever tested the "two primaries" case. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Alexey Khoroshilov	193d01532a	drbd: add module_put() on error path in drbd_proc_open() If single_open() fails in drbd_proc_open(), module refcount is left incremented. The patch adds module_put() on the error path. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Lars Ellenberg	607f25e56e	drbd: fix drbd epoch write count for ahead/behind mode The sanity check when receiving P_BARRIER_ACK does expect all write requests with a given req->epoch to have been either all replicated, or all not replicated. Because req->epoch was assigned before calling maybe_pull_ahead(), this expectation was not met, leading to an off-by-one in the sanity check, and further to a "Protocol Error". Fix: move the call to maybe_pull_ahead() a few lines up, and assign req->epoch only after that. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Philipp Reisner	ef57f9e6bb	drbd: Fix build error when CONFIG_CRYPTO_HMAC is not set Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Lars Ellenberg	a3f8f7dc7a	drbd: validate resync_after dependency on attach already We validated resync_after dependencies, if changed via disk-options. But we did not validate them when first created via attach. We also did not check or cleanup dependencies that used to be correct, but now point to meanwhile removed minor devices. If the drbd_resync_after_valid() validation in disk-options tried to follow a dependency chain in this way, this could lead to NULL pointer dereference. Validate resync_after settings in drbd_adm_attach() already, as well as in drbd_adm_disk_opts(), and and only reject dependency loops. Depending on non-existing disks is allowed and equivalent to no dependency. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Lars Ellenberg	94ad0a1014	drbd: fix memory leak We forgot to free the disk_conf, so for each attach/detach cycle we leaked 336 bytes. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Lars Ellenberg	7074e4a745	drbd: only fail empty flushes if no good data is reachable We completed empty flushes (blkdev_issue_flush()) with IO error if we lost the local disk, even if we still have an established replication link to a healthy remote disk. Fix this to only report errors to upper layers, if neither local nor remote data is reachable. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Philipp Reisner	2bd5ed5d67	drbd: Fix disconnect to keep the peer disk state if connection breaks during operation The issue was that if the connection broke while we did the gracefull state change to C_DISCONNECTING (C_TEARDOWN), then we returned a success code from the state engine. (SS_CW_NO_NEED) The result of that is that we missed to call the fence-peer script in such a case. Fixed that by introducing a new error code (SS_OUTDATE_WO_CONN). This one should never reach back into user space. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Philipp Reisner	bb45185de2	drbd: fix spurious warning about bitmap being locked from detach Introduced in drbd: always write bitmap on detach, the bitmap bulk writeout on detach was indicating it expected exclusive bitmap access. Where I meant to say: expect no more modifications, but testing/counting is still allowed. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:25 -06:00
Philipp Reisner	0b2dafcd9f	drbd: drop now useless duplicate state request from invalidate Patch best viewed with git diff --ignore-space-change. Now that we attempt the fallback to local bitmap operation only when disconnected, we can safely drop the extra "silent" state request from both invalidate and invalidate-remote. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:24 -06:00
Philipp Reisner	5c4f13d991	drbd: fix effective error returned when refusing an invalidate Since commit drbd: Disallow the peer_disk_state to be D_OUTDATED while connected trying to invalidate a disconnected Primary returned an error code that did not really match the situation: "Refusing to be Outdated while Connected" Insert two more specific conditions into is_valid_state(), changing that to "Need access to UpToDate data", respectively "Need a connection to start verify or resync". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:24 -06:00
Philipp Reisner	9376d9f8b9	drbd: move invalidating the whole bitmap out of after_state ch() To avoid other state change requests, after passing through sanitize_state(), to be mistaken for an invalidate, move the "set all bits as out-of-sync" into the invalidate path. Make invalidate and invalidate-remote behave consistently wrt. current connection state (need either an established replication link, or really be disconnected). Also mention that in the documentation. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:24 -06:00
Philipp Reisner	a700471bf3	drbd: abort start of resync early, if it raced with connection breakage We've seen a spurious full resync, because a connection breakage raced with drbd_start_resync(, C_SYNC_TARGET), and the resulting state change request intended to start the resync ended up looking like a local invalidate. Fix: Double check the state inside the lock, and don't even request that state change, if we had connection or IO problems. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:24 -06:00
Philipp Reisner	2d56a974f3	drbd: reset ap_in_flight counter for new connections Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 10:10:24 -06:00
Wei Yongjun	c613c5f686	mg_disk: fix error return code in mg_probe() Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-28 09:43:43 -06:00
Vishal Verma	f8ebf8409a	NVMe: Add definitions for format command The SCSI emulation has the ability to send format commands, so we need to add the definition of the command. Also add a missing error code. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-27 07:17:49 -04:00
Vishal Verma	13c3b0fcc8	NVMe: Move structures & definitions to header file nvme-scsi.c uses several data structures and definitions that were previously private to nvme-core.c. Move the definitions to nvme.h, protected by __KERNEL__. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-27 07:05:42 -04:00
Philip J Kelleher	80b00df291	rsxx: remove unused variable Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-26 14:48:12 -06:00
Philip J Kelleher	4dcaf47258	rsxx: enable error return of rsxx_eeh_save_issued_dmas() Commit `d8d595df` introduced a bug where we did not check for a NULL return from kmalloc(). Make rsxx_eeh_save_issued_dmas() return an error for that case, and make the callers handle that. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-26 14:48:11 -06:00
Vishal Verma	729dd1bd80	NVMe: Rename nvme.c to nvme-core.c In preparation for adding nvme-scsi.c It is preferable to retain the module name 'nvme' Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-26 14:24:56 -04:00
Keith Busch	0e5e4f0e56	NVMe: Add discard support for capable devices This adds discard support to block queues if the nvme device is capable of deallocating blocks as indicated by the controller's optional command support. A discard flagged bio request will submit an NVMe deallocate Data Set Management command for the requested blocks. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-26 14:18:58 -04:00
Philip J Kelleher	d8d595dfce	block: removes dynamic allocation on stack This patch removes dynamic allocation on the stack error. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-25 19:22:31 -06:00
Jens Axboe	2124469efa	aoe: get rid of cached bv variable in bufinit() Less error prone if we just kill it, it's only used once anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-25 15:27:26 -06:00
Kent Overstreet	f1fb3449ef	aoe: Fix unitialized var usage Commit `4f2ac93c17` (block: Remove bi_idx references) accidently removed the bit that set bv - readd that. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: fengguang.wu@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-25 13:46:14 -06:00
Kent Overstreet	d74c6d514f	block: Add bio_for_each_segment_all() __bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>	2013-03-23 14:26:28 -07:00
Kent Overstreet	ff8e0070d1	pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage In the short term this'll help with code auditing, and if this code ever gets used now it's converted :) Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jiri Kosina <jkosina@suse.cz>	2013-03-23 14:15:37 -07:00
Kent Overstreet	ffb25dc60f	pktcdvd: use bio_copy_data() Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz>	2013-03-23 14:15:37 -07:00
Kent Overstreet	4f2ac93c17	block: Remove bi_idx references For immutable bvecs, all bi_idx usage needs to be audited - so here we're removing all the unnecessary uses. Most of these are places where it was being initialized on a bio that was just allocated, a few others are conversions to standard macros. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>	2013-03-23 14:15:31 -07:00
Kent Overstreet	aa8b57aa3d	block: Use bio_sectors() more consistently Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>	2013-03-23 14:15:30 -07:00
Kent Overstreet	f73a1c7d11	block: Add bio_end_sector() Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2013-03-23 14:15:29 -07:00
Lars Ellenberg	5bbcf5e6ab	drbd: adjust upper limit for activity log extents Now that the on-disk activity-log ring buffer size is adjustable, the maximum active set can become larger, and is now limited by the use of 16bit "labels". This increases the maximum working set from 6433 to 65534 extents, each of which covers an area of 4MiB. Which means that if you use the maximum, you'd have to resync more than 250 GiB after an unclean Primary shutdown. With capable backend storage and replication links, this is entirely feasible. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:18:09 -06:00
Lars Ellenberg	45ad07b3ac	drbd: try hard to max out the updates per AL transaction There may have been more incoming requests while we where preparing the current transaction. Try to consolidate more updates into this transaction until we make no more progres. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:18:09 -06:00
Lars Ellenberg	7e8c288f6c	drbd: move start io accounting before activity log transaction The IO accounting of the drbd "queue depth" was misleading. We only started IO accounting once we already wrote the activity log. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:18:09 -06:00
Lars Ellenberg	08a1ddab6d	drbd: consolidate as many updates as possible into one AL transaction Depending on current IO depth, try to consolidate as many updates as possible into one activity log transaction. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 22:18:09 -06:00
Lars Ellenberg	779b3fe4c0	drbd: queue writes on submitter thread, unless they pass the activity log fastpath Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:15:17 -06:00
Lars Ellenberg	6c3c4355d6	drbd: split out some helper functions to drbd_al_begin_io To make the code easier to follow, use an explicit find_active_resync_extent(), and add a "nonblock" parameter to _al_get(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:15:17 -06:00
Lars Ellenberg	b5bc8e0864	drbd: split drbd_al_begin_io into fastpath, prepare, and commit Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:15:17 -06:00
Lars Ellenberg	113fef9e20	drbd: prepare to queue write requests on a submit worker Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:14:40 -06:00
Lars Ellenberg	6d9febe237	drbd: split __drbd_make_request in before and after drbd_al_begin_io This is in preparation to be able to defer requests that need to wait for an activity log transaction to a submitter workqueue. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:14:00 -06:00
Lars Ellenberg	ebfd5d8f71	drbd: drbd_al_being_io: short circuit to reduce latency A request hitting an already "hot" extent should proceed right away, even if some other requests need to wait for pending transactions. Without that short-circuit, several simultaneous make_request contexts race for committing the transaction, possibly penalizing the innocent. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:14:00 -06:00
Lars Ellenberg	56392d2f40	drbd: Clarify when activity log I/O is delegated to the worker thread Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:14:00 -06:00
Lars Ellenberg	c04ccaa669	drbd: read meta data early, base on-disk offsets on super block We used to calculate all on-disk meta data offsets, and then compare the stored offsets, basically treating them as magic numbers. Now with the activity log striping, the activity log size is no longer fixed. We need to first read the super block, then base the activity log and bitmap offsets on the stored offsets/al stripe settings. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Lars Ellenberg	cccac9857d	drbd: mechanically rename la_size to la_size_sect Make it obvious that this value is in units of 512 Byte sectors. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Lars Ellenberg	68e41a43f1	drbd: use the cached meta_dev_idx Now we have the cached meta_dev_idx member, we can get rid of a few rcu_read_lock() sections and rcu_dereference(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Lars Ellenberg	3a4d4eb3cb	drbd: prepare for new striped layout of activity log Introduce two new on-disk meta data fields: al_stripes and al_stripe_size_4k The intended use case is activity log on RAID 0 or similar. Logically consecutive transactions will advance their on-disk position by al_stripe_size_4k 4kB (transaction sized) blocks. Right now, these are still asserted to be the backward compatible values al_stripes = 1, al_stripe_size_4k = 8 (which amounts to 32kB). Also introduce a caching member for meta_dev_idx in the in-core structure: even though it is initially passed in in the rcu-protected disk_conf structure, it cannot change without a detach/attach cycle. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Lars Ellenberg	ae8bf312e9	drbd: cleanup ondisk meta data layout calculations and defines Add a comment about our meta data layout variants, and rename a few defines (e.g. MD_RESERVED_SECT -> MD_128MB_SECT) to make it clear that they are short hand for fixed constants, and not arbitrarily to be redefined as one may see fit. Properly pad struct meta_data_on_disk to 4kB, and initialize to zero not only the first 512 Byte, but all of it in drbd_md_sync(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Lars Ellenberg	9114d79569	drbd: cleanup bogus assert message This fixes ASSERT( mdev->state.disk == D_FAILED ) in drivers/block/drbd/drbd_main.c When we detach from local disk, we let the local refcount hit zero twice. First, we transition to D_FAILED, so we won't give out new references to incoming requests; we still may give out internal references, though. Once the refcount hits zero [1] while in D_FAILED, we queue a transition to D_DISKLESS to our worker. We need to queue it, because we may be in atomic context when putting the reference. Once the transition to D_DISKLESS actually happened [2] from worker context, we don't give out new internal references either. Between hitting zero the first time [1] and actually transition to D_DISKLESS [2], there may be a few very short lived internal get/put, so we may hit zero more than once while being in D_FAILED, or even see a race where a an internal get_ldev() happened while D_FAILED, but the corresponding put_ldev() happens just after the transition to D_DISKLESS. That's why we have the additional test_and_set_bit(GO_DISKLESS,); and that's why the assert was placed wrong. Since there was exactly one code path left to drbd_go_diskless(), and that checks already for D_FAILED, drop that assert, and fold in the drbd_queue_work(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 18:13:59 -06:00
Linus Torvalds	5da273fe3f	Merge git://git.infradead.org/users/willy/linux-nvme Pull NVMe driver update from Matthew Wilcox: "These patches have mostly been baking for a few months; sorry I didn't get them in during the merge window. They're all bug fixes, except for the addition of the SMART log and the addition to MAINTAINERS." * git://git.infradead.org/users/willy/linux-nvme: NVMe: Add namespaces with no LBA range feature MAINTAINERS: Add entry for the NVMe driver NVMe: Initialize iod nents to 0 NVMe: Define SMART log NVMe: Add result to nvme_get_features NVMe: Set result from user admin command NVMe: End queued bio requests when freeing queue NVMe: Free cmdid on nvme_submit_bio error	2013-03-22 16:43:53 -07:00
Keith Busch	122090366d	NVMe: Add namespaces with no LBA range feature The LBA Range Type feature is optional in the NVMe specification, so we should continue with adding namespaces for controllers that do not implement this feature. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>	2013-03-22 14:50:23 -04:00
Wei Yongjun	d2b805d895	cciss: fix invalid use of sizeof in cciss_find_cfgtables() sizeof() when applied to a pointer typed expression gives the size of the pointer, not that of the pointed data. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: scameron@beardog.cce.hp.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 12:22:49 -06:00
Phillip Susi	8761a3dc1f	loop: cleanup partitions when detaching loop device Any partitions added by user space to the loop device were being left in place after detaching the loop device. This was because the detach path issued a BLKRRPART to clean up partitions if LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto scanned on attach. Replace this BLKRRPART with code that unconditionally cleans up partitions on detach instead. Signed-off-by: Phillip Susi <psusi@ubuntu.com> Modified by Jens to export delete_partition(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 12:21:53 -06:00
Wei Yongjun	183cfb5720	loop: fix error return code in loop_add() Fix to return a negative error code from the error handling case, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 08:59:19 -06:00
Wei Yongjun	d137c8306c	mtip32xx: fix error return code in mtip_pci_probe() Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-22 08:58:23 -06:00
Jens Axboe	7fbaee72ff	Merge branch 'stable/for-jens-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Konrad writes: [the branch] has a bunch of fixes. They vary from being able to deal with unknown requests, overflow in statistics, compile warnings, bug in the error path, removal of unnecessary logic. There is also one performance fix - which is to allocate pages for requests when the driver loads - instead of doing it per request	2013-03-22 08:56:32 -06:00
Rusty Russell	0a11cc36f7	virtio_blk: remove nents member. It's simply a flag as to whether we have data now, so make it an explicit function parameter rather than a member of struct virtblk_req. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Asias He <asias@redhat.com>	2013-03-20 15:44:58 +10:30
Paolo Bonzini	20af3cfd20	virtio-blk: use virtqueue_add_sgs on req path (This is a respin of Paolo Bonzini's patch, but it calls virtqueue_add_sgs() instead of his multi-part API). This is similar to the previous patch, but a bit more radical because the bio and req paths now share the buffer construction code. Because the req path doesn't use vbr->sg, however, we need to add a couple of arguments to __virtblk_add_req. We also need to teach __virtblk_add_req how to build SCSI command requests. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Asias He <asias@redhat.com>	2013-03-20 15:44:55 +10:30
Paolo Bonzini	8f39db9d37	virtio-blk: use virtqueue_add_sgs on bio path (This is a respin of Paolo Bonzini's patch, but it calls virtqueue_add_sgs() instead of his multi-part API). Move the creation of the request header and response footer to __virtblk_add_req. vbr->sg only contains the data scatterlist, the header/footer are added separately using virtqueue_add_sgs(). With this change, virtio-blk (with use_bio) is not relying anymore on the virtio functions ignoring the end markers in a scatterlist. The next patch will do the same for the other path. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Asias He <asias@redhat.com>	2013-03-20 15:44:54 +10:30
Paolo Bonzini	5ee21a52c0	virtio-blk: reorganize virtblk_add_req Right now, both virtblk_add_req and virtblk_add_req_wait call virtqueue_add_buf. To prepare for the next patches, abstract the call to virtqueue_add_buf into a new function __virtblk_add_req, and include the waiting logic directly in virtblk_add_req. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Asias He <asias@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2013-03-20 15:44:54 +10:30
Roger Pau Monne	b1173e316b	xen-blkfront: remove frame list from blk_shadow We already have the frame (pfn of the grant page) stored inside struct grant, so there's no need to keep an aditional list of mapped frames for a specific request. This reduces memory usage in blkfront. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-19 12:50:07 -04:00
Roger Pau Monne	9c1e050cae	xen-blkfront: pre-allocate pages for requests This prevents us from having to call alloc_page while we are preparing the request. Since blkfront was calling alloc_page with a spinlock held we used GFP_ATOMIC, which can fail if we are requesting a lot of pages since it is using the emergency memory pools. Allocating all the pages at init prevents us from having to call alloc_page, thus preventing possible failures. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-19 12:50:06 -04:00
Roger Pau Monne	ffb1dabd1e	xen-blkback: don't store dev_bus_addr dev_bus_addr returned in the grant ref map operation is the mfn of the passed page, there's no need to store it in the persistent grant entry, since we can always get it provided that we have the page. This reduces the memory overhead of persistent grants in blkback. While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as it makes much more sense - as we use that value in bio_add_page which as the fourth argument expects the offset. We hadn't used the physical address as part of this at all. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org [v1: s/buf/offset/] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-19 12:50:00 -04:00
Roger Pau Monne	155b7edb51	xen-blkfront: switch from llist to list The git commit `f84adf4921` (xen-blkfront: drop the use of llist_for_each_entry_safe) was a stop-gate to fix a GCC4.1 bug. The appropiate way is to actually use an list instead of using an llist. As such this patch replaces the usage of llist with an list. Since we always manipulate the list while holding the io_lock, there's no need for additional locking (llist used previously is safe to use concurrently without additional locking). Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> CC: stable@vger.kernel.org [v1: Redid the git commit description] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-19 10:27:56 -04:00
Roger Pau Monne	217fd5e709	xen-blkback: fix foreach_grant_safe to handle empty lists We may use foreach_grant_safe in the future with empty lists, so make sure we can handle them. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-19 08:53:15 -04:00
Mihnea Dobrescu-Balaur	29d0b218c8	xen-blkfront: replace kmalloc and then memcpy with kmemdup The benefits are: * code is cleaner * kmemdup adds additional debugging info useful for tracking the real place where memory was allocated (CONFIG_DEBUG_SLAB). Signed-off-by: Mihnea Dobrescu-Balaur <mihneadb@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-18 16:31:31 -04:00
Jan Beulich	0e5e098ac2	xen-blkback: fix dispatch_rw_block_io() error path Commit `7708992` ("xen/blkback: Seperate the bio allocation and the bio submission") consolidated the pendcnt updates to just a single write, neglecting the fact that the error path relied on it getting set to 1 up front (such that the decrement in __end_block_io_op() would actually drop the count to zero, triggering the necessary cleanup actions). Also remove a misleading and a stale (after said commit) comment. CC: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-18 16:31:22 -04:00
Jens Axboe	351a2c6e7d	rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas() Spotted by Fenguan Wu's super build robot. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-16 10:10:48 +01:00
Philip J Kelleher	c95246c3a2	Adding in EEH support to the IBM FlashSystem 70/80 device driver Changes in v2 include: o Fixed spelling of guarantee. o Fixed potential memory leak if slot reset fails out. o Changed list_for_each_entry_safe with list_for_each_entry. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-16 08:22:25 +01:00
Milos Vyletel	9d9598b81c	virtio-blk: emit udev event when device is resized When virtio-blk device is resized from host (using block_resize from QEMU) emit KOBJ_CHANGE uevent to notify guest about such change. This allows user to have custom udev rules which would take whatever action if such event occurs. As a proof of concept I've created simple udev rule that automatically resize filesystem on virtio-blk device. ACTION=="change", KERNEL=="vd", \ ENV{RESIZE}=="1", \ ENV{ID_FS_TYPE}=="ext[3-4]", \ RUN+="/sbin/resize2fs /dev/%k" ACTION=="change", KERNEL=="vd", \ ENV{RESIZE}=="1", \ ENV{ID_FS_TYPE}=="LVM2_member", \ RUN+="/sbin/pvresize /dev/%k" Signed-off-by: Milos Vyletel <milos.vyletel@sde.cz> Tested-by: Asias He <asias@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor simplification)	2013-03-12 15:44:55 +10:30
Philip J Kelleher	1ebfd10982	block: IBM RamSan 70/80 error message bug fix. This patch includes a simple change to the rsxx_pci_remove function that caused error messages because traffic was halted too early. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-11 19:53:55 +01:00
Philip J Kelleher	9bb3c4469e	block: IBM RamSan 70/80 branding changes. This patch includes changing the hardware branding name from IBM RamSan to IBM FlashSystem. v2 Changes include: o Removing the unnecessary IBM Vendor ID #define v1 Changes include: o Changed all references of RamSan to FlashSystem. o Changed the vendor/device IDs for the product. o Changed driver version number. o Updated the MAINTAINERS file. o Various other little things. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-11 19:53:55 +01:00
Philip J Kelleher	03ac03a897	block: IBM RamSan 70/80 fixes inconsistent locking. This patch includes changes to the cregs locking scheme. Before, inconsistant locking would occur because of misuse of spin_lock, spin_lock_bh, and counter parts. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-11 19:53:55 +01:00
Philip J Kelleher	f37912039e	block: IBM RamSan 70/80 trivial changes. This patch includes trivial changes that were recommended by different members of the Linux Community. Changes include: o Removing the redundant wmb(). o Formatting o Various other little things. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-03-11 19:53:55 +01:00
Zoltan Kiss	986cacbd26	xen/blkback: Change statistics counter types to unsigned These values shouldn't be negative, but after an overflow their value can turn into negative, if they are signed. xentop can show bogus values in this case. Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-11 13:56:54 -04:00
David Vrabel	0e367ae465	xen/blkback: correctly respond to unknown, non-native requests If the frontend is using a non-native protocol (e.g., a 64-bit frontend with a 32-bit backend) and it sent an unrecognized request, the request was not translated and the response would have the incorrect ID. This may cause the frontend driver to behave incorrectly or crash. Since the ID field in the request is always in the same place, regardless of the request type we can get the correct ID and make a valid response (which will report BLKIF_RSP_EOPNOTSUPP). This bug affected 64-bit SLES 11 guests when using a 32-bit backend. This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES source) and would crash in blkif_int() as the ID in the response would be invalid. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-11 13:54:28 -04:00
Chen Gang	a72d9002f8	xen/xen-blkback: preq.dev is used without initialized If call xen_vbd_translate failed, the preq.dev will be not initialized. Use blkif->vbd.pdevice instead (still better to print relative info). Note that before commit `01c681d4c7` (xen/blkback: Don't trust the handle from the frontend.) the value bogus, as it was the guest provided value from req->u.rw.handle rather than the actual device. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-01 08:41:43 -05:00
Linus Torvalds	1cf0209c43	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "A few groups of patches here. Alex has been hard at work improving the RBD code, layout groundwork for understanding the new formats and doing layering. Most of the infrastructure is now in place for the final bits that will come with the next window. There are a few changes to the data layout. Jim Schutt's patch fixes some non-ideal CRUSH behavior, and a set of patches from me updates the client to speak a newer version of the protocol and implement an improved hashing strategy across storage nodes (when the server side supports it too). A pair of patches from Sam Lang fix the atomicity of open+create operations. Several patches from Yan, Zheng fix various mds/client issues that turned up during multi-mds torture tests. A final set of patches expose file layouts via virtual xattrs, and allow the policies to be set on directories via xattrs as well (avoiding the awkward ioctl interface and providing a consistent interface for both kernel mount and ceph-fuse users)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits) libceph: add support for HASHPSPOOL pool flag libceph: update osd request/reply encoding libceph: calculate placement based on the internal data types ceph: update support for PGID64, PGPOOL3, OSDENC protocol features ceph: update "ceph_features.h" libceph: decode into cpu-native ceph_pg type libceph: rename ceph_pg -> ceph_pg_v1 rbd: pass length, not op for osd completions rbd: move rbd_osd_trivial_callback() libceph: use a do..while loop in con_work() libceph: use a flag to indicate a fault has occurred libceph: separate non-locked fault handling libceph: encapsulate connection backoff libceph: eliminate sparse warnings ceph: eliminate sparse warnings in fs code rbd: eliminate sparse warnings libceph: define connection flag helpers rbd: normalize dout() calls rbd: barriers are hard rbd: ignore zero-length requests ...	2013-02-28 17:43:09 -08:00
Linus Torvalds	f042fea0da	Merge branch 'for-3.9/drivers' of git://git.kernel.dk/linux-block Pull block driver bits from Jens Axboe: "After the block IO core bits are in, please grab the driver updates from below as well. It contains: - Fix ancient regression in dac960. Nobody must be using that anymore... - Some good fixes from Guo Ghao for loop, fixing both potential oopses and deadlocks. - Improve mtip32xx for NUMA systems, by being a bit more clever in distributing work. - Add IBM RamSan 70/80 driver. A second round of fixes for that is pending, that will come in through for-linus during the 3.9 cycle as per usual. - A few xen-blk{back,front} fixes from Konrad and Roger. - Other minor fixes and improvements." * 'for-3.9/drivers' of git://git.kernel.dk/linux-block: loopdev: ignore negative offset when calculate loop device size loopdev: remove an user triggerable oops loopdev: move common code into loop_figure_size() loopdev: update block device size in loop_set_status() loopdev: fix a deadlock xen-blkback: use balloon pages for persistent grants xen-blkfront: drop the use of llist_for_each_entry_safe xen/blkback: Don't trust the handle from the frontend. xen-blkback: do not leak mode property block: IBM RamSan 70/80 driver fixes rsxx: add slab.h include to dma.c drivers/block/mtip32xx: add missing GENERIC_HARDIRQS dependency block: remove new __devinit/exit annotations on ramsam driver block: IBM RamSan 70/80 device driver drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static? dac960: return success instead of -ENOTTY mtip32xx: add trim support mtip32xx: Add workqueue and NUMA support block: delete super ancient PC-XT driver for 1980's hardware	2013-02-28 13:16:07 -08:00
Linus Torvalds	ee89f81252	Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch\|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front\|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...	2013-02-28 12:52:24 -08:00
Alex Elder	398eb08555	nbd: fix sparse warning I just fixed this in "drivers/block/rbd.c" and I noticed that "drivers/block/nbd.c" has the same problem. Fix a warning issued by sparse by adding some lockdep annotations to indicate the queue lock gets dropped (because it's held when do_nbd_request() is called) and re-acquired within the function. Signed-off-by: Alex Elder <elder@inktank.com> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Paul Clements <paul.clements@us.sios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:22 -08:00
Paolo Bonzini	a83e814b5b	nbd: show read-only state in sysfs Pass the read-only flag to set_device_ro, so that it will be visible to the block layer and in sysfs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Alex Bligh <alex@alex.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:22 -08:00
Paolo Bonzini	3a2d63f879	nbd: fsync and kill block device on shutdown There are two problems with shutdown in the NBD driver. 1: Receiving the NBD_DISCONNECT ioctl does not sync the filesystem. This patch adds the sync operation into __nbd_ioctl()'s NBD_DISCONNECT handler. This is useful because BLKFLSBUF is restricted to processes that have CAP_SYS_ADMIN, and the NBD client may not possess it (fsync of the block device does not sync the filesystem, either). 2: Once we clear the socket we have no guarantee that later reads will come from the same backing storage. The patch adds calls to kill_bdev() in __nbd_ioctl()'s socket clearing code so the page cache is cleaned, lest reads that hit on the page cache will return stale data from the previously-accessible disk. Example: # qemu-nbd -r -c/dev/nbd0 /dev/sr0 # file -s /dev/nbd0 /dev/stdin: # UDF filesystem data (version 1.5) etc. # qemu-nbd -d /dev/nbd0 # qemu-nbd -r -c/dev/nbd0 /dev/sda # file -s /dev/nbd0 /dev/stdin: # UDF filesystem data (version 1.5) etc. While /dev/sda has: # file -s /dev/sda /dev/sda: x86 boot sector; etc. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Paul Clements <Paul.Clements@steeleye.com> Cc: Alex Bligh <alex@alex.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:22 -08:00
Alex Bligh	75f187aba5	nbd: support FLUSH requests Currently, the NBD device does not accept flush requests from the Linux block layer. If the NBD server opened the target with neither O_SYNC nor O_DSYNC, however, the device will be effectively backed by a writeback cache. Without issuing flushes properly, operation of the NBD device will not be safe against power losses. The NBD protocol has support for both a cache flush command and a FUA command flag; the server will also pass a flag to note its support for these features. This patch adds support for the cache flush command and flag. In the kernel, we receive the flags via the NBD_SET_FLAGS ioctl, and map NBD_FLAG_SEND_FLUSH to the argument of blk_queue_flush. When the flag is active the block layer will send REQ_FLUSH requests, which we translate to NBD_CMD_FLUSH commands. FUA support is not included in this patch because all free software servers implement it with a full fdatasync; thus it has no advantage over supporting flush only. Because I [Paolo] cannot really benchmark it in a realistic scenario, I cannot tell if it is a good idea or not. It is also not clear if it is valid for an NBD server to support FUA but not flush. The Linux block layer gives a warning for this combination, the NBD protocol documentation says nothing about it. The patch also fixes a small problem in the handling of flags: nbd->flags must be cleared at the end of NBD_DO_IT, but the driver was not doing that. The bug manifests itself as follows. Suppose you two different client/server pairs to start the NBD device. Suppose also that the first client supports NBD_SET_FLAGS, and the first server sends NBD_FLAG_SEND_FLUSH; the second pair instead does neither of these two things. Before this patch, the second invocation of NBD_DO_IT will use a stale value of nbd->flags, and the second server will issue an error every time it receives an NBD_CMD_FLUSH command. This bug is pre-existing, but it becomes much more important after this patch; flush failures make the device pretty much unusable, unlike Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alex Bligh <alex@alex.org.uk> Acked-by: Paul Clements <Paul.Clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:22 -08:00
Tejun Heo	56de210245	drbd: convert to idr_alloc() Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:15 -08:00
Tejun Heo	c718aa652d	block/loop: convert to idr_alloc() Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:15 -08:00
Tejun Heo	9d60916677	block/loop: don't use idr_remove_all() idr_destroy() can destroy idr by itself and idr_remove_all() is being deprecated. Drop its usage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:13 -08:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Sage Weil	1b83bef24c	libceph: update osd request/reply encoding Use the new version of the encoding for osd requests and replies. In the process, update the way we are tracking request ops and reply lengths and results in the struct ceph_osd_request. Update the rbd and fs/ceph users appropriately. The main changes are: - we keep pointers into the request memory for fields we need to update each time the request is sent out over the wire - we keep information about the result in an array in the request struct where the users can easily get at it. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>	2013-02-26 15:02:50 -08:00
Alex Elder	c47f937154	rbd: pass length, not op for osd completions The only thing type-specific osd completion functions do with their osd op parameter is (in some cases) extract the number of bytes transferred from it. In the other cases, the xferred bytes field is not used, and total message data transfer byte count (which may well be zero) is used. Just set the object request transfer count in the main osd request callback function and provide that to the other routines. There is then no longer any need to pass the op pointer to the type-specific completion routines, so drop those parameters. Stop doing anything with the total message data length. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-02-26 15:00:06 -08:00
Alex Elder	39bf2c5d09	rbd: move rbd_osd_trivial_callback() This function is slightly out of place, probably the result of an errant automatic merge or something. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-02-26 14:59:49 -08:00
Al Viro	3dadecce20	switch vfs_getattr() to struct path Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-26 02:46:08 -05:00
Alex Elder	cc344fa1b5	rbd: eliminate sparse warnings Fengguang Wu reminded me that there were outstanding sparse reports in the ceph and rbd code. This patch fixes these problems in rbd that lead to those reports: - Convert functions that are never referenced externally to have static scope. - Add a lockdep annotation to rbd_request_fn(), because it releases a lock before acquiring it again. This partially resolves: http://tracker.ceph.com/issues/4184 Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-25 15:37:08 -06:00
Alex Elder	37206ee5be	rbd: normalize dout() calls Add dout() calls to facilitate tracing of image and object requests. Change a few existing calls so they use __func__ rather than the hard-coded function name. Have calls always add ":" after the name of the function, and prefix pointer values with a consistent tag indicating what it represents. (Note that there remain some older dout() calls that are left untouched by this patch.) Issue a warning if rbd_osd_write_callback() ever gets a short write. This resolves: http://tracker.ceph.com/issues/4235 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-25 15:36:56 -06:00
Alex Elder	632b88cade	rbd: barriers are hard Let's go shopping! I'm afraid this may not have gotten it right: `07741308` rbd: add barriers near done flag operations The smp_wmb() should have been done before setting the done flag, to ensure all other data was valid before marking the object request done. Switch to use atomic_inc_return() here to set the done flag, which allows us to verify we don't mark something done more than once. Doing this also implies general barriers before and after the call. And although a read memory barrier might have been sufficient before reading the done flag, convert this to a full memory barrier just to put this issue to bed. This resolves: http://tracker.ceph.com/issues/4238 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-25 15:36:50 -06:00
Alex Elder	4dda41d3d7	rbd: ignore zero-length requests The old request code simply ignored zero-length requests. We should still operate that same way to avoid any changes in behavior. We can implement handling for special zero-length requests separately (see http://tracker.ceph.com/issues/4236). Add some assertions based on this new constraint. This resolves: http://tracker.ceph.com/issues/4237 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-25 15:36:36 -06:00
Al Viro	496ad9aa8e	new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-22 23:31:31 -05:00
Guo Chao	b7a1da695f	loopdev: ignore negative offset when calculate loop device size Negative offset may cause loop device size larger than backing file size. $ fallocate -l 1M a $ losetup --offset 0xffffffffffff0000 /dev/loop0 a $ blockdev --getsize64 /dev/loop0 1114112 $ ls -l a -rw-r--r-- 1 root root 1048576 Jan 23 12:46 a $ cat /dev/loop0 cat: /dev/loop0: Input/output error It makes no sense to do that. Only apply offset when it's positive. Fix a typo in the comment by the way. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:43:22 +01:00
Guo Chao	b1a6650406	loopdev: remove an user triggerable oops When loopdev is built as module and we pass an invalid parameter, loop_init() will return directly without deregister misc device, which will cause an oops when insert loop module next time because we left some garbage in the misc device list. Test case: sudo modprobe loop max_part=1024 (failed due to invalid parameter) sudo modprobe loop (oops) Clean up nicely to avoid such oops. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:43:22 +01:00
Guo Chao	7b0576a3d8	loopdev: move common code into loop_figure_size() Update block device size in accord with gendisk size and let userspace know the change in loop_figure_size(). This is a clean up to remove common code of loop_figure_size()'s two callers. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:43:22 +01:00
Guo Chao	541c742a75	loopdev: update block device size in loop_set_status() Loop device driver sometimes fails to impose the size limit on the device. Keep issuing following two commands: losetup --offset 7517244416 --sizelimit 3224971264 /dev/loop0 backed_file blockdev --getsize64 /dev/loop0 blockdev reports file size instead of sizelimit several out of 100 times. The problems are: - losetup set up the device in two ioctl: LOOP_SET_FD and LOOP_SET_STATUS64. - LOOP_SET_STATUS64 only update size of gendisk. Block device size will be updated lazily when device comes to use. If udev rushes in between the two ioctl, it will bring in a block device whose size is backing file size. If the device is not released after LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size. Update block size in LOOP_SET_STATUS64 ioctl. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Reported-by: M. Hindess <hindessm@uk.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:43:22 +01:00
Guo Chao	5370019dc2	loopdev: fix a deadlock bd_mutex and lo_ctl_mutex can be held in different order. Path #1: blkdev_open blkdev_get __blkdev_get (hold bd_mutex) lo_open (hold lo_ctl_mutex) Path #2: blkdev_ioctl lo_ioctl (hold lo_ctl_mutex) lo_set_capacity (hold bd_mutex) Lockdep does not report it, because path #2 actually holds a subclass of lo_ctl_mutex. This subclass seems creep into the code by mistake. The patch author actually just mentioned it in the changelog, see commit `f028f3b2` ("loop: fix circular locking in loop_clr_fd()"), also see: http://marc.info/?l=linux-kernel&m=123806169129727&w=2 Path #2 hold bd_mutex to call bd_set_size(), I've protected it with i_mutex in a previous patch, so drop bd_mutex at this site. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:43:21 +01:00
Cong Ding	7414d4f64b	drivers/block/swim3.c: fix null pointer dereference The use of pointer fs should be after the null check. Signed-off-by: Cong Ding <dinggnu@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:42:46 +01:00
Linus Torvalds	06991c28f3	Driver core patches for 3.9-rc1 Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL If you need me to provide a merged tree to handle these resolutions, please let me know. Other than those patches, there's not much here, some minor fixes and updates. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9 weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW =yWAQ -----END PGP SIGNATURE----- Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg Kroah-Hartman: "Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL Other than those patches, there's not much here, some minor fixes and updates" Fix up trivial conflicts * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits) base: memory: fix soft/hard_offline_page permissions drivercore: Fix ordering between deferred_probe and exiting initcalls backlight: fix class_find_device() arguments TTY: mark tty_get_device call with the proper const values driver-core: constify data for class_find_device() firmware: Ignore abort check when no user-helper is used firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER firmware: Make user-mode helper optional firmware: Refactoring for splitting user-mode helper code Driver core: treat unregistered bus_types as having no devices watchdog: Convert to devm_ioremap_resource() thermal: Convert to devm_ioremap_resource() spi: Convert to devm_ioremap_resource() power: Convert to devm_ioremap_resource() mtd: Convert to devm_ioremap_resource() mmc: Convert to devm_ioremap_resource() mfd: Convert to devm_ioremap_resource() media: Convert to devm_ioremap_resource() iommu: Convert to devm_ioremap_resource() drm: Convert to devm_ioremap_resource() ...	2013-02-21 12:05:51 -08:00
Linus Torvalds	e177bb587e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k Pull m68k update from Geert Uytterhoeven. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k: Sort out !CONFIG_MMU_SUN3 vs. CONFIG_HAS_DMA swim: Add missing spinlock init	2013-02-20 14:27:00 -08:00
Jens Axboe	d4308febf3	Merge branch 'stable/for-jens-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.9/drivers Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.9 which has bug-fixes that did not make it in v3.8. They all are marked as material for the stable tree as well. There are two bug-fixes for the code that has been in there for some time (that is the Jan's fix and one of mine). And there are two bug-fixes for the persistent grant feature that debuted in v3.8 for xen blk[back\|front]end.	2013-02-20 08:26:06 +01:00
Alex Elder	4c7a08c83a	Merge branch 'testing' of github.com:ceph/ceph-client into into linux-3.8-ceph	2013-02-19 19:21:08 -06:00
Alex Elder	903bb32e89	libceph: drop return value from page vector copy routines The return values provided for ceph_copy_to_page_vector() and ceph_copy_from_page_vector() serve no purpose, so get rid of them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-19 19:14:05 -06:00
Alex Elder	23ed6e13b3	rbd: ignore result of ceph_copy_from_page_vector() The result of ceph_copy_from_page_vector() is simply the length argument it is provided. This is called by rbd_obj_method_sync(), which returns the result if it's non-negative. But we always either ignore or overwrite that return value. So explicitly ignore what's returned by the copy function, and have rbd_obj_method_sync() always return either a negative errno or 0. We also return the result of ceph_copy_from_page_vector() in rbd_obj_read_sync(). There we still want to return the number of bytes transferred, but we can use the value we already have in hand rather than what ceph_copy_from_page_vector() provides. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-19 19:14:05 -06:00
Alex Elder	1ceae7ef0f	rbd: prevent bytes transferred overflow In rbd_obj_read_sync(), verify the number of bytes transferred won't exceed what can be represented by a size_t before using it to indicate the number of bytes to copy to the result buffer. (The real motivation for this is to prepare for the next patch.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-19 19:14:04 -06:00
Alex Elder	fbfab53966	libceph: allow STAT osd operations Add support for CEPH_OSD_OP_STAT operations in the osd client and in rbd. This operation sends no data to the osd; everything required is encoded in identity of the target object. The result will be ENOENT if the object doesn't exist. If it does exist and no other error occurs the server returns the size and last modification time of the target object as output data (in little endian format). The size is a 64 bit unsigned and the time is ceph_timespec structure (two unsigned 32-bit integers, representing a seconds and nanoseconds value). This resolves: http://tracker.ceph.com/issues/4007 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-19 19:14:03 -06:00
Alex Elder	ef06f4d32a	rbd: add parentheses to object request iterator macros The for_each_obj_request*() macros should parenthesize their uses of the ireq parameter. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-19 19:14:02 -06:00
Roger Pau Monne	087ffecdaa	xen-blkback: use balloon pages for persistent grants With current persistent grants implementation we are not freeing the persistent grants after we disconnect the device. Since grant map operations change the mfn of the allocated page, and we can no longer pass it to __free_page without setting the mfn to a sane value, use balloon grant pages instead, as the gntdev device does. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: stable@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-02-19 15:17:21 -05:00
Konrad Rzeszutek Wilk	f84adf4921	xen-blkfront: drop the use of llist_for_each_entry_safe Replace llist_for_each_entry_safe with a while loop. llist_for_each_entry_safe can trigger a bug in GCC 4.1, so it's best to remove it and use a while loop and do the deletion manually. Specifically this bug can be triggered by hot-unplugging a disk, either by doing xm block-detach or by save/restore cycle. BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffffa0047223>] blkif_free+0x63/0x130 [xen_blkfront] The crash call trace is: ... bad_area_nosemaphore+0x13/0x20 do_page_fault+0x25e/0x4b0 page_fault+0x25/0x30 ? blkif_free+0x63/0x130 [xen_blkfront] blkfront_resume+0x46/0xa0 [xen_blkfront] xenbus_dev_resume+0x6c/0x140 pm_op+0x192/0x1b0 device_resume+0x82/0x1e0 dpm_resume+0xc9/0x1a0 dpm_resume_end+0x15/0x30 do_suspend+0x117/0x1e0 When drilling down to the assembler code, on newer GCC it does .L29: cmpq $-16, %r12 #, persistent_gnt check je .L30 #, out of the loop .L25: ... code in the loop testq %r13, %r13 # n je .L29 #, back to the top of the loop cmpq $-16, %r12 #, persistent_gnt check movq 16(%r12), %r13 # <variable>.node.next, n jne .L25 #, back to the top of the loop .L30: While on GCC 4.1, it is: L78: ... code in the loop testq %r13, %r13 # n je .L78 #, back to the top of the loop movq 16(%rbx), %r13 # <variable>.node.next, n jmp .L78 #, back to the top of the loop Which basically means that the exit loop condition instead of being: &(pos)->member != NULL; is: ; which makes the loop unbound. Since xen-blkfront is the only user of the llist_for_each_entry_safe macro remove it from llist.h. Orabug: 16263164 CC: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-02-19 15:17:08 -05:00
Konrad Rzeszutek Wilk	01c681d4c7	xen/blkback: Don't trust the handle from the frontend. The 'handle' is the device that the request is from. For the life-time of the ring we copy it from a request to a response so that the frontend is not surprised by it. But we do not need it - when we start processing I/Os we have our own 'struct phys_req' which has only most essential information about the request. In fact the 'vbd_translate' ends up over-writing the preq.dev with a value from the backend. This assignment of preq.dev with the 'handle' value is superfluous so lets not do it. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-02-19 15:17:03 -05:00
Jan Beulich	9d092603cc	xen-blkback: do not leak mode property "be->mode" is obtained from xenbus_read(), which does a kmalloc() for the message body. The short string is never released, so do it along with freeing "be" itself, and make sure the string isn't kept when backend_changed() doesn't complete successfully (which made it desirable to slightly re-structure that function, so that the error cleanup can be done in one place). Reported-by: Olaf Hering <olaf@aepfle.de> CC: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-02-19 15:16:52 -05:00
Philip J Kelleher	c206c70924	block: IBM RamSan 70/80 driver fixes This patch includes the following driver fixes for the IBM RamSan 70/80 driver: o Changed the creg_ctrl lock from a mutex to a spinlock. o Added a count check for ioctl calls. o Removed unnecessary casting of void pointers. o Made every function static that needed to be. o Added comments to explain things more thoroughly. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-18 21:35:59 +01:00
Alex Elder	3c663bbdcd	libceph: kill ceph_osdc_create_event() "one_shot" parameter There is only one caller of ceph_osdc_create_event(), and it provides 0 as its "one_shot" argument. Get rid of that argument and just use 0 in its place. Replace the code in handle_watch_notify() that executes if one_shot is nonzero in the event with a BUG_ON() call. While modifying "osd_client.c", give handle_watch_notify() static scope. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-18 12:20:00 -06:00
Linus Torvalds	11e7651432	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc Pull sparc fixes from David Miller: "A couple small fixes for sparc including some THP brown-paper-bag material: 1) During the merging of all the THP support for various architectures, sparc missed adding a HAVE_ARCH_TRANSPARENT_HUGEPAGE to it's Kconfig, oops. 2) Sparc needs to be mindful of hugepages in get_user_pages_fast(). 3) Fix memory leak in SBUS probe, from Cong Ding. 4) The sunvdc virtual disk client driver has a test of the bitmask of vdisk server supported operations which was off by one bit" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sunvdc: Fix off-by-one in generic_request(). sparc64: Fix get_user_pages_fast() wrt. THP. sparc64: Add missing HAVE_ARCH_TRANSPARENT_HUGEPAGE. sparc: kernel/sbus.c: fix memory leakage	2013-02-15 12:05:57 -08:00
David S. Miller	f4d9605434	sunvdc: Fix off-by-one in generic_request(). The 'operations' bitmap corresponds one-for-one with the operation codes, no adjustment is necessary. Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-14 11:49:01 -08:00
Jens Axboe	ec8edc764e	Merge branch 'delete-xt-disk' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux into for-3.9/drivers Paul writes: Please pull the following to get the removal of the original IBM PC-XT hard disk driver from the block layer (drivers/block/xd.c). As near as I can tell, it hasn't seen a run time fix in over a dozen years, and with drive sizes of 10-20MB, and performance of about 128kB/s maximum, it is no surprise that it has been completely unused for well over a decade. The removal was originally posted[1] well over a month ago, and since then, there has been nobody objecting to the removal, aside from someone who had mistakenly confused it with a completely different driver (hd.c)	2013-02-14 16:29:34 +01:00
Alex Elder	077413082f	rbd: add barriers near done flag operations Somehow, I missed this little item in Documentation/atomic_ops.txt: * WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! * Create and use some helper functions that include the proper memory barriers for manipulating the done field. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:11 -08:00
Alex Elder	a14ea269dd	rbd: turn off interrupts for open/remove locking This commit: bc7a62ee5 rbd: prevent open for image being removed added checking for removing rbd before allowing an open, and used the same request spinlock for protecting that and updating the open count as is used for the request queue. However it used the non-irq protected version of the spinlocks. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:11 -08:00
Alex Elder	9cbb1d7268	libceph: don't require r_num_pages for bio requests There is a check in the completion path for osd requests that ensures the number of pages allocated is enough to hold the amount of incoming data expected. For bio requests coming from rbd the "number of pages" is not really meaningful (although total length would be). So stop requiring that nr_pages be supplied for bio requests. This is done by checking whether the pages pointer is null before checking the value of nr_pages. Note that this value is passed on to the messenger, but there it's only used for debugging--it's never used for validation. While here, change another spot that used r_pages in a debug message inappropriately, and also invalidate the r_con_filling_msg pointer after dropping a reference to it. This resolves: http://tracker.ceph.com/issues/3875 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:11 -08:00
Alex Elder	1e32d34cfa	rbd: don't take extra bio reference for osd client Currently, if the OSD client finds an osd request has had a bio list attached to it, it drops a reference to it (or rather, to the first entry on that list) when the request is released. The code that added that reference (i.e., the rbd client) is therefore required to take an extra reference to that first bio structure. The osd client doesn't really do anything with the bio pointer other than transfer it from the osd request structure to outgoing (for writes) and ingoing (for reads) messages. So it really isn't the right place to be taking or dropping references. Furthermore, the rbd client already holds references to all bio structures it passes to the osd client, and holds them until the request is completed. So there's no need for this extra reference whatsoever. So remove the bio_put() call in ceph_osdc_release_request(), as well as its matching bio_get() call in rbd_osd_req_create(). This change could lead to a crash if old libceph.ko was used with new rbd.ko. Add a compatibility check at rbd initialization time to avoid this possibilty. This resolves: http://tracker.ceph.com/issues/3798 and http://tracker.ceph.com/issues/3799 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:11 -08:00
Alex Elder	b82d167be6	rbd: prevent open for image being removed An open request for a mapped rbd image can arrive while removal of that mapping is underway. We need to prevent such an open request from succeeding. (It appears that Maciej Galkiewicz ran into this problem.) Define and use a "removing" flag to indicate a mapping is getting removed. Set it in the remove path after verifying nothing holds the device open. And check it in the open path before allowing the open to proceed. Acquire the rbd device's lock around each of these spots to avoid any races accessing the flags and open_count fields. This addresses: http://tracker.newdream.net/issues/3427 Reported-by: Maciej Galkiewicz <maciejgalkiewicz@ragnarson.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:10 -08:00
Alex Elder	6d292906f8	rbd: define flags field, use it for exists flag Define a new rbd device flags field, manipulated using bit operations. Replace the use of the current "exists" flag with a bit in this new "flags" field. Add a little commentary about the "exists" flag, which does not need to be manipulated atomically. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:10 -08:00
Alex Elder	8eb8756530	rbd: don't drop watch requests on completion When we register an osd request to linger, it means that request will stay around (under control of the osd client) until we've unregistered it. We do that for an rbd image's header object, and we keep a pointer to the object request associated with it. Keep a reference to the watch object request for as long as it is registered to linger. Drop it again after we've removed the linger registration. This resolves: http://tracker.ceph.com/issues/3937 (Note: this originally came about because the osd client was issuing a callback more than once. But that behavior will be changing soon, documented in tracker issue 3967.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:10 -08:00
Alex Elder	25dcf954c3	rbd: decrement obj request count when deleting Decrement the obj_request_count value when deleting an object request from its image request's list. Rearrange a few lines in the surrounding code. This resolves: http://tracker.ceph.com/issues/3940 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:10 -08:00
Alex Elder	975241afcb	rbd: track object rather than osd request for watch Switch to keeping track of the object request pointer rather than the osd request used to watch the rbd image header object. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:10 -08:00
Alex Elder	6977c3f983	rbd: unregister linger in watch sync routine Move the code that unregisters an rbd device's lingering header object watch request into rbd_dev_header_watch_sync(), so it occurs in the same function that originally sets up that request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:09 -08:00
Alex Elder	9f20e02a53	rbd: get rid of rbd_req_sync_exec() Get rid rbd_req_sync_exec() because it is no longer used. That eliminates the last use of rbd_req_sync_op(), so get rid of that too. And finally, that leaves rbd_do_request() unreferenced, so get rid of that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:09 -08:00
Alex Elder	36be9a7618	rbd: implement sync method with new code Reimplement synchronous object method calls using the new request tracking code. Use the name rbd_obj_method_sync() Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:09 -08:00
Alex Elder	cf81b60e4b	rbd: send notify ack asynchronously When we receive notification of a change to an rbd image's header object we need to refresh our information about the image (its size and snapshot context). Once we have refreshed our rbd image we need to acknowledge the notification. This acknowledgement was previously done synchronously, but there's really no need to wait for it to complete. Change it so the caller doesn't wait for the notify acknowledgement request to complete. And change the name to reflect it's no longer synchronous. This resolves: http://tracker.newdream.net/issues/3877 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:09 -08:00
Alex Elder	5ae9db81b4	rbd: get rid of rbd_req_sync_notify_ack() Get rid rbd_req_sync_notify_ack() because it is no longer used. As a result rbd_simple_req_cb() becomes unreferenced, so get rid of that too. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:09 -08:00
Alex Elder	b8d70035b3	rbd: use new code for notify ack Use the new object request tracking mechanism for handling a notify_ack request. Move the callback function below the definition of this so we don't have to do a pre-declaration. This resolves: http://tracker.newdream.net/issues/3754 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:08 -08:00
Alex Elder	ecf7a0318b	rbd: get rid of rbd_req_sync_watch() Get rid of rbd_req_sync_watch(), because it is no longer used. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:08 -08:00
Alex Elder	9969ebc5af	rbd: implement watch/unwatch with new code Implement a new function to set up or tear down a watch event for an mapped rbd image header using the new request code. Create a new object request type "nodata" to handle this. And define rbd_osd_trivial_callback() which simply marks a request done. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:08 -08:00
Alex Elder	86ea43bfcb	rbd: get rid of rbd_req_sync_read() Delete rbd_req_sync_read() is no longer used, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:08 -08:00
Alex Elder	788e2df3b9	rbd: implement sync object read with new code Reimplement the synchronous read operation used for reading a version 1 header using the new request tracking code. Name the resulting function rbd_obj_read_sync() to better reflect that it's a full object operation, not an object request. To do this, implement a new OBJ_REQUEST_PAGES object request type. This implements a new mechanism to allow the caller to wait for completion for an rbd_obj_request by calling rbd_obj_request_wait(). This partially resolves: http://tracker.newdream.net/issues/3755 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:08 -08:00
Alex Elder	7d250b949a	rbd: kill rbd_req_coll and rbd_request The two remaining callers of rbd_do_request() always pass a null collection pointer, so the "coll" and "coll_index" parameters are not needed. There is no other use of that data structure, so it can be eliminated. Deleting them means there is no need to allocate a rbd_request structure for the callback function. And since that's the only use of that structure, it too can be eliminated. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:07 -08:00
Alex Elder	2250a71b59	rbd: kill rbd_rq_fn() and all other related code Now that the request function has been replaced by one using the new request management data structures the old one can go away. Deleting it makes rbd_dev_do_request() no longer needed, and deleting that makes other functions unneeded, and so on. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:07 -08:00
Alex Elder	bf0d5f503d	rbd: new request tracking code This patch fully implements the new request tracking code for rbd I/O requests. Each I/O request to an rbd image will get an rbd_image_request structure allocated to track it. This provides access to all information about the original request, as well as access to the set of one or more object requests that are initiated as a result of the image request. An rbd_obj_request structure defines a request sent to a single osd object (possibly) as part of an rbd image request. An rbd object request refers to a ceph_osd_request structure built up to represent the request; for now it will contain a single osd operation. It also provides space to hold the result status and the version of the object when the osd request completes. An rbd_obj_request structure can also stand on its own. This will be used for reading the version 1 header object, for issuing acknowledgements to event notifications, and for making object method calls. All rbd object requests now complete asynchronously with respect to the osd client--they supply a common callback routine. This resolves: http://tracker.newdream.net/issues/3741 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-02-13 18:29:07 -08:00
Jean Delvare	243eeb7890	swim: Add missing spinlock init It doesn't seem this spinlock was properly initialized. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Finn Thain <fthain@telegraphics.com.au> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>	2013-02-09 14:23:33 +01:00
Jens Axboe	e5e9fdaad4	rsxx: add slab.h include to dma.c kbuild test robot says: tree: git://git.kernel.dk/linux-block.git for-3.9/drivers head: `1262e24a59` commit: `8722ff8cdb` [6/8] block: IBM RamSan 70/80 device driver config: make ARCH=alpha allyesconfig All error/warnings: drivers/block/rsxx/dma.c: In function 'rsxx_complete_dma': >> drivers/block/rsxx/dma.c:251:2: error: implicit declaration of function 'kmem_cache_free' [-Werror=implicit-function-declaration] drivers/block/rsxx/dma.c: In function 'rsxx_queue_discard': >> drivers/block/rsxx/dma.c:567:2: error: implicit declaration of function 'kmem_cache_alloc' [-Werror=implicit-function-declaration] >> drivers/block/rsxx/dma.c:567:6: warning: assignment makes pointer from integer without a cast [enabled by default] drivers/block/rsxx/dma.c: In function 'rsxx_queue_dma': >> drivers/block/rsxx/dma.c:601:6: warning: assignment makes pointer from integer without a cast [enabled by default] drivers/block/rsxx/dma.c: In function 'rsxx_dma_init': >> drivers/block/rsxx/dma.c:985:2: error: implicit declaration of function 'KMEM_CACHE' [-Werror=implicit-function-declaration] >> drivers/block/rsxx/dma.c:985:29: error: 'rsxx_dma' undeclared (first use in this function) drivers/block/rsxx/dma.c:985:29: note: each undeclared identifier is reported only once for each function it appears in >> drivers/block/rsxx/dma.c:985:39: error: 'SLAB_HWCACHE_ALIGN' undeclared (first use in this function) drivers/block/rsxx/dma.c: In function 'rsxx_dma_cleanup': >> drivers/block/rsxx/dma.c:995:2: error: implicit declaration of function 'kmem_cache_destroy' [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-07 10:04:21 +01:00
Heiko Carstens	1262e24a59	drivers/block/mtip32xx: add missing GENERIC_HARDIRQS dependency The MTIP32XX driver calls devm_request_irq() and therefore needs a GENERIC_HARDIRQS dependency to prevent building it on s390. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-07 07:37:29 +01:00
Linus Torvalds	2110cf029a	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "I've got a few bits pending for 3.8 final, that I better get sent out. It's all been sitting for a while, I consider it safe. It contains: - Two bug fixes for mtip32xx, fixing a driver hang and a crash. - A few-liner protocol error fix for drbd. - A few fixes for the xen block front/back driver, fixing a potential data corruption issue. - A race fix for disk_clear_events(), causing spurious warnings. Out of the Chrome OS base. - A deadlock fix for disk_clear_events(), moving it to the a unfreezable workqueue. Also from the Chrome OS base." * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix potential protocol error and resulting disconnect/reconnect mtip32xx: fix for crash when the device surprise removed during rebuild mtip32xx: fix for driver hang after a command timeout block: prevent race/cleanup block: remove deadlock in disk_clear_events xen-blkfront: handle bvecs with partial data llist/xen-blkfront: implement safe version of llist_for_each_entry xen-blkback: implement safe iterator for the list of persistent grants	2013-02-07 08:38:33 +11:00
Stephen Rothwell	82bed4d5f8	block: remove new __devinit/exit annotations on ramsam driver Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-06 09:34:20 +01:00
josh.h.morris@us.ibm.com	8722ff8cdb	block: IBM RamSan 70/80 device driver This patch includes the device driver for the IBM RamSan family of PCI SSD flash storage cards. This driver will include support for the RamSan 70 and 80. The driver presents a block device for device I/O. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-05 14:16:05 +01:00
Alex Elder	969e5aa3b0	Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing	2013-01-30 07:54:34 -06:00
Greg Kroah-Hartman	422d26b6ec	Merge 3.8-rc5 into driver-core-next This resolves a gpio driver merge issue pointed out in linux-next. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-25 21:06:30 -08:00
Alex Elder	c04306471a	rbd: don't retry setting up header watch When an rbd image is initially mapped a watch event is registered so we can do something if the header object changes. The code that does this currently loops if initiating the watch request results in an ERANGE error. The osds will never return ERANGE, so there's no reason to do this loop, so get rid of it. This resolves: http://tracker.newdream.net/issues/3860 Note that the problem this loop was intended to solve is a race between collecting image header information and setting up the watch on the header object. The real fix for that problem is described here: http://tracker.newdream.net/issues/3871 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-25 17:33:37 -06:00
Alex Elder	38901e0f24	rbd: check for overflow in rbd_get_num_segments() The return type of rbd_get_num_segments() is int, but the values it operates on are u64. Although it's not likely, there's no guarantee the result won't exceed what can be respresented in an int. The function is already designed to return -ERANGE on error, so just add this possible overflow as another reason to return that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-25 17:33:14 -06:00
Alex Elder	98571b5aa7	rbd: small changes A few very minor changes to the rbd code: - RBD_MAX_OPT_LEN is unused, so get rid of it - Consolidate rbd options definitions - Make rbd_segment_name() return pointer to const char Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-25 17:32:46 -06:00
Jens Axboe	1383923d19	Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus	2013-01-22 08:22:11 -07:00
Kees Cook	0cb3d9c6ba	drivers/block/paride: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Tim Waugh <tim@cyberelk.net> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:52:42 -08:00
Lars Ellenberg	2681f7f6ce	drbd: fix potential protocol error and resulting disconnect/reconnect When we notice a disk failure on the receiving side, we stop sending it new incoming writes. Depending on exact timing of various events, the same transfer log epoch could end up containing both replicated (before we noticed the failure) and local-only requests (after we noticed the failure). The sanity checks in tl_release(), called when receiving a P_BARRIER_ACK, check that the ack'ed transfer log epoch matches the expected epoch, and the number of contained writes matches the number of ack'ed writes. In this case, they counted both replicated and local-only writes, but the peer only acknowledges those it has seen. We get a mismatch, resulting in a protocol error and disconnect/reconnect cycle. Messages logged are "BAD! BarrierAck #%u received with n_writes=%u, expected n_writes=%u!\n" A similar issue can also be triggered when starting a resync while having a healthy replication link, by invalidating one side, forcing a full sync, or attaching to a diskless node. Fix this by closing the current epoch if the state changes in a way that would cause the replication intent of the next write. Epochs now contain either only non-replicated, or only replicated writes. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>	2013-01-21 22:58:36 +01:00
Linus Torvalds	226364766f	Various minor fixes, but a slightly more complex one to fix the per-cpu overload problem introduced recently by kvm id changes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJQ/IaJAAoJENkgDmzRrbjxOjAQAIrI9+Jo3Lsxk1v9gXeo9xn2 ST4LNv7/oW2+3NFBOkKsGVpcXe1JtGySIXyx9k+dELPa5xe4Rs4HE3pHQj/VoEx8 FKz3oUXSHkuh+paKuFXvZ2u/z0/FI99GmqHPObvGQ4iS3hTXAibzO83yYYPxwApq Zq4kof/dAcVVPLm8fGVAMPA2Rbh/WmjDfrIv8gv71QkDjtRLzcr40VIgky5cvu7V FWcBl4/DVoKkGnDPsLDhLK9QGqgBGhFIlNIcVX4Jv50DiCibOyzdjeUXYxMftoGr Rw56hHwGpPdqbRIjBkR071vIl/mlXTmxIv+d77vZNBin2MIBwAzCQXo8I1/HojCK /wKhI+RFj0J5DaDo/BTB80cmI3X2oah5sRUebW6vd9HjunhFFndg4mVeDNPa0E0+ F72xWlj79BjdIOuD06TLg6Tg2klL49nC8bUc0wrsh6onEjhd9v7Cp/X/rxi5cKYW eEv3oLkKwUHoheF9gBlpnT0Yyl/HpFe+nemblzj/ybRKnk4A5vtJqV9eZnqoOS16 lgIkKOpgXT9dzSom2EL/f4sMCeLLYC44DQwOvxNKt/BdMY0r5y8OLaJORXQGfEDF Ztvu2G8PmELxV0B3JZcGR/zOcKxpOBsrGoVn0/EQIul3A/0C0ID7i5zwJAyX6LP7 V+6vyF2eHMf10tB0rbfB =SpOo -----END PGP SIGNATURE----- Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module fixes and a virtio block fix from Rusty Russell: "Various minor fixes, but a slightly more complex one to fix the per-cpu overload problem introduced recently by kvm id changes." * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: put modules in list much earlier. module: add new state MODULE_STATE_UNFORMED. module: prevent warning when finit_module a 0 sized file virtio-blk: Don't free ida when disk is in use	2013-01-20 16:44:28 -08:00
Alex Elder	e0b49868d3	rbd: fix type of snap_id in rbd_dev_v2_snap_info() The type of the snap_id local variable is defined with the wrong byte order. Fix that. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:35:55 -06:00
Alex Elder	8b84de7940	rbd: assign watch request more directly Both rbd_req_sync_op() and rbd_do_request() have a "linger" parameter, which is the address of a pointer that should refer to the osd request structure used to issue a request to an osd. Only one case ever supplies a non-null "linger" argument: an CEPH_OSD_OP_WATCH start. And in that one case it is assigned &rbd_dev->watch_request. Within rbd_do_request() (where the assignment ultimately gets made) we know the rbd_dev and therefore its watch_request field. We also know whether the op being sent is CEPH_OSD_OP_WATCH start. Stop opaquely passing down the "linger" pointer, and instead just assign the value directly inside rbd_do_request() when it's needed. This makes it unnecessary for rbd_req_sync_watch() to make arrangements to hold a value that's not available until a bit later. This more clearly separates setting up a watch request from submitting it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:59 -06:00
Alex Elder	5efea49a98	rbd: move remaining osd op setup into rbd_osd_req_op_create() The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and CEPH_OSD_OP_NOTIFY_ACK. Move the setup of those operations into rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and rbd_destroy_op(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:59 -06:00
Alex Elder	2647ba3810	rbd: move call osd op setup into rbd_osd_req_op_create() Move the initialization of the CEPH_OSD_OP_CALL operation into rbd_osd_req_op_create(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:59 -06:00
Alex Elder	8d23bf2909	rbd: don't assign extent info in rbd_req_sync_op() Move the assignment of the extent offset and length and payload length out of rbd_req_sync_op() and into its caller in the one spot where a read (and note--no write) operation might be initiated. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	c561191813	rbd: don't assign extent info in rbd_do_request() In rbd_do_request() there's a sort of last-minute assignment of the extent offset and length and payload length for read and write operations. Move those assignments into the caller (in those spots that might initiate read or write operations) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	1821665749	rbd: don't leak rbd_req for rbd_req_sync_notify_ack() When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies rbd_simple_req_cb() as its callback function. Because the callback is supplied, an rbd_req structure gets allocated and populated so it can be used by the callback. However rbd_simple_req_cb() is not freeing (or even using) the rbd_req structure, so it's getting leaked. Since rbd_simple_req_cb() has no need for the rbd_req structure, just avoid allocating one for this case. Of the three calls to rbd_do_request(), only the one from rbd_do_op() needs the rbd_req structure, and that call can be distinguished from the other two because it supplies a non-null rbd_collection pointer. So fix this leak by only allocating the rbd_req structure if a non-null "coll" value is provided to rbd_do_request(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	2e53c6c379	rbd: don't leak rbd_req on synchronous requests When rbd_do_request() is called it allocates and populates an rbd_req structure to hold information about the osd request to be sent. This is done for the benefit of the callback function (in particular, rbd_req_cb()), which uses this in processing when the request completes. Synchronous requests provide no callback function, in which case rbd_do_request() waits for the request to complete before returning. This case is not handling the needed free of the rbd_req structure like it should, so it is getting leaked. Note however that the synchronous case has no need for the rbd_req structure at all. So rather than simply freeing this structure for synchronous requests, just don't allocate it to begin with. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	907703d050	rbd: combine rbd sync watch/unwatch functions The rbd_req_sync_watch() and rbd_req_sync_unwatch() functions are nearly identical. Combine them into a single function with a flag indicating whether a watch is to be initiated or torn down. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	0903e875ca	rbd: use a common layout for each device Each osd message includes a layout structure, and for rbd it is always the same (at least for osd's in a given pool). Initialize a layout structure when an rbd_dev gets created and just copy that into osd requests for the rbd image. Replace an assertion that was done when initializing the layout structures with code that catches and handles anything that would trigger the assertion as soon as it is identified. This precludes that (bad) condition from ever occurring. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	47dba7ba26	rbd: don't bother calculating file mapping When rbd_do_request() has a request to process it initializes a ceph file layout structure and uses it to compute offsets and limits for the range of the request using ceph_calc_file_object_mapping(). The layout used is fixed, and is based on RBD_MAX_OBJ_ORDER (30). It sets the layout's object size and stripe unit to be 1 GB (2^30), and sets the stripe count to be 1. The job of ceph_calc_file_object_mapping() is to determine which of a sequence of objects will contain data covered by range, and within that object, at what offset the range starts. It also truncates the length of the range at the end of the selected object if necessary. This is needed for ceph fs, but for rbd it really serves no purpose. It does its own blocking of images into objects, echo of which is (1 << obj_order) in size, and as a result it ignores the "bno" value returned by ceph_calc_file_object_mapping(). In addition, by the point a request has reached this function, it is already destined for a single rbd object, and its length will not exceed that object's extent. Because of this, and because the mapping will result in blocking up the range using an integer multiple of the image's object order, ceph_calc_file_object_mapping() will never change the offset or length values defined by the request. In other words, this call is a big no-op for rbd data requests. There is one exception. We read the header object using this function, and in that case we will not have already limited the request size. However, the header is a single object (not a file or rbd image), and should not be broken into pieces anyway. So in fact we should not be calling ceph_calc_file_object_mapping() when operating on the header object. So... Don't call ceph_calc_file_object_mapping() in rbd_do_request(), because useless for image data and incorrect to do sofor the image header. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	e01e79273b	rbd: open code rbd_calc_raw_layout() This patch gets rid of rbd_calc_raw_layout() by simply open coding it in its one caller. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	0829661863	rbd: pull in ceph_calc_raw_layout() This is the first in a series of patches aimed at eliminating the use of ceph_calc_raw_layout() by rbd. It simply pulls in a copy of that function and renames it rbd_calc_raw_layout(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:58 -06:00
Alex Elder	30573d6803	rbd: assume single op in a request We now know that every of rbd_req_sync_op() passes an array of exactly one operation, as evidenced by all callers passing 1 as its num_op argument. So get rid of that argument, assuming a single op. Similarly, we now know that all callers of rbd_do_request() pass 1 as the num_op value, so that parameter can be eliminated as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:57 -06:00
Alex Elder	139b4318ad	rbd: there is really only one op Throughout the rbd code there are spots where it appears we can handle an osd request containing more than one osd request op. But that is only the way it appears. In fact, currently only one operation at a time can be supported, and supporting more than one will require much more than fleshing out the support that's there now. This patch changes names to make it perfectly clear that anywhere we're dealing with a block of ops, we're in fact dealing with exactly one of them. We'll be able to simplify some things as a result. When multiple op support is implemented, we can update things again accordingly. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:57 -06:00
Alex Elder	ae7ca4a35b	libceph: pass num_op with ops Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are provided an array of ceph osd request operations. Rather than just passing the number of operations in the array, the caller is required append an additional zeroed operation structure to signal the end of the array. All callers know the number of operations at the time these functions are called, so drop the silly zero entry and supply that number directly. As a result, get_num_ops() is no longer needed. This also means that ceph_osdc_alloc_request() never uses its ops argument, so that can be dropped. Also rbd_create_rw_ops() no longer needs to add one to reserve room for the additional op. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:57 -06:00
Alex Elder	d07c09589f	rbd: pass num_op with ops array Add a num_op parameter to rbd_do_request() and rbd_req_sync_op() to indicate the number of entries in the array. The callers of these functions always know how many entries are in the array, so just pass that information down. This is in anticipation of eliminating the extra zero-filled entry in these ops arrays. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:57 -06:00
Alex Elder	54a5400721	libceph: don't set pages or bio in ceph_osdc_alloc_request() Only one of the two callers of ceph_osdc_alloc_request() provides page or bio data for its payload. And essentially all that function was doing with those arguments was assigning them to fields in the osd request structure. Simplify ceph_osdc_alloc_request() by having the caller take care of making those assignments Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 16:34:57 -06:00
Alex Elder	d178a9e740	libceph: don't set flags in ceph_osdc_alloc_request() The only thing ceph_osdc_alloc_request() really does with the flags value it is passed is assign it to the newly-created osd request structure. Do that in the caller instead. Both callers subsequently call ceph_osdc_build_request(), so have that function (instead of ceph_osdc_alloc_request()) issue a warning if a request comes through with neither the read nor write flags set. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:05 -06:00
Alex Elder	e75b45cf36	libceph: drop osdc from ceph_calc_raw_layout() The osdc parameter to ceph_calc_raw_layout() is not used, so get rid of it. Consequently, the corresponding parameter in calc_layout() becomes unused, so get rid of that as well. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:05 -06:00
Alex Elder	4d6b250bf1	libceph: drop snapid in ceph_calc_raw_layout() A snapshot id must be provided to ceph_calc_raw_layout() even though it is not needed at all for calculating the layout. Where the snapshot id is needed is when building the request message for an osd operation. Drop the snapid parameter from ceph_calc_raw_layout() and pass that value instead in ceph_osdc_build_request(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:05 -06:00
Alex Elder	0120be3c60	libceph: pass length to ceph_osdc_build_request() The len argument to ceph_osdc_build_request() is set up to be passed by address, but that function never updates its value so there's no need to do this. Tighten up the interface by passing the length directly. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:04 -06:00
Alex Elder	7c3d22cf16	rbd: don't bother setting snapid in rbd_do_request() For some reason, the snapid field of the osd request header is explicitly set to CEPH_NOSNAP in rbd_do_request(). Just a few lines later--with no code that would access this field in between--a call is made to ceph_calc_raw_layout() passing the snapid provided to rbd_do_request(), which encodes the snapid value it is provided into that field instead. In other words, there is no need to fill in CEPH_NOSNAP, and doing so suggests it might be necessary. Don't do that any more. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:03 -06:00
Alex Elder	25704ac9de	rbd: kill rbd_req_sync_op() snapc and snapid parameters The snapc and snapid parameters to rbd_req_sync_op() always take the values NULL and CEPH_NOSNAP, respectively. So just get rid of them and use those values where needed. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:02 -06:00
Alex Elder	07b2391fbb	rbd: drop flags parameter from rbd_req_sync_exec() All callers of rbd_req_sync_exec() pass CEPH_OSD_FLAG_READ as their flags argument. Delete that parameter and use CEPH_OSD_FLAG_READ within the function. If we find a need to support write operations we can add it back again. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:02 -06:00
Alex Elder	4775618d92	rbd: drop snapid parameter from rbd_req_sync_read() There is only one caller of rbd_req_sync_read(), and it passes CEPH_NOSNAP as the snapshot id argument. Delete that parameter and just use CEPH_NOSNAP within the function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:01 -06:00
Alex Elder	af77f26caa	rbd: drop oid parameters from ceph_osdc_build_request() The last two parameters to ceph_osd_build_request() describe the object id, but the values passed always come from the osd request structure whose address is also provided. Get rid of those last two parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:01 -06:00
Alex Elder	0ec8ce87f3	rbd: separate layout init Pull a block of code that initializes the layout structure in an osd request into its own function so it can be reused. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:01 -06:00
Alex Elder	a7b4c65f4f	rbd: only get snap context for write requests Right now we get the snapshot context for an rbd image (under protection of the header semaphore) for every request processed. There's no need to get the snap context if we're doing a read, so avoid doing so in that case. Note that we no longer need to hold the header semaphore to check the rbd_dev's existence flag. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:00 -06:00
Alex Elder	d78b650a59	rbd: make exists flag atomic The rbd_device->exists field can be updated asynchronously, changing from set to clear if a mapped snapshot disappears from the base image's snapshot context. Currently, value of the "exists" flag is only read and modified under protection of the header semaphore, but that will change with the next patch. Making it atomic ensures this won't be a problem because the a the non-existence of device will be immediately known. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:52:00 -06:00
Alex Elder	b395e8b5b8	rbd: a little more cleanup of rbd_rq_fn() Now that a big hunk in the middle of rbd_rq_fn() has been moved into its own routine we can simplify it a little more. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:51:51 -06:00
Alex Elder	cd323ac0eb	rbd: end request on error in rbd_do_request() caller Only one of the three callers of rbd_do_request() provide a collection structure to aggregate status. If an error occurs in rbd_do_request(), have the caller take care of calling rbd_coll_end_req() if necessary in that one spot. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:33:41 -06:00
Alex Elder	8295cda7ce	rbd: encapsulate handling for a single request In rbd_rq_fn(), requests are fetched from the block layer and each request is processed, looping through the request's list of bio's until they've all been consumed. Separate the handling for a single request into its own function to make it a bit easier to see what's going on. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 15:04:47 -06:00
Alex Elder	8986cb37b1	rbd: be picky about osd request status type The result field in a ceph osd reply header is a signed 32-bit type, but rbd code often casually uses int to represent it. The following changes the types of variables that handle this result value to be "s32" instead of "int" to be completely explicit about it. Only at the point we pass that result to __blk_end_request() does the type get converted to the plain old int defined for that interface. There is almost certainly no binary impact of this change, but I prefer to show the exact size and signedness of the value since we know it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2013-01-17 14:53:20 -06:00
Alex Elder	5f29ddd4f0	rbd: standardize ceph_osd_request variable names There are spots where a ceph_osds_request pointer variable is given the name "req". Since we're dealing with (at least) three types of requests (block layer, rbd, and osd), I find this slightly distracting. Change such instances to use "osd_req" consistently to make the abstraction represented a little more obvious. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:53:15 -06:00
Alex Elder	725afc97c9	rbd: standardize rbd_request variable names There are two names used for items of rbd_request structure type: "req" and "req_data". The former name is also used to represent items of pointers to struct ceph_osd_request. Change all variables that have these names so they are instead called "rbd_req" consistently. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:53:07 -06:00
Alex Elder	935dc89f3e	rbd: add warnings to rbd_dev_probe_update_spec() Josh suggested adding warnings to this function to help users diagnose problems. Other than memory allocatino errors, there are two places where errors can be returned. Both represent problems that should have been caught earlier, and as such might well have been handled with BUG_ON() calls. But if either ever did manage to happen, it will be reported. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:12:46 -06:00
Alex Elder	f5400b7a0e	rbd: add a warning in bio_chain_clone_range() Add a warning in bio_chain_clone_range() to help a user determine what exactly might have led to a failure. There is only one; please say something if you disagree with the following reasoning. There are three places this can return abnormally: - Initially, if there is nothing to clone. It turns out that right now this cannot happen anyway. The test is in place because the code below it doesn't work if those conditions don't hold. As such they could be assertions but since I can return a null to indicate an error I just do that instead. I have not added a warning here because it won't happen. - While processing bio's, if none remain but there are supposed to be more bytes to clone. Here I have added a warning. - If bio_clone_range() returns a null pointer. That function will have already produced a warning (at least the first time, via WARN_ON_ONCE()) to distinguish the cause of the error. The only exception is memory exhaustion, and I'd rather not pepper the code with warnings in all those spots. So no warning is added in that place. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:12:31 -06:00
Alex Elder	4fb5d67139	rbd: add warning messages for missing arguments Tell the user (via dmesg) what was wrong with the arguments provided via /sys/bus/rbd/add. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2013-01-17 14:10:21 -06:00
Alex Elder	06ecc6cbf7	rbd: define and use rbd_warn() Define a new function rbd_warn() that produces a boilerplate warning message, identifying in the resulting message the affected rbd device in the best way available. Use it in a few places that now use pr_warning(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:09:29 -06:00
Alex Elder	4caf35f9ec	rbd: use kmemdup() This replaces two kmalloc()/memcpy() combinations with a single call to kmemdup(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:09:00 -06:00
Alex Elder	979ed480a2	rbd: kill rbd_spec->image_id_len There is no real benefit to keeping the length of an image id, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:08:54 -06:00
Alex Elder	69e7a02f63	rbd: kill rbd_spec->image_name_len There may have been a benefit to hanging on to the length of an image name before, but there is really none now. The only time it's used is when probing for rbd images, so we can just compute the length then. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:08:46 -06:00
Alex Elder	c66c6e0c0b	rbd: document rbd_spec structure I promised Josh I would document whether there were any restrictions needed for accessing fields of an rbd_spec structure. This adds a big block of comments that documents the structure and how it is used--including the fact that we don't attempt to synchronize access to it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-01-17 14:07:50 -06:00
Fengguang Wu	478c030eec	drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static? Hi Asai, FYI, there are new sparse warnings show up in tree: git://git.kernel.dk/linux-block.git for-3.9/drivers head: `3d6a87430e` commit: `152834694d` [2/3] mtip32xx: add trim support >> drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:3348:17: sparse: cast to restricted __le32 drivers/block/mtip32xx/mtip32xx.c:4125:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4126:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4127:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4128:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4129:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4130:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4131:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c:4132:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags': drivers/block/mtip32xx/mtip32xx.c:2804:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=] drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers': drivers/block/mtip32xx/mtip32xx.c:2781:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-12 09:15:19 +01:00
Fengguang Wu	25bac122b8	drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static? Hi Asai, FYI, there are new sparse warnings show up in tree: git://git.kernel.dk/linux-block.git for-3.9/drivers head: `3d6a87430e` commit: `16c906e51c` [1/3] mtip32xx: Add workqueue and NUMA support drivers/block/mtip32xx/mtip32xx.c:3267:17: sparse: cast to restricted __le32 >> drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4030:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4031:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4032:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4033:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4034:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4035:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static? >> drivers/block/mtip32xx/mtip32xx.c:4036:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static? drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags': drivers/block/mtip32xx/mtip32xx.c:2723:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=] drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers': drivers/block/mtip32xx/mtip32xx.c:2700:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=] Please consider folding the below diff :-) Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-12 09:15:11 +01:00
Dan Carpenter	3d6a87430e	dac960: return success instead of -ENOTTY There is a missing break statement here. This used to return directly but we re-worked it in 2008 to add locking as part of the BKL push down. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:42:36 +01:00
Asai Thambi S P	152834694d	mtip32xx: add trim support TRIM support added through vendor unique command. Signed-off-by: Sam Bradshaw < sbradshaw@micron.com> Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:41:34 +01:00
Asai Thambi S P	16c906e51c	mtip32xx: Add workqueue and NUMA support This patch contains * parallel command completion using workers * bind the workers to the chosen numa node * bind isr to the chosen numa node * allocating memory in the chosen numa node Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:38:57 +01:00
Asai Thambi S P	58c49df378	mtip32xx: fix for crash when the device surprise removed during rebuild When rebuild is in progress, disk->queue is yet to be created. Surprise removing the device will call remove()-> del_gendisk(). del_gendisk() expect disk->queue be not NULL. Fix is to call put_disk() when disk_queue is NULL. Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:35:58 +01:00
Asai Thambi S P	47cd4b3c7e	mtip32xx: fix for driver hang after a command timeout If an I/O command times out when a PIO command is active, MTIP_PF_EH_ACTIVE_BIT is not cleared. This results in I/O hang in the driver. Fix is to clear this bit. Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-11 14:35:55 +01:00
Paul Gortmaker	d1a6f4f197	block: delete super ancient PC-XT driver for 1980's hardware This driver was for the 8 bit ISA cards that were installed in the PC-XT machines of 1980 vintage. They supported the dual ribbon cable MFM drives of 10-20MB capacity, and ran at a 3:1 interleave, giving performance on the order of 128kB/s. By the introduction of the PC-AT (286) these controllers were already scrapped in favour of 16 bit controllers with some onboard RAM that could support a 1:1 interleave. The git history doesn't show any evidence of runtime fixes that would reflect active usage; instead just the usual tree-wide API type changes/cleanups. Going back to in-source changelogs, the last "runtime" fix that is evident is something I did over a dozen years ago[1] -- and even back then, the hardware was long since unavailable, so that ancient fix was also not runtime tested. The time is long overdue for this to get flushed, so lets get rid of it before anyone wastes more time doing builds and sparse checks etc. on long since dead code. [1] http://lkml.indiana.edu/hypermail/linux/kernel/0102.2/0027.html Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2013-01-04 20:17:40 -05:00
Greg Kroah-Hartman	8d85fce77e	Drivers: block: remove __dev* attributes. CONFIG_HOTPLUG is going away as an option. As a result, the __dev* markings need to be removed. This change removes the use of __devinit, __devexit_p, __devinitdata, __devinitconst, and __devexit from these drivers. Based on patches originally written by Bill Pemberton, but redone by me in order to handle some of the coding style issues better, by hand. Cc: Bill Pemberton <wfp5p@virginia.edu> Cc: Mike Miller <mike.miller@hp.com> Cc: Chirag Kantharia <chirag.kantharia@hp.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: NeilBrown <neilb@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tao Guo <Tao.Guo@emc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-03 15:57:15 -08:00
Alexander Graf	f4953fe6c4	virtio-blk: Don't free ida when disk is in use When a file system is mounted on a virtio-blk disk, we then remove it and then reattach it, the reattached disk gets the same disk name and ids as the hot removed one. This leads to very nasty effects - mostly rendering the newly attached device completely unusable. Trying what happens when I do the same thing with a USB device, I saw that the sd node simply doesn't get free'd when a device gets forcefully removed. Imitate the same behavior for vd devices. This way broken vd devices simply are never free'd and newly attached ones keep working just fine. Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org	2013-01-02 15:37:58 +10:30
Linus Torvalds	40889e8d9f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few different groups of commits here. The largest is Alex's ongoing work to enable the coming RBD features (cloning, striping). There is some cleanup in libceph that goes along with it. Cyril and David have fixed some problems with NFS reexport (leaking dentries and page locks), and there is a batch of patches from Yan fixing problems with the fs client when running against a clustered MDS. There are a few bug fixes mixed in for good measure, many of which will be going to the stable trees once they're upstream. My apologies for the late pull. There is still a gremlin in the rbd map/unmap code and I was hoping to include the fix for that as well, but we haven't been able to confirm the fix is correct yet; I'll send that in a separate pull once it's nailed down." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits) rbd: get rid of rbd_{get,put}_dev() libceph: register request before unregister linger libceph: don't use rb_init_node() in ceph_osdc_alloc_request() libceph: init event->node in ceph_osdc_create_event() libceph: init osd->o_node in create_osd() libceph: report connection fault with warning libceph: socket can close in any connection state rbd: don't use ENOTSUPP rbd: remove linger unconditionally rbd: get rid of RBD_MAX_SEG_NAME_LEN libceph: avoid using freed osd in __kick_osd_requests() ceph: don't reference req after put rbd: do not allow remove of mounted-on image libceph: Unlock unprocessed pages in start_read() error path ceph: call handle_cap_grant() for cap import message ceph: Fix __ceph_do_pending_vmtruncate ceph: Don't add dirty inode to dirty list if caps is in migration ceph: Fix infinite loop in __wake_requests ceph: Don't update i_max_size when handling non-auth cap bdi_register: add __printf verification, fix arg mismatch ...	2012-12-20 14:00:13 -08:00
Alex Elder	c3e946ce72	rbd: get rid of rbd_{get,put}_dev() The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers that add no value, and their existence suggests they may do more than what they do. Get rid of them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>	2012-12-20 10:56:44 -06:00
Jens Axboe	b6c46cfa31	Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Konrad writes: Please git pull the following branch: git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.8 which has a bug-fix to the xen-blkfront and xen-blkback driver when using the persistent mode. An issue was discovered where LVM disks could not be read correctly and this fixes it. There is also a change in llist.h which has been blessed by akpm.	2012-12-19 20:37:10 +01:00
Linus Torvalds	848b81415c	Merge branch 'akpm' (Andrew's patch-bomb) Merge misc patches from Andrew Morton: "Incoming: - lots of misc stuff - backlight tree updates - lib/ updates - Oleg's percpu-rwsem changes - checkpatch - rtc - aoe - more checkpoint/restart support I still have a pile of MM stuff pending - Pekka should be merging later today after which that is good to go. A number of other things are twiddling thumbs awaiting maintainer merges." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits) scatterlist: don't BUG when we can trivially return a proper error. docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output fs, fanotify: add @mflags field to fanotify output docs: add documentation about /proc/<pid>/fdinfo/<fd> output fs, notify: add procfs fdinfo helper fs, exportfs: add exportfs_encode_inode_fh() helper fs, exportfs: escape nil dereference if no s_export_op present fs, epoll: add procfs fdinfo helper fs, eventfd: add procfs fdinfo helper procfs: add ability to plug in auxiliary fdinfo providers tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test breakpoint selftests: print failure status instead of cause make error kcmp selftests: print fail status instead of cause make error kcmp selftests: make run_tests fix mem-hotplug selftests: print failure status instead of cause make error cpu-hotplug selftests: print failure status instead of cause make error mqueue selftests: print failure status instead of cause make error vm selftests: print failure status instead of cause make error ubifs: use prandom_bytes mtd: nandsim: use prandom_bytes ...	2012-12-17 20:58:12 -08:00
Roger Pau Monne	d62f691858	xen-blkfront: handle bvecs with partial data Currently blkfront fails to handle cases in blkif_completion like the following: 1st loop in rq_for_each_segment * bv_offset: 3584 * bv_len: 512 * offset += bv_len * i: 0 2nd loop: * bv_offset: 0 * bv_len: 512 * i: 0 In the second loop i should be 1, since we assume we only wanted to read a part of the previous page. This patches fixes this cases where only a part of the shared page is read, and blkif_completion assumes that if the bv_offset of a bvec is less than the previous bv_offset plus the bv_size we have to switch to the next shared page. Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: linux-kernel@vger.kernel.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2012-12-17 21:56:03 -05:00
Roger Pau Monne	ebb351cf78	llist/xen-blkfront: implement safe version of llist_for_each_entry Implement a safe version of llist_for_each_entry, and use it in blkif_free. Previously grants where freed while iterating the list, which lead to dereferences when trying to fetch the next item. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> [v2: Move the llist_for_each_entry_safe in llist.h] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2012-12-17 21:55:56 -05:00
Dan Carpenter	31279b1457	aoe: fix use after free in aoedev_by_aoeaddr() We should return NULL on failure instead of returning a freed pointer. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:26 -08:00
Ed Cashin	2b37c7d865	aoe: update internal version number to 81 This version number is printed to the console on module initialization and is available in sysfs, which is where the userland aoe-version tool looks for it. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:26 -08:00
Ed Cashin	bf29754ae8	aoe: identify source of runt AoE packets This change only affects experimental AoE storage networks. It modifies the console message about runt packets detected so that the AoE major and minor addresses of the AoE target that generated the runt are mentioned. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:26 -08:00
Ed Cashin	4a6c9ee93c	aoe: allow comma separator in aoe_iflist value By default, the aoe driver uses any ethernet interface for AoE, but the aoe_iflist module parameter provides a convenient way to limit AoE traffic to a specific list of local network interfaces. This change allows a list to be specified using the comma character as a separator. For example, modprobe aoe aoe_iflist=eth2,eth3 Before, it was inconvenient to get the quoting right in shell scripts when setting aoe_iflist to have more than one network interface. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:26 -08:00
Ed Cashin	c450ba0fc1	aoe: allow user to disable target failure timeout With this change, the aoe driver treats the value zero as special for the aoe_deadsecs module parameter. Normally, this value specifies the number of seconds during which the driver will continue to attempt retransmits to an unresponsive AoE target. After aoe_deadsecs has elapsed, the aoe driver marks the aoe device as "down" and fails all I/O. The new meaning of an aoe_deadsecs of zero is for the driver to retransmit commands indefinitely. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	71114ec45f	aoe: use dynamic number of remote ports for AoE storage target Many AoE targets have four or fewer network ports, but some existing storage devices have many, and the AoE protocol sets no limit. This patch allows the use of more than eight remote MAC addresses per AoE target, while reducing the amount of memory used by the aoe driver in cases where there are many AoE targets with fewer than eight MAC addresses each. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	e52a293264	aoe: avoid races between device destruction and discovery This change avoids a race that could result in a NULL pointer derference following a WARNing from kobject_add_internal, "don't try to register things with the same name in the same directory." The problem was found with a test that forgets and discovers an aoe device in a loop: while test ! -r /tmp/stop; do aoe-flush -a aoe-discover done The race was between aoedev_flush taking aoedevs out of the devlist, allowing a new discovery of the same AoE target to take place before the driver gets around to calling sysfs_remove_group. Fixing that one revealed another race between do_open and add_disk, and this patch avoids that, too. The fix required some care, because for flushing (forgetting) an aoedev, some of the steps must be performed under lock and some must be able to sleep. Also, for discovering a new aoedev, some steps might sleep. The check for a bad aoedev pointer remains from a time when about half of this patch was done, and it was possible for the bdev->bd_disk->private_data to become corrupted. The check should be removed eventually, but it is not expected to add significant overhead, occurring in the aoeblk_open routine. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	bbb44e30d0	aoe: improve handling of misbehaving network paths An AoE target can have multiple network ports used for AoE, and in the aoe driver, those are tracked by the aoetgt struct. These changes allow the aoe driver to handle network paths, or aoetgts, that are not working well, compared to the others. Paths that do not get responses despite the retransmission of AoE commands are marked as "tainted", and non-tainted paths are preferred. Meanwhile, the aoe driver attempts to "probe" the tainted path in the background by issuing reads of LBA 0 that are padded out to full (possibly jumbo-frame) size. If the probes get responses, then the path is "redeemed", and its taint is removed. This mechanism has been shown to be helpful in transparently handling and recovering from real-world network "brown outs" in ways that the earlier "shoot the help-needing target in the head" mechanism could not. Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	b91316f2b7	aoe: return real minor number for static minors The value returned by the static minor device number number allocator is the real minor number, so it must be multiplied by the supported number of partitions per aoedev. Without this fix the support for systems without udev is incomplete, and the few users of aoe on such systems will have surprising results when device nodes names do not match the AoE target. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	10935d052e	aoe: initialize sysminor to avoid compiler warning Because the minor_get and related functions use the return values for errors, the compiler doesn't know that sysminor will always either 1) be initialized in aoedev_by_aoeaddr by the call to minor_get, or 2) be unused as the "goto out" is executed. This patch avoids the compiler warning. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	e0b2bbab0b	aoe: make error messages more specific in static minor allocation For some special-purpose systems where udev isn't present, static allocation of minor numbers is desirable. This update distinguishes different failure scenarios, to help the user understand what went wrong. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	60116cf773	aoe: remove call to request handler from I/O completion There is no need to call the request handler function in the I/O completion routine. The user impact of not doing it is a more "nice" aoe driver that is less susceptible to causing soft lockups. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	72837600ee	aoe: cleanup: correct comment for aoetgt nout A misplaced comment was attached to the nout member of the aoetgt. This change corrects the comment. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00
Ed Cashin	7b6ccc5f97	aoe: increase default cap on outstanding AoE commands in the network The aoe driver will never be waiting for more than aoe_maxout AoE commands from a given remote network port on an AoE target. Increasing the cap increases performance. Users can tighten the setting to reduce the amount of memory used for handling AoE traffic or the network bandwidth used for AoE. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:25 -08:00

... 6 7 8 9 10 ...

4096 Commits