linux

Commit Graph

Author	SHA1	Message	Date
Ilya Dryomov	663ae2cc04	rbd: get/put img_request in rbd_img_request_submit() By the time we get to checking for_each_obj_request_safe(img_request) terminating condition, all obj_requests may be complete and img_request ref, that rbd_img_request_submit() takes away from its caller, may be put. Moving the next_obj_request cursor is then a use-after-free on img_request. It's totally benign, as the value that's read is never used, but I think it's still worth fixing. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-05-26 00:36:20 +02:00
Ilya Dryomov	d3767f0fae	rbd: report unsupported features to syslog ... instead of just returning an error. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2016-04-28 10:07:43 +02:00
Ilya Dryomov	811c668877	rbd: fix rbd map vs notify races A while ago, commit `9875201e10` ("rbd: fix use-after free of rbd_dev->disk") fixed rbd unmap vs notify race by introducing an exported wrapper for flushing notifies and sticking it into do_rbd_remove(). A similar problem exists on the rbd map path, though: the watch is registered in rbd_dev_image_probe(), while the disk is set up quite a few steps later, in rbd_dev_device_setup(). Nothing prevents a notify from coming in and crashing on a NULL rbd_dev->disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 Call Trace: [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd] [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81645dd8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd] If an error occurs during rbd map, we have to error out, potentially tearing down a watch. Just like on rbd unmap, notifies have to be flushed, otherwise rbd_watch_cb() may end up trying to read in the image header after rbd_dev_image_release() has run: Assertion failure in rbd_dev_header_info() at line 4722: rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); Call Trace: [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150 [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390 [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290 [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0 [<ffffffff81107799>] process_one_work+0x689/0x1a80 [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80 [<ffffffff81132065>] ? finish_task_switch+0x225/0x640 [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0 [<ffffffff81108c69>] worker_thread+0xd9/0x1320 [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80 [<ffffffff8111b02d>] kthread+0x21d/0x2e0 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 [<ffffffff82022802>] ret_from_fork+0x22/0x40 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30 To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling revalidate_disk(), b) move ceph_osdc_flush_notifies() call into rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn header read-in into a critical section. The latter also happens to take care of rbd map foo@bar vs rbd snap rm foo@bar race. Fixes: http://tracker.ceph.com/issues/15490 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2016-04-28 10:07:22 +02:00
David Disseldorp	2224d879c7	rbd: use GFP_NOIO consistently for request allocations As of `5a60e87603`, RBD object request allocations are made via rbd_obj_request_create() with GFP_NOIO. However, subsequent OSD request allocations in rbd_osd_req_create*() use GFP_ATOMIC. With heavy page cache usage (e.g. OSDs running on same host as krbd client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been observed to fail, where direct reclaim would have allowed GFP_NOIO allocations to succeed. Cc: stable@vger.kernel.org # 3.18+ Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-04-05 22:11:37 +02:00
Geliang Tang	03d9440676	rbd: use KMEM_CACHE macro Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:56 +01:00
Ilya Dryomov	3f1af42ad0	libceph: enable large, variable-sized OSD requests Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:43 +01:00
Yan, Zheng	7665d85b73	libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-03-25 18:51:42 +01:00
Markus Elfring	1761b22966	rbd: delete an unnecessary check before rbd_dev_destroy() The rbd_dev_destroy() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-01-21 19:36:07 +01:00
Ilya Dryomov	70b16db86f	rbd: don't put snap_context twice in rbd_queue_workfn() Commit `4e752f0ab0` ("rbd: access snapshot context and mapping size safely") moved ceph_get_snap_context() out of rbd_img_request_create() and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the error path in rbd_queue_workfn(). However, rbd_img_request_create() consumes a ref on snapc, so calling ceph_put_snap_context() after a successful rbd_img_request_create() leads to an extra put. Fix it. Cc: stable@vger.kernel.org # 3.18+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2015-12-04 14:29:18 +01:00
Ilya Dryomov	4afb04c0c8	rbd: remove duplicate calls to rbd_dev_mapping_clear() Commit `d1cf578845` ("rbd: set mapping info earlier") defined rbd_dev_mapping_clear(), but, just a few days after, commit `f35a4dee14` ("rbd: set the mapping size and features later") moved rbd_dev_mapping_set() calls and added another rbd_dev_mapping_clear() call instead of moving the old one. Around the same time, another duplicate was introduced in rbd_dev_device_release() - kill both. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-11-02 23:36:48 +01:00
Ilya Dryomov	6cac4695f2	rbd: set device_type::release instead of device::release No point in providing an empty device_type::release callback and then setting device::release for each rbd_dev dynamically. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-11-02 23:36:48 +01:00
Ilya Dryomov	dd5ac32d42	rbd: don't free rbd_dev outside of the release callback struct rbd_device has struct device embedded in it, which means it's part of kobject universe and has an unpredictable life cycle. Freeing its memory outside of the release callback is flawed, yet commits `200a6a8be5` ("rbd: don't destroy rbd_dev in device release function") and `8ad42cd0c0` ("rbd: don't have device release destroy rbd_dev") moved rbd_dev_destroy() out to rbd_dev_image_release(). This commit reverts most of that, the key points are: - rbd_dev->dev is initialized in rbd_dev_create(), making it possible to use rbd_dev_destroy() - which is just a put_device() - both before we register with device core and after. - rbd_dev_release() (the release callback) is the only place we kfree(rbd_dev). It's also where we do module_put(), keeping the module unload race window as small as possible. - We pin the module in rbd_dev_create(), but only for mapping rbd_dev-s. Moving image related stuff out of struct rbd_device into another struct which isn't tied with sysfs and device core is long overdue, but until that happens, this will keep rbd module refcount (which users can observe with lsmod) sane. Fixes: http://tracker.ceph.com/issues/12697 Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-11-02 23:36:48 +01:00
Ilya Dryomov	b51c83c241	rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes userspace think it needs to retry the write. Fix it - it's a leftover from the times when the equivalent of rbd_dev_create() was the first action in rbd_add(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-11-02 23:36:48 +01:00
Julia Lawall	13bf283408	rbd: drop null test before destroy functions Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) { $kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy$(x); x = NULL; -} // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-11-02 23:36:47 +01:00
Ronny Hegewald	bae818ee15	rbd: require stable pages if message data CRCs are enabled rbd requires stable pages, as it performs a crc of the page data before they are send to the OSDs. But since kernel 3.9 (patch `1d1d1a7672` "mm: only enforce stable page writes if the backing device requires it") it is not assumed anymore that block devices require stable pages. This patch sets the necessary flag to get stable pages back for rbd. In a ceph installation that provides multiple ext4 formatted rbd devices "bad crc" messages appeared regularly (ca 1 message every 1-2 minutes on every OSD that provided the data for the rbd) in the OSD-logs before this patch. After this patch this messages are pretty much gone (only ca 1-2 / month / OSD). Cc: stable@vger.kernel.org # 3.9+, needs backporting Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de> [idryomov@gmail.com: require stable pages only in crc case, changelog] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-10-30 19:25:02 +01:00
Ilya Dryomov	6d69bb536b	rbd: prevent kernel stack blow up on rbd map Mapping an image with a long parent chain (e.g. image foo, whose parent is bar, whose parent is baz, etc) currently leads to a kernel stack overflow, due to the following recursion in the reply path: rbd_osd_req_callback() rbd_obj_request_complete() rbd_img_obj_callback() rbd_img_parent_read_callback() rbd_obj_request_complete() ... Limit the parent chain to 16 images, which is ~5K worth of stack. When the above recursion is eliminated, this limit can be lifted. Fixes: http://tracker.ceph.com/issues/12538 Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2015-10-23 18:37:24 +02:00
Ilya Dryomov	1f2c6651f6	rbd: don't leak parent_spec in rbd_dev_probe_parent() Currently we leak parent_spec and trigger a "parent reference underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. The problem is we take the !parent out_err branch and that only drops refcounts; parent_spec that would've been freed had we called rbd_dev_unparent() remains and triggers rbd_warn() in rbd_dev_parent_put() - at that point we have parent_spec != NULL and parent_ref == 0, so counter ends up being -1 after the decrement. Redo rbd_dev_probe_parent() to fix this. Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 4.2 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-10-23 18:36:03 +02:00
Ilya Dryomov	e30b7577bf	rbd: use writefull op for object size writes This covers only the simplest case - an object size sized write, but it's still useful in tiering setups when EC is used for the base tier as writefull op can be proxied, saving an object promotion. Even though updating ceph_osdc_new_request() to allow writefull should just be a matter of fixing an assert, I didn't do it because its only user is cephfs. All other sites were updated. Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-10-16 16:49:01 +02:00
Ilya Dryomov	0d9fde4fc8	rbd: set max_sectors explicitly Commit `30e2bc08b2` ("Revert "block: remove artifical max_hw_sectors cap"") restored a clamp on max_sectors. It's now 2560 sectors instead of 1024, but it's not good enough: we set max_hw_sectors to rbd object size because we don't want object sized I/Os to be split, and the default object size is 4M. So, set max_sectors to max_hw_sectors in rbd at queue init time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-10-16 16:48:36 +02:00
Linus Torvalds	e013f74b60	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few fixes for snapshot behavior with CephFS and support for the new keepalive protocol from Zheng, a libceph fix that affects both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several small fixes and cleanups from Jianpeng and others" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: improve readahead for file holes ceph: get inode size for each append write libceph: check data_len in ->alloc_msg() libceph: use keepalive2 to verify the mon session is alive rbd: plug rbd_dev->header.object_prefix memory leak rbd: fix double free on rbd_dev->header_name libceph: set 'exists' flag for newly up osd ceph: cleanup use of ceph_msg_get ceph: no need to get parent inode in ceph_open ceph: remove the useless judgement ceph: remove redundant test of head->safe and silence static analysis warnings ceph: fix queuing inode to mdsdir's snaprealm libceph: rename con_work() to ceph_con_workfn() libceph: Avoid holding the zero page on ceph_msgr_slab_init errors libceph: remove the unused macro AES_KEY_SIZE ceph: invalidate dirty pages after forced umount ceph: EIO all operations after forced umount	2015-09-11 12:33:03 -07:00
Ilya Dryomov	d194cd1dd1	rbd: plug rbd_dev->header.object_prefix memory leak Need to free object_prefix when rbd_dev_v2_snap_context() fails, but only if this is the first time we are reading in the header. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-09-08 23:14:30 +03:00
Ilya Dryomov	3ebe138ac6	rbd: fix double free on rbd_dev->header_name If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name is freed twice: once in rbd_dev_probe_parent() and then in its caller rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to handle parent images). rbd_dev_probe_parent() is responsible for probing the parent, so it shouldn't muck with clone's fields. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-09-08 23:14:29 +03:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Ilya Dryomov	2761713d35	rbd: fix copyup completion race For write/discard obj_requests that involved a copyup method call, the opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is rbd_img_obj_copyup_callback(). The latter frees copyup pages, sets ->xferred and delegates to rbd_img_obj_callback(), the "normal" image object callback, for reporting to block layer and putting refs. rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op, which means obj_request is marked done in rbd_osd_trivial_callback(), before ->callback is invoked and rbd_img_obj_copyup_callback() has a chance to run. Marking obj_request done essentially means giving rbd_img_obj_callback() a license to end it at any moment, so if another obj_request from the same img_request is being completed concurrently, rbd_img_obj_end_request() may very well be called on such prematurally marked done request: <obj_request-1/2 reply> handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() rbd_obj_request_complete() rbd_img_obj_copyup_callback() rbd_img_obj_callback() <obj_request-2/2 reply> handle_reply() rbd_osd_req_callback() rbd_osd_trivial_callback() for_each_obj_request(obj_request->img_request) { rbd_img_obj_end_request(obj_request-1/2) rbd_img_obj_end_request(obj_request-2/2) <-- } Calling rbd_img_obj_end_request() on such a request leads to trouble, in particular because its ->xfferred is 0. We report 0 to the block layer with blk_update_request(), get back 1 for "this request has more data in flight" and then trip on rbd_assert(more ^ (which == img_request->obj_request_count)); with rhs (which == ...) being 1 because rbd_img_obj_end_request() has been called for both requests and lhs (more) being 1 because we haven't got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet. To fix this, leverage that rbd wants to call class methods in only two cases: one is a generic method call wrapper (obj_request is standalone) and the other is a copyup (obj_request is part of an img_request). So make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke rbd_img_obj_copyup_callback() from it if obj_request is part of an img_request, similar to how CEPH_OSD_OP_READ handler invokes rbd_img_obj_request_read_callback(). Since rbd_img_obj_copyup_callback() is now being called from the OSD request callback (only), it is renamed to rbd_osd_copyup_callback(). Cc: Alex Elder <elder@linaro.org> Cc: stable@vger.kernel.org # 3.10+, needs backporting for < 3.18 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-07-31 11:38:57 +03:00
Jens Axboe	2bb4cd5cc4	block: have drivers use blk_queue_max_discard_sectors() Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-17 08:41:53 -06:00
Ilya Dryomov	5a60e87603	rbd: use GFP_NOIO in rbd_obj_request_create() rbd_obj_request_create() is called on the main I/O path, so we need to use GFP_NOIO to make sure allocation doesn't blow back on us. Not all callers need this, but I'm still hardcoding the flag inside rather than making it a parameter because a) this is going to stable, and b) those callers shouldn't really use rbd_obj_request_create() and will be fixed in the future. More memory allocation fixes will follow. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-07-01 00:46:46 +03:00
Ilya Dryomov	b55841807f	rbd: queue_depth map option nr_requests (/sys/block/rbd<id>/queue/nr_requests) is pretty much irrelevant in blk-mq case because each driver sets its own max depth that it can handle and that's the number of tags that gets preallocated on setup. Users can't increase queue depth beyond that value via writing to nr_requests. For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most cases but we want to give users the opportunity to increase it. Introduce a new per-device queue_depth option to do just that: $ sudo rbd map -o queue_depth=1024 ... Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 18:30:55 +03:00
Ilya Dryomov	d147543d79	rbd: store rbd_options in rbd_device Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 18:30:55 +03:00
Ilya Dryomov	210c104c54	rbd: terminate rbd_opts_tokens with Opt_err Also nuke useless Opt_last_bool and don't break lines unnecessarily. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 18:30:54 +03:00
Ilya Dryomov	d3834fefcf	rbd: bump queue_max_segments The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128) unnecessarily limits bio sizes to 512k (assuming 4k pages). rbd, being a virtual block device, doesn't have any restrictions on the number of physical segments, so bump max_segments to max_hw_sectors, in theory allowing a sector per segment (although the only case this matters that I can think of is some readv/writev style thing). In practice this is going to give us 1M bios - the number of segments in a bio is limited in bio_get_nr_vecs() by BIO_MAX_PAGES = 256. Note that this doesn't result in any improvement on a typical direct sequential test. This is because on a box with a not too badly fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice rbd object size sized requests. The only difference is the size of bios being merged - 512k vs 1M for something like $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 18:30:49 +03:00
Ilya Dryomov	2894e1d769	rbd: timeout watch teardown on unmap with mount_timeout As part of unmap sequence, kernel client has to talk to the OSDs to teardown watch on the header object. If none of the OSDs are available it would hang forever, until interrupted by a signal - when that happens we follow through with the rest of unmap procedure (i.e. unregister the device and put all the data structures) and the unmap is still considired successful (rbd cli tool exits with 0). The watch on the userspace side should eventually timeout so that's fine. This isn't very nice, because various userspace tools (pacemaker rbd resource agent, for example) then have to worry about setting up their own timeouts. Timeout it with mount_timeout (60 seconds by default). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Sage Weil <sage@redhat.com>	2015-06-25 11:49:30 +03:00
Ilya Dryomov	a319bf56a6	libceph: store timeouts in jiffies, verify user input There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:29 +03:00
Yan, Zheng	144cba1493	libceph: allow setting osd_req_op's flags Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:27 +03:00
Ilya Dryomov	082a75dad8	rbd: end I/O the entire obj_request on error When we end I/O struct request with error, we need to pass obj_request->length as @nr_bytes so that the entire obj_request worth of bytes is completed. Otherwise block layer ends up confused and we trip on rbd_assert(more ^ (which == img_request->obj_request_count)); in rbd_img_obj_callback() due to more being true no matter what. We already do it in most cases but we are missing some, in particular those where we don't even get a chance to submit any obj_requests, due to an early -ENOMEM for example. A number of obj_request->xferred assignments seem to be redundant but I haven't touched any of obj_request->xferred stuff to keep this small and isolated. Cc: Alex Elder <elder@linaro.org> Cc: stable@vger.kernel.org # 3.10+ Reported-by: Shawn Edwards <lesser.evil@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-05-01 16:44:30 -07:00
Ilya Dryomov	f77303bdda	rbd: rbd_wq comment is obsolete After the switch to blk-mq rbd_wq processes requests, not devices. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-22 18:33:49 +03:00
Ilya Dryomov	d8a2c89c86	rbd: mark block queue as non-rotational Set QUEUE_FLAG_NONROT. Following commit `b277da0a8a` ("block: disable entropy contributions for nonrot devices") we should also clear QUEUE_FLAG_ADD_RANDOM, but it's off by default for blk-mq drivers, so just note it in the comment. Also remove physical block size assignment - no sense in repeating defaults that are not going to change. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-20 18:55:38 +03:00
Ilya Dryomov	1fe480235a	rbd: be more informative on -ENOENT failures pr_info what exactly was the culprit: missing pool, image or snap. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-20 18:55:33 +03:00
Christoph Hellwig	7ad18afad0	rbd: convert to blk-mq This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 120000 iops, although the only comparism available was an old 3.10 kernel which gave 80000iops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: context, blk_mq_init_queue() EH] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-02-19 14:27:42 +03:00
Ilya Dryomov	cf32bd9c86	rbd: do not treat standalone as flatten If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means "let go of all parent state, including the spec(s) of parent images(s)". This leads to a mismatch between "rbd info" and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar \| grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-02-19 13:31:39 +03:00
Ilya Dryomov	73e39e4dba	rbd: fix error paths in rbd_dev_refresh() header_rwsem should be released on errors. Also remove useless rbd_dev->mapping.size != rbd_dev->header.image_size test. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:38 +03:00
Rickard Strandqvist	3a25cf43e0	rbd: nuke copy_token() It's been largely superseded by dup_token() and unused for over 2 years, identified by cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> [idryomov@redhat.com: changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:38 +03:00
Ilya Dryomov	e69b8d414f	rbd: drop parent_ref in rbd_dev_unprobe() unconditionally This effectively reverts the last hunk of `392a9dad7e` ("rbd: detect when clone image is flattened"). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the "clone image now standalone" logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-01-28 16:12:02 +03:00
Ilya Dryomov	ae43e9d05e	rbd: fix rbd_dev_parent_get() when parent_overlap == 0 The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the "drop it again if there is no overlap" part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. Cc: stable@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-01-28 16:11:51 +03:00
Ilya Dryomov	7e868b6eff	rbd: don't treat CEPH_OSD_OP_DELETE as extent op CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This sneaked in with discard patches - it's one of the three osd ops (the other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard is implemented with. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-12-17 20:09:51 +03:00
SF Markus Elfring	e96a650a81	ceph, rbd: delete unnecessary checks before two function calls The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:50 +03:00
Jan Kara	a8d4205623	rbd: Fix error recovery in rbd_obj_read_sync() When we fail to allocate page vector in rbd_obj_read_sync() we just basically ignore the problem and continue which will result in an oops later. Fix the problem by returning proper error. CC: Yehuda Sadeh <yehuda@inktank.com> CC: Sage Weil <sage@inktank.com> CC: ceph-devel@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226882 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-10-30 13:11:51 +03:00
Ilya Dryomov	f5ee37bd31	rbd: use a single workqueue for all devices Using one queue per device doesn't make much sense given that our workfn processes "devices" and not "requests". Switch to a single workqueue for all devices. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-30 13:11:25 +03:00
Ilya Dryomov	792c3a9149	rbd: rbd workqueues need a resque worker Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups under memory pressure - we sit on the memory reclaim path. Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:05 -07:00
Josh Durgin	b76f82398c	rbd: set the remaining discard properties to enable support max_discard_sectors must be set for the queue to support discard. Operations implementing discard for rbd zero data, so report that. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:37 +04:00
Josh Durgin	d3246fb0da	rbd: use helpers to handle discard for layered images correctly Only allocate two osd ops for discard requests, since the preallocation hint is only added for regular writes. Use rbd_img_obj_request_fill() to recreate the original write or discard osd operations, isolating that logic to one place, and change the assert in rbd_osd_req_create_copyup() to accept discard requests as well. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:36 +04:00
Josh Durgin	3b434a2aff	rbd: extract a method for adding object operations rbd_img_request_fill() creates a ceph_osd_request and has logic for adding the appropriate osd ops to it based on the request type and image properties. For layered images, the original rbd_obj_request is resent with a copyup operation in front, using a new ceph_osd_request. The logic for adding the original operations should be the same as when first sending them, so move it to a helper function. op_type only needs to be checked once, so create a helper for that as well and call it outside the loop in rbd_img_request_fill(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:35 +04:00
Josh Durgin	1c220881e3	rbd: make discard trigger copy-on-write Discard requests are a form of write, so they should go through the same process as plain write requests and trigger copy-on-write for layered images. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:34 +04:00
Josh Durgin	d0265de7c3	rbd: tolerate -ENOENT for discard operations Discard may try to delete an object from a non-layered image that does not exist. If this occurs, the image already has no data in that range, so change the result to success. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:33 +04:00
Josh Durgin	bef95455a4	rbd: fix snapshot context reference count for discards Discards take a reference to the snapshot context of an image when they are created. This reference needs to be cleaned up when the request is done just as it is for regular writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:32 +04:00
Josh Durgin	3c5df89367	rbd: read image size for discard check safely In rbd_img_request_fill() the image size is only checked to determine whether we can truncate an object instead of zeroing it for discard requests. Take rbd_dev->header_rwsem while reading the image size, and move this read into the discard check, so that non-discard ops don't need to take the semaphore in this function. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:32 +04:00
Guangliang Zhao	90e98c5229	rbd: initial discard bits from Guangliang Zhao This patch add the discard support for rbd driver. There are three types operation in the driver: 1. The objects would be removed if they completely contained within the discard range. 2. The objects would be truncated if they partly contained within the discard range, and align with their boundary. 3. Others would be zeroed. A discard request from blkdev_issue_discard() is defined which REQ_WRITE and REQ_DISCARD both marked and no data, so we must check the REQ_DISCARD first when getting the request type. This resolve: http://tracker.ceph.com/issues/190 [ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up commits by Josh Durgin for refinements and fixes which weren't folded in to preserve authorship. ] Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:31 +04:00
Guangliang Zhao	6d2940c881	rbd: extend the operation type It could only handle the read and write operations now, extend it for the coming discard support. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:30 +04:00
Guangliang Zhao	c622d22615	rbd: skip the copyup when an entire object writing It need to copyup the parent's content when layered writing, but an entire object write would overwrite it, so skip it. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:29 +04:00
Ilya Dryomov	70d045f660	rbd: add img_obj_request_simple() helper To clarify the conditions and make it easier to add new ones. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-10-14 21:03:28 +04:00
Josh Durgin	4e752f0ab0	rbd: access snapshot context and mapping size safely These fields may both change while the image is mapped if a snapshot is created or deleted or the image is resized. They are guarded by rbd_dev->header_rwsem, so hold that while reading them, and store a local copy to refer to outside of the critical section. The local copy will stay consistent since the snapshot context is reference counted, and the mapping size is just a u64. This prevents torn loads from giving us inconsistent values. Move reading header.snapc into the caller of rbd_img_request_create() so that we only need to take the semaphore once. The read-only caller, rbd_parent_request_create() can just pass NULL for snapc, since the snapshot context is only relevant for writes. Signed-off-by: Josh Durgin <josh.durgin@inktank.com>	2014-10-14 21:03:27 +04:00
Ilya Dryomov	7dd440c9e0	rbd: do not return -ERANGE on auth failures Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL sink, thus propagating rbd::get_id cls method errors. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:26 +04:00
Wei Yongjun	255939e783	rbd: fix error return code in rbd_dev_device_setup() Fix to return -ENOMEM from the workqueue alloc error handling case instead of 0, as done elsewhere in this function. Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2014-09-10 11:59:06 +04:00
Ilya Dryomov	58d1362b50	rbd: avoid format-security warning inside alloc_workqueue() drivers/block/rbd.c: In function ‘rbd_dev_device_setup’: drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security] Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-09-10 11:59:06 +04:00
Ilya Dryomov	9584d50826	rbd: remove extra newlines from rbd_warn() messages rbd_warn() string should be a single line - rbd_warn() appends \n. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-08-07 16:01:09 +04:00
Ilya Dryomov	7a716aac01	rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC Now that rbd_img_request_create() is called from work functions, no need to use GFP_ATOMIC. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-08-07 16:00:52 +04:00
Ilya Dryomov	bc1ecc65a2	rbd: rework rbd_request_fn() While it was never a good idea to sleep in request_fn(), commit `34c6bc2c91` ("locking/mutexes: Add extra reschedule point") made it a bad idea. mutex_lock() since 3.15 may reschedule before putting task on the mutex wait queue, which for tasks in !TASK_RUNNING state means block forever. request_fn() may be called with !TASK_RUNNING on the way to schedule() in io_schedule(). Offload request handling to a workqueue, one per rbd device, to avoid calling blocking primitives from rbd_request_fn(). Fixes: http://tracker.ceph.com/issues/8818 Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Tested-by: Eric Eastman <eric0e@aol.com> Tested-by: Greg Wilson <greg.wilson@keepertech.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-08-07 14:56:20 +04:00
Ilya Dryomov	4d9b67cddd	rbd: take snap_id into account when reading in parent info If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null echo "Base image" dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo "Snapshot" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd rbd resize --allow-shrink --size 4 bar echo "Snapshot after base image resize" dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null \| xxd # ./overlap-snap.sh Base image 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Snapshot 0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........ Even though bar@snap is taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:16:50 +04:00
Ilya Dryomov	e8f59b595d	rbd: do not read in parent info before snap context Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for.) The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:16:14 +04:00
Ilya Dryomov	5ff1108ccc	rbd: update mapping size only on refresh There is no sense in trying to update the mapping size before it's even been set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:58 +04:00
Ilya Dryomov	52bb1f9bed	rbd: harden rbd_dev_refresh() and callers a bit Recently discovered watch/notify problems showed that we really can't ignore errors in anything refresh related. Alas, currently there is not much we can do in response to those errors, except print warnings. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:44 +04:00
Ilya Dryomov	0407759971	rbd: split rbd_dev_spec_update() into two functions rbd_dev_spec_update() has two modes of operation, with nothing in common between them. Split it into two functions, one for each mode and make our expectations more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:35 +04:00
Ilya Dryomov	7626eb7d82	rbd: remove unnecessary asserts in rbd_dev_image_probe() spec->image_id assert doesn't buy us much and image_format is asserted in rbd_dev_header_name() and rbd_dev_header_info() anyway. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:18 +04:00
Ilya Dryomov	a720ae0901	rbd: introduce rbd_dev_header_info() A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:15:00 +04:00
Ilya Dryomov	ff96128fb0	rbd: show the entire chain of parent images Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-25 13:14:27 +04:00
Himangi Saraogi	7d5079aa8b	rbd: use rbd_segment_name_free() instead of kfree() Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The helper rbd_segment_name_free does the job here. Its position is shifted above the calling function. The Coccinelle semantic patch that detects this change is as follows: // <smpl> @@ expression x,E,c; @@ x = $kmem_cache_alloc\\|kmem_cache_zalloc\\|kmem_cache_alloc_node$(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-07-24 11:57:27 +04:00
Ilya Dryomov	fbba11b3be	rbd: do not leak image_id in rbd_dev_v2_parent_info() image_id is leaked if the parent happens to have been recorded already. Fix it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:46 +04:00
Ilya Dryomov	76756a51e2	rbd: use rbd_obj_watch_request_helper() helper Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix rbd_dev_header_unwatch_sync() to destroy watch_request structures before queuing watch-remove message while at it. This mistake slipped into commit `b30a01f2a3` ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") and could lead to "image still in use" errors on image removal. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	bb040aa03c	rbd: add rbd_obj_watch_request_helper() helper In the past, rbd_dev_header_watch_sync() used to handle both watch and unwatch requests and was entangled and leaky. Commit `b30a01f2a3` ("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()") split it into two separate functions. This commit cleanly abstracts the common bits, relying on the fixed rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	71c20a066f	rbd: rbd_obj_request_wait() should cancel the request if interrupted rbd_obj_request_wait() should cancel the underlying OSD request if interrupted. Otherwise libceph will hold onto it indefinitely, causing assert failures or leaking the original object request. This also adds an rbd wrapper around ceph_osdc_cancel_request() to match rbd_obj_request_submit() and rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	9638556a27	rbd: handle parent_overlap on writes correctly The following check in rbd_img_obj_request_submit() rbd_dev->parent_overlap <= obj_request->img_offset allows the fall through to the non-layered write case even if both parent_overlap and obj_request->img_offset belong to the same RADOS object. This leads to data corruption, because the area to the left of parent_overlap ends up unconditionally zero-filled instead of being populated with parent data. Suppose we want to write 1M to offset 6M of image bar, which is a clone of foo@snap; object_size is 4M, parent_overlap is 5M: rbd_data.<id>.0000000000000001 ---------------------\|----------------------\|------------ \| should be copyup'ed \| should be zeroed out \| write ... ---------------------\|----------------------\|------------ 4M 5M 6M parent_overlap obj_request->img_offset 4..5M should be copyup'ed from foo, yet it is zero-filled, just like 5..6M is. Given that the only striping mode kernel client currently supports is chunking (i.e. stripe_unit == object_size, stripe_count == 1), round parent_overlap up to the next object boundary for the purposes of the overlap check. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-06-23 12:55:37 +04:00
Josh Durgin	22001f619f	rbd: only set disk to read-only once rbd_open(), called every time the device is opened, calls set_device_ro(). There's no reason to set the device read-only or read-write every time it is opened. Just do this once during device setup, using set_disk_ro() instead because the struct block_device isn't available to us there. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-10 18:09:25 -07:00
Josh Durgin	77f33c0373	rbd: move calls that may sleep out of spin lock range get_user() and set_disk_ro() may allocate memory, leading to a potential deadlock if theye are called while a spin lock is held. Move the acquisition and release of rbd_dev->lock from rbd_ioctl() into rbd_ioctl_set_ro(), so it can occur between get_user() and set_disk_ro(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-10 18:09:25 -07:00
Guangliang Zhao	131fd9f6fc	rbd: add ioctl for rbd When running the following commands: [root@ceph0 mnt]# blockdev --setro /dev/rbd1 [root@ceph0 mnt]# blockdev --getro /dev/rbd1 0 The block setro didn't take effect, it is because the rbd doesn't support ioctl of block driver. This resolves: http://tracker.ceph.com/issues/6265 Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-06-10 18:09:24 -07:00
Ilya Dryomov	ffe312cf31	rbd: fix ida/idr memory leak ida_destroy() needs to be called on module exit to release ida caches. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-06-06 09:30:00 +08:00
Alex Elder	0f2d5be792	rbd: use reference counts for image requests Each image request contains a reference count, but to date it has not actually been used. (I think this was just an oversight.) A recent report involving rbd failing an assertion shed light on why and where we need to use these reference counts. Every OSD request associated with an object request uses rbd_osd_req_callback() as its callback function. That function will call a helper function (dependent on the type of OSD request) that will set the object request's "done" flag if the object request if appropriate. If that "done" flag is set, the object request is passed to rbd_obj_request_complete(). In rbd_obj_request_complete(), requests are processed in sequential order. So if an object request completes before one of its predecessors in the image request, the completion is deferred. Otherwise, if it's a completing object's "turn" to be completed, it is passed to rbd_img_obj_end_request(), which records the result of the operation, accumulates transferred bytes, and so on. Next, the successor to this request is checked and if it is marked "done", (deferred) completion processing is performed on that request, and so on. If the last object request in an image request is completed, rbd_img_request_complete() is called, which (typically) destroys the image request. There is a race here, however. The instant an object request is marked "done" it can be provided (by a thread handling completion of one of its predecessor operations) to rbd_img_obj_end_request(), which (for the last request) can then lead to the image request getting torn down. And this can happen before that object has itself entered rbd_img_obj_end_request(). As a result, once it does enter that function, the image request (and even the object request itself) may have been freed and become invalid. All that's necessary to avoid this is to properly count references to the image requests. We tear down an image request's object requests all at once--only when the entire image request has completed. So there's no need for an image request to count references for its object requests. However, we don't want an image request to go away until the last of its object requests has passed through rbd_img_obj_callback(). In other words, we don't want rbd_img_request_complete() to necessarily result in the image request being destroyed, because it may get called before we've finished processing on all of its object requests. So the fix is to add a reference to an image request for each of its object requests. The reference can be viewed as representing an object request that has not yet finished its call to rbd_img_obj_callback(). That is emphasized by getting the reference right after assigning that as the image object's callback function. The corresponding release of that reference is done at the end of rbd_img_obj_callback(), which every image object request passes through exactly once. Cc: stable@vger.kernel.org Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-06-06 09:29:59 +08:00
Ilya Dryomov	b30a01f2a3	rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-06-06 09:29:59 +08:00
Ilya Dryomov	30ba1f0202	rbd: make sure we have latest osdmap on 'rbd map' Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-06-06 09:29:58 +08:00
Duan Jiong	461f758ac0	rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>	2014-06-06 09:29:51 +08:00
Ilya Dryomov	0ccd592669	rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op In an effort to reduce fragmentation, prefix every rbd write with a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set to the object size (1 << order). Backwards compatibility is taken care of on the libceph/osd side. "The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to do it once. The reason every rbd write is prefixed is that rbd doesn't explicitly create objects and relies on writes creating them implicitly, so there is no place to stick a single hint op into. To get around that we decided to prefix every rbd write with a hint (just like write and setattr ops, hint op will create an object implicitly if it doesn't exist)." Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:52 +08:00
Ilya Dryomov	deb236b300	rbd: num_ops parameter for rbd_osd_req_create() In preparation for prefixing rbd writes with an allocation hint introduce a num_ops parameter for rbd_osd_req_create(). The rationale is that not every write request is a write op that needs to be prefixed (e.g. watch op), so the num_ops logic needs to be in the callers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:52 +08:00
Ilya Dryomov	7cc69d42e6	libceph: bump CEPH_OSD_MAX_OP to 3 Our longest osd request now contains 3 ops: copyup+hint+write. Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was hard-coded to 2. Fix it, and switch to rbd_assert while at it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:52 +08:00
Ilya Dryomov	42dd037c08	rbd: fix error paths in rbd_img_request_fill() Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is not only insufficient, but also triggers an rbd_assert() in rbd_obj_request_destroy(): Assertion failure in rbd_obj_request_destroy() at line 1867: rbd_assert(obj_request->img_request == NULL); rbd_img_obj_request_add() adds obj_requests to the img_request, the opposite is rbd_img_obj_request_del(). Use it. Fixes: http://tracker.ceph.com/issues/7327 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:51 +08:00
Ilya Dryomov	62054da65c	rbd: remove out_partial label in rbd_img_request_fill() Commit `03507db631` ("rbd: fix buffer size for writes to images with snapshots") moved the call to rbd_img_obj_request_add() up, making the out_partial label bogus. Remove it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:50 +08:00
Alex Elder	638c323c4d	rbd: drop an unsafe assertion Olivier Bonvalet reported having repeated crashes due to a failed assertion he was hitting in rbd_img_obj_callback(): Assertion failure in rbd_img_obj_callback() at line 2165: rbd_assert(which >= img_request->next_completion); With a lot of help from Olivier with reproducing the problem we were able to determine the object and image requests had already been completed (and often freed) at the point the assertion failed. There was a great deal of discussion on the ceph-devel mailing list about this. The problem only arose when there were two (or more) object requests in an image request, and the problem was always seen when the second request was being completed. The problem is due to a race in the window between setting the "done" flag on an object request and checking the image request's next completion value. When the first object request completes, it checks to see if its successor request is marked "done", and if so, that request is also completed. In the process, the image request's next_completion value is updated to reflect that both the first and second requests are completed. By the time the second request is able to check the next_completion value, it has been set to a value greater than its own "which" value, which caused an assertion to fail. Fix this problem by skipping over any completion processing unless the completing object request is the next one expected. Test only for inequality (not >=), and eliminate the bad assertion. Tested-by: Olivier Bonvalet <ob@daevel.fr> Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-03-29 10:38:14 -07:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Ilya Dryomov	3c972c95c6	libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before introducing r_target_{oloc,oid} needed for redirects. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-01-27 23:57:49 +02:00
Ilya Dryomov	4295f2217a	libceph: introduce and start using oid abstraction In preparation for tiering support, which would require having two (base and target) object names for each osd request and also copying those names around, introduce struct ceph_object_id (oid) and a couple helpers to facilitate those copies and encapsulate the fact that object name is not necessarily a NUL-terminated string. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-01-27 23:57:28 +02:00
Ilya Dryomov	2d0ebc5d59	libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-01-27 23:57:24 +02:00
Ilya Dryomov	22116525ba	libceph: start using oloc abstraction Instead of relying on pool fields in ceph_file_layout (for mapping) and ceph_pg (for enconding), start using ceph_object_locator (oloc) abstraction. Note that userspace oloc currently consists of pool, key, nspace and hash fields, while this one contains only a pool. This is OK, because at this point we only send (i.e. encode) olocs and never have to receive (i.e. decode) them. This makes keeping a copy of ceph_file_layout in every osd request unnecessary, so ceph_osd_request::r_file_layout field is nuked. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-01-27 23:57:03 +02:00
Ilya Dryomov	e37180c0f2	rbd: tear down watch request if rbd_dev_device_setup() fails Tear down watch request if rbd_dev_device_setup() fails. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:32:07 +02:00
Ilya Dryomov	fca2706539	rbd: introduce rbd_dev_header_unwatch_sync() and switch to it Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more clear what is going on. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:32:06 +02:00
Ilya Dryomov	7e513d4366	rbd: enable extended devt in single-major mode If single-major device number allocation scheme is turned on, instead of reserving 256 minors per device, which imposes a limit of 4096 images mapped at once, reserve 16 minors per device and enable extended devt feature. This results in a theoretical limit of 65536 images mapped at once, and still allows to have more than 15 partititions: partitions starting with 16th are mapped under major 259 (Block Extended Major): $ rbd showmapped id pool image snap device 0 rbd b5 - /dev/rbd0 # no partitions 1 rbd b2 - /dev/rbd1 # 40 partitions 2 rbd b3 - /dev/rbd2 # 2 partitions $ cat /proc/partitions 251 0 1024 rbd0 251 16 1024 rbd1 251 17 0 rbd1p1 251 18 0 rbd1p2 ... 251 30 0 rbd1p14 251 31 0 rbd1p15 259 0 0 rbd1p16 259 1 0 rbd1p17 ... 259 23 0 rbd1p39 259 24 0 rbd1p40 251 32 1024 rbd2 251 33 0 rbd2p1 251 34 0 rbd2p2 (major 251 was assigned dynamically at module load time) Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:32:04 +02:00
Ilya Dryomov	9b60e70b3b	rbd: add support for single-major device number allocation scheme Currently each rbd device is allocated its own major number, which leads to a hard limit of 230-250 images mapped at once. This commit adds support for a new single-major device number allocation scheme, which is hidden behind a new single_major boolean module parameter and is disabled by default for backwards compatibility reasons. (Old userspace cannot correctly unmap images mapped under single-major scheme and would essentially just unmap a random image, if that.) $ rbd showmapped id pool image snap device 0 rbd b100 - /dev/rbd0 1 rbd b101 - /dev/rbd1 2 rbd b102 - /dev/rbd2 3 rbd b103 - /dev/rbd3 Old scheme (modprobe rbd): $ ls -l /dev/rbd* brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0 brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1 brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1 brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2 brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3 brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2 brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1 brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3 New scheme (modprobe rbd single_major=Y): $ ls -l /dev/rbd* brw-rw---- 1 root disk 253, 0 Dec 10 12:30 /dev/rbd0 brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1 brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1 brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2 brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3 brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2 brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1 brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3 (major 253 was assigned dynamically at module load time) The new limit is 4096 images mapped at once, and it comes from the fact that, as before, 256 minor numbers are reserved for each mapping. (A follow-up commit changes the number of minors reserved and the way we deal with partitions over that number.) If single_major is set to true, two new sysfs interfaces show up: /sys/bus/rbd/{add,remove}_single_major. These are to be used instead of /sys/bus/rbd/{add,remove}, which are disabled for backwards compatibility reasons outlined above. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:59 +02:00
Ilya Dryomov	92c76dc036	rbd: wire up is_visible() sysfs callback for rbd bus In preparation for single-major device number allocation scheme, wire up attribute_group::is_visible() callback for rbd bus. This allows us to make the new single-major attributes conditional. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:58 +02:00
Ilya Dryomov	dd82fff1e8	rbd: add 'minor' sysfs rbd device attribute Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting rbd whole disk minor numbers. This is a step towards single-major device number allocation scheme, but also a good thing on its own. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:57 +02:00
Ilya Dryomov	f8a22fc238	rbd: switch to ida for rbd id assignments Currently rbd ids are allocated using an atomic variable that keeps track of the highest id currently in use and each new id is simply one more than the value of that variable. That's nice and cheap, but it does mean that rbd ids are allowed to grow boundlessly, and, more importantly, it's completely unpredictable. So, in preparation for single-major device number allocation scheme, which is going to establish and rely on a constant mapping between rbd ids and device numbers, switch to ida for rbd id assignments. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:56 +02:00
Ilya Dryomov	e1b4d96dea	rbd: refactor rbd_init() a bit Refactor rbd_init() a bit to make it more clear what's going on. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:55 +02:00
Ilya Dryomov	90da258b88	rbd: tweak "loaded" message and module description Tweak "loaded" message, so that it looks like [ 30.184235] rbd: loaded instead of [ 38.056564] rbd: loaded rbd (rados block device) Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors are next to each other in modinfo output. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:54 +02:00
Ilya Dryomov	70eebd2006	rbd: rbd_device::dev_id is an int, format it as such rbd_device::dev_id is an int, format it as such. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-12-31 20:31:53 +02:00
Kent Overstreet	5341a6278b	rbd: Refactor bio cloning Now that we've got drivers converted to the new immutable bvec primitives, bio splitting becomes much easier - this is how the new bio_split() will work. (Someone more familiar with the ceph code could probably use bio_clone_fast() instead of bio_clone() here). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org	2013-11-23 22:33:54 -08:00
Kent Overstreet	7988613b0e	block: Convert bio_for_each_segment() to bvec_iter More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>	2013-11-23 22:33:49 -08:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Linus Torvalds	e9ff04dd94	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "These fix several bugs with RBD from 3.11 that didn't get tested in time for the merge window: some error handling, a use-after-free, and a sequencing issue when unmapping and image races with a notify operation. There is also a patch fixing a problem with the new ceph + fscache code that just went in" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: fscache: check consistency does not decrement refcount rbd: fix error handling from rbd_snap_name() rbd: ignore unmapped snapshots that no longer exist rbd: fix use-after free of rbd_dev->disk rbd: make rbd_obj_notify_ack() synchronous rbd: complete notifies before cleaning up osd_client and rbd_dev libceph: add function to ensure notifies are complete	2013-09-19 12:50:37 -05:00
Jingoo Han	bb8e0e84b3	block: replace strict_strtoul() with kstrtoul() The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-11 15:56:56 -07:00
Josh Durgin	da6a6b6397	rbd: fix error handling from rbd_snap_name() rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the format of the image. The format 1 version returns NULL on error, which is handled by the caller. The format 2 version returns an ERR_PTR, which the caller of rbd_snap_name() does not expect. Fortunately this is unlikely to occur in practice because rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit similar errors to rbd_snap_name() (like the snapshot not existing) and return early, so rbd_snap_name() would not hit an error unless the snapshot was removed between the two calls or memory was exhausted. Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error can be propagated, and it is consistent with rbd_dev_v2_snap_name(). Handle the ERR_PTR in the only rbd_snap_name() caller. Suggested-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-09-09 11:16:44 -07:00
Josh Durgin	efadc98aab	rbd: ignore unmapped snapshots that no longer exist This prevents erroring out while adding a device when a snapshot unrelated to the current mapping is deleted between reading the snapshot context and reading the snapshot names. If the mapped snapshot name is not found an error still occurs as usual. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-09-09 11:16:32 -07:00
Josh Durgin	9875201e10	rbd: fix use-after free of rbd_dev->disk Removing a device deallocates the disk, unschedules the watch, and finally cleans up the rbd_dev structure. rbd_dev_refresh(), called from the watch callback, updates the disk size and rbd_dev structure. With no locking between them, rbd_dev_refresh() may use the device or rbd_dev after they've been freed. To fix this, check whether RBD_DEV_FLAG_REMOVING is set before updating the disk size in rbd_dev_refresh(). In order to prevent a race where rbd_dev_refresh() is already revalidating the disk when rbd_remove() is called, move the call to rbd_bus_del_dev() after the watch is unregistered and all notifies are complete. It's safe to defer deleting this structure because no new requests can be submitted once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be opened. Fixes: http://tracker.ceph.com/issues/5636 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-09-09 11:16:25 -07:00
Josh Durgin	20e0af67ce	rbd: make rbd_obj_notify_ack() synchronous The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used asynchronously with no tracking of when the notify ack completes, so it may still be in progress when the osd_client is shut down. This results in a BUG() since the osd client assumes no requests are in flight when it stops. Since all notifies are flushed before the osd_client is stopped, waiting for the notify ack to complete before returning from the watch callback ensures there are no notify acks in flight during shutdown. Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect its new synchronous nature. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-09-09 11:16:02 -07:00
Josh Durgin	9abc59908e	rbd: complete notifies before cleaning up osd_client and rbd_dev To ensure rbd_dev is not used after it's released, flush all pending notify callbacks before calling rbd_dev_image_release(). No new notifies can be added to the queue at this point because the watch has already be unregistered with the osd_client. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-09-09 11:15:57 -07:00
Linus Torvalds	6cccc7d301	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This includes both the first pile of Ceph patches (which I sent to torvalds@vger, sigh) and a few new patches that add support for fscache for Ceph. That includes a few fscache core fixes that David Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski for putting this feature together) This first batch of patches (included here) had (has) several important RBD bug fixes, hole punch support, several different cleanups in the page cache interactions, improvements in the truncate code (new truncate mutex to avoid shenanigans with i_mutex), and a series of fixes in the synchronous striping read/write code. On top of that is a random collection of small fixes all across the tree (error code checks and error path cleanup, obsolete wq flags, etc)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits) ceph: use d_invalidate() to invalidate aliases ceph: remove ceph_lookup_inode() ceph: trivial buildbot warnings fix ceph: Do not do invalidate if the filesystem is mounted nofsc ceph: page still marked private_2 ceph: ceph_readpage_to_fscache didn't check if marked ceph: clean PgPrivate2 on returning from readpages ceph: use fscache as a local presisent cache fscache: Netfs function for cleanup post readpages FS-Cache: Fix heading in documentation CacheFiles: Implement interface to check cache consistency FS-Cache: Add interface to check consistency of a cached object rbd: fix null dereference in dout rbd: fix buffer size for writes to images with snapshots libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc rbd: fix I/O error propagation for reads ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem ceph: allow sync_read/write return partial successed size of read/write. ceph: fix bugs about handling short-read for sync read mode. ceph: remove useless variable revoked_rdcache ...	2013-09-09 09:13:22 -07:00
Josh Durgin	c35455791c	rbd: fix null dereference in dout The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but the dout() always derefences it. Move this to another dout() protected by a check that order is non-NULL. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>	2013-09-03 22:08:51 -07:00
Josh Durgin	03507db631	rbd: fix buffer size for writes to images with snapshots rbd_osd_req_create() needs to know the snapshot context size to create a buffer large enough to send it with the message front. It gets this from the img_request, which was not set for the obj_request yet. This resulted in trying to write past the end of the front payload, hitting this BUG: libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len); Fix this by associating the obj_request with its img_request immediately after it's created, before the osd request is created. Fixes: http://tracker.ceph.com/issues/5760 Suggested-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>	2013-09-03 22:08:46 -07:00
Josh Durgin	17c1cc1d92	rbd: fix I/O error propagation for reads When a request returns an error, the driver needs to report the entire extent of the request as completed. Writes already did this, since they always set xferred = length, but reads were skipping that step if an error other than -ENOENT occurred. Instead, rbd would end up passing 0 xferred to blk_end_request(), which would always report needing more data. This resulted in an assert failing when more data was required by the block layer, but all the object requests were done: [ 1868.719077] rbd: obj_request read result -108 xferred 0 [ 1868.719077] [ 1868.719518] end_request: I/O error, dev rbd1, sector 0 [ 1868.719739] [ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736: [ 1868.719739] [ 1868.719739] rbd_assert(more ^ (which == img_request->obj_request_count)); Without this assert, reads that hit errors would hang forever, since the block layer considered them incomplete. Fixes: http://tracker.ceph.com/issues/5647 CC: stable@vger.kernel.org # v3.10 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>	2013-09-03 22:06:10 -07:00
Greg Kroah-Hartman	b15a21ddda	rbd: convert bus code to use bus_groups The bus_attrs field of struct bus_type is going away soon, dev_groups should be used instead. This converts the RBD bus code to use the correct field. Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Acked-by: Alex Elder <elder@linaro.org> Cc: <ceph-devel@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-08-27 22:07:49 -07:00
Jingoo Han	a158073c43	block: rbd: use NULL instead of 0 The local variables such as 'bio_list', and 'pages' are pointers; thus, use NULL instead of 0 to fix the following sparse warnings. drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer Signed-off-by: Jingoo Han <jg1.han@samsung.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-08-09 17:55:40 -07:00
Sage Weil	e976cad0f0	rbd: fix a couple warnings gcc isn't quite smart enough and generates these warnings: drivers/block/rbd.c: In function 'rbd_img_request_fill': drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized] drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized] even though they are initialized for their respective code paths. Signed-off-by: Sage Weil <sage@inktank.com>	2013-07-03 15:32:50 -07:00
Alex Elder	d552c6191b	rbd: take a little credit Add a name to the list of authors. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:44 -07:00
Alex Elder	cfbf6377b6	rbd: use rwsem to protect header updates Updating an image header needs to be protected to ensure it's done consistently. However distinct headers can be updated concurrently without a problem. Instead of using the global control lock to serialize headder updates, just rely on the header semaphore. (It's already used, this just moves it out to cover a broader section of the code.) That leaves the control mutex protecting only the creation of rbd clients, so rename it. This resolves: http://tracker.ceph.com/issues/5222 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:43 -07:00
Alex Elder	1ba0f1e797	rbd: don't hold ctl_mutex to get/put device When an rbd device is first getting mapped, its device registration is protected the control mutex. There is no need to do that though, because the device has already been assigned an id that's guaranteed to be unique. An unmap of an rbd device won't proceed if the device has a non-zero open count or is already being unmapped. So there's no need to hold the control mutex in that case either. Finally, an rbd device can't be opened if it is being removed, and it won't go away if there is a non-zero open count. So here too there's no need to hold the control mutex while getting or putting a reference to an rbd device's Linux device structure. Drop the mutex calls in these cases. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:42 -07:00
Alex Elder	82a442d239	rbd: protect against concurrent unmaps Make sure two concurrent unmap operations on the same rbd device won't collide, by only proceeding with the removal and cleanup of a device if is not already underway. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:41 -07:00
Alex Elder	751cc0e3cf	rbd: set removing flag while holding list lock When unmapping a device, its id is supplied, and that is used to look up which rbd device should be unmapped. Looking up the device involves searching the rbd device list while holding a spinlock that protects access to that list. Currently all of this is done under protection of the control lock, but that protection is going away soon. To ensure the rbd_dev is still valid (still on the list) while setting its REMOVING flag, do so while still holding the list lock. To do so, get rid of __rbd_get_dev(), and open code what it did in the one place it was used. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:41 -07:00
Alex Elder	08f75463c1	rbd: protect against duplicate client creation If more than one rbd image has the same ceph cluster configuration (same options, same set of monitors, same keys) they normally share a single rbd client. When an image is getting mapped, rbd looks to see if an existing client can be used, and creates a new one if not. The lookup and creation are not done under a common lock though, so mapping two images concurrently could lead to duplicate clients getting set up needlessly. This isn't a major problem, but it's wasteful and different from what's intended. This patch fixes that by using the control mutex to protect both the lookup and (if needed) creation of the client. It was previously used just when creating. This resolves: http://tracker.ceph.com/issues/3094 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:39 -07:00
Alex Elder	3b5cf2a2f1	rbd: clean up a few things in the refresh path This includes a few relatively small fixes I found while examining the code that refreshes image information. This resolves: http://tracker.ceph.com/issues/5040 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:38 -07:00
Alex Elder	e215605417	rbd: flush dcache after zeroing page data Neither zero_bio_chain() nor zero_pages() contains a call to flush caches after zeroing a portion of a page. This can cause problems on architectures that have caches that allow virtual address aliasing. This resolves: http://tracker.ceph.com/issues/4777 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-03 15:32:37 -07:00
Alex Elder	912c317d46	rbd: drop original request earlier for existence check The reference to the original request dropped at the end of rbd_img_obj_exists_callback() corresponds to the reference taken in rbd_img_obj_exists_submit() to account for the stat request referring to it. Move the put of that reference up right after clearing that pointer to make its purpose more obvious. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-07-01 09:52:02 -07:00
Geert Uytterhoeven	491205a8b4	rbd: Use min_t() to fix comparison of distinct pointer types warning drivers/block/rbd.c: In function ‘zero_pages’: drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast Remove the hackish casts and use min_t() to fix this. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Alex Elder <elder@inktank.com>	2013-07-01 09:52:01 -07:00
Linus Torvalds	bd2931b5cf	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fix from Sage Weil: "This is a recently spotted regression in the snapshot behavior... It turns out several tests weren't being run in the nightlies so this took a while to spot" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: send snapshot context with writes	2013-06-29 10:31:15 -07:00
Josh Durgin	d2d1f17a0d	rbd: send snapshot context with writes Sending the right snapshot context with each write is required for snapshots to work. Due to the ordering of calls, the snapshot context is never set for any requests. This causes writes to the current version of the image to be reflected in all snapshots, which are supposed to be read-only. This happens because rbd_osd_req_format_write() sets the snapshot context based on obj_request->img_request. At this point, however, obj_request->img_request has not been set yet, to the snapshot context is set to NULL. Fix this by moving rbd_img_obj_request_add(), which sets obj_request->img_request, before the osd request formatting calls. This resolves: http://tracker.ceph.com/issues/5465 Reported-by: Karol Jurak <karol.jurak@gmail.com> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2013-06-27 05:55:29 -07:00
Linus Torvalds	78750f1908	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fix from Sage Weil: "This fixes another problem with using v2 images on 3.10 due to the order in which fields are read from the image header. Hopefully this is the last one" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fetch object order before using it	2013-06-26 08:47:46 -10:00
Josh Durgin	1617e40c1e	rbd: fetch object order before using it rbd_dev_v2_header_onetime() fetches striping information, and checks whether the image can be read by compariing the stripe unit to the object size. It determines the object size by shifting the object order, which is 0 at this point since it has not been read yet. Move the call to get the image size and object order before rbd_dev_v2_header_onetime() so it is set before use. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-06-25 12:27:31 -07:00
Linus Torvalds	7ecba6f2f3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fix from Sage Weil: "This fixes a problem preventing the kernel and userland librbd libraries from sharing data with the new format 2 images" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: use the correct length for format 2 object names	2013-06-21 06:27:40 -10:00
Josh Durgin	3a96d5cd7b	rbd: use the correct length for format 2 object names Format 2 objects use 16 characters for the object name suffix to be able to express the full 64-bit range of object numbers. Format 1 images only use 12 characters for this. Using 12-character names for format 2 caused userspace and kernel rbd clients to read differently named objects, which made an image written by one client look empty to the other client. CC: stable@vger.kernel.org # 3.9+ Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2013-06-13 08:46:15 -07:00
Linus Torvalds	8d7a8fe2ce	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph fixes from Sage Weil: "There is a pair of fixes for double-frees in the recent bundle for 3.10, a couple of fixes for long-standing bugs (sleep while atomic and an endianness fix), and a locking fix that can be triggered when osds are going down" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: fix cleanup in rbd_add() rbd: don't destroy ceph_opts in rbd_add() ceph: ceph_pagelist_append might sleep while atomic ceph: add cpu_to_le32() calls when encoding a reconnect capability libceph: must hold mutex for reset_changed_osds()	2013-06-12 08:28:19 -07:00
Alex Elder	3abef3b358	rbd: fix cleanup in rbd_add() Bjorn Helgaas pointed out that a recent commit introduced a use-after-free condition in an error path for rbd_add(). He correctly stated: I think `b536f69a3a` "rbd: set up devices only for mapped images" introduced a use-after-free error in rbd_add(): ... If rbd_dev_device_setup() returns an error, we call rbd_dev_image_release(), which ultimately kfrees rbd_dev. Then we call rbd_dev_destroy(), which references fields in the already-freed rbd_dev struct before kfreeing it again. The simple fix is to return the error code after the call to rbd_dev_image_release(). Closer examination revealed that there's no need to clean up rbd_opts in that function, so fix that too. Update some other comments that have also become out of date. Reported-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-17 12:50:10 -05:00
Alex Elder	7262cfca43	rbd: don't destroy ceph_opts in rbd_add() Whether rbd_client_create() successfully creates a new client or not, it takes responsibility for getting the ceph_opts structure it's passed destroyed. If successful, the structure becomes associated with the created client; if not, rbd_client_create() will destroy it. Previously, rbd_get_client() would call ceph_destroy_options() if rbd_get_client() failed, and that meant it got called twice. That led freeing various pointers more than once, which is never a good idea. This resolves: http://tracker.ceph.com/issues/4559 Cc: stable@vger.kernel.org # 3.8+ Reported-by: Dan van der Ster <dan@vanderster.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-17 12:50:03 -05:00
Linus Torvalds	109c3c0292	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "Yes, this is a much larger pull than I would like after -rc1. There are a few things included: - a few fixes for leaks and incorrect assertions - a few patches fixing behavior when mapped images are resized - handling for cloned/layered images that are flattened out from underneath the client The last bit was non-trivial, and there is some code movement and associated cleanup mixed in. This was ready and was meant to go in last week but I missed the boat on Friday. My only excuse is that I was waiting for an all clear from the testing and there were many other shiny things to distract me. Strictly speaking, handling the flatten case isn't a regression and could wait, so if you like we can try to pull the series apart, but Alex and I would much prefer to have it all in as it is a case real users will hit with 3.10." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (33 commits) rbd: re-submit flattened write request (part 2) rbd: re-submit write request for flattened clone rbd: re-submit read request for flattened clone rbd: detect when clone image is flattened rbd: reference count parent requests rbd: define parent image request routines rbd: define rbd_dev_unparent() rbd: don't release write request until necessary rbd: get parent info on refresh rbd: ignore zero-overlap parent rbd: support reading parent page data for writes rbd: fix parent request size assumption libceph: init sent and completed when starting rbd: kill rbd_img_request_get() rbd: only set up watch for mapped images rbd: set mapping read-only flag in rbd_add() rbd: support reading parent page data rbd: fix an incorrect assertion condition rbd: define rbd_dev_v2_header_info() rbd: get rid of trivial v1 header wrappers ...	2013-05-15 13:36:19 -07:00
Alex Elder	638f5abed3	rbd: re-submit flattened write request (part 2) Add code to rbd_img_obj_exists_callback() to detect when a clone's parent image has disappeared, and re-submit the original write request in that case. Kill off some redundant assertions. This completes the resolution for: http://tracker.ceph.com/issues/3763 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:46 -05:00
Alex Elder	bbea1c1a31	rbd: re-submit write request for flattened clone Add code to rbd_img_parent_read_full_callback() to detect when a clone's parent image has disappeared, and re-submit the original write request in that case. (See the previous commit for more reasoning about why this is appropriate.) Rename some variables in rbd_img_obj_parent_read_full_callback() to match the convention used in the previous patch. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:45 -05:00
Alex Elder	02c74fbad9	rbd: re-submit read request for flattened clone If a clone image gets flattened while a parent read request is underway, the original rbd object request needs to be resubmitted. The reason is that by the time we get the response to the parent read request, the data read from the parent may be out of date. In other words, we could see this sequence of events: rbd client parent image/osd ---------- ---------------- original object ENOENT; issue parent read respond to parent read child image flattened original image header refresh <--- original object written independently here parent read response received Add code to rbd_img_parent_read_callback() to detect when a clone's parent image has disappeared (as evidenced by its parent overlap becoming 0), and re-submit the original read request in that case. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:45 -05:00
Alex Elder	392a9dad7e	rbd: detect when clone image is flattened A format 2 clone image can be the subject of a "flatten" operation, during which all of its data gets "copied up" from its parent image, leaving the image fully populated. Once this is complete, the clone's association with the parent is abolished. Since this can occur when a clone is mapped, we need to detect when it has occurred and handle it accordingly. We know an image has been flattened when we know it at one time had a parent, but we have learned (via a "get_parent" object class method call) it no longer has one. There might be in-flight requests at the point we learn an image has been flattened, so we can't simply clean up parent data structures right away. Instead, we'll drop the initial parent reference when the parent has disappeared (rather than when the image gets destroyed), which will allow the last in-flight reference to clean things up when it's complete. We leverage the fact that a zero parent overlap renders an image effectively unlayered. We set the overlap to 0 at the point we detect the clone image has flattened, which allows the unlayered behavior to take effect immediately, while keeping other parent structures in place until in-flight requests to complete. This and the next few patches resolve: http://tracker.ceph.com/issues/3763 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:45 -05:00
Alex Elder	a2acd00e79	rbd: reference count parent requests Keep a reference count for uses of the parent information for an rbd device. An initial reference is set in rbd_img_request_create() if the target image has a parent (with non-zero overlap). Each image request for an image with a non-zero parent overlap gets another reference when it's created, and that reference is dropped when the request is destroyed. The initial reference is dropped when the image gets torn down. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:44 -05:00
Alex Elder	e93f315235	rbd: define parent image request routines Define rbd_parent_request_create() and rbd_parent_request_destroy() to handle the creation of parent image requests submitted for layered image objects. For simplicity, let rbd_img_request_put() handle dropping the reference to any image request (parent or not), and call whichever destructor is appropriate on the last put. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:44 -05:00
Alex Elder	fb65d2284c	rbd: define rbd_dev_unparent() Define rbd_dev_unparent() to encapsulate cleaning up parent data structures from a layered rbd image. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:43 -05:00
Alex Elder	8785b1d487	rbd: don't release write request until necessary Previously when a layered write was going to involve a copyup request, the original osd request was released before submitting the parent full-object read. The osd request for the copyup would then be allocated in rbd_img_obj_parent_read_full_callback(). Shortly we will be handling the event of mapped layered images getting flattened, and when that occurs we need to resubmit the original request. We therefore don't want to release the osd request until we really konw we're going to replace it--in the callback function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:43 -05:00
Alex Elder	642a25375f	rbd: get parent info on refresh Get parent info for format 2 images on every refresh (rather than just during the initial probe). This will be needed to detect the disappearance of the parent image in the event a mapped image becomes unlayered (i.e., flattened). Avoid leaking the previous parent spec on the second and subsequent times this information is requested by dropping the previous one (if any) before updating it. (Also, extract the pool id into a local variable before assigning it into the parent spec.) Switch to using a non-zero parent overlap value rather than the existence of a parent (a non-null parent_spec pointer) to determine whether to mark a request layered. It will soon be possible for a layered image to become unlayered while a request is in flight. This means that the layered flag for an image request indicates that there was a non-zero parent overlap at the time the image request was created. The parent overlap can change thereafter, which may lead to special handling at request submission or completion time. This and the next several patches are related to: http://tracker.ceph.com/issues/3763 NOTE: If an error occurs while refreshing the parent info (i.e., requesting it after initial probe), the old parent info will persist. This is not really correct, and is a scenario that needs to be addressed. For now we'll assert that the failure mode is unlikely, but the issue has been documented in tracker issue 5040. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 15:06:33 -05:00
Alex Elder	70cf49cfc7	rbd: ignore zero-overlap parent An rbd clone image that has an overlap with its parent of 0 is effectively not a layered image at all. Detect this case and treat such an image as non-layered. Issue a warning to be sure the user knows what's going on. This resolves: http://tracker.ceph.com/issues/5028 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 14:12:41 -05:00
Alex Elder	b91f09f17b	rbd: support reading parent page data for writes Currently, rbd_img_obj_parent_read_full() assumes the incoming object request contains bio data. But if a layered image is part of a multi-layer stack of images it will result in read requests of page data to parent images. This is handling the same kind of issue as was resolved by this commit: `5b2ab72d` rbd: support reading parent page data This resolves: http://tracker.ceph.com/issues/5027 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 14:12:40 -05:00
Alex Elder	ebda6408f2	rbd: fix parent request size assumption The code that reads object data from the parent for a copyup on write request currently assumes that the size of that request is the size of a "full" object from the original target image. That is not necessarily the case. The parent overlap could reduce the request size below that. To fix that assumption we need to record the number of pages in the copyup_pages array, for both an image request and an object request. Rename a local variable in rbd_img_obj_parent_read_full_callback() to reflect we're recording the length of the parent read request, not the size of the target object. This resolves: http://tracker.ceph.com/issues/5038 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-13 14:09:01 -05:00
Alex Elder	c48f3f86e2	rbd: kill rbd_img_request_get() Get rid of rbd_img_request_get(), because it isn't used, and maybe won't ever be needed. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 20:17:00 -05:00
Alex Elder	1f3ef78861	rbd: only set up watch for mapped images Any changes to parent images are immaterial to any mapped clone. So there is no need to have a watch event registered on header objects except for the header object of an image that is mapped. In fact, a watch request is a write operation, and we may only have read access to a parent image. We can't set up the watch request until we know the name of the header object though. So pass a flag to rbd_dev_image_probe() to indicate whether this probe is for a mapping or for a parent image. Change the second parameter to rbd_dev_header_watch_sync() be Boolean while we're at it. This resolves: http://tracker.ceph.com/issues/4941 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 20:16:55 -05:00
Alex Elder	7ce4eef7b5	rbd: set mapping read-only flag in rbd_add() The rbd_dev->mapping field for a parent image is not meaningful. Since rbd_image_probe() is used both for images being mapped and their parents, it doesn't make sense to set that flag in that function. So move the setting of the mapping.read_only flag out of rbd_dev_image_probe() and into rbd_add() instead. This resolves: http://tracker.ceph.com/issues/4940 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 20:16:50 -05:00
Alex Elder	5b2ab72d36	rbd: support reading parent page data Currently, rbd_img_parent_read() assumes the incoming object request contains bio data. But if a layered image is part of a multi-layer stack of images it will result in read requests of page data to parent images. Fortunately, it's not hard to add support for page data. This resolves: http://tracker.ceph.com/issues/4939 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 20:16:25 -05:00
Alex Elder	91c6febb38	rbd: fix an incorrect assertion condition In rbd_img_obj_parent_read_full_callback() there is an assertion intended to verify the size of the image request for a full parent read was the size of the original request's target object. But assertion was looking at the parent image order rather than the original one, and these values can differ. Fix that. This resolves: http://tracker.ceph.com/issues/4938 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 20:16:10 -05:00
Alex Elder	2df3fac758	rbd: define rbd_dev_v2_header_info() This rearranges rbd_dev_v2_refresh() so it works more like rbd_dev_v1_header_info(). While format 1 images need to read the whole header object to get any information, format 2 can collect almost all information selectively. So the one-time initialization will remain in a separate function--based on rbd_dev_v2_probe(). Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have it call rbd_dev_v2_header_onetime() if it's being called for the first time for the given rbd device. Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and remove the image size and snapshot context calls it held in common with the refresh function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:52 -05:00
Alex Elder	99a41ebcee	rbd: get rid of trivial v1 header wrappers Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls in their place. Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to be more generic (it will better reflect what happens with format 2 images). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:46 -05:00
Alex Elder	30d60ba2f2	rbd: simplify rbd_dev_v1_probe() An rbd_dev structure's fields are all zero-filled for an initial probe, so there's no need to explicitly zero the parent_spec and parent_overlap fields in rbd_dev_v1_probe(). Removing these assignments makes rbd_dev_v1_probe() almost trivial. Move the dout() message that announces discovery of an image into rbd_dev_image_probe(), generalize to support images in either format and only show it if an image is fully discovered. This highlights that are some unnecessary cleanups in the error path for rbd_dev_v1_probe(), so they can be removed. Now rbd_dev_v1_probe() is a trivial wrapper function. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:41 -05:00
Alex Elder	662518b128	rbd: update in-core header directly Now that rbd_header_from_disk() only fills in one-time fields once, we can extend it slightly so it releases the other fields before replacing their values. This way there's no need to pass a temporary buffer and then copy all the results in. Just use the rbd device header structure in rbd_header_from_disk() so its values get updated directly. Note that this means we need to take the header semaphore at the point we update things. So pass the rbd_dev rather than the address of its header as its first argument to rbd_header_from_disk(), and have it return an error code. As a result, rbd_dev_v1_header_read() does all the work, rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh() becomes a very simple wrapper. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:37 -05:00
Alex Elder	bb23e37acb	rbd: refactor rbd_header_from_disk() This rearranges rbd_header_from_disk so that it: - allocates the snapshot context right away - keeps results in local variables, not changing the passed-in header until it's known we'll succeed - does initialization of set-once fields in a header only if they have not already been set The last point is moot at the moment, because rbd_read_header() (the only caller) always supplies a zero-filled header buffer. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:33 -05:00
Alex Elder	46578dcdca	rbd: zero format 1 header structure earlier The passed-in header structure is zeroed in rbd_header_from_disk(). Instead, have the caller do it. Note that there are two callers, rbd_dev_v1_refresh() and rbd_dev_v1_probe(). The latter already has a zeroed header structure so zeroing it isn't necessary there. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:28 -05:00
Alex Elder	f35a4dee14	rbd: set the mapping size and features later Defer setting the size and features fields of a mapped image until after the Linux disk structure is set up. Set the capacity of the disk after that. Rearrange the definition of rbd_image_header, separating the fields that are set only once from those that can be updated. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 17:00:00 -05:00
Linus Torvalds	4de13d7aa8	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...	2013-05-08 10:13:35 -07:00
Alex Elder	51344a38ba	rbd: always set read-only flag in rbd_add() Hold off setting the read-only flag in rbd_add() for an image being mapped until we have successfully probed the image. At that point we know whether it's a snapshot mapping or not, so we can set the read-only flag in that one place rather than doing so (for snapshots) in rbd_dev_mapping_set(). To do this, pass a flag to the image probe routine indicating whether we want a read-only mapping. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:48:12 -05:00
Alex Elder	6d80b130d5	rbd: kill rbd_dev_clear_mapping() This function is a duplicate of rbd_dev_mapping_clear(), and was added by mistake. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:48:12 -05:00
Alex Elder	8f4b7d9821	rbd: don't look up snapshot id in rbd_dev_mapping_set() Currently rbd_dev_mapping_set() looks up the snapshot id for the snapshot whose name is found in the rbd device's spec structure. That function gets called by rbd_dev_device_setup(), which is called by rbd_add() after rbd_dev_image_probe(). If the image probe succeeds, the rbd device's spec will already have been updated to include names and ids for all fields. Therefore there's no need to look up the snapshot id in rbd_dev_mapping_set(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:48:11 -05:00
Alex Elder	c734b79655	rbd: don't print warning if not mapping a parent The presence of the LAYERING bit in an rbd image's feature mask does not guarantee the image actually has a parent image. Currently that bit is set only when a clone (i.e., image with a parent) is created, but it is (currently) not cleared if that clone gets flattened back into a "normal" image. A "parent_id" query will leave the parent_spec for the image being mapped a null pointer, but will not return an error. Currently, whenever an image with the LAYERED feature gets mapped, a warning about the use of layered images gets printed. But we don't want to do this for a flattened image, so print the warning only if we find there is a parent spec after the probe. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:48:11 -05:00
Alex Elder	29334ba49c	rbd: kill rbd_update_mapping_size() Since rbd_update_mapping_size() is now a trivial wrapper, just open code it in its two callers. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:45:39 -05:00
Alex Elder	00a653e216	rbd: update capacity in rbd_dev_refresh() When a mapped image changes size, we change the capacity recorded for the Linux disk associated with it, in rbd_update_mapping_size(). That function is called in two places--the format 1 and format 2 refresh routines. There is no need to set the capacity while holding the header semaphore. Instead, do it in the common rbd_dev_refresh(), using the logic that's already there to initiate disk revalidation. Add handling in the request function, just in case a request that exceeds the capacity of the device comes in (perhaps one that was started before a refresh shrunk the device). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:45:30 -05:00
Alex Elder	e627db085e	rbd: revalidate only for mapping size changes This commit: `d98df63e` rbd: revalidate_disk upon rbd resize instituted a call to revalidate_disk() to notify interested parties that a mapped image has changed size. This works well, as long as the the rbd device doesn't map a snapshot. A snapshot will never change size. However, the base image the snapshot is associated with can, and it can do so while the snapshot is mapped. The problem is that the test for the size is looking at the size of the base image, not the size of the mapped snapshot. This patch corrects that. Update the warning message shown in the event of error, and move it into the callers. This resolves: http://tracker.ceph.com/issues/4911 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:40:48 -05:00
Alex Elder	49ece55428	rbd: fix leak of format 2 snapshot context When rbd_dev_v2_refresh() is called, the rbd device already has a snapshot context associated with it. But that never gets freed, the pointer just gets overwritten. Fix this by dropping the rbd device's reference to the snapshot context before overwriting the pointer. Because ceph_put_snap_context() already handles for a null pointer we don't need to check for that (for the probe case, where no context has yet been assigned). This resolves: http://tracker.ceph.com/issues/4912 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-08 07:38:30 -05:00
Linus Torvalds	292088ee03	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "A couple of fixes + getting rid of __blkdev_put() return value" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: proc: Use PDE attribute setting accessor functions make blkdev_put() return void block_device_operations->release() should return void mtd_blktrans_ops->release() should return void hfs: SMP race on directory close()	2013-05-07 15:14:53 -07:00
Al Viro	db2a144bed	block_device_operations->release() should return void The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) and it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-07 02:16:21 -04:00
Alex Elder	b5b09be30c	rbd: fix image request leak on parent read When a read for a layered image object finds the target object doesn't exist, a read image request for the parent image is created and submitted. When that completes, the callback routine was not releasing that parent image request. Fix that. The slab allocation stuff just added has greatly simplified the search for the source of this memory leak. This resolves: http://tracker.ceph.com/issues/4803 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 12:15:28 -05:00
Alex Elder	78c2a44aae	rbd: allocate image object names with a slab allocator The names of objects used for image object requests are always fixed size. So create a slab cache to manage them. Define a new function rbd_segment_name_free() to match rbd_segment_name() (which is what supplies the dynamically-allocated name buffer). This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:30 -05:00
Alex Elder	868311b1eb	rbd: allocate object requests with a slab allocator Create a slab cache to manage rbd_obj_request allocation. We aren't using a constructor, and we'll zero-fill object request structures when they're allocated. This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:30 -05:00
Alex Elder	f907ad5596	rbd: allocate name separate from obj_request The next patch will define a slab allocator for a object requests. To use that we'll need to allocate the name of an object separate from the request structure itself. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:29 -05:00
Alex Elder	1c2a9dfe21	rbd: allocate image requests with a slab allocator Create a slab cache to manage rbd_img_request allocation. Nothing too fancy at this point--we'll still initialize everything at allocation time (no constructor) This is part of: http://tracker.ceph.com/issues/3926 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:29 -05:00
Alex Elder	30d1cff817	rbd: use binary search for snapshot lookup Use bsearch(3) to make snapshot lookup by id more efficient. (There could be thousands of snapshots, and conceivably many more.) Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:58:17 -05:00
Alex Elder	15228ede7d	rbd: clear EXISTS flag if mapped snapshot disappears This functionality inadvertently disappeared in the last patch. Image snapshots can get removed at just about any time. In particular it can disappear even if it is in use by an rbd client as a mapped image. The rbd client deals with such a disappearance by responding to new requests with ENXIO. This is implemented by each rbd device maintaining an EXISTS flag, which is normally set but cleared if a snapshot disappears. This patch (re-)implements the clearing of that flag. Whenever mapped image header information is refreshed, if the mapping is for a snapshot, verify the mapped snapshot is still present in the updated snapshot context. If it is not, clear the flag. It is not necessary to check this in the initial probe, because the probe will not succeed if the snapshot doesn't exist. This resolves: http://tracker.ceph.com/issues/4880 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-02 11:57:03 -05:00
Alex Elder	33dca39f5c	rbd: kill off the snapshot list We no longer use the snapshot list for anything. When we need to look up a snapshot name, id, size, or feature mask, we just do it directly rather than relying on this list being updated with every refresh. The main reason it existed was for the benefit of the device/sysfs entries that previously were associated with snapshots. So get rid of the snapshot list, and struct rbd_snap, and the hundreds of lines of code that supported them. This resolves: http://tracker.ceph.com/issues/4868 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:22 -07:00
Alex Elder	2ad3d7167e	rbd: define rbd_snap_size() and rbd_snap_features() This patch defines a handful of new functions that will allow us to get rid of the rbd device structure's list of snapshots. Define rbd_snap_id_by_name() to look up a snapshot id given its name. This is efficient for format 1 images but not for format 2. Fortunately it only gets called at mapping time so it's not that critical. Use rbd_snap_id_by_name() to find out the id for a snapshot getting mapped, and pass that id to new functions rbd_snap_size() and rbd_snap_features() to look up information about a given snapshot's size and feature mask given its snapshot id. All this gets done in rbd_dev_mapping_set(). As a result, snap_by_name() is no longer needed, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:20 -07:00
Alex Elder	54cac61fb6	rbd: use snap_id not index to look up snap info In order to align with what was needed for format 1 rbd images, rbd_dev_v2_snap_info() was set up to take as argument an index into the array of snapshot ids in a rbd device's snapshot context. This switches that around, so we pass the snapshot id instead. In doing this, rbd_snap_name() now returns a dynamically-allocated string rather than a fixed one, so there's no need to make a duplicate in its caller, rbd_dev_spec_update(). This means the following functions take a snapshot id where they previously used an index value: rbd_dev_snap_info() rbd_dev_v1_snap_info() rbd_dev_v2_snap_info() A new function, rbd_dev_snap_index(), determines the snap index for format 1 images and uses it to look up the name. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:19 -07:00
Alex Elder	9682fc6d3a	rbd: look up snapshot name in names buffer Rather than scanning the list of snapshot structures for it, scan the snapshot context buffer containing snapshot names in order to determine for a format 1 image the name associated with a given snapshot id. Pull out the part of rbd_dev_v1_snap_info() that does this scan into a new function, _rbd_dev_v1_snap_name(). Have that function return a dynamically-allocated copy of the name, and don't duplicate it in rbd_dev_v1_snap_info(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:18 -07:00
Alex Elder	dedc81ea84	rbd: drop obj_request->version Nothing ever uses the version field maintained in the object request structure any more, so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:17 -07:00
Alex Elder	e2a58ee55b	rbd: drop rbd_obj_method_sync() version parameter Only NULL is passed as the version argument to rbd_obj_method_sync(), so get rid of it. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:16 -07:00
Alex Elder	cc4a38bdd5	rbd: more version parameter removal Continued from the last patch, more parameters that can go away because we no longer have a need to track object versions. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:15 -07:00
Alex Elder	7097f8df6e	rbd: get rid of some version parameters Several functions in rbd have parameters meant to allow the version of an object to be passed in or out. The purpose of those was to allow the version of a header object to be maintained, but we no longer do that. As a result, these parameters are never actually needed or used, so get rid of them. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:14 -07:00
Alex Elder	b21ebdddeb	rbd: stop tracking header object version The rbd code takes care to maintain the version of the header object. This was done in hopes of using it to detect a change in the object between reading it and setting up a watch request to be notified of changes. The mechanism was never fully implemented, however. And we now avoid the original problem by setting up the watch request before ever reading the content of the header. The osd doesn't interpret the object version supplied with a WATCH osd op, nor does it use the version supplied with a NOTIFY_ACK op (we can just supply 0 for both). There is therefore no need to maintain the header's object version any more, so stop doing so. We'll be able to simplify some more rbd code in the next few patches as a result of this. This resolves: http://tracker.ceph.com/issues/3952 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:13 -07:00
Alex Elder	cb75223d2b	rbd: snap names are pointer to constant data Make explicit that snapshot names don't change by making functions return and take parameters that that point to const qualified data. This resolves: http://tracker.ceph.com/issues/4867 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:12 -07:00
Alex Elder	a3fbe5d447	rbd: don't revalidate so much Whenever a header object event causes a mapped rbd image to refresh its header information, revalidate_disk() is being called. This was done in rbd_dev_refresh() outside the control mutex in order to avoid a lock inversion. Although a an event like this might indicate the image has changed size, most of the time it does not. Record the image size before and after the refresh, and only call revalidate_disk() if it changes. This resolves: http://tracker.ceph.com/issues/4867 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2013-05-01 21:20:11 -07:00

... 2 3 4 5 6 ...

737 Commits