CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
When we fail to allocate page vector in rbd_obj_read_sync() we just
basically ignore the problem and continue which will result in an oops
later. Fix the problem by returning proper error.
CC: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@inktank.com>
CC: ceph-devel@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226882
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Using one queue per device doesn't make much sense given that our
workfn processes "devices" and not "requests". Switch to a single
workqueue for all devices.
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.
Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes. Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.
For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.
op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Discards take a reference to the snapshot context of an image when
they are created. This reference needs to be cleaned up when the
request is done just as it is for regular writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
This patch add the discard support for rbd driver.
There are three types operation in the driver:
1. The objects would be removed if they completely contained
within the discard range.
2. The objects would be truncated if they partly contained within
the discard range, and align with their boundary.
3. Others would be zeroed.
A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.
This resolve:
http://tracker.ceph.com/issues/190
[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
commits by Josh Durgin for refinements and fixes which weren't
folded in to preserve authorship. ]
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It could only handle the read and write operations now,
extend it for the coming discard support.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.
Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.
Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors. (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
drivers/block/rbd.c: In function ‘rbd_dev_device_setup’:
drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security]
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
While it was never a good idea to sleep in request_fn(), commit
34c6bc2c91 ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever. request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().
Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().
Fixes: http://tracker.ceph.com/issues/8818
Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Tested-by: Eric Eastman <eric0e@aol.com>
Tested-by: Greg Wilson <greg.wilson@keepertech.com>
Reviewed-by: Alex Elder <elder@linaro.org>
If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image. Not doing so may
in particular result in us returning zeros instead of user data:
# cat overlap-snap.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
echo "Base image"
dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd snap create bar@snap
BAR_DEV=$(rbd map bar@snap)
echo "Snapshot"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd resize --allow-shrink --size 4 bar
echo "Snapshot after base image resize"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
# ./overlap-snap.sh
Base image
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Snapshot
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Resizing image: 100% complete...done.
Snapshot after base image resize
0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........
Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes. Fix it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in. This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit. (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)
The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in. The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained. The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
There is no sense in trying to update the mapping size before it's even
been set.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related. Alas, currently there is
not much we can do in response to those errors, except print warnings.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
rbd_dev_spec_update() has two modes of operation, with nothing in
common between them. Split it into two functions, one for each mode
and make our expectations more clear.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
spec->image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent
images. While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree. The helper rbd_segment_name_free does the job here.
Its position is shifted above the calling function.
The Coccinelle semantic patch that detects this change is as follows:
// <smpl>
@@
expression x,E,c;
@@
x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
image_id is leaked if the parent happens to have been recorded already.
Fix it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix
rbd_dev_header_unwatch_sync() to destroy watch_request structures
before queuing watch-remove message while at it. This mistake slipped
into commit b30a01f2a3 ("rbd: fix osd_request memory leak in
__rbd_dev_header_watch_sync()") and could lead to "image still in use"
errors on image removal.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In the past, rbd_dev_header_watch_sync() used to handle both watch and
unwatch requests and was entangled and leaky. Commit b30a01f2a3
("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()")
split it into two separate functions. This commit cleanly abstracts
the common bits, relying on the fixed rbd_obj_request_wait().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
rbd_obj_request_wait() should cancel the underlying OSD request if
interrupted. Otherwise libceph will hold onto it indefinitely, causing
assert failures or leaking the original object request.
This also adds an rbd wrapper around ceph_osdc_cancel_request() to
match rbd_obj_request_submit() and rbd_obj_request_wait().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
The following check in rbd_img_obj_request_submit()
rbd_dev->parent_overlap <= obj_request->img_offset
allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object. This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data. Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:
rbd_data.<id>.0000000000000001
---------------------|----------------------|------------
| should be copyup'ed | should be zeroed out | write ...
---------------------|----------------------|------------
4M 5M 6M
parent_overlap obj_request->img_offset
4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.
Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
rbd_open(), called every time the device is opened, calls
set_device_ro(). There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.
Move the acquisition and release of rbd_dev->lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
When running the following commands:
[root@ceph0 mnt]# blockdev --setro /dev/rbd1
[root@ceph0 mnt]# blockdev --getro /dev/rbd1
0
The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.
This resolves:
http://tracker.ceph.com/issues/6265
Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ida_destroy() needs to be called on module exit to release ida caches.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Each image request contains a reference count, but to date it has
not actually been used. (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.
Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function. That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate. If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().
In rbd_obj_request_complete(), requests are processed in sequential
order. So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on. Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on. If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.
There is a race here, however. The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down. And this can happen *before* that object has
itself entered rbd_img_obj_end_request(). As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.
All that's necessary to avoid this is to properly count references
to the image requests. We tear down an image request's object
requests all at once--only when the entire image request has
completed. So there's no need for an image request to count
references for its object requests. However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback(). In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.
So the fix is to add a reference to an image request for
each of its object requests. The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback(). That is emphasized by
getting the reference right after assigning that as the image
object's callback function. The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist. This is because lingering requests are special and get
an extra ref in the reply path. Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:
$ ceph osd pool create pool1 8 8
$ rbd create --size 1000 pool1/img1
$ sudo rbd map pool1/img1
$ ceph osd pool create pool2 8 8
$ rbd create --size 1000 pool2/img2
$ sudo rbd map pool2/img2
rbd: sysfs write failed
rbd: map failed: (2) No such file or directory
This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client. The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.
Fixes: http://tracker.ceph.com/issues/8184
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the libceph/osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create(). The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Our longest osd request now contains 3 ops: copyup+hint+write.
Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2. Fix it, and switch to rbd_assert while at it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():
Assertion failure in rbd_obj_request_destroy() at line 1867:
rbd_assert(obj_request->img_request == NULL);
rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del(). Use it.
Fixes: http://tracker.ceph.com/issues/7327
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Commit 03507db631 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus. Remove it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>