Keep a reference count for uses of the parent information for an rbd
device.
An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap). Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.
The initial reference is dropped when the image gets torn down.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects. For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read. The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().
Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request. We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get parent info for format 2 images on every refresh (rather than
just during the initial probe). This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened). Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)
Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered. It will soon be possible for
a layered image to become unlayered while a request is in flight.
This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created. The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.
This and the next several patches are related to:
http://tracker.ceph.com/issues/3763
NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist. This is not really correct, and is a scenario that needs
to be addressed. For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all. Detect this case and treat
such an image as non-layered. Issue a warning to be sure the user
knows what's going on.
This resolves:
http://tracker.ceph.com/issues/5028
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data. But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.
This is handling the same kind of issue as was resolved by this
commit:
5b2ab72d rbd: support reading parent page data
This resolves:
http://tracker.ceph.com/issues/5027
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.
That is not necessarily the case. The parent overlap could reduce
the request size below that. To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request. Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.
This resolves:
http://tracker.ceph.com/issues/5038
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.
We can't set up the watch request until we know the name of the
header object though. So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.
Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.
This resolves:
http://tracker.ceph.com/issues/4941
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.
So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.
This resolves:
http://tracker.ceph.com/issues/4940
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data. But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.
Fortunately, it's not hard to add support for page data.
This resolves:
http://tracker.ceph.com/issues/4939
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object. But
assertion was looking at the parent image order rather than the
original one, and these values can differ.
Fix that.
This resolves:
http://tracker.ceph.com/issues/4938
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info(). While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively. So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().
Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.
Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.
Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe(). Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.
Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.
This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.
Now rbd_dev_v1_probe() *is* a trivial wrapper function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values. This way there's no need to pass a
temporary buffer and then copy all the results in. Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.
Note that this means we need to take the header semaphore at the
point we update things. So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.
As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This rearranges rbd_header_from_disk so that it:
- allocates the snapshot context right away
- keeps results in local variables, not changing the passed-in
header until it's known we'll succeed
- does initialization of set-once fields in a header only if
they have not already been set
The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it. Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe(). The latter already has
a zeroed header structure so zeroing it isn't necessary there.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up. Set the capacity of the
disk after that.
Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image. At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set(). To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.
That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe(). If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.
Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image. Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image. A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.
Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed. But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.
There is no need to set the capacity while holding the header
semaphore. Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.
Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This commit:
d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size. This works well, as long as
the the rbd device doesn't map a snapshot.
A snapshot will never change size. However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.
The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot. This patch
corrects that.
Update the warning message shown in the event of error, and move
it into the callers.
This resolves:
http://tracker.ceph.com/issues/4911
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When rbd_dev_v2_refresh() is called, the rbd device already has a
snapshot context associated with it. But that never gets freed,
the pointer just gets overwritten.
Fix this by dropping the rbd device's reference to the snapshot
context before overwriting the pointer.
Because ceph_put_snap_context() already handles for a null pointer
we don't need to check for that (for the probe case, where no
context has yet been assigned).
This resolves:
http://tracker.ceph.com/issues/4912
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull more vfs updates from Al Viro:
"A couple of fixes + getting rid of __blkdev_put() return value"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
proc: Use PDE attribute setting accessor functions
make blkdev_put() return void
block_device_operations->release() should return void
mtd_blktrans_ops->release() should return void
hfs: SMP race on directory close()
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted. When that completes, the callback routine was
not releasing that parent image request. Fix that.
The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.
This resolves:
http://tracker.ceph.com/issues/4803
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The names of objects used for image object requests are always fixed
size. So create a slab cache to manage them. Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).
This is part of:
http://tracker.ceph.com/issues/3926
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Create a slab cache to manage rbd_obj_request allocation. We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.
This is part of:
http://tracker.ceph.com/issues/3926
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Create a slab cache to manage rbd_img_request allocation. Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)
This is part of:
http://tracker.ceph.com/issues/3926
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use bsearch(3) to make snapshot lookup by id more efficient. (There
could be thousands of snapshots, and conceivably many more.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This functionality inadvertently disappeared in the last patch.
Image snapshots can get removed at just about any time. In
particular it can disappear even if it is in use by an rbd
client as a mapped image.
The rbd client deals with such a disappearance by responding to new
requests with ENXIO. This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.
This patch (re-)implements the clearing of that flag.
Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context. If it is not, clear the
flag.
It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.
This resolves:
http://tracker.ceph.com/issues/4880
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We no longer use the snapshot list for anything. When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh. The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.
So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.
This resolves:
http://tracker.ceph.com/issues/4868
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.
Define rbd_snap_id_by_name() to look up a snapshot id given its
name. This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.
Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id. All this gets done
in rbd_dev_mapping_set().
As a result, snap_by_name() is no longer needed, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.
This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().
This means the following functions take a snapshot id where they
previously used an index value:
rbd_dev_snap_info()
rbd_dev_v1_snap_info()
rbd_dev_v2_snap_info()
A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.
Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name(). Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out. The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that. As a result, these parameters are never actually
needed or used, so get rid of them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd code takes care to maintain the version of the header
object. This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.
The mechanism was never fully implemented, however. And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.
The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both). There is therefore no need to
maintain the header's object version any more, so stop doing so.
We'll be able to simplify some more rbd code in the next few patches
as a result of this.
This resolves:
http://tracker.ceph.com/issues/3952
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.
This resolves:
http://tracker.ceph.com/issues/4867
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called. This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion. Although a an event like this *might*
indicate the image has changed size, most of the time it does not.
Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.
This resolves:
http://tracker.ceph.com/issues/4867
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A warning gets spewed for any image being probed, including parent
images. Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.
Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead. People experimenting with layering
will know what the message means.
This is part of:
http://tracker.ceph.com/issues/4867
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that we have a library routine to create snap contexts, use it.
This is part of:
http://tracker.ceph.com/issues/4857
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().
A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.
This resolves:
http://tracker.ceph.com/issues/4774
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it. Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).
Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.
This means that:
- rbd_dev_device_release() gets called when the last device
reference gets dropped;
- that undoes everything done by the rbd_dev_device_setup() call
at the end of rbd_dev_image_probe() (and nothing more), ending
by calling rbd_dev_image_release(); and
- rbd_dev_image_release() undoes everything else done by
rbd_dev_image_probe() (and this includes a call to
rbd_dev_unprobe().
This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rename rbd_dev_probe_finish() to be rbd_dev_device_setup(). Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).
Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.
Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish(). Move
it all the way up to before doing the initial probe. This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.
This resolves:
http://tracker.ceph.com/issues/3871
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally. This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.
An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check. The order really can't
ever change anyway--this was just a sanity check. So get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped. Instead, tear
it down when removing the device. If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.
All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent(). For now, ignore
any errors that occur in this case.
Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move a block of initialization related to the "ceph-side" of an rbd
image out of rbd_dev_probe_finish() and into rbd_dev_image_probe().
Add appropriate error handling to clean things up in the event any
of these new functions return an error.
We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and
rbd_dev_probe_parent() all clean up after themselves before they
return an error, so no special cleanup is required except when an
earlier call succeeds. Since rbd_dev_spec_update() only updates the
spec field (whose cleanup will be handled by dropping the last
reference to the spec) there is no cleanup action associatied with
that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Probe for a parent device earlier in rbd_dev_probe_finish(), before
starting to set up the Linux side of the rbd device.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When an error occurs while finishing probing a device it is assumed
that parent devices get cleaned up when deleting a device. They
don't. Add a call to clean them up. Note that this means the
parent spec will already be cleaned up so it doesn't have to be
in one of the rbd_add() error paths.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In certain error paths, it is possible for an rbd device to have a
parent spec but no parent rbd_dev. In rbd_dev_remove_parent() use
the parent field rather than parent_spec in determining whether to
try to remove any parent devices. Use assertions to indicate that
any non-null parent pointer has parent_spec associated with it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The function __rbd_remove() is used in two spots, and it's fairly
simple. It combines cleanup of part of the ceph-side state as well
as cleaning up the Linux-side state. Just open code it in the two
callers and eliminate the function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Set the mapping size and features earlier in rbd_dev_probe_finish().
Define rbd_dev_mapping_clear() as an inverse for setting those
fields, and use it both in error handling in rbd_dev_image_probe()
and in the final cleanup in rbd_dev_release(). Change the name
of rbd_dev_set_mapping() to of rbd_dev_mapping_set().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Encapsulate the code that removes an rbd device's parent images into
a new function, rbd_dev_remove_parent().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Encapsulate the code that probes for an rbd device's parent images
into a new function, rbd_dev_probe_parent().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Don't set the disk capacity until right before we announce the
device as available for use.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Hold off setting the EXISTS rbd device flag until just before we
announce the disk as available for use. There's no point in doing
so any earlier than that, and at that point the device truly is
fully set up and ready to use.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This just tweaks a few things in the routines that implement
rbd sysfs files.
All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/
will represent information whose valid values are known by the time
they are accessible.
Right now we get the size of the mapped image by a call to
get_capacity(). There's no need to do this, because that will
return what we last set the capacity to, which is just the size
recorded for the mapping. So just show that value instead.
We also get this under protection of the header semaphore, in order
to provide a precisely correct value. This isn't really necessary;
these files are really informational only and it's not necessary to
be so careful.
Finally, print a special value in case the major device number is
not recorded. Right now that won't matter much but soon the parent
images won't have devices associated with them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When a snapshot context update occurs, rbd_update_mapping_size() is
called to set the capacity of the disk to record the updated
size of the image in case it has changed.
There's a bug though. The mapping size is in units of *bytes*. The
code that updates the mapping size field is assigning a value that
has been scaled down to *sectors*.
Fix that. Also, check to see if the size has actually changed, and
don't bother updating things (specifically, calling set_capacity())
if it has not.
This resolves:
http://tracker.ceph.com/issues/4833
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Fairly straightforward refactoring of rbd_dev_probe_update_spec().
The name is changed to rbd_dev_spec_update().
Rearrange it so nothing gets assigned to the spec until all of the
names have been successfully acquired.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rename rbd_dev_probe() to be rbd_dev_image_probe(). Its purpose
will eventually be to probe for the existence of a valid rbd image
for the rbd device--focusing only on the ceph side and not the Linux
device side of initialization.
For now the two "sides" are not fully separated, and this function
is still the entry point for initializing the full rbd device.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_dev_destroy() does more than just the inverse of what
rbd_dev_create() does. Stop doing that, and move the two extra
things it does into the three call sites.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Encapsulate the creation of a snapshot context for rbd in a new
function rbd_snap_context_create(). Define rbd wrappers for getting
and dropping references to them once they're created.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change some calls to WARN_ON() so they use rbd_warn() instead, so we
get consistent messaging. A few remain but they can probably just
go away eventually.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This commit added fetching if fancy striping parameters:
09186ddb rbd: get and check striping parameters
They are almost unused, but the two fields storing the information
really belonged in the rbd_image_header structure.
This patch moves them there.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Make the names and image id in an rbd_spec be pointers to constant
data. This required the use of a local variable to hold the
snapshot name in rbd_add_parse_args() to avoid a warning.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Set the rbd spec's snapshot id for an image getting mapped in
rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping().
This is the more logical place for that to happen (even though
it means we might look up the snapshot by name twice).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A function called snap_by_name() ought to just look up a snapshot by
name. It does that, but then it assigns some stuff to the rbd
device structure as well.
Change the function to do just the lookup, and have the caller do
the assignments that follow.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If a format 2 image id is found for an image being mapped, but the
subsequent probe of the image fails, rbd_dev_probe() quits without
freeing the image id. Fix that.
Also drop a redundant hunk of code in rbd_dev_image_id().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, rbd_dev_probe() assumes that any error returned by
rbd_dev_image_id() is most likely -ENOENT, and responds by
calling the format 1 probe routine, rbd_dev_v1_probe(). Then,
at the top of rbd_dev_v1_probe(), an empty string is allocated
for the image id.
This is sort of unbalanced. Fix this by having rbd_dev_image_id()
look for -ENOENT from its "get_id" method call. If that is seen,
have it allocate the empty string there rather than depending on
rbd_dev_v1_probe() to do it.
Given that this is effectively defining the format of the image,
set rbd_dev->image_format inside rbd_dev_image_id() rather than in
the format-specific probe routines.
Also drop a redundant hunk of code in rbd_dev_image_id().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
I found during some failure injection testing that the call to
rbd_free_disk() in the error path of rbd_dev_probe_finish() was
dropping an extra reference to the disk queue. The problem
occurred when put_disk tried to drop a reference to the disk's
queue. A call to blk_cleanup_queue() just prior to that will have
also dropped a reference to the queue.
The problem is that the reference dropped by put_disk() is assumed
to have been taken by add_disk(). Our code has error paths that can
occur after the disk and its queue are initialized, but before the
call to add_disk(), and in those paths we won't have that extra
reference.
The fix is easy though. In rbd_free_disk() we're already checking
the disk's GENHD_FL_UP flag. That flag is an indication that
add_disk() has been called, so just call blk_cleanup_queue()
conditional on that flag being set.
This resolves:
http://tracker.ceph.com/issues/4800
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that rbd_obj_method_sync() returns the number of bytes
returned by the method call, that value should be used by
callers to ensure we don't overrun the valid portion of the
buffer.
Fix the two spots that remained that weren't doing that,
rbd_dev_image_name() and rbd_dev_v2_snap_name().
Rearrange the error path slightly in rbd_dev_v2_snap_name().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When the snapshot context for an rbd device gets updated (or the
initial one is recorded) a a list of snapshot structures is created
to represent them, one entry per snapshot. Each entry includes a
dynamically-allocated copy of the snapshot name.
Currently the name is allocated in rbd_snap_create(), as a duplicate
of the passed-in name.
For format 1 images, the snapshot name provided is just a pointer to
an existing name. But for format 2 images, the passed-in name is
already dynamically allocated, and in the the process of duplicating
it here we are leaking the passed-in name.
Fix this by dynamically allocating the name for format 1 snapshots
also, and then stop allocating a duplicate in rbd_snap_create().
Change rbd_dev_v1_snap_info() so none of its parameters is
side-effected unless it's going to return success.
This is part of:
http://tracker.ceph.com/issues/4803
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rename __rbd_add_snap_dev() to be rbd_snap_create(). We no longer
have devices for non-mapped snapshots, and we're not actually
"adding" it to the list in this function, just creating it.
Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons
similar to the above. Stop having this function delete the snapshot
from its list (to be symmetrical with its create counterpart) and do
that in the caller instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change rbd_dev_v2_snap_info() so it only ever sets values of the
size and features parameters if looking up the snapshot name was
successful.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Only one of the two callers of _rbd_dev_v2_snap_size() needs the
order value returned. So make that an optional argument--a null
pointer if the caller doesn't need it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When an rbd image is initially mapped, its snapshot context is
collected, and then a list of snapshot entries representing the
snapshots in that context is created. The list is created using
rbd_dev_snaps_update(). (This function also supports updating an
existing snapshot list based on a new snapshot context.)
If an error occurs, updating the list is aborted, and the list is
currently left as-is, in an inconsistent state. At that point,
there may be a partially-constructed list, but the calling functions
(rbd_dev_probe_finish() from rbd_dev_probe() from rbd_add()) never
clean them up. So this constitutes a leak.
A snapshot list that is inconsistent with the current snapshot
context is of no use, and might even be actively bad. So rather
than just having the caller clean it up, have rbd_dev_snaps_update()
just clear out the entire snapshot list in the event an error
occurs.
The other place rbd_dev_snaps_update() is used is when a refresh is
triggered, either because of a watch callback or via a write to the
/sys/bus/rbd/devices/<id>/refresh interface. An error while
updating the snapshots has no substantive effect in either of those
cases, but one of them issues a warning. Move that warning to the
common rbd_dev_refresh() function so it gets issued regardless of
how it got initiated.
This is part of:
http://tracker.ceph.com/issues/4803
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When an rbd image gets mapped a device entry gets created for it
under /sys/bus/rbd/devices/<id>/. Inside that directory there are
sysfs files that contain information about the image: its size,
feature bits, major device number, and so on.
Additionally, if that image has any snapshots, a device entry gets
created for each of those as a "child" of the mapped device. Each
of these is a subdirectory of the mapped device, and each directory
contains a few files with information about the snapshot (its
snapshot id, size, and feature mask).
There is no clear benefit to having those device entries for the
snapshots. The information provided via sysfs of of little real
value--and all of it is available via rbd CLI commands. If we
still wanted to see the kernel's view of this information it could
be done much more simply by including it in a single sysfs file for
the mapped image.
But there *is* a clear cost to supporting them. Every time a snapshot
context changes, these entries need to be updated (deleted snapshots
removed, new snapshots created). The rbd driver is notified of
changes to the snapshot context via callbacks from an osd, and care
must be taken to coordinate removal of snapshot data structures
with the possibility of one these notifications occurring.
Things would be considerably simpler if we just didn't have to
maintain device entries for the snapshots.
So get rid of them.
The ability to map a snapshot of an rbd image will remain; the only
thing lost will be the ability to query these sysfs directories for
information about snapshots of mapped images.
This resolves:
http://tracker.ceph.com/issues/4796
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that we have most everything in place to support layered rbd
images, enable support for them in the kernel client. Issue a
warning to the log that the support is considered experimental
whenever a format 2 layered image is mapped.
Note that we also have to claim to support the STRIPINGV2 feature,
due to a mistake in the way the rbd CLI set up those flags. This
feature can work if it has the right parameters, and safeguards
have been put in place to reject those images that do not have
compatible parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If an rbd format 2 image indicates it supports the STRIPINGV2
feature we need to find out its stripe unit and stripe count in
order to know whether we can use it. We don't yet support fancy
striping fully, but if the default parameters are used the behavior
is indistinguishible from non-fancy striping.
This is necessary because some images require the STRIPINGV2 feature
even if they use the default parameters. (Which is to say the feature
bit was erroneously set even if the feature was not used.)
This resolves:
http://tracker.ceph.com/issues/4709
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Callers of rbd_obj_method_sync() don't know how many bytes of data
got returned by the class method call. As a result, they have been
assuming enough got returned to decode whatever was expected.
This isn't safe. We know how many bytes got transferred, so have
rbd_obj_method_sync() return that amount (rather than just 0) if
the call is successful.
Change all callers to use this return value to ensure decoding of
the results is done safely.
On the other hand, most callers of rbd_obj_method_sync() only
indicate success or failure, so all of *their* callers can simply
test for non-zero result.
This resolves:
http://tracker.ceph.com/issues/4773
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Make the inbound and outbound data parameters have void rather than
character type for rbd_obj_method_sync(). This makes it more clear
they don't expect typed data, and eliminates the need for some silly
type casts.
One more unrelated change: define the features buffer used in
_rbd_dev_v2_snap_features() to be a packed data structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Make the buf parameter into which the data is to be read have type
void pointer.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A clone image has a defined overlap point with its parent image.
That is the byte offset beyond which the parent image has no
defined data to back the clone, and anything thereafter can be
viewed as being zero-filled by the clone image.
This is needed because a clone image can be resized. If it gets
resized larger than the snapshot it is based on, the overlap defines
the original size. If the clone gets resized downward below the
original size the new clone size defines the overlap. If the clone
is subsequently resized to be larger, the overlap won't be increased
because the previous resize invalidated any parent data beyond that
point.
This resolves:
http://tracker.ceph.com/issues/4724
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This implements the main copyup functionality for layered writes.
Here we add a copyup_pages field to the object request, which is
used only for copyup requests to keep track of the page array
containing data read from the parent image.
A copyup request is currently the only request rbd has that requires
two osd operations. Because of this we handle copyup specially.
All image object requests get an osd request allocated when they are
created. For a write request, if a copyup is required, the osd
request originally allocated is released, and a new one (with room
for two osd ops) is allocated to replace it. A new function
rbd_osd_req_create_copyup() allocates an osd request suitable for
a copyup request.
The first op is then filled with a copyup object class method call,
supplying the array of pages containing data read from the parent.
The second op is filled in with the original write request.
The original request otherwise remains intact, and it describes the
original write request (found in the second osd op). The presence
of the copyup op is sort of implicit; a non-null copyup_pages field
could be used to distinguish between a "normal" write request and a
request containing both a copyup call and a write.
This resolves:
http://tracker.ceph.com/issues/3419
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
As a step toward implementing layered writes, implement reading the
data for a target object from the parent image for a write request
whose target object is known to not exist. Add a copyup_pages field
to an image request to track the page array used (only) for such a
request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If rbd disk is open and rbd resize is done, new size is not
visible by filesystem. Like is done in virtio-blk and dm driver,
revalidate_disk() permits to update the bd_inode size.
Signed-off-by: Laurent Barbe <laurent@ksperis.com>
Reviewed-by: Alex Elder <elder@inktank.com>
This patch adds the ability to build an image request whose data
will be written from or read into memory described by a page array.
(Previously only bio lists were supported.)
Originally this was going to define a new function for this purpose
but it was largely identical to the rbd_img_request_fill_bio(). So
instead, rbd_img_request_fill_bio() has been generalized to handle
both types of image request.
For the moment we still only fill image requests with bio data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new function zero_pages() that zeroes a range of memory
defined by a page array, along the lines of zero_bio_chain(). It
saves and the irq flags like bvec_kmap_irq() does, though I'm not
sure at this point that it's necessary.
Update rbd_img_obj_request_read_callback() to use the new function
if the object request contains page rather than bio data.
For the moment, only bio data is used for osd READ ops.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Object requests that are part of an image request are subject to
some additional handling. Define rbd_img_obj_request_submit() to
encapsulate that, and use it when initially submitting an image
object request, and when re-submitting it during callback of
an object existence check.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Separate rbd_osd_req_format() into two functions, one for read
requests and the other for write requests.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is a step toward fully implementing layered writes.
Add checks before request submission for the object(s) associated
with an image request. For write requests, if we don't know that
the target object exists, issue a STAT request to find out. When
that request completes, mark the known and exists flags for the
original object request accordingly and re-submit the object
request. (Note that this still does the existence check only; the
copyup operation is not yet done.)
A new object request is created to perform the existence check. A
pointer to the original request is added to that object request to
allow the stat request to re-issue the original request after
updating its flags. If there is a failure with the stat request
the error code is stored with the original request, which is then
completed.
This resolves:
http://tracker.ceph.com/issues/3418
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This creates two new flags for object requests to indicate what is
known about the existence of the object to which a request is to be
sent. The KNOWN flag will be true if the the EXISTS flag is
meaningful. That is:
KNOWN EXISTS
----- ------
0 0 don't know whether the object exists
0 1 (not used/invalid)
1 0 object is known to not exist
1 0 object is known to exist
This will be used in determining how to handle write requests for
data objects for layered rbd images.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In a few spots, whether the an object request's img_request pointer
is null is used to determine whether an object request is being done
as part of an image data request.
Stop doing that, and instead always use the object request IMG_DATA
flag for that purpose. Swap the order of the definition of the
IMG_DATA and DONE flag helpers, because obj_request_done_set() now
refers to obj_request_img_data_set() to get its rbd_dev value.
This will become important because the img_request pointer is
about to become part of a union.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An extra reference is taken when an object request is added as one
of the requests making up an image object. A reference is dropped
again when the image's object requests get submitted.
The original reference for the object request will remain throughout
this period, so we don't need to add and then take away an extra
one.
This can be interpreted as the image request inheriting the original
object request's reference.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data. Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement layered read requests for format 2 rbd images.
If an rbd image is a clone of a snapshot, the snapshot will be the
clone's "parent" image. When an object read request on a clone
comes back with ENOENT it indicates that the clone is not yet
populated with that portion of the image's data, and the parent
image should be consulted to satisfy the read.
When this occurs, a new image request is created, directed to the
parent image. The offset and length of the image are the same as
the image-relative offset and length of the object request that
produced ENOENT. Data from the parent image therefore satisfies the
object read request for the original image request.
While this code works, it will not be active until we enable the
layering feature (by adding RBD_FEATURE_LAYERING to the value of
RBD_FEATURES_SUPPORTED).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Call the probe function for the parent device if one is present.
Since we don't formally support the layering feature we won't
be using this functionality just yet.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add a flag to distinguish between object requests being done on
standalone objects and requests being sent for objects representing
rbd image data (i.e., object requests that are the result of image
request).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We're going to need some more Boolean values for object requests,
so create a flags bit field and use it to record whether the request
is done.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Encapsulate the code that completes processing of an object request
that's part of an image request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a flag indicating whether an image request is for a layered
image (one with a parent image to which requests will be redirected
if the target object of a request does not exist). The code that
checks this flag will be added shortly.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a flag indicating whether an image request originated from
the Linux block layer (from blk_fetch_request()) or whether it was
initiated in order to satisfy an object request for a child image
of a layered rbd device. For image requests initiated by objects of
child images we'll save a pointer to the object request rather than
the Linux block request.
For now, only block requests are used.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are several Boolean values we'll be maintaining for image
requests. Switch from the single write_request field to a
general-purpose flags field, and use one if its bits to represent
the direction of I/O for the image request. Define helper functions
for setting and testing that flag.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
For an image object request we will need to know what offset within
the rbd image the request covers. Record that when the object
request gets created.
Update the I/O error warnings so they use this so what's reported
is more informative.
Rename a local variable to fit the convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Compute the total number of bytes transferred for an image
request--the sum across each of the request's object requests.
To avoid contention do it only when all object requests are
complete, in rbd_img_request_complete().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
If any image object request produces a non-zero result, preserve
that as the result of the overall image request. If multiple
objects have non-zero results, save only the first one.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is a new rbd feature bit defined for "fancy striping." Add
it to the ones defined in the kernel client.
Change RBD_FEATURES_ALL so it represents the set of all feature
bits (rather than just the ones we support). Define a new symbol
RBD_FEATURES_SUPPORTED to indicate the supported ones.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Right now the data for a method call is specified via a pointer and
length, and it's copied--along with the class and method name--into
a pagelist data item to be sent to the osd. Instead, encode the
data in a data item separate from the class and method names.
This will allow large amounts of data to be supplied to methods
without copying. Only rbd uses the class functionality right now,
and when it really needs this it will probably need to use a page
array rather than a page list. But this simple implementation
demonstrates the functionality on the osd client, and that's enough
for now.
This resolves:
http://tracker.ceph.com/issues/4104
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This ends up being a rather large patch but what it's doing is
somewhat straightforward.
Basically, this is replacing two calls with one. The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters. In place of those two will be a single function that
initializes the op directly.
That means we sort of fan out a set of the needed functions:
- extent ops with pages data
- extent ops with pagelist data
- extent ops with bio list data
and
- class ops with page data for receiving a response
We also have define another one, but it's only used internally:
- class ops with pagelist data for request parameters
Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields. All the osd ops refer to them for
their data. For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch just trivially moves around some code for consistency.
In preparation for initializing osd request data fields in
ceph_osdc_build_request(), I wanted to verify that rbd did in fact
call that immediately before it called ceph_osdc_start_request().
It was true (although image requests are built in a group and then
started as a group). But I made the changes here just to make
it more obvious, by making all of the calls follow a common
sequence:
osd_req_op_<optype>_init();
ceph_osd_data_<type>_init()
osd_req_op_<optype>_<datafield>()
rbd_osd_req_format()
...
ret = rbd_obj_request_submit()
I moved the initialization of the callback for image object requests
into rbd_img_request_fill_bio(), again, for consistency. To avoid
a forward reference, I moved the definition of rbd_img_obj_callback()
up in the file.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The osd data for a request is currently initialized inside
rbd_osd_req_create(), but that assumes an object request's data
belongs in the osd request's data in or data out field.
There are only three places where requests with data are set up, and
it turns out it's easier to call just the osd data init routines
directly there rather than handling it in rbd_osd_req_create().
(The real motivation here is moving toward getting rid of the
osd request in and out data fields.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently an object request has its osd request's data field set in
rbd_osd_req_format_op(). That assumes a single osd op per object
request, and that won't be the case for long.
Move the code that sets this out and into the caller.
Rename rbd_osd_req_format_op() to be just rbd_osd_req_format(),
removing the notion that it's doing anything op-specific.
This and the next patch resolve:
http://tracker.ceph.com/issues/4658
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.
So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize. This better hides the details the
op structure (and faciltates moving the data pointers they use).
Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope. Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message. Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.
Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message. The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.
Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient. Add some assertions to verify
pointers are always set the way they're expected to be.
This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.
This is the first in a series of patches that resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.
In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.
As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops. And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.
This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used. The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.
The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.
Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.
It does not yet add per-op data, that's coming soon.
This resolves:
http://tracker.ceph.com/issues/4656
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define rbd_osd_req_format_op(), which encapsulates formatting
an osd op into an object request's osd request message. Only
one op is supported right now.
Stop calling ceph_osdc_build_request() in rbd_osd_req_create().
Instead, call rbd_osd_req_format_op() in each of the callers of
rbd_osd_req_create().
This is to prepare for the next patch, in which the source ops for
an osd request will be held in the osd request itself. Because of
that, we won't have the source op to work with until after the
request is created, so we can't format the op until then.
This an the next patch resolve:
http://tracker.ceph.com/issues/4656
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When rbd creates an object request containing an object method call
operation it is passing 0 for the size. I originally thought this
was because the length was not needed for method calls, but I think
it really should be supplied, to describe how much space is
available to receive response data. So provide the supplied length.
This resolves:
http://tracker.ceph.com/issues/4659
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When assigning a bio pointer to an osd request, we don't have an
efficient way of knowing the total length bytes in the bio list.
That information is available at the point it's set up by the rbd
code, so record it with the osd data when it's set.
This and the next patch are related to maintaining the length of a
message's data independent of the message header, as described here:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd code has a function that allocates and populates a
ceph_osd_req_op structure (the in-core version of an osd request
operation). When reviewed, Josh suggested two things: that the
big varargs function might be better split into type-specific
functions; and that this functionality really belongs in the osd
client rather than rbd.
This patch implements both of Josh's suggestions. It breaks
up the rbd function into separate functions and defines them
in the osd client module as exported interfaces. Unlike the
rbd version, however, the functions don't allocate an osd_req_op
structure; they are provided the address of one and that is
initialized instead.
The rbd function has been eliminated and calls to it have been
replaced by calls to the new routines. The rbd code now now use a
stack (struct) variable to hold the op rather than allocating and
freeing it each time.
For now only the capabilities used by rbd are implemented.
Implementing all the other osd op types, and making the rest of the
code use it will be done separately, in the next few patches.
Note that only the extent, cls, and watch portions of the
ceph_osd_req_op structure are currently used. Delete the others
(xattr, pgls, and snap) from its definition so nobody thinks it's
actually implemented or needed. We can add it back again later
if needed, when we know it's been tested.
This (and a few follow-on patches) resolves:
http://tracker.ceph.com/issues/3861
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move some definitions for max integer values out of the rbd code and
into the more central "decode.h" header file. These really belong
in a Linux (or libc) header somewhere, but I haven't gotten around
to proposing that yet.
This is in preparation for moving some code out of rbd.c and into
the osd client.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The length of outgoing data in an osd request is dependent on the
osd ops that are embedded in that request. Each op is encoded into
a request message using osd_req_encode_op(), so that should be used
to determine the amount of outgoing data implied by the op as it
is encoded.
Have osd_req_encode_op() return the number of bytes of outgoing data
implied by the op being encoded, and accumulate and use that in
ceph_osdc_build_request().
As a result, ceph_osdc_build_request() no longer requires its "len"
parameter, so get rid of it.
Using the sum of the op lengths rather than the length provided is
a valid change because:
- The only callers of osd ceph_osdc_build_request() are
rbd and the osd client (in ceph_osdc_new_request() on
behalf of the file system).
- When rbd calls it, the length provided is only non-zero for
write requests, and in that case the single op has the
same length value as what was passed here.
- When called from ceph_osdc_new_request(), (it's not all that
easy to see, but) the length passed is also always the same
as the extent length encoded in its (single) write op if
present.
This resolves:
http://tracker.ceph.com/issues/4406
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
It's possible that the reference to the object request dropped
inside the loop in rbd_img_request_submit() will be the last
one, in which case the content of the object pointer can't be
trusted.
Use a safe form of the object request list traversal to avoid
problems.
This resolves:
http://tracker.ceph.com/issues/4705
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A result of ENOENT from a read request for an object that's part of
an rbd image indicates that there is a hole in that portion of the
image. Similarly, a short read for such an object indicates that
the remainder of the read should be interpreted a full read with
zeros filling out the end of the request.
This behavior is not correct for objects that are not backing rbd
image data. Currently rbd_img_obj_request_callback() assumes it
should be done for all objects.
Change rbd_img_obj_request_callback() so it only does this zeroing
for image objects. Encapsulate that special handling in its own
function. Add an assertion that the image object request is a bio
request, since we assume that (and we currently don't support any
other types).
This resolves a problem identified here:
http://tracker.ceph.com/issues/4559
The regression was introduced by bf0d5f503d.
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx. Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.
For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
Use the new version of the encoding for osd requests and replies. In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request. Update the rbd and fs/ceph users
appropriately.
The main changes are:
- we keep pointers into the request memory for fields we need to update
each time the request is sent out over the wire
- we keep information about the result in an array in the request struct
where the users can easily get at it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
The only thing type-specific osd completion functions do with their
osd op parameter is (in some cases) extract the number of bytes
transferred from it. In the other cases, the xferred bytes field
is not used, and total message data transfer byte count (which may
well be zero) is used.
Just set the object request transfer count in the main osd request
callback function and provide that to the other routines. There is
then no longer any need to pass the op pointer to the type-specific
completion routines, so drop those parameters.
Stop doing anything with the total message data length.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
This function is slightly out of place, probably the result
of an errant automatic merge or something.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Fengguang Wu reminded me that there were outstanding sparse reports
in the ceph and rbd code. This patch fixes these problems in rbd
that lead to those reports:
- Convert functions that are never referenced externally to have
static scope.
- Add a lockdep annotation to rbd_request_fn(), because it
releases a lock before acquiring it again.
This partially resolves:
http://tracker.ceph.com/issues/4184
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add dout() calls to facilitate tracing of image and object requests.
Change a few existing calls so they use __func__ rather than the
hard-coded function name. Have calls always add ":" after the name
of the function, and prefix pointer values with a consistent tag
indicating what it represents. (Note that there remain some older
dout() calls that are left untouched by this patch.)
Issue a warning if rbd_osd_write_callback() ever gets a short write.
This resolves:
http://tracker.ceph.com/issues/4235
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Let's go shopping!
I'm afraid this may not have gotten it right:
07741308 rbd: add barriers near done flag operations
The smp_wmb() should have been done *before* setting the done flag,
to ensure all other data was valid before marking the object request
done.
Switch to use atomic_inc_return() here to set the done flag, which
allows us to verify we don't mark something done more than once.
Doing this also implies general barriers before and after the call.
And although a read memory barrier might have been sufficient before
reading the done flag, convert this to a full memory barrier just
to put this issue to bed.
This resolves:
http://tracker.ceph.com/issues/4238
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The old request code simply ignored zero-length requests. We should
still operate that same way to avoid any changes in behavior. We
can implement handling for special zero-length requests separately
(see http://tracker.ceph.com/issues/4236).
Add some assertions based on this new constraint.
This resolves:
http://tracker.ceph.com/issues/4237
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The result of ceph_copy_from_page_vector() is simply the length
argument it is provided.
This is called by rbd_obj_method_sync(), which returns the result if
it's non-negative. But we always either ignore or overwrite that
return value. So explicitly ignore what's returned by the copy
function, and have rbd_obj_method_sync() always return either a
negative errno or 0.
We also return the result of ceph_copy_from_page_vector() in
rbd_obj_read_sync(). There we still want to return the number of
bytes transferred, but we can use the value we already have in hand
rather than what ceph_copy_from_page_vector() provides.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_obj_read_sync(), verify the number of bytes transferred won't
exceed what can be represented by a size_t before using it to
indicate the number of bytes to copy to the result buffer.
(The real motivation for this is to prepare for the next patch.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.
This operation sends no data to the osd; everything required is
encoded in identity of the target object.
The result will be ENOENT if the object doesn't exist. If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format). The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).
This resolves:
http://tracker.ceph.com/issues/4007
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The for_each_obj_request*() macros should parenthesize their uses of
the ireq parameter.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of ceph_osdc_create_event(), and it
provides 0 as its "one_shot" argument. Get rid of that argument and
just use 0 in its place.
Replace the code in handle_watch_notify() that executes if one_shot
is nonzero in the event with a BUG_ON() call.
While modifying "osd_client.c", give handle_watch_notify() static
scope.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Somehow, I missed this little item in Documentation/atomic_ops.txt:
*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
Create and use some helper functions that include the proper memory
barriers for manipulating the done field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This commit:
bc7a62ee5 rbd: prevent open for image being removed
added checking for removing rbd before allowing an open, and used
the same request spinlock for protecting that and updating the open
count as is used for the request queue.
However it used the non-irq protected version of the spinlocks.
Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is a check in the completion path for osd requests that
ensures the number of pages allocated is enough to hold the amount
of incoming data expected.
For bio requests coming from rbd the "number of pages" is not really
meaningful (although total length would be). So stop requiring that
nr_pages be supplied for bio requests. This is done by checking
whether the pages pointer is null before checking the value of
nr_pages.
Note that this value is passed on to the messenger, but there it's
only used for debugging--it's never used for validation.
While here, change another spot that used r_pages in a debug message
inappropriately, and also invalidate the r_con_filling_msg pointer
after dropping a reference to it.
This resolves:
http://tracker.ceph.com/issues/3875
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently, if the OSD client finds an osd request has had a bio list
attached to it, it drops a reference to it (or rather, to the first
entry on that list) when the request is released.
The code that added that reference (i.e., the rbd client) is
therefore required to take an extra reference to that first bio
structure.
The osd client doesn't really do anything with the bio pointer other
than transfer it from the osd request structure to outgoing (for
writes) and ingoing (for reads) messages. So it really isn't the
right place to be taking or dropping references.
Furthermore, the rbd client already holds references to all bio
structures it passes to the osd client, and holds them until the
request is completed. So there's no need for this extra reference
whatsoever.
So remove the bio_put() call in ceph_osdc_release_request(), as
well as its matching bio_get() call in rbd_osd_req_create().
This change could lead to a crash if old libceph.ko was used with
new rbd.ko. Add a compatibility check at rbd initialization time to
avoid this possibilty.
This resolves:
http://tracker.ceph.com/issues/3798 and
http://tracker.ceph.com/issues/3799
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An open request for a mapped rbd image can arrive while removal of
that mapping is underway. We need to prevent such an open request
from succeeding. (It appears that Maciej Galkiewicz ran into this
problem.)
Define and use a "removing" flag to indicate a mapping is getting
removed. Set it in the remove path after verifying nothing holds
the device open. And check it in the open path before allowing the
open to proceed. Acquire the rbd device's lock around each of these
spots to avoid any races accessing the flags and open_count fields.
This addresses:
http://tracker.newdream.net/issues/3427
Reported-by: Maciej Galkiewicz <maciejgalkiewicz@ragnarson.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a new rbd device flags field, manipulated using bit
operations. Replace the use of the current "exists" flag with a bit
in this new "flags" field. Add a little commentary about the
"exists" flag, which does not need to be manipulated atomically.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When we register an osd request to linger, it means that request
will stay around (under control of the osd client) until we've
unregistered it. We do that for an rbd image's header object, and
we keep a pointer to the object request associated with it.
Keep a reference to the watch object request for as long as it is
registered to linger. Drop it again after we've removed the linger
registration.
This resolves:
http://tracker.ceph.com/issues/3937
(Note: this originally came about because the osd client was
issuing a callback more than once. But that behavior will be
changing soon, documented in tracker issue 3967.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Decrement the obj_request_count value when deleting an object
request from its image request's list. Rearrange a few lines
in the surrounding code.
This resolves:
http://tracker.ceph.com/issues/3940
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Switch to keeping track of the object request pointer rather than
the osd request used to watch the rbd image header object.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the code that unregisters an rbd device's lingering header
object watch request into rbd_dev_header_watch_sync(), so it
occurs in the same function that originally sets up that request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid rbd_req_sync_exec() because it is no longer used. That
eliminates the last use of rbd_req_sync_op(), so get rid of that
too. And finally, that leaves rbd_do_request() unreferenced, so get
rid of that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reimplement synchronous object method calls using the new request
tracking code. Use the name rbd_obj_method_sync()
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When we receive notification of a change to an rbd image's header
object we need to refresh our information about the image (its
size and snapshot context). Once we have refreshed our rbd image
we need to acknowledge the notification.
This acknowledgement was previously done synchronously, but there's
really no need to wait for it to complete.
Change it so the caller doesn't wait for the notify acknowledgement
request to complete. And change the name to reflect it's no longer
synchronous.
This resolves:
http://tracker.newdream.net/issues/3877
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid rbd_req_sync_notify_ack() because it is no longer used.
As a result rbd_simple_req_cb() becomes unreferenced, so get rid
of that too.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use the new object request tracking mechanism for handling a
notify_ack request.
Move the callback function below the definition of this so we don't
have to do a pre-declaration.
This resolves:
http://tracker.newdream.net/issues/3754
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Get rid of rbd_req_sync_watch(), because it is no longer used.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement a new function to set up or tear down a watch event
for an mapped rbd image header using the new request code.
Create a new object request type "nodata" to handle this. And
define rbd_osd_trivial_callback() which simply marks a request done.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Delete rbd_req_sync_read() is no longer used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reimplement the synchronous read operation used for reading a
version 1 header using the new request tracking code. Name the
resulting function rbd_obj_read_sync() to better reflect that
it's a full object operation, not an object request. To do this,
implement a new OBJ_REQUEST_PAGES object request type.
This implements a new mechanism to allow the caller to wait for
completion for an rbd_obj_request by calling rbd_obj_request_wait().
This partially resolves:
http://tracker.newdream.net/issues/3755
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The two remaining callers of rbd_do_request() always pass a null
collection pointer, so the "coll" and "coll_index" parameters are
not needed. There is no other use of that data structure, so it
can be eliminated.
Deleting them means there is no need to allocate a rbd_request
structure for the callback function. And since that's the only use
of *that* structure, it too can be eliminated.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that the request function has been replaced by one using the new
request management data structures the old one can go away.
Deleting it makes rbd_dev_do_request() no longer needed, and
deleting that makes other functions unneeded, and so on.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch fully implements the new request tracking code for rbd
I/O requests.
Each I/O request to an rbd image will get an rbd_image_request
structure allocated to track it. This provides access to all
information about the original request, as well as access to the
set of one or more object requests that are initiated as a result
of the image request.
An rbd_obj_request structure defines a request sent to a single osd
object (possibly) as part of an rbd image request. An rbd object
request refers to a ceph_osd_request structure built up to represent
the request; for now it will contain a single osd operation. It
also provides space to hold the result status and the version of the
object when the osd request completes.
An rbd_obj_request structure can also stand on its own. This will
be used for reading the version 1 header object, for issuing
acknowledgements to event notifications, and for making object
method calls.
All rbd object requests now complete asynchronously with respect
to the osd client--they supply a common callback routine.
This resolves:
http://tracker.newdream.net/issues/3741
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When an rbd image is initially mapped a watch event is registered so
we can do something if the header object changes.
The code that does this currently loops if initiating the watch
request results in an ERANGE error. The osds will never return
ERANGE, so there's no reason to do this loop, so get rid of it.
This resolves:
http://tracker.newdream.net/issues/3860
Note that the problem this loop was intended to solve is a race
between collecting image header information and setting up the watch
on the header object. The real fix for that problem is described
here:
http://tracker.newdream.net/issues/3871
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The return type of rbd_get_num_segments() is int, but the values it
operates on are u64. Although it's not likely, there's no guarantee
the result won't exceed what can be respresented in an int. The
function is already designed to return -ERANGE on error, so just add
this possible overflow as another reason to return that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A few very minor changes to the rbd code:
- RBD_MAX_OPT_LEN is unused, so get rid of it
- Consolidate rbd options definitions
- Make rbd_segment_name() return pointer to const char
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The type of the snap_id local variable is defined with the
wrong byte order. Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Both rbd_req_sync_op() and rbd_do_request() have a "linger"
parameter, which is the address of a pointer that should refer to
the osd request structure used to issue a request to an osd.
Only one case ever supplies a non-null "linger" argument: an
CEPH_OSD_OP_WATCH start. And in that one case it is assigned
&rbd_dev->watch_request.
Within rbd_do_request() (where the assignment ultimately gets made)
we know the rbd_dev and therefore its watch_request field. We
also know whether the op being sent is CEPH_OSD_OP_WATCH start.
Stop opaquely passing down the "linger" pointer, and instead just
assign the value directly inside rbd_do_request() when it's needed.
This makes it unnecessary for rbd_req_sync_watch() to make
arrangements to hold a value that's not available until a
bit later. This more clearly separates setting up a watch
request from submitting it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and
CEPH_OSD_OP_NOTIFY_ACK. Move the setup of those operations into
rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and
rbd_destroy_op().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the initialization of the CEPH_OSD_OP_CALL operation into
rbd_osd_req_op_create().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move the assignment of the extent offset and length and payload
length out of rbd_req_sync_op() and into its caller in the one spot
where a read (and note--no write) operation might be initiated.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_do_request() there's a sort of last-minute assignment of the
extent offset and length and payload length for read and write
operations. Move those assignments into the caller (in those spots
that might initiate read or write operations)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies
rbd_simple_req_cb() as its callback function. Because the callback
is supplied, an rbd_req structure gets allocated and populated so it
can be used by the callback. However rbd_simple_req_cb() is not
freeing (or even using) the rbd_req structure, so it's getting
leaked.
Since rbd_simple_req_cb() has no need for the rbd_req structure,
just avoid allocating one for this case. Of the three calls to
rbd_do_request(), only the one from rbd_do_op() needs the rbd_req
structure, and that call can be distinguished from the other two
because it supplies a non-null rbd_collection pointer.
So fix this leak by only allocating the rbd_req structure if a
non-null "coll" value is provided to rbd_do_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When rbd_do_request() is called it allocates and populates an
rbd_req structure to hold information about the osd request to be
sent. This is done for the benefit of the callback function (in
particular, rbd_req_cb()), which uses this in processing when
the request completes.
Synchronous requests provide no callback function, in which case
rbd_do_request() waits for the request to complete before returning.
This case is not handling the needed free of the rbd_req structure
like it should, so it is getting leaked.
Note however that the synchronous case has no need for the rbd_req
structure at all. So rather than simply freeing this structure for
synchronous requests, just don't allocate it to begin with.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd_req_sync_watch() and rbd_req_sync_unwatch() functions are
nearly identical. Combine them into a single function with a flag
indicating whether a watch is to be initiated or torn down.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Each osd message includes a layout structure, and for rbd it is
always the same (at least for osd's in a given pool).
Initialize a layout structure when an rbd_dev gets created and just
copy that into osd requests for the rbd image.
Replace an assertion that was done when initializing the layout
structures with code that catches and handles anything that would
trigger the assertion as soon as it is identified. This precludes
that (bad) condition from ever occurring.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When rbd_do_request() has a request to process it initializes a ceph
file layout structure and uses it to compute offsets and limits for
the range of the request using ceph_calc_file_object_mapping().
The layout used is fixed, and is based on RBD_MAX_OBJ_ORDER (30).
It sets the layout's object size and stripe unit to be 1 GB (2^30),
and sets the stripe count to be 1.
The job of ceph_calc_file_object_mapping() is to determine which
of a sequence of objects will contain data covered by range, and
within that object, at what offset the range starts. It also
truncates the length of the range at the end of the selected object
if necessary.
This is needed for ceph fs, but for rbd it really serves no purpose.
It does its own blocking of images into objects, echo of which is
(1 << obj_order) in size, and as a result it ignores the "bno"
value returned by ceph_calc_file_object_mapping(). In addition,
by the point a request has reached this function, it is already
destined for a single rbd object, and its length will not exceed
that object's extent. Because of this, and because the mapping will
result in blocking up the range using an integer multiple of the
image's object order, ceph_calc_file_object_mapping() will never
change the offset or length values defined by the request.
In other words, this call is a big no-op for rbd data requests.
There is one exception. We read the header object using this
function, and in that case we will not have already limited the
request size. However, the header is a single object (not a file or
rbd image), and should not be broken into pieces anyway. So in fact
we should *not* be calling ceph_calc_file_object_mapping() when
operating on the header object.
So...
Don't call ceph_calc_file_object_mapping() in rbd_do_request(),
because useless for image data and incorrect to do sofor the image
header.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch gets rid of rbd_calc_raw_layout() by simply open coding
it in its one caller.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is the first in a series of patches aimed at eliminating
the use of ceph_calc_raw_layout() by rbd.
It simply pulls in a copy of that function and renames it
rbd_calc_raw_layout().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We now know that every of rbd_req_sync_op() passes an array of
exactly one operation, as evidenced by all callers passing 1 as its
num_op argument. So get rid of that argument, assuming a single op.
Similarly, we now know that all callers of rbd_do_request() pass 1
as the num_op value, so that parameter can be eliminated as well.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Throughout the rbd code there are spots where it appears we can
handle an osd request containing more than one osd request op.
But that is only the way it appears. In fact, currently only one
operation at a time can be supported, and supporting more than
one will require much more than fleshing out the support that's
there now.
This patch changes names to make it perfectly clear that anywhere
we're dealing with a block of ops, we're in fact dealing with
exactly one of them. We'll be able to simplify some things as
a result.
When multiple op support is implemented, we can update things again
accordingly.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
provided an array of ceph osd request operations. Rather than just
passing the number of operations in the array, the caller is
required append an additional zeroed operation structure to signal
the end of the array.
All callers know the number of operations at the time these
functions are called, so drop the silly zero entry and supply that
number directly. As a result, get_num_ops() is no longer needed.
This also means that ceph_osdc_alloc_request() never uses its ops
argument, so that can be dropped.
Also rbd_create_rw_ops() no longer needs to add one to reserve room
for the additional op.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add a num_op parameter to rbd_do_request() and rbd_req_sync_op() to
indicate the number of entries in the array. The callers of these
functions always know how many entries are in the array, so just
pass that information down.
This is in anticipation of eliminating the extra zero-filled entry
in these ops arrays.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Only one of the two callers of ceph_osdc_alloc_request() provides
page or bio data for its payload. And essentially all that function
was doing with those arguments was assigning them to fields in the
osd request structure.
Simplify ceph_osdc_alloc_request() by having the caller take care of
making those assignments
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>