The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
of it. Consequently, the corresponding parameter in calc_layout()
becomes unused, so get rid of that as well.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A snapshot id must be provided to ceph_calc_raw_layout() even though
it is not needed at all for calculating the layout.
Where the snapshot id *is* needed is when building the request
message for an osd operation.
Drop the snapid parameter from ceph_calc_raw_layout() and pass
that value instead in ceph_osdc_build_request().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The len argument to ceph_osdc_build_request() is set up to be
passed by address, but that function never updates its value
so there's no need to do this. Tighten up the interface by
passing the length directly.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
For some reason, the snapid field of the osd request header is
explicitly set to CEPH_NOSNAP in rbd_do_request(). Just a few lines
later--with no code that would access this field in between--a call
is made to ceph_calc_raw_layout() passing the snapid provided to
rbd_do_request(), which encodes the snapid value it is provided into
that field instead.
In other words, there is no need to fill in CEPH_NOSNAP, and doing
so suggests it might be necessary. Don't do that any more.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The snapc and snapid parameters to rbd_req_sync_op() always take
the values NULL and CEPH_NOSNAP, respectively. So just get rid
of them and use those values where needed.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All callers of rbd_req_sync_exec() pass CEPH_OSD_FLAG_READ as their
flags argument. Delete that parameter and use CEPH_OSD_FLAG_READ
within the function. If we find a need to support write operations
we can add it back again.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is only one caller of rbd_req_sync_read(), and it passes
CEPH_NOSNAP as the snapshot id argument. Delete that parameter
and just use CEPH_NOSNAP within the function.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The last two parameters to ceph_osd_build_request() describe the
object id, but the values passed always come from the osd request
structure whose address is also provided. Get rid of those last
two parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull a block of code that initializes the layout structure in an osd
request into its own function so it can be reused.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Right now we get the snapshot context for an rbd image (under
protection of the header semaphore) for every request processed.
There's no need to get the snap context if we're doing a read,
so avoid doing so in that case.
Note that we no longer need to hold the header semaphore to
check the rbd_dev's existence flag.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd_device->exists field can be updated asynchronously, changing
from set to clear if a mapped snapshot disappears from the base
image's snapshot context.
Currently, value of the "exists" flag is only read and modified
under protection of the header semaphore, but that will change with
the next patch. Making it atomic ensures this won't be a problem
because the a the non-existence of device will be immediately known.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Now that a big hunk in the middle of rbd_rq_fn() has been moved
into its own routine we can simplify it a little more.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Only one of the three callers of rbd_do_request() provide a
collection structure to aggregate status.
If an error occurs in rbd_do_request(), have the caller
take care of calling rbd_coll_end_req() if necessary in
that one spot.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In rbd_rq_fn(), requests are fetched from the block layer and each
request is processed, looping through the request's list of bio's
until they've all been consumed.
Separate the handling for a single request into its own function to
make it a bit easier to see what's going on.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The result field in a ceph osd reply header is a signed 32-bit type,
but rbd code often casually uses int to represent it.
The following changes the types of variables that handle this result
value to be "s32" instead of "int" to be completely explicit about
it. Only at the point we pass that result to __blk_end_request()
does the type get converted to the plain old int defined for that
interface.
There is almost certainly no binary impact of this change, but I
prefer to show the exact size and signedness of the value since we
know it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
There are spots where a ceph_osds_request pointer variable is given
the name "req". Since we're dealing with (at least) three types of
requests (block layer, rbd, and osd), I find this slightly
distracting.
Change such instances to use "osd_req" consistently to make the
abstraction represented a little more obvious.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are two names used for items of rbd_request structure type:
"req" and "req_data". The former name is also used to represent
items of pointers to struct ceph_osd_request.
Change all variables that have these names so they are instead
called "rbd_req" consistently.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Josh suggested adding warnings to this function to help users
diagnose problems.
Other than memory allocatino errors, there are two places where
errors can be returned. Both represent problems that should
have been caught earlier, and as such might well have been
handled with BUG_ON() calls. But if either ever did manage to
happen, it will be reported.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add a warning in bio_chain_clone_range() to help a user determine
what exactly might have led to a failure. There is only one; please
say something if you disagree with the following reasoning.
There are three places this can return abnormally:
- Initially, if there is nothing to clone. It turns out that
right now this cannot happen anyway. The test is in place
because the code below it doesn't work if those conditions
don't hold. As such they could be assertions but since I can
return a null to indicate an error I just do that instead.
I have not added a warning here because it won't happen.
- While processing bio's, if none remain but there are supposed
to be more bytes to clone. Here I have added a warning.
- If bio_clone_range() returns a null pointer. That function
will have already produced a warning (at least the first
time, via WARN_ON_ONCE()) to distinguish the cause of the
error. The only exception is memory exhaustion, and I'd
rather not pepper the code with warnings in all those spots.
So no warning is added in that place.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Tell the user (via dmesg) what was wrong with the arguments provided
via /sys/bus/rbd/add.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Define a new function rbd_warn() that produces a boilerplate warning
message, identifying in the resulting message the affected rbd
device in the best way available. Use it in a few places that now
use pr_warning().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This replaces two kmalloc()/memcpy() combinations with a single
call to kmemdup().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There is no real benefit to keeping the length of an image id, so
get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There may have been a benefit to hanging on to the length of an
image name before, but there is really none now. The only time it's
used is when probing for rbd images, so we can just compute the
length then.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
I promised Josh I would document whether there were any restrictions
needed for accessing fields of an rbd_spec structure. This adds a
big block of comments that documents the structure and how it is
used--including the fact that we don't attempt to synchronize access
to it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Hi Asai,
FYI, there are new sparse warnings show up in
tree: git://git.kernel.dk/linux-block.git for-3.9/drivers
head: 3d6a87430e
commit: 152834694d [2/3] mtip32xx: add trim support
>> drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:3348:17: sparse: cast to restricted __le32
drivers/block/mtip32xx/mtip32xx.c:4125:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4126:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4127:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4128:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4129:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4130:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4131:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c:4132:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags':
drivers/block/mtip32xx/mtip32xx.c:2804:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers':
drivers/block/mtip32xx/mtip32xx.c:2781:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hi Asai,
FYI, there are new sparse warnings show up in
tree: git://git.kernel.dk/linux-block.git for-3.9/drivers
head: 3d6a87430e
commit: 16c906e51c [1/3] mtip32xx: Add workqueue and NUMA support
drivers/block/mtip32xx/mtip32xx.c:3267:17: sparse: cast to restricted __le32
>> drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4030:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4031:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4032:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4033:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4034:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4035:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4036:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static?
drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags':
drivers/block/mtip32xx/mtip32xx.c:2723:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers':
drivers/block/mtip32xx/mtip32xx.c:2700:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=]
Please consider folding the below diff :-)
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a missing break statement here. This used to return directly
but we re-worked it in 2008 to add locking as part of the BKL push down.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
TRIM support added through vendor unique command.
Signed-off-by: Sam Bradshaw < sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch contains
* parallel command completion using workers
* bind the workers to the chosen numa node
* bind isr to the chosen numa node
* allocating memory in the chosen numa node
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When rebuild is in progress, disk->queue is yet to be created. Surprise
removing the device will call remove()-> del_gendisk(). del_gendisk()
expect disk->queue be not NULL. Fix is to call put_disk() when disk_queue
is NULL.
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If an I/O command times out when a PIO command is active,
MTIP_PF_EH_ACTIVE_BIT is not cleared. This results in I/O
hang in the driver. Fix is to clear this bit.
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This driver was for the 8 bit ISA cards that were installed in
the PC-XT machines of 1980 vintage. They supported the dual
ribbon cable MFM drives of 10-20MB capacity, and ran at a 3:1
interleave, giving performance on the order of 128kB/s.
By the introduction of the PC-AT (286) these controllers were
already scrapped in favour of 16 bit controllers with some onboard
RAM that could support a 1:1 interleave.
The git history doesn't show any evidence of runtime fixes that
would reflect active usage; instead just the usual tree-wide API
type changes/cleanups. Going back to in-source changelogs, the
last "runtime" fix that is evident is something I did over a
dozen years ago[1] -- and even back then, the hardware was long
since unavailable, so that ancient fix was also not runtime tested.
The time is long overdue for this to get flushed, so lets get
rid of it before anyone wastes more time doing builds and sparse
checks etc. on long since dead code.
[1] http://lkml.indiana.edu/hypermail/linux/kernel/0102.2/0027.html
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Chirag Kantharia <chirag.kantharia@hp.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tao Guo <Tao.Guo@emc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a file system is mounted on a virtio-blk disk, we then remove it
and then reattach it, the reattached disk gets the same disk name and
ids as the hot removed one.
This leads to very nasty effects - mostly rendering the newly attached
device completely unusable.
Trying what happens when I do the same thing with a USB device, I saw
that the sd node simply doesn't get free'd when a device gets forcefully
removed.
Imitate the same behavior for vd devices. This way broken vd devices
simply are never free'd and newly attached ones keep working just fine.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Pull Ceph update from Sage Weil:
"There are a few different groups of commits here. The largest is
Alex's ongoing work to enable the coming RBD features (cloning,
striping). There is some cleanup in libceph that goes along with it.
Cyril and David have fixed some problems with NFS reexport (leaking
dentries and page locks), and there is a batch of patches from Yan
fixing problems with the fs client when running against a clustered
MDS. There are a few bug fixes mixed in for good measure, many of
which will be going to the stable trees once they're upstream.
My apologies for the late pull. There is still a gremlin in the rbd
map/unmap code and I was hoping to include the fix for that as well,
but we haven't been able to confirm the fix is correct yet; I'll send
that in a separate pull once it's nailed down."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
rbd: get rid of rbd_{get,put}_dev()
libceph: register request before unregister linger
libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
libceph: init event->node in ceph_osdc_create_event()
libceph: init osd->o_node in create_osd()
libceph: report connection fault with warning
libceph: socket can close in any connection state
rbd: don't use ENOTSUPP
rbd: remove linger unconditionally
rbd: get rid of RBD_MAX_SEG_NAME_LEN
libceph: avoid using freed osd in __kick_osd_requests()
ceph: don't reference req after put
rbd: do not allow remove of mounted-on image
libceph: Unlock unprocessed pages in start_read() error path
ceph: call handle_cap_grant() for cap import message
ceph: Fix __ceph_do_pending_vmtruncate
ceph: Don't add dirty inode to dirty list if caps is in migration
ceph: Fix infinite loop in __wake_requests
ceph: Don't update i_max_size when handling non-auth cap
bdi_register: add __printf verification, fix arg mismatch
...
The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers
that add no value, and their existence suggests they may do more
than what they do.
Get rid of them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Konrad writes:
Please git pull the following branch:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.8
which has a bug-fix to the xen-blkfront and xen-blkback driver
when using the persistent mode. An issue was discovered where LVM
disks could not be read correctly and this fixes it. There
is also a change in llist.h which has been blessed by akpm.
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
Currently blkfront fails to handle cases in blkif_completion like the
following:
1st loop in rq_for_each_segment
* bv_offset: 3584
* bv_len: 512
* offset += bv_len
* i: 0
2nd loop:
* bv_offset: 0
* bv_len: 512
* i: 0
In the second loop i should be 1, since we assume we only wanted to
read a part of the previous page. This patches fixes this cases where
only a part of the shared page is read, and blkif_completion assumes
that if the bv_offset of a bvec is less than the previous bv_offset
plus the bv_size we have to switch to the next shared page.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v2: Move the llist_for_each_entry_safe in llist.h]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We should return NULL on failure instead of returning a freed pointer.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This version number is printed to the console on module initialization
and is available in sysfs, which is where the userland aoe-version tool
looks for it.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change only affects experimental AoE storage networks.
It modifies the console message about runt packets detected so that the
AoE major and minor addresses of the AoE target that generated the runt
are mentioned.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By default, the aoe driver uses any ethernet interface for AoE, but the
aoe_iflist module parameter provides a convenient way to limit AoE
traffic to a specific list of local network interfaces.
This change allows a list to be specified using the comma character as a
separator. For example,
modprobe aoe aoe_iflist=eth2,eth3
Before, it was inconvenient to get the quoting right in shell scripts
when setting aoe_iflist to have more than one network interface.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this change, the aoe driver treats the value zero as special for
the aoe_deadsecs module parameter. Normally, this value specifies the
number of seconds during which the driver will continue to attempt
retransmits to an unresponsive AoE target. After aoe_deadsecs has
elapsed, the aoe driver marks the aoe device as "down" and fails all
I/O.
The new meaning of an aoe_deadsecs of zero is for the driver to
retransmit commands indefinitely.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many AoE targets have four or fewer network ports, but some existing
storage devices have many, and the AoE protocol sets no limit.
This patch allows the use of more than eight remote MAC addresses per AoE
target, while reducing the amount of memory used by the aoe driver in
cases where there are many AoE targets with fewer than eight MAC addresses
each.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change avoids a race that could result in a NULL pointer derference
following a WARNing from kobject_add_internal, "don't try to register
things with the same name in the same directory."
The problem was found with a test that forgets and discovers an
aoe device in a loop:
while test ! -r /tmp/stop; do
aoe-flush -a
aoe-discover
done
The race was between aoedev_flush taking aoedevs out of the devlist,
allowing a new discovery of the same AoE target to take place before the
driver gets around to calling sysfs_remove_group. Fixing that one
revealed another race between do_open and add_disk, and this patch avoids
that, too.
The fix required some care, because for flushing (forgetting) an aoedev,
some of the steps must be performed under lock and some must be able to
sleep. Also, for discovering a new aoedev, some steps might sleep.
The check for a bad aoedev pointer remains from a time when about half of
this patch was done, and it was possible for the
bdev->bd_disk->private_data to become corrupted. The check should be
removed eventually, but it is not expected to add significant overhead,
occurring in the aoeblk_open routine.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An AoE target can have multiple network ports used for AoE, and in the
aoe driver, those are tracked by the aoetgt struct. These changes allow
the aoe driver to handle network paths, or aoetgts, that are not working
well, compared to the others.
Paths that do not get responses despite the retransmission of AoE
commands are marked as "tainted", and non-tainted paths are preferred.
Meanwhile, the aoe driver attempts to "probe" the tainted path in the
background by issuing reads of LBA 0 that are padded out to full
(possibly jumbo-frame) size. If the probes get responses, then the path
is "redeemed", and its taint is removed.
This mechanism has been shown to be helpful in transparently handling
and recovering from real-world network "brown outs" in ways that the
earlier "shoot the help-needing target in the head" mechanism could not.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The value returned by the static minor device number number allocator is
the real minor number, so it must be multiplied by the supported number
of partitions per aoedev.
Without this fix the support for systems without udev is incomplete, and
the few users of aoe on such systems will have surprising results when
device nodes names do not match the AoE target.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because the minor_get and related functions use the return values for
errors, the compiler doesn't know that sysminor will always either 1) be
initialized in aoedev_by_aoeaddr by the call to minor_get, or 2) be
unused as the "goto out" is executed.
This patch avoids the compiler warning.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some special-purpose systems where udev isn't present, static
allocation of minor numbers is desirable. This update distinguishes
different failure scenarios, to help the user understand what went
wrong.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to call the request handler function in the I/O
completion routine. The user impact of not doing it is a more "nice" aoe
driver that is less susceptible to causing soft lockups.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A misplaced comment was attached to the nout member of the aoetgt. This
change corrects the comment.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The aoe driver will never be waiting for more than aoe_maxout AoE
commands from a given remote network port on an AoE target. Increasing
the cap increases performance. Users can tighten the setting to reduce
the amount of memory used for handling AoE traffic or the network
bandwidth used for AoE.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before the aoe driver was an I/O request handler, it was a
make_request-style block driver. Even so, there was a problem where
sysfs expected a request queue to exist, so one was provided in commit
7135a71b19 ("aoe: allocate unused request_queue for sysfs").
During the transition to the request-handler style, a patch was merged
that was based on a driver without the noop queue, and the noop queue
remained in place after the patch was merged, even though a new
functional queue was introduced by the patch, allocated through
blk_init_queue.
The user impact is a memory leak proportional to the number of AoE
targets discovered. This patch removes the memory leak and cleans up
vestiges of the old do-nothing queue from the aoeblk_gdalloc function.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f3b8e07af774 ("aoe: commands in retransmit queue use new
destination on failure") omits the copying of the coarse-grained time
when an AoE command was sent during the failover from one destination
MAC address on the AoE target to another.
The coarse-grained timing is only used when the system time changes or
an unlikely length of time has passed since the sending of the AoE
command. Users will not be impacted unless their system clock is very
inaccurate or something unusual (e.g., 10 GbE link reset) happens during
the period when the aoe driver is handling the failure of a port on the
AoE target. Being effected will mean that an AoE target could be
considered "down" too eagerly.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When one remote MAC address isn't working as a destination for AoE
commands, the frames used to track information associated with the AoE
commands are moved to a new aoetgt (defined by the tuple of {AoE major,
AoE minor, target MAC address}).
This patch makes sure that the frames on the queue for retransmits that
need to be done are updated to use the new destination, so that
retransmits will be sent through a working network path.
Without this change, packets on the retransmit queue will be needlessly
retransmitted to the unresponsive destination MAC, possibly causing
premature target failure before there's time for the retransmit timer to
run again, decide to retransmit again, and finally update the destination
to a working MAC address on the AoE target.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These changes improve the accuracy of the decision about whether it's time
to retransmit an AoE command by using the microsecond-resolution
gettimeofday instead of jiffies.
Because the system time can jump suddenly, the decision reverts to using
jiffies if the high-resolution time difference is relatively large.
Otherwise the AoE targets could be considered failed inappropriately.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this bugfix in place the calculation of the criterion for "lateness"
is performed under lock. Without the lock, there is a chance that one of
the non-atomic operations performed on the round trip time statistics
could be incomplete, such that an incorrect lateness criterion would be
calculated.
Without this change, the effect of the bug would be rare unecessary but
benign retransmissions.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /dev/etherd/err character device provides low-level information about
normal but sometimes interesting AoE command retransmits and "unexpected
responses", i.e., responses for packets that have already been
retransmitted.
This change adds MAC addresses to the messages about unexpected responses,
so that when they occur, it's more easy to determine the network paths to
which they belong.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The aoe driver already had some congestion handling, but it was limited in
its ability to cope with the kind of congestion that can arise on more
complex networks such as those involving paths through multiple ethernet
switches.
Some of the lessons from TCP's history of development can be applied to
improving the congestion control and avoidance on AoE storage networks.
These changes use familar concepts from Van Jacobson's "Congestion
Avoidance and Control" paper from '88, without adding significant
overhead.
This patch depends on an upcoming patch that covers the failover case when
AoE commands being retransmitted are transferred from one retransmit queue
to another. Another upcoming patch increases the timing accuracy.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the aoe driver follow expected behavior when the user uses ioctl to
get the ATA device identify information, allowing access to model, serial
number, etc.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The userland aoetools package includes an "aoe-stat" command that can
display a "payload size" column when the aoe driver exports this
information. Users can quickly see what amount of user data is
transferred inside each AoE command on the network, network headers
excluded.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The GPFS filesystem is an example of an aoe user that requires the aoe
driver to support I/O request sizes larger than the default. Most users
will not need large I/O request sizes, because they would need to be split
up into multiple AoE commands anyway.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users sometimes want to cause the aoe driver to forget a particular
previously discovered device when it is no longer online. The aoetools
provide an "aoe-flush" command that users run to perform this
administrative task. The changes below provide the support needed in the
driver.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ATA over Ethernet config query response contains a "buffer count"
field reflecting the AoE target's capacity to buffer incoming AoE
commands.
By taking the current value of this field into accound, we increase
performance throughput or avoid network congestion, when the value
has increased or decreased, respectively.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dropped transmits are not common, but when they do occur, increasing
the transmit queue length often helps.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
ENOTSUPP is not a standard errno (it shows up as "Unknown error 524"
in an error message). This is what was getting produced when the
the local rbd code does not implement features required by a
discovered rbd image.
Change the error code returned in this case to ENXIO.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
RBD_MAX_SEG_NAME_LEN represents the maximum length of an rbd object
name (i.e., one of the objects providing storage backing an rbd
image).
Another symbol, MAX_OBJ_NAME_SIZE, is used in the osd client code to
define the maximum length of any object name in an osd request.
Right now they disagree, with RBD_MAX_SEG_NAME_LEN being too big.
There's no real benefit at this point to defining the rbd object
name length limit separate from any other object name, so just
get rid of RBD_MAX_SEG_NAME_LEN and use MAX_OBJ_NAME_SIZE in its
place.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
There is no check in rbd_remove() to see if anybody holds open the
image being removed. That's not cool.
Add a simple open count that goes up and down with opens and closes
(releases) of the device, and don't allow an rbd image to be removed
if the count is non-zero.
Protect the updates of the open count value with ctl_mutex to ensure
the underlying rbd device doesn't get removed while concurrently
being opened.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We no longer need the connector.
But we need libcrc32c.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This was introduces when moving the code over from the 8.3 codebase
with commit 328e0f125b
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_set_role(, R_PRIMARY, ) does the state change to Primary,
some more housekeeping, and possibly generates a new UUID set.
All of this holding the "state_mutex".
The connection handshake involves sending of various state information,
including the current data generation UUID set, and two connection
state changes from C_WF_CONNECTION to C_WF_REPORT_PARAMS further to
a number of different outcomes, resync being one of them.
If the connection handshake happens between the state change to Primary
and the generation of the new UUIDs, the resync decision based on the
old UUID set may be confused, depending on circumstances.
Make sure that, before we do the handshake, any promotion to Primary
role will either be complete (including the housekeeping stuff), or can
see, and serialize with, the ongoing handshake, based on the
"STATE_SENT" bit, which is set when we start the handshake, and cleared
only when we leave C_WF_REPORT_PARAMS again.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We need to propagate the configuration into the flag bits,
or it won't be effective.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Smatch complained about it this redundanct check.
The check was introduced in 2006-09-13. On 2007-07-24 the body of the
function was enclosed by get_ldev()/put_ldev() reference counting.
Since then the check is useless and miss leading.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Compiling drbd yields:
drivers/block/drbd/drbd_state.c: In function ‘_conn_request_state’:
drivers/block/drbd/drbd_state.c:1804:5: error: macro "wait_event_lock_irq" passed 4 arguments, but takes just 3
drivers/block/drbd/drbd_state.c:1801:3: error: ‘wait_event_lock_irq’ undeclared (first use in this function)
drivers/block/drbd/drbd_state.c:1801:3: note: each undeclared identifier is reported only once for each function it appears in
drivers/block/drbd/drbd_state.c: At top level:
drivers/block/drbd/drbd_state.c:1734:1: warning: ‘_conn_rq_cond’ defined but not used [-Wunused-function]
Due to drbd having copied the MD definition for wait_event_lock_irq()
as well. Kill them.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently there is not limitation of number of requests in the loop bio
list. This can lead into some nasty situations when the caller spawns
tons of bio requests taking huge amount of memory. This is even more
obvious with discard where blkdev_issue_discard() will submit all bios
for the range and wait for them to finish afterwards. On really big loop
devices and slow backing file system this can lead to OOM situation as
reported by Dave Chinner.
With this patch we will wait in loop_make_request() if the number of
bios in the loop bio list would exceed 'nr_congestion_on'.
We'll wake up the process as we process the bios form the list. Some
threshold hysteresis is in place to avoid high frequency oscillation.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Free the page allocated for the persistent grant.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Hi Jens,
Another tiny patch.
Removed __packed before the struct smart_attr and added __packed at end of
the structure to fix padding issue.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Calling the request handler directly on a plugged queue defeats
the performance improvements provided by the plugging mechanism.
Use the __blk_run_queue function instead of calling the request
handler directly, so that we don't interfere with the block
layer's ability to plug the queue.
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The dereference to port should be moved below the NULL test.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we're building a 32-bit kernel and CONFIG_LBADF isn't set,
sector_t is 32-bits wide. The shifts by 32 and 40 are thus
larger than we support.
Cast the sector offset to a u64 to avoid these warnings.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previous commit use value 3 for erasemode mask.
Changing the mask to correct value to 2
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Earlier lba address was assigned directly to lba_low and lba_low_ex,
which would result in a different number (bytes reversed) in
big-endian systems. Now assigning lba address byte-by-byte to fis.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The mtip driver lifted this code from elsewhere and then added a special
handling check for SEC_ERASE_UNIT. If the caller tries to do a security
erase but passes no output data for the command then outbuf is not
allocated and the driver duly explodes.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use check_signature() to find a signature in the mmio address.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Remove unnecessary correction of bit and address
- Use BITS_TO_LONGS macro to calculate bitmap size
- Use bitmap_zero()
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use copy_highpage() to copy from one page to another.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The 8.3.12 commit drbd: Bugfix for the connection behavior fixes a
"wasted established connection", if a former connection attempt failed
during its early stages.
However it opened a window for a regression, if a connection attempt
fails during its last stages. The result was a terminated receiver
thread, that left behind the supposedly transient "C_UNCONNECTED" state.
Any later requests to change the connection state fail, as they wait for
the connection state to "stabilize".
Fix: short circuit and keep retrying to restablish a new connection,
if we don't reach C_WF_REPORT_PARAMS.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jing Wang <windsdaemon@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the disk has failed already, there is no point trying to change the
bitmap. drbd_set_out_of_sync() already had this safeguard,
time to add it to drbd_set_in_sync() as well.
This also prevents some warning messages, like
FIXME asender in bm_change_bits_to, bitmap locked for 'detach' by worker
if our disk fails during resync, while there are some resync acks queued up.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
recent commit
drbd: always write bitmap on detach
introduced a bitmap writeout during detach,
which obviously needs some meta data device to write to.
Unfortunately, that same error path may be taken if we fail to attach,
e.g. due to UUID mismatch, after we changed state to D_ATTACHING,
but before the lower level device pointer is even assigned.
We need to test for presence of mdev->ldev.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we detach due to local read-error (which sets a bit in the bitmap),
stay Primary, and then re-attach (which re-reads the bitmap from disk),
we potentially lost the "out-of-sync" (or, "bad block") information in
the bitmap.
Always (try to) write out the changed bitmap pages before going diskless.
That way, we don't lose the bit for the bad block,
the next resync will fetch it from the peer, and rewrite
it locally, which may result in block reallocation in some
lower layer (or the hardware), and thereby "heal" the bad blocks.
If the bitmap writeout errors out as well, we will (again: try to)
mark the "we need a full sync" bit in our super block,
if it was a READ error; writes are covered by the activity log already.
If that superblock does not make it to disk either, we are sorry.
Maybe we just lost an entire disk or controller (or iSCSI connection),
and there actually are no bad blocks at all, so we don't need to
re-fetch from the peer, there is no "auto-healing" necessary.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The intention of force-detach is to be able to deal with a completely
unresponsive lower level IO stack, which does not even deliver error
completions anymore, but no completion at all.
In all other cases, we must still wait for the meta data IO completion.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
This has not yet been observed, but conceivably, when using GFP_KERNEL
allocations from drbd_md_sync(), drbd_flush_after_epoch() or
receive_SyncParam(), we could trigger additional IO to our own device,
or an other device in a criss-cross setup, and end up in a local
deadlock, or potentially a distributed deadlock in a criss-cross setup
involving the peer blocked in a similar way waiting for us to make
progress.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The former comment arguing that GFP_KERNEL was good enough was wrong: it
did not take resize into account at all, and assumed the only path
leading here was the normal attach on a still secondary device, so no
deadlock would be possible.
Both resize on a Primary, or attach on a diskless Primary,
could potentially deadlock.
drbd_bm_resize() is called while IO to the respective device is
suspended, so we must use GFP_NOIO to avoid potential deadlock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Using list_move_tail() instead of list_del() + list_add_tail().
spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
"aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer
complete requests at all, not even do error completions. In this
situation, usually a hard-reset and failover is the only way out.
By "aborting", basically faking a local error-completion,
we allow for a more graceful swichover by cleanly migrating services.
Still the affected node has to be rebooted "soon".
By completing these requests, we allow the upper layers to re-use
the associated data pages.
If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will
just put random data into unused pages; but typically it will corrupt
meanwhile completely unrelated data, causing all sorts of damage.
Which means delayed successful completion,
especially for READ requests,
is a reason to panic().
We assume that a delayed *error* completion is OK,
though we still will complain noisily about it.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
is_valid_transition() might return SS_NOTHING_TO_DO.
The condition function _req_st_cond() returned SS_NOTHING_TO_DO, which
caused the wait_event to abort too early. Therefore drbd_req_state()
did not consume the next CL_ST_CHG_SUCCESS or SS_CW_FAILED_BY_PEER
causing serve disruption of the state machine logic...
Detaching from a single volue was one way to trigger this bug.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We use the RQ_POSTPONED flag to mark a request for several reasons.
It may be a conflicting request in a dual-primary setup,
where conflict detection and resolution on the peer decided that
this request needs to be re-submitted, it needs to re-enter
drbd_make_request() to fix the data divergence caused by these
conflicting, partially overlapping, quasi-simultaneous requests.
In this case we need to mark the corresponding area as out-of-sync,
before we call drbd_al_complete_io().
We also use the RQ_POSTPONED flag to just "push back" a request,
before even processing it, if IO is suspended for some reason.
In this case, as this request was neither submitted nor sent yet,
we must not touch the bitmap.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
A postponed request might has RQ_IN_ACT_LOG already set, but
is POSTPONED before it gets something in the RQ_LOCAL_MASK
set. Up to now this caused a left-over active extent.
Fix that by only testing for the RQ_IN_ACT_LOG bit in drbd_req_destroy()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.
If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.
The worker might be blocked in sending something, i.e. it
takes until it gets into its timeout. That is 6 seconds by
default which is longer than the 5 seconds of the md_sync_timer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* Postponed requests should not set or clear out-of-sync marks
* When a request gets postponed we need to drop its reference
mdev->local_cnt (put_ldev()).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With merging the commit
'drbd: Delay/reject other state changes while establishing a connection'
the condition check for clearing the flag was wrong.
Move the bit clearing to the __drbd_set_state() function
in order to have it already cleared for the other parts of
the function. I.e. clearing the susp_fen in the after_state_ch() function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When _conn_requests_state() is used to change other parts of the state
than the connection, do not check for a valid connection transition.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The previous way of doing the state change was also okay since the
state change on the susp flag gets propagated from the mdev
to the tconn.
Fortunately all this goes away in drbd-9.0
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the md_sync_timer triggers a second time,
while the work queued during the first time is still pending,
this could result in list_add() of an already added item,
and corrupt the work item list.
This likely only triggered because of the erroneous
batch-dequeueing of work items fixed with
drbd: dequeue single work items in wait_for_work()
Still, skip queueing if md_sync_work is already queued.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
As long as we still use drbd_queue_work_front(),
we must only dequeue the single first item during normal operation.
The comment in drbd_worker() even says so,
but bc8a5a1 drbd: remove struct drbd_tl_epoch objects (barrier works)
introduced the batch dequeueing again via list_splice_init() in
wait_for_work().
Change back to list_move() of the first item, if any.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Documentation of mutex_unlock says
we must not use it in interrupt context.
So do not call it while holding the spin_lock_irq,
but give up the spinlock temporarily.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the preconditions for a state change change after the wait_event() we
might hit the BUG() statement in conn_set_state().
With holding the spin_lock while evaluating the condition AND until the
actual state change we ensure the the preconditions can not change anymore.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_adm_disk_opts() does
wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
drbd_al_shrink(mdev);
If the device is very busy, this can take a very long time to succeed.
Fix this by temporarily suspending IO,
then quickly change the settings, and resume.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We must only send P_BARRIER for epochs we actually sent P_DATA in.
If we (re-)establish a connection, we reinitialized the
send.current_epoch_nr, but forgot to reset send.current_epoch_writes.
This could result in a spurious P_BARRIER with stale epoch information,
and a disconnect/reconnect cycle once the then "unexpected"
P_BARRIER_ACK is received:
BAD! BarrierAck #28823 received, expected #28829!
Introduce re_init_if_first_write() and maybe_send_barrier() helpers,
and call them appropriately for read/write/set-out-of-sync requests.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
drbd_disconnected() is supposed to clear the resync lru cache,
by calling drbd_rs_cancel_all().
We must do so before we call drbd_flush_workqueue(), as at least the
callback w_restart_disk_io() may wait for resync progres, and would
otherwise deadlock.
drbd_finish_peer_reqs() may again populate that cache, which will
then potentially be stale after the next resync handshake and bitmap
exchange, we have to do it again after that.
A stale resync lru cache causes no harm but ugly messages like this:
BAD! sector=196608s enr=6 rs_left=-256 rs_failed=0 count=256 cstate=SyncTarget
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.
In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.
At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
There is at least the worker context, the receiver context, the context of
receiving netlink packts.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We need to write the whole bitmap after we moved the meta data
due to an online resize operation.
With the support for one peta byte devices bitmap IO was optimized
to only write out touched pages. This optimization must be turned
off when writing the bitmap after an online resize.
This issue was introduced with drbd-8.3.10.
The impact of this bug is that after an online resize, the next
resync could become larger than expected.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In various places (E.g. CONNECTION_LOST_WHILE_PENDING) the
RQ_COMPLETION_SUSP mask is passed in the clear set to mod_rq_state().
The issue was that it tried to clear the RQ_COMPLETION_SUSP bit
out of the state mask first, and eventuelly set it afterwards,
in the drbd_req_put_completion_ref() function.
Fixed that by moving the reference getting out of
drbd_req_put_completion_ref() into the mod_rq_state(), before the place
where the extra reference might be put.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If for some reason (typically "split-brained" cluster manager)
drbd replica data has diverged, we can chose a victim,
and reconnect using "--discard-my-data", causing the victim
to become sync-target, fetching all changed blocks from the peer.
If we are Primary, we are potentially in use, and we refuse to
"roll back" changes to the data below the page cache and other users.
Rename the error symbol for this to ERR_DISCARD_IMPOSSIBLE.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We don't discard anything here, really.
We resolve conflicting, concurrent writes to overlapping data blocks.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
To avoid confusion with REQ_DISCARD aka TRIM, rename our
"discard concurrent write acks" from P_DISCARD_WRITE to P_SUPERSEDED.
At the same time, rename the drbd request event DISCARD_WRITE
to CONFLICT_RESOLVED. It already triggers both successful completion
or restart of the request, depending on our RQ_POSTPONED flag.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Don't drop a request from the transfer log just because it was NEG_ACKED.
We need it around to be able to verify P_BARRIER_ACKs against the
transver log.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Almost all code paths calling start_new_tl_epoch() guarded it with
if (... current_tle_writes > 0 ... ).
Just move that inside start_new_tl_epoch().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.
When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.
Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.
I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
DRBD has a concept of request epochs or reorder-domains,
which are separated on the wire by P_BARRIER packets.
Older DRBD is not able to handle zero-sized requests at all,
so we need to map empty flushes to these drbd barriers.
These are the equivalent of empty flushes, and
by default trigger flushes on the receiving side anyways
(unless not supported or explicitly disabled),
so there is no need to handle this differently in newer drbd either.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since the listening socket is open all the time, it was possible to
get into stable "initial packet S crossed" loops.
* when both sides realize in the drbd_socket_okay() call at the end
of the loop that the other side closed the main socket you had
the chance to get into a stable loop with repeated "packet S crossed"
messages.
* when both sides do not realize with the drbd_socket_okay() call at the end
of the loop that the other side closed the main socket you had
the chance to get into a stable loop with alternating "packet S crossed"
"packet M crossed" messages.
In order to break out these stable loops randomize the behaviour if
such a crossing of P_INITIAL_DATA or P_INITIAL_META packets is detected.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since now our listening socket is open all the time we will get
connection tries of the peer always in. No need to try it three
times.
This is valid when connecting to older peers as well, it simply
increases the probability that the new version DRBD will accept
a connection instead that it will establish one.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Since the drbd_socket_okay() function itself tests if the the
socket is NULL, the explicit test "if (sock.socket && &msock.socket)"
was redundent.
Apart from that the address opperator ('&') before msock.socket rendered
the test pointless.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>