We used to write these with BIO_RW_BARRIER aka REQ_HARDBARRIER (unless
disabled in the configuration). The correct semantic now would be to
write with FLUSH/FUA.
For example, with activity log transactions, FUA alone is not enough, we
need the corresponding bitmap update (and all related application
updates) on stable storage as well.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we have an asymetrically congested network, we may send P_PING,
but due to congestion, the corresponding P_PING_ACK would time out,
and we would drop a (congested, but otherwise) healthy connection
("PingAck did not arrive in time.")
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If we have a good resync rate, we will frequently update the on-disk
bitmap, which, if not accounted for as resync io, may let an otherwise
idle device appear to be "busy", and cause us to throttle resync.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The last commit, drbd: add missing spinlock to bitmap receive,
introduced a cond_resched_lock(), where the lock in question is taken
with irqs disabled.
As we must not schedule with IRQs disabled,
and cond_resched_lock_irq() does not exist, yet,
we re-aquire the spin_lock_irq() for each bitmap page processed in turn.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
During bitmap exchange, when using the RLE bitmap compression scheme,
we have a code path that can set the whole bitmap at once.
To avoid holding spin_lock_irq() for too long, we used to lock out other
bitmap modifications during bitmap exchange by other means, and then,
knowing we have exclusive access to the bitmap, modify it without
the spinlock, and with IRQs enabled.
Since we now allow local IO to continue, potentially setting additional
bits during the bitmap receive phase, this is no longer true, and we get
uncoordinated updates of bitmap members, causing bm_set to no longer
accurately reflect the total number of set bits.
To actually see this, you'd need to have a large bitmap, use RLE bitmap
compression, and have busy IO during sync handshake and bitmap exchange.
Fix this by taking the spin_lock_irq() in this code path as well, but
calling cond_resched_lock() after each page worth of bits processed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Use hlist_entry() for io_context.cic_list.first
cfq-iosched: Remove bogus check in queue_fail path
xen/blkback: potential null dereference in error handling
xen/blkback: don't call vbd_size() if bd_disk is NULL
block: blkdev_get() should access ->bd_disk only after success
CFQ: Fix typo and remove unnecessary semicolon
block: remove unwanted semicolons
Revert "block: Remove extra discard_alignment from hd_struct."
nbd: adjust 'max_part' according to part_shift
nbd: limit module parameters to a sane value
nbd: pass MSG_* flags to kernel_recvmsg()
block: improve the bio_add_page() and bio_add_pc_page() descriptions
Jens' back-merge commit 698567f3fa ("Merge commit 'v2.6.39' into
for-2.6.40/core") was incorrectly done, and re-introduced the
DISK_EVENT_MEDIA_CHANGE lines that had been removed earlier in commits
- 9fd097b149 ("block: unexport DISK_EVENT_MEDIA_CHANGE for
legacy/fringe drivers")
- 7eec77a181 ("ide: unexport DISK_EVENT_MEDIA_CHANGE for ide-gd
and ide-cd")
because of conflicts with the "g->flags" updates near-by by commit
d4dc210f69 ("block: don't block events on excl write for non-optical
devices")
As a result, we re-introduced the hanging behavior due to infinite disk
media change reports.
Tssk, tssk, people! Don't do back-merges at all, and *definitely* don't
do them to hide merge conflicts from me - especially as I'm likely
better at merging them than you are, since I do so many merges.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
blkbk->pending_pages can be NULL here so I added a check for it.
Signed-off-by: Dan Carpenter <error27@gmail.com>
[v1: Redid the loop a bit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
It is easier to figure out the context by reading SCSI_SENSE_BUFFERSIZE
instead of plain '96'.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Wire up the virtio_driver config_changed method to get notified about
config changes raised by the host. For now we just re-read the device
size to support online resizing of devices, but once we add more
attributes that might be changeable they could be added as well.
Note that the config_changed method is called from irq context, so
we'll have to use the workqueue infrastructure to provide us a proper
user context for our changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
x86 idle: deprecate "no-hlt" cmdline param
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
x86 idle floppy: deprecate disable_hlt()
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
x86 idle: clarify AMD erratum 400 workaround
idle governor: Avoid lock acquisition to read pm_qos before entering idle
cpuidle: menu: fixed wrapping timers at 4.294 seconds
Plan to remove floppy_disable_hlt in 2012, an ancient
workaround with comments that it should be removed.
This allows us to remove clutter and a run-time branch
from the idle code.
WARN_ONCE() on invocation until it is removed.
cc: x86@kernel.org
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
The 'max_part' parameter determines how many partitions are supported
on each nbd device. However the actual number can be changed to the
power of 2 minus 1 form during the module initialization as
alloc_disk() is called with (1 << part_shift) for some reason.
So adjust 'max_part' also at least for consistency with loop and brd.
It is exported via sysfs already, and a user should check this value
after module loading if [s]he wants to use that number correctly
(i.e. fdisk or something).
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Unlike kernel_sendmsg(), kernel_recvmsg() requires passing flags explicitly
via last parameter instead of struct msghdr.msg_flags. Therefore calls to
sock_xmit(lo, 0, ..., MSG_WAITALL) have not been processed properly by tcp
layer wrt. the flag. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Export 'max_loop' and 'max_part' parameters to sysfs so user can know
that how many devices are allowed and how many partitions are supported.
If 'max_loop' is 0, there is no restriction on the number of loop devices.
User can create/use the devices as many as minor numbers available. If
'max_part' is 0, it means simply the device doesn't support partitioning.
Also note that 'max_part' can be adjusted to power of 2 minus 1 form if
needed. User should check this value after the module loading if he/she
want to use that number correctly (i.e. fdisk, mknod, etc.).
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Export 'rd_nr', 'rd_size' and 'max_part' parameters to sysfs so user can
know that how many devices are allowed, how big each device is and how
many partitions are supported. If 'max_part' is 0, it means simply the
device doesn't support partitioning.
Also note that 'max_part' can be adjusted to power of 2 minus 1 form if
needed. User should check this value after the module loading if he/she
want to use that number correctly (i.e. fdisk, mknod, etc.).
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If 'rd_nr' param was not specified, 16 (can be adjusted via
CONFIG_BLK_DEV_RAM_COUNT) devices would be created by default
but comment said 1. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When finding or allocating a ram disk device, brd_probe() did not take
partition numbers into account so that it can result to a different
device. Consider following example (I set CONFIG_BLK_DEV_RAM_COUNT=4
for simplicity) :
$ sudo modprobe brd max_part=15
$ ls -l /dev/ram*
brw-rw---- 1 root disk 1, 0 2011-05-25 15:41 /dev/ram0
brw-rw---- 1 root disk 1, 16 2011-05-25 15:41 /dev/ram1
brw-rw---- 1 root disk 1, 32 2011-05-25 15:41 /dev/ram2
brw-rw---- 1 root disk 1, 48 2011-05-25 15:41 /dev/ram3
$ sudo mknod /dev/ram4 b 1 64
$ sudo dd if=/dev/zero of=/dev/ram4 bs=4k count=256
256+0 records in
256+0 records out
1048576 bytes (1.0 MB) copied, 0.00215578 s, 486 MB/s
namhyung@leonhard:linux$ ls -l /dev/ram*
brw-rw---- 1 root disk 1, 0 2011-05-25 15:41 /dev/ram0
brw-rw---- 1 root disk 1, 16 2011-05-25 15:41 /dev/ram1
brw-rw---- 1 root disk 1, 32 2011-05-25 15:41 /dev/ram2
brw-rw---- 1 root disk 1, 48 2011-05-25 15:41 /dev/ram3
brw-r--r-- 1 root root 1, 64 2011-05-25 15:45 /dev/ram4
brw-rw---- 1 root disk 1, 1024 2011-05-25 15:44 /dev/ram64
After this patch, /dev/ram4 - instead of /dev/ram64 - was
accessed correctly.
In addition, 'range' passed to blk_register_region() should
include all range of dev_t that RAMDISK_MAJOR can address.
It does not need to be limited by partition numbers unless
'rd_nr' param was specified.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
brd_refcnt, brd_offset, brd_sizelimit and brd_blocksize in struct
brd_device seem to be copied from struct loop_device but they're
not used anywhere. Let get rid of them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
ceph: fix cap flush race reentrancy
libceph: subscribe to osdmap when cluster is full
libceph: handle new osdmap down/state change encoding
rbd: handle online resize of underlying rbd image
ceph: avoid inode lookup on nfs fh reconnect
ceph: use LOOKUPINO to make unconnected nfs fh more reliable
rbd: use snprintf for disk->disk_name
rbd: cleanup: make kfree match kmalloc
rbd: warn on update_snaps failure on notify
ceph: check return value for start_request in writepages
ceph: remove useless check
libceph: add missing breaks in addr_set_port
libceph: fix TAG_WAIT case
ceph: fix broken comparison in readdir loop
libceph: fix osdmap timestamp assignment
ceph: fix rare potential cap leak
libceph: use snprintf for unknown addrs
libceph: use snprintf for formatting object name
ceph: use snprintf for dirstat content
libceph: fix uninitialized value when no get_authorizer method is set
...
* 'for-2.6.40/drivers' of git://git.kernel.dk/linux-2.6-block: (110 commits)
loop: handle on-demand devices correctly
loop: limit 'max_part' module param to DISK_MAX_PARTS
drbd: fix warning
drbd: fix warning
drbd: Fix spelling
drbd: fix schedule in atomic
drbd: Take a more conservative approach when deciding max_bio_size
drbd: Fixed state transitions after async outdate-peer-handler returned
drbd: Disallow the peer_disk_state to be D_OUTDATED while connected
drbd: Fix for the connection problems on high latency links
drbd: fix potential activity log refcount imbalance in error path
drbd: Only downgrade the disk state in case of disk failures
drbd: fix disconnect/reconnect loop, if ping-timeout == ping-int
drbd: fix potential distributed deadlock
lru_cache.h: fix comments referring to ts_ instead of lc_
drbd: Fix for application IO with the on-io-error=pass-on policy
xen/p2m: Add EXPORT_SYMBOL_GPL to the M2P override functions.
xen/p2m/m2p/gnttab: Support GNTMAP_host_map in the M2P override.
xen/blkback: don't fail empty barrier requests
xen/blkback: fix xenbus_transaction_start() hang caused by double xenbus_transaction_end()
...
* 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits)
cfq-iosched: free cic_index if cfqd allocation fails
cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
cfq-iosched: reduce bit operations in cfq_choose_req()
cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
block: move bd_set_size() above rescan_partitions() in __blkdev_get()
block: call elv_bio_merged() when merged
cfq-iosched: Make IO merge related stats per cpu
cfq-iosched: Fix a memory leak of per cpu stats for root group
backing-dev: Kill set but not used var in bdi_debug_stats_show()
block: get rid of on-stack plugging debug checks
blk-throttle: Make no throttling rule group processing lockless
blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
blk-throttle: Make dispatch stats per cpu
blk-throttle: Free up a group only after one rcu grace period
blk-throttle: Use helper function to add root throtl group to lists
blk-throttle: Introduce a helper function to fill in device details
blk-throttle: Dynamically allocate root group
blk-cgroup: Allow sleeping while dynamically allocating a group
...
When finding or allocating a loop device, loop_probe() did not take
partition numbers into account so that it can result to a different
device. Consider following example:
$ sudo modprobe loop max_part=15
$ ls -l /dev/loop*
brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0
brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1
brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2
brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3
brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4
brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5
brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6
brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7
$ sudo mknod /dev/loop8 b 7 128
$ sudo losetup /dev/loop8 ~/temp/disk-with-3-parts.img
$ sudo losetup -a
/dev/loop128: [0805]:278201 (/home/namhyung/temp/disk-with-3-parts.img)
$ ls -l /dev/loop*
brw-rw---- 1 root disk 7, 0 2011-05-24 22:16 /dev/loop0
brw-rw---- 1 root disk 7, 16 2011-05-24 22:16 /dev/loop1
brw-rw---- 1 root disk 7, 2048 2011-05-24 22:18 /dev/loop128
brw-rw---- 1 root disk 7, 2049 2011-05-24 22:18 /dev/loop128p1
brw-rw---- 1 root disk 7, 2050 2011-05-24 22:18 /dev/loop128p2
brw-rw---- 1 root disk 7, 2051 2011-05-24 22:18 /dev/loop128p3
brw-rw---- 1 root disk 7, 32 2011-05-24 22:16 /dev/loop2
brw-rw---- 1 root disk 7, 48 2011-05-24 22:16 /dev/loop3
brw-rw---- 1 root disk 7, 64 2011-05-24 22:16 /dev/loop4
brw-rw---- 1 root disk 7, 80 2011-05-24 22:16 /dev/loop5
brw-rw---- 1 root disk 7, 96 2011-05-24 22:16 /dev/loop6
brw-rw---- 1 root disk 7, 112 2011-05-24 22:16 /dev/loop7
brw-r--r-- 1 root root 7, 128 2011-05-24 22:17 /dev/loop8
After this patch, /dev/loop8 - instead of /dev/loop128 - was
accessed correctly.
In addition, 'range' passed to blk_register_region() should
include all range of dev_t that LOOP_MAJOR can address. It does
not need to be limited by partition numbers unless 'max_loop'
param was specified.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
In file included from drivers/block/drbd/drbd_main.c:54: drivers/block/drbd/drbd_int.h:1190: warning: parameter has incomplete type
Forward declarations of enums do not work.
Fix it unpleasantly by moving the prototype.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Lars Ellenberg <drbd-dev@lists.linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Found these with the help of ispell -l.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
An administrative detach used to request a state change directly to D_DISKLESS,
first suspending IO to avoid the last put_ldev() occuring from an endio handler,
potentially in irq context.
This is not enough on the receiving side (typically secondary), we may miss
some peer_req on the way to local disk, which then may do the last put_ldev()
from their drbd_peer_request_endio().
This patch makes the detach always go through the intermediate D_FAILED state.
We may consider to rename it D_DETACHING.
Alternative approach would be to create yet an other work item to be scheduled
on the worker, do the destructor work from there, and get the timing right.
manually picked commit 564040f from the drbd 8.4 branch.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The old (optimistic) implementation could shrink the bio size
on an primary device.
Shrinking the bio size on a primary device is bad. Since there
we might get BIOs with the old (bigger) size shortly after
we published the new size.
The new implementation is more conservative, and eventually
increases the max_bio_size on a primary device (which is valid).
It does so, when it knows the local limit AND the remote limit.
We cache the last seen max_bio_size of the peer in the meta
data, and rely on that, to make the operation of single
nodes more efficient.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
It seems that the real cause of all the issues where that
we did not noticed in drbd_try_connect() when the other
guy closes one socket if the round trip time gets higher
than 100ms. There were that 100ms hard coded!
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
It is no longer sufficient to trigger on local WRITE,
we need to check on (rq_state & RQ_IN_ACT_LOG)
before calling drbd_al_complete_io also in the error path.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If there is no replication traffic within the idle timeout
(ping-int seconds), DRBD will send a P_PING,
and adjust the timeout to ping-timeout.
If there is no P_PING_ACK received within this ping-timeout,
DRBD finally drops the connection, and tries to re-establish it.
To decide which timeout was active, we compared the current timeout
with the ping-timeout, and dropped the connection, if that was the case.
By default, ping-int is 10 seconds, ping-timeout is 500 ms.
Unfortunately, if you configure ping-timeout to be the same as ping-int,
expiry of the idle-timeout had been mistaken for a missing ping ack,
and caused an immediate reconnection attempt.
Fix:
Allow both timeouts to be equal, use a local variable
to store which timeout is active.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We limit ourselves to a configurable maximum number of pages used as
temporary bio pages.
If the configured "max_buffers" is not big enough to match the bandwidth
of the respective deployment, a distributed deadlock could be triggered
by e.g. fast online verify and heavy application IO.
TCP connections would block on congestion, because both receivers
would wait on pages to become available.
Fortunately the respective senders in this case would be able to give
back some pages already. So do that.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In case a write failes on the local disk, go into D_INCONSISTENT
disk state. That causes future reads of that block to be shipped
to the peer.
Read retry remote was already in place.
Actually the documentation needs to get fixed now. Since the
application is still shielded from the error. (as long as we have
only a single disk failing) The difference to detach is that
we keep the disk. And therefore might keep all the other, still
working sectors up to date.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
After discovering that wide use of prefetch on modern CPUs
could be a net loss instead of a win, net drivers which were
relying on the implicit inclusion of prefetch.h via the list
headers showed up in the resulting cleanup fallout. Give
them an explicit include via the following $0.02 script.
=========================================
#!/bin/bash
MANUAL=""
for i in `git grep -l 'prefetch(.*)' .` ; do
grep -q '<linux/prefetch.h>' $i
if [ $? = 0 ] ; then
continue
fi
( echo '?^#include <linux/?a'
echo '#include <linux/prefetch.h>'
echo .
echo w
echo q
) | ed -s $i > /dev/null 2>&1
if [ $? != 0 ]; then
echo $i needs manual fixup
MANUAL="$i $MANUAL"
fi
done
echo ------------------- 8\<----------------------
echo vi $MANUAL
=========================================
Signed-off-by: Paul <paul.gortmaker@windriver.com>
[ Fixed up some incorrect #include placements, and added some
non-network drivers and the fib_trie.c case - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.
Also fixes up conflicts in the below files.
Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The sector number on empty barrier requests may (will?) be -1, which,
given that it's being treated as unsigned 64-bit quantity, will almost
always exceed the actual (virtual) disk's size.
Inspired by Konrad's "When writting barriers set the sector number to
zero...".
While at it also add overflow checking to the math in vbd_translate().
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>