Sufficient L2 cache can noticeably improve the performance when using
large images with frequent I/O.
Previously, unless 'cache-size' was specified and was large enough, the
L2 cache was set to a certain size without taking the virtual image size
into account.
Now, the L2 cache assignment is aware of the virtual size of the image,
and will cover the entire image, unless the cache size needed for that is
larger than a certain maximum. This maximum is set to 1 MB by default
(enough to cover an 8 GB image with the default cluster size) but can
be increased or decreased using the 'l2-cache-size' option. This option
was previously documented as the *maximum* L2 cache size, and this patch
makes it behave as such, instead of as a constant size. Also, the
existing option 'cache-size' can limit the sum of both L2 and refcount
caches, as previously.
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The refcount cache size does not need to be set to its minimum value in
read_cache_sizes(), as it is set to at least its minimum value in
qcow2_update_options_prepare().
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Leonid Bloch <lbloch@janustech.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The file-posix code is used for the "file", "host_device" and
"host_cdrom" drivers, and it allows reopening images. However the only
option that is actually processed is "x-check-cache-dropped", and
changes in all other options (e.g. "filename") are silently ignored:
(qemu) qemu-io virtio0 "reopen -o file.filename=no-such-file"
While we could allow changing some of the other options, let's keep
things as they are for now but return an error if the user tries to
change any of them.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The default value of x-check-cache-dropped is false. There's no reason
to use the previous value as a default in raw_reopen_prepare() because
bdrv_reopen_queue_child() already takes care of putting the old
options in the BDRVReopenState.options QDict.
If x-check-cache-dropped was previously set but is now missing from
the reopen QDict then it should be reset to false.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image locking errors happening at device initialization time doesn't say
which file cannot be locked, for instance,
-device scsi-disk,drive=drive-1: Failed to get shared "write" lock
Is another process using the image?
could refer to either the overlay image or its backing image.
Hoist the error_append_hint to the caller of raw_check_lock_bytes where
file name is known, and include it in the error hint.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is a rare case which the size of last compressed cluster
is larger than the cluster size, which will cause the file is
not aligned at the sector boundary.
There are three reasons to do it. First, if vmdk doesn't align at
the sector boundary, there may be many undefined behaviors,
such as, in vbox it will show VMDK: Compressed image is corrupted
'syno-vm-disk1.vmdk' (VERR_ZIP_CORRUPTED) when we try to import an
ova with unaligned vmdk. Second, all the cluster_sector is aligned
to sector, the last one should be like this, too. Third, it ease
reading with sector based I/Os.
Signed-off-by: yuchenlin <yuchenlin@synology.com>
Message-Id: <20180913082952.3675-1-yuchenlin@synology.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
- Deprecate the "enforce-config-section" machine parameter
- Re-enable the wdt_ib700, endianness and vmxnet3 qtests
- Some trivial fixes and doc update patches that crossed my way
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJbqlsyAAoJEC7Z13T+cC21RbAP/3IvGfBxuRm6rBWoghjQgbl8
KU8nPnlZUtqjxmfUTILO/h+pJ3na5MQ8hh7v8JHi+xlQ2DPkECW21DtnfdxntVjw
+b+N5Ap6J22GHyEq4HJXPWAk2rDInqkU966DvL40RiMvOTfXdg9EO0TDX0VsVgZv
BR1r7/t3T0P7hiQ0XWb9U2JchRIC+Zgk34gXZPSTpoIv89fUhzNoK5LvAA6yV1FQ
TvE8VTKJm4wkqThH1ShtbJCBKjHjW/W8LYZr3YMothcs8vGjEdEcDL4BoJZDn3bF
h4VTkU+k8lp7W9LmlnPnu1WH/5ezhzdwJTeFaPJt4U10WKJptAS4vbK03DXlds9O
9d2BOXKrima2kSr1ejSe1f0kcE8fis1XFmSuhF61Nbw6ngT5+pP2JSc1XwFazd2K
zQwV4GXBLzAGnd4F2Ec+5TKzbGFVfczxeBDiBkkVmG+XdX/UXJpkpPYGAaw7DDiK
JwKVVYIPk1ll6MAbR6qEGsvE/adHNEm8lUdjXqwgbQlIeUZ2H0hCu9lJ0X81mtoQ
WZP+nMa/87COnlPX6VPVgxM2TXQOH/UbGz/WmYzZ6/gPKTX+gfwrHQGdp7Tjl33U
KxFKWioFnoqGuyWasvTtKEK67/IlrY+w1nXuuqKJg8J2/qx1SVtx45FHkRkxkIDx
4boRpx0XUqpDVdf8VhRB
=dXgp
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/huth-gitlab/tags/pull-request-2018-09-25' into staging
- Deprecate the usage of a network backend via "name" instead of "id"
- Deprecate the "enforce-config-section" machine parameter
- Re-enable the wdt_ib700, endianness and vmxnet3 qtests
- Some trivial fixes and doc update patches that crossed my way
# gpg: Signature made Tue 25 Sep 2018 16:58:42 BST
# gpg: using RSA key 2ED9D774FE702DB5
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>"
# gpg: aka "Thomas Huth <thuth@redhat.com>"
# gpg: aka "Thomas Huth <huth@tuxfamily.org>"
# gpg: aka "Thomas Huth <th.huth@posteo.de>"
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5
* remotes/huth-gitlab/tags/pull-request-2018-09-25:
Revert "check: Move VMXNET3 test to common"
Revert "check: Move endianess test to common"
Revert "check: Move wdt_ib700 test to common"
tests/migration: Speed up the test on ppc64
hw/qdev-core: Fix description of instance_init
qdev: fix a typo in comment
docs: Fix some typos (most found by codespell)
trivial: Make bios files and source files non-executable
memfd: fix possible usage of the uninitialized file descriptor
hw/core/machine: Officially deprecate the enforce-config-section parameter
net/slirp: Deprecate the [hub_id name] parameter tuple
net: Deprecate the "name" parameter of -net
Makefile: Add missing dependency for qemu-deprecated.texi
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
These files can not be executed on the host, so they should not be
marked as executable.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
When draining a block node, we recurse to its parent and for subtree
drains also to its children. A single AIO_WAIT_WHILE() is then used to
wait for bdrv_drain_poll() to become true, which depends on all of the
nodes we recursed to. However, if the respective child or parent becomes
quiescent and calls bdrv_wakeup(), only the AioWait of the child/parent
is checked, while AIO_WAIT_WHILE() depends on the AioWait of the
original node.
Fix this by using a single AioWait for all callers of AIO_WAIT_WHILE().
This may mean that the draining thread gets a few more unnecessary
wakeups because an unrelated operation got completed, but we already
wake it up when something _could_ have changed rather than only if it
has certainly changed.
Apart from that, drain is a slow path anyway. In theory it would be
possible to use wakeups more selectively and still correctly, but the
gains are likely not worth the additional complexity. In fact, this
patch is a nice simplification for some places in the code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_drain_poll_top_level() was buggy because it didn't release the
AioContext lock of the node to be drained before calling aio_poll().
This way, callbacks called by aio_poll() would possibly take the lock a
second time and run into a deadlock with a nested AIO_WAIT_WHILE() call.
However, it turns out that the aio_poll() call isn't actually needed any
more. It was introduced in commit 91af091f92, which is effectively
reverted by this patch. The cases it was supposed to fix are now covered
by bdrv_drain_poll(), which waits for block jobs to reach a quiescent
state.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Request callbacks can do pretty much anything, including operations that
will yield from the coroutine (such as draining the backend). In that
case, a decreased in_flight would be visible to other code and could
lead to a drain completing while the callback hasn't actually completed
yet.
Note that reordering these operations forbids calling drain directly
inside an AIO callback. As Paolo explains, indirectly calling it is
okay:
- Calling it through a coroutine is okay, because then
bdrv_drained_begin() goes through bdrv_co_yield_to_drain() and you
have in_flight=2 when bdrv_co_yield_to_drain() yields, then soon
in_flight=1 when the aio_co_wake() in the AIO callback completes, then
in_flight=0 after the bottom half starts.
- Calling it through a bottom half would be okay too, as long as the AIO
callback remembers to do inc_in_flight/dec_in_flight just like
bdrv_co_yield_to_drain() and bdrv_co_drain_bh_cb() do
A few more important cases that come to mind:
- A coroutine that yields because of I/O is okay, with a sequence
similar to bdrv_co_yield_to_drain().
- A coroutine that yields with no I/O pending will correctly decrease
in_flight to zero before yielding.
- Calling more AIO from the callback won't overflow the counter just
because of mutual recursion, because AIO functions always yield at
least once before invoking the callback.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
blk_unref() first decreases the refcount of the BlockBackend and calls
blk_delete() if the refcount reaches zero. Requests can still be in
flight at this point, they are only drained during blk_delete():
At this point, arbitrary callbacks can run. If any callback takes a
temporary BlockBackend reference, it will first increase the refcount to
1 and then decrease it to 0 again, triggering another blk_delete(). This
will cause a use-after-free crash in the outer blk_delete().
Fix it by draining the BlockBackend before decreasing to refcount to 0.
Assert in blk_ref() that it never takes the first refcount (which would
mean that the BlockBackend is already being deleted).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
A bdrv_drain operation must ensure that all parents are quiesced, this
includes BlockBackends. Otherwise, callbacks called by requests that are
completed on the BDS layer, but not quite yet on the BlockBackend layer
could still create new requests.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
bdrv_do_drained_begin/end() assume that they are called with the
AioContext lock of bs held. If we call drain functions from a coroutine
with the AioContext lock held, we yield and schedule a BH to move out of
coroutine context. This means that the lock for the home context of the
coroutine is released and must be re-acquired in the bottom half.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
In qemu_laio_process_completions_and_submit, the AioContext is acquired
before the ioq_submit iteration and after qemu_laio_process_completions,
but the latter is not thread safe either.
This change avoids a number of random crashes when the Main Thread and
an IO Thread collide processing completions for the same AioContext.
This is an example of such crash:
- The IO Thread is trying to acquire the AioContext at aio_co_enter,
which evidences that it didn't lock it before:
Thread 3 (Thread 0x7fdfd8bd8700 (LWP 36743)):
#0 0x00007fdfe0dd542d in __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fdfe0dd0de6 in _L_lock_870 () at /lib64/libpthread.so.0
#2 0x00007fdfe0dd0cdf in __GI___pthread_mutex_lock (mutex=mutex@entry=0x5631fde0e6c0)
at ../nptl/pthread_mutex_lock.c:114
#3 0x00005631fc0603a7 in qemu_mutex_lock_impl (mutex=0x5631fde0e6c0, file=0x5631fc23520f "util/async.c", line=511) at util/qemu-thread-posix.c:66
#4 0x00005631fc05b558 in aio_co_enter (ctx=0x5631fde0e660, co=0x7fdfcc0c2b40) at util/async.c:493
#5 0x00005631fc05b5ac in aio_co_wake (co=<optimized out>) at util/async.c:478
#6 0x00005631fbfc51ad in qemu_laio_process_completion (laiocb=<optimized out>) at block/linux-aio.c:104
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
#8 0x00005631fbfc5499 in qemu_laio_process_completions_and_submit (s=0x7fdfc0297670)
at block/linux-aio.c:237
#9 0x00005631fc05d978 in aio_dispatch_handlers (ctx=ctx@entry=0x5631fde0e660) at util/aio-posix.c:406
#10 0x00005631fc05e3ea in aio_poll (ctx=0x5631fde0e660, blocking=blocking@entry=true)
at util/aio-posix.c:693
#11 0x00005631fbd7ad96 in iothread_run (opaque=0x5631fde0e1c0) at iothread.c:64
#12 0x00007fdfe0dcee25 in start_thread (arg=0x7fdfd8bd8700) at pthread_create.c:308
#13 0x00007fdfe0afc34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
- The Main Thread is also processing completions from the same
AioContext, and crashes due to failed assertion at util/iov.c:78:
Thread 1 (Thread 0x7fdfeb5eac80 (LWP 36740)):
#0 0x00007fdfe0a391f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fdfe0a3a8e8 in __GI_abort () at abort.c:90
#2 0x00007fdfe0a32266 in __assert_fail_base (fmt=0x7fdfe0b84e68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5631fc238ccb "offset == 0", file=file@entry=0x5631fc23698e "util/iov.c", line=line@entry=78, function=function@entry=0x5631fc236adc <__PRETTY_FUNCTION__.15220> "iov_memset")
at assert.c:92
#3 0x00007fdfe0a32312 in __GI___assert_fail (assertion=assertion@entry=0x5631fc238ccb "offset == 0", file=file@entry=0x5631fc23698e "util/iov.c", line=line@entry=78, function=function@entry=0x5631fc236adc <__PRETTY_FUNCTION__.15220> "iov_memset") at assert.c:101
#4 0x00005631fc065287 in iov_memset (iov=<optimized out>, iov_cnt=<optimized out>, offset=<optimized out>, offset@entry=65536, fillc=fillc@entry=0, bytes=15515191315812405248) at util/iov.c:78
#5 0x00005631fc065a63 in qemu_iovec_memset (qiov=<optimized out>, offset=offset@entry=65536, fillc=fillc@entry=0, bytes=<optimized out>) at util/iov.c:410
#6 0x00005631fbfc5178 in qemu_laio_process_completion (laiocb=0x7fdd920df630) at block/linux-aio.c:88
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
#8 0x00005631fbfc5499 in qemu_laio_process_completions_and_submit (s=0x7fdfc0297670)
at block/linux-aio.c:237
#9 0x00005631fbfc54ed in qemu_laio_poll_cb (opaque=<optimized out>) at block/linux-aio.c:272
#10 0x00005631fc05d85e in run_poll_handlers_once (ctx=ctx@entry=0x5631fde0e660) at util/aio-posix.c:497
#11 0x00005631fc05e2ca in aio_poll (blocking=false, ctx=0x5631fde0e660) at util/aio-posix.c:574
#12 0x00005631fc05e2ca in aio_poll (ctx=0x5631fde0e660, blocking=blocking@entry=false)
at util/aio-posix.c:604
#13 0x00005631fbfcb8a3 in bdrv_do_drained_begin (ignore_parent=<optimized out>, recursive=<optimized out>, bs=<optimized out>) at block/io.c:273
#14 0x00005631fbfcb8a3 in bdrv_do_drained_begin (bs=0x5631fe8b6200, recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, poll=<optimized out>) at block/io.c:390
#15 0x00005631fbfbcd2e in blk_drain (blk=0x5631fe83ac80) at block/block-backend.c:1590
#16 0x00005631fbfbe138 in blk_remove_bs (blk=blk@entry=0x5631fe83ac80) at block/block-backend.c:774
#17 0x00005631fbfbe3d6 in blk_unref (blk=0x5631fe83ac80) at block/block-backend.c:401
#18 0x00005631fbfbe3d6 in blk_unref (blk=0x5631fe83ac80) at block/block-backend.c:449
#19 0x00005631fbfc9a69 in commit_complete (job=0x5631fe8b94b0, opaque=0x7fdfcc1bb080)
at block/commit.c:92
#20 0x00005631fbf7d662 in job_defer_to_main_loop_bh (opaque=0x7fdfcc1b4560) at job.c:973
#21 0x00005631fc05ad41 in aio_bh_poll (bh=0x7fdfcc01ad90) at util/async.c:90
#22 0x00005631fc05ad41 in aio_bh_poll (ctx=ctx@entry=0x5631fddffdb0) at util/async.c:118
#23 0x00005631fc05e210 in aio_dispatch (ctx=0x5631fddffdb0) at util/aio-posix.c:436
#24 0x00005631fc05ac1e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#25 0x00007fdfeaae44c9 in g_main_context_dispatch (context=0x5631fde00140) at gmain.c:3201
#26 0x00007fdfeaae44c9 in g_main_context_dispatch (context=context@entry=0x5631fde00140) at gmain.c:3854
#27 0x00005631fc05d503 in main_loop_wait () at util/main-loop.c:215
#28 0x00005631fc05d503 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#29 0x00005631fc05d503 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#30 0x00005631fbd81412 in main_loop () at vl.c:1866
#31 0x00005631fbc18ff3 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at vl.c:4647
- A closer examination shows that s->io_q.in_flight appears to have
gone backwards:
(gdb) frame 7
#7 0x00005631fbfc523c in qemu_laio_process_completions (s=s@entry=0x7fdfc0297670)
at block/linux-aio.c:222
222 qemu_laio_process_completion(laiocb);
(gdb) p s
$2 = (LinuxAioState *) 0x7fdfc0297670
(gdb) p *s
$3 = {aio_context = 0x5631fde0e660, ctx = 0x7fdfeb43b000, e = {rfd = 33, wfd = 33}, io_q = {plugged = 0,
in_queue = 0, in_flight = 4294967280, blocked = false, pending = {sqh_first = 0x0,
sqh_last = 0x7fdfc0297698}}, completion_bh = 0x7fdfc0280ef0, event_idx = 21, event_max = 241}
(gdb) p/x s->io_q.in_flight
$4 = 0xfffffff0
Signed-off-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180906130225.5118-8-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
For purposes of minimum code movement, refactor the mirror_exit
callback to use the post-finalization callbacks in a trivial way.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180906130225.5118-7-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
[mreitz: Added comment for the mirror_exit() function]
Signed-off-by: Max Reitz <mreitz@redhat.com>
In cases where we abort the block/mirror job, there's no point in
installing the new backing chain before we finish aborting.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180906130225.5118-6-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Use the component callbacks; prepare, abort, and clean.
NB: prepare is only called when the job has not yet failed;
and abort can be called after prepare.
complete -> prepare -> abort -> clean
complete -> abort -> clean
During refactor, a potential problem with bdrv_drop_intermediate
was identified, the patched behavior is no worse than the pre-patch
behavior, so leave a FIXME for now to be fixed in a future patch.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180906130225.5118-5-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-4-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-3-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add support for taking and passing forward job creation flags.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Message-id: 20180906130225.5118-2-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
The sslverify setting is supposed to turn off all TLS certificate
checks in libcurl. However because of the way we use it, it only
turns off peer certificate authenticity checks
(CURLOPT_SSL_VERIFYPEER). This patch makes it also turn off the check
that the server name in the certificate is the same as the server
you're connecting to (CURLOPT_SSL_VERIFYHOST).
We can use Google's server at 8.8.8.8 which happens to have a bad TLS
certificate to demonstrate this:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.sslverify": "off", "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: SSL: no alternative certificate subject name matches target host name '8.8.8.8'
Could not open backing image to determine size.
With this patch applied, qemu-img connects to the server regardless of
the bad certificate:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.sslverify": "off", "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: The requested URL returned error: 404 Not Found
(The 404 error is expected because 8.8.8.8 is not actually serving a
file called "/foo".)
Of course the default (without sslverify=off) remains to always check
the certificate:
$ ./qemu-img create -q -f qcow2 -b 'json: { "file.driver": "https", "file.url": "https://8.8.8.8/foo" }' /var/tmp/file.qcow2
qemu-img: /var/tmp/file.qcow2: CURL: Error opening file: SSL: no alternative certificate subject name matches target host name '8.8.8.8'
Could not open backing image to determine size.
Further information about the two settings is available here:
https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.htmlhttps://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Message-id: 20180914095622.19698-1-rjones@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
When we converted rbd to get rid of the older key/value-centric
encoding format, we broke compatibility with image files with backing
file strings encoded in the old format.
This leaves a bit of an ugly conundrum, and a hacky solution.
If the initial attempt to parse the "proper" options fails, it assumes
that we may have an older key/value encoded filename. Fall back to
attempting to parse the filename, and extract the required options from
it. If that fails, pass along the original error message.
We do not support mixed modern usage alongside legacy keyvalue pair
usage.
A deprecation warning has been added, although care should be taken
when actually deprecating since the impact is not limited to
commandline or qapi usage, but also opening existing images.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 15b332e5432ad069441f7275a46080f465d789a0.1536704901.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Code movement to pull the conversion from Qdict to BlockdevOptionsRbd
into a helper function.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Message-id: 5b49a980f2cde6610ab1df41bb0277d00b5db893.1536704901.git.jcody@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
Rename opaque_job to job to be consistent with other job implementations.
Rename 'job', the BackupBlockJob object, to 's' to also be consistent.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-8-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Utilize the job_exit shim by not calling job_defer_to_main_loop, and
where applicable, converting the deferred callback into the job_exit
callback.
This converts backup, stream, create, and the unit tests all at once.
Most of these jobs do not see any changes to the order in which they
clean up their resources, except the test-blockjob-txn test, which
now puts down its bs before job_completed is called.
This is safe for the same reason the reordering in the mirror job is
safe, because job_completed no longer runs under two locks, making
the unref safe even if it causes a flush.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-7-jsnow@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Change the manual deferment to mirror_exit into the implicit
callback to job_exit and the mirror_exit callback.
This does change the order of some bdrv_unref calls and job_completed,
but thanks to the new context in which we call .exit, this is safe to
defer the possible flushing of any nodes to the job_finalize_single
cleanup stage.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180830015734.19765-6-jsnow@redhat.com
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Change the manual deferment to commit_complete into the implicit
callback to job_exit, renaming commit_complete to commit_exit.
This conversion does change the timing of when job_completed is
called to after the bdrv_replace_node and bdrv_unref calls, which
could have implications for bjob->blk which will now be put down
after this cleanup.
Kevin highlights that we did not take any permissions for that backend
at job creation time, so it is safe to reorder these operations.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-5-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Jobs presently use both an Error object in the case of the create job,
and char strings in the case of generic errors elsewhere.
Unify the two paths as just j->err, and remove the extra argument from
job_completed. The integer error code for job_completed is kept for now,
to be removed shortly in a separate patch.
Signed-off-by: John Snow <jsnow@redhat.com>
Message-id: 20180830015734.19765-3-jsnow@redhat.com
[mreitz: Dropped a superfluous g_strdup()]
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Presently we codify the entry point for a job as the "start" callback,
but a more apt name would be "run" to clarify the idea that when this
function returns we consider the job to have "finished," except for
any cleanup which occurs in separate callbacks later.
As part of this clarification, change the signature to include an error
object and a return code. The error ptr is not yet used, and the return
code while captured, will be overwritten by actions in the job_completed
function.
Signed-off-by: John Snow <jsnow@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180830015734.19765-2-jsnow@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
The generated qapi_event_send_FOO() take an Error ** argument. They
can't actually fail, because all they do with the argument is passing it
to functions that can't fail: the QObject output visitor, and the
@qmp_emit callback, which is either monitor_qapi_event_queue() or
event_test_emit().
Drop the argument, and pass &error_abort to the QObject output visitor
and @qmp_emit instead.
Suggested-by: Eric Blake <eblake@redhat.com>
Suggested-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20180815133747.25032-4-peterx@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Commit message rewritten, update to qapi-code-gen.txt corrected]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
.bdrv_close handler is optional after previous commit, no needs to keep
empty functions more.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
blockdev-mirror with the same node for source and target segfaults
today: A node is in its own backing chain, so mirror_start_job() decides
that this is an active commit. When adding the intermediate nodes with
block_job_add_bdrv(), it starts the iteration through the subchain with
the backing file of source, though, so it never reaches target and
instead runs into NULL at the base.
While we could fix that by starting with source itself, there is no
point in allowing mirroring a node into itself and I wouldn't be
surprised if this caused more problems later.
So just check for this scenario and error out.
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
This reinstates commit b008326744,
which was temporarily reverted for the 3.0 release so that libvirt gets
some extra time to update their command lines.
The -drive option serial was deprecated in QEMU 2.10. It's time to
remove it.
Tests need to be updated to set the serial number with -global instead
of using the -drive option.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Format drivers such as qcow2 don't allow sharing the same image between
two QEMU instances in order to prevent image corruptions, because of
metadata cache. LUKS driver don't modify metadata except for when
creating image, so it is safe to relax the permission. This makes
share-rw=on property work on virtual devices.
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 6fccbb475b fixed a bug caused by
QEMU attempting to remove a throttle group member with no pending
requests but an active timer set. This was the result of a previous
bdrv_drained_begin() call processing the throttled requests but
leaving the timer untouched.
Although the commit does solve the problem, the situation shouldn't
happen in the first place. If we try to drain a throttle group member
which has a timer set, we should cancel the timer instead of ignoring
it.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the throttling code after an I/O request has been completed the
next one is selected from a different member using a round-robin
algorithm. This ensures that all members get a chance to finish their
pending I/O requests.
However, if a group member has its I/O limits disabled (because it's
being drained) then we should always give it priority in order to have
all its pending requests finished as soon as possible.
If we don't do this we could have a member in the process of being
drained waiting for the throttled requests of other members, for which
the I/O limits still apply.
This can have additional consequences: if we're running in qtest mode
(with QEMU_CLOCK_VIRTUAL) then timers can only fire if we advance the
clock manually, so attempting to drain a block device can hang QEMU in
the BDRV_POLL_WHILE() loop at the end of bdrv_do_drained_begin().
Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For BlockBackends that are skipped in query-blockstats, we would leak
info since commit 567dcb31. Allocate info only later to avoid the memory
leak.
Fixes: CID 1394727
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
bdrv_io_plug/bdrv_io_unplug take care of keeping a nesting count,
so change s->plugged to just a bool.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20180813144320.12382-2-pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
It is wrong to leave this field as 1, as nvme_close() called in the
error handling code in nvme_file_open() will use it and try to free
s->queues again.
Another problem is the cleaning ups are duplicated between the fail*
labels of nvme_init() and nvme_file_open(), which calls nvme_close().
A third problem is nvme_close() misses g_free() and
event_notifier_cleanup().
Fix all of them.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-Id: <20180712025420.4932-1-famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Consistent with query-block, query-blockstats should not only include
named BlockBackends, but also those that are anonymous, but belong to a
device model.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Like for query-block, the client needs to identify which BlockBackend
the returned data is for. Anonymous BlockBackends are identified by the
device model they are attached to. Add a 'qdev' field that contains the
qdev ID or QOM path of the attached device model.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.
This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since 42a3e1ab36 qemu asserts when using the
vvfat driver:
git clone git://qemu.org/qemu.git
cd qemu
./configure --target-list=ppc-softmmu --enable-debug
make -j8
mkdir foo
touch foo/hello
./ppc-softmmu/qemu-system-ppc -M prep --nographic --monitor null \
-hda fat:rw:./foo
"Ctrl-C"
qemu-system-ppc: block.c:3368: bdrv_close_all: Assertion \
`((&all_bdrv_states)->tqh_first == ((void *)0))' failed.
This is because we reference bs twice in qcow_co_create(..) one time in
bdrv_open_blockdev_ref(..) and in blk_insert_bs(..) but we unref it only once
in blk_unref which leads to the reference leak.
Note that I didn't tested much QCOW after this change as I don't use it much.
Signed-off-by: KONRAD Frederic <frederic.konrad@adacore.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qstring_from_substr() takes the index of the substring's first and
last character. qstring_from_substr(s, 0, SIZE_MAX) denotes an empty
substring. Awkward.
Shift the end index one to the right. This simplifies both
qstring_from_substr() and its callers.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180727062204.10401-3-armbru@redhat.com>
In ed6e2161 ("linux-aio: properly bubble up errors from initialzation"),
I only added a bdrv_attach_aio_context callback for the bdrv_file
driver. There are several other drivers that use the shared
aio_plug callback, though, and they will trip the assertion added to
aio_get_linux_aio because they did not call aio_setup_linux_aio first.
Add the appropriate callback definition to the affected driver
definitions.
Fixes: ed6e2161 ("linux-aio: properly bubble up errors from initialization")
Reported-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180718211256.29774-1-naravamudan@digitalocean.com
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It's annoying to see this debug message every time you use vvfat.
Disable it with the DLOG() macro by default, as it is done with the
other debug messages in this file.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: John Snow <jsnow@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Throttle groups consist of members sharing one throttling state
(including bps/iops limits). Round-robin scheduling is used to ensure
fairness. If a group member already has a timer pending then other
groups members do not schedule their own timers. The next group member
will have its turn when the existing timer expires.
A hang may occur when a group member leaves while it had a timer
scheduled. Although the code carefully removes the group member from
the round-robin list, it does not schedule the next member. Therefore
remaining members continue to wait for the removed member's timer to
expire.
This patch schedules the next request if a timer is pending.
Unfortunately the actual bug is a race condition that I've been unable
to capture in a test case.
Sometimes drive2 hangs when drive1 is removed from the throttling group:
$ qemu ... -drive if=none,id=drive1,cache=none,format=qcow2,file=data1.qcow2,iops=100,group=foo \
-device virtio-blk-pci,id=virtio-blk-pci0,drive=drive1 \
-drive if=none,id=drive2,cache=none,format=qcow2,file=data2.qcow2,iops=10,group=foo \
-device virtio-blk-pci,id=virtio-blk-pci1,drive=drive2
(guest-console1)# fio -filename /dev/vda 4k-seq-read.job
(guest-console2)# fio -filename /dev/vdb 4k-seq-read.job
(qmp) {"execute": "block_set_io_throttle", "arguments": {"device": "drive1","bps": 0,"bps_rd": 0,"bps_wr": 0,"iops": 0,"iops_rd": 0,"iops_wr": 0}}
Reported-by: Nini Gu <ngu@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20180704145410.794-1-stefanha@redhat.com
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1535914
Cc: Alberto Garcia <berto@igalia.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Adjust each caller of raw_open_common to specify if they are expecting
host and character devices or not. Tighten expectations of file types upon
open in the common code and refuse types that are not expected.
This has two effects:
(1) Character and block devices are now considered deprecated for the
'file' driver, which expects only S_IFREG, and
(2) no file-posix driver (file, host_cdrom, or host_device) can open
directories now.
I don't think there's a legitimate reason to open directories as if
they were files. This prevents QEMU from opening and attempting to probe
a directory inode, which can break in exciting ways. One of those ways
is lseek on ext4/xfs, which will return 0x7fffffffffffffff as the file
size instead of EISDIR. This can coax QEMU into responding with a
confusing "file too big" instead of "Hey, that's not a file".
See: https://bugs.launchpad.net/qemu/+bug/1739304/
Signed-off-by: John Snow <jsnow@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Truncation is the last to convert from open coded req handling to
reusing helpers. This time the permission check in prepare has to adapt
to the new caller: it checks a different permission bit, and doesn't
trigger the before write notifier.
Also, truncation should always trigger a bs->total_sectors update and in
turn call parent resize_cb. Update the condition in finish helper, too.
It's intended to do a duplicated bs->read_only check before calling
bdrv_co_write_req_prepare() so that we can be more informative with the
error message, as bdrv_co_write_req_prepare() doesn't have Error
parameter.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we are growing the image and potentially using preallocation for the
new area, we need to make sure that no write requests are made to the
"preallocated" area which is [@old_size, @offset), not
[@offset, offset * 2 - @old_size).
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This brings the request handling logic inline with write and discard,
fixing write_gen, resize_cb, dirty bitmaps and image size refreshing.
The last of these issues broke iotest case 222, which is now fixed.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reuse the new bdrv_co_write_req_prepare/finish helpers. The variation
here is that discard requests don't affect bs->wr_highest_offset, and it
cannot extend the image.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Two problems exist when a write request that enlarges the image (i.e.
write beyond EOF) finishes:
1) parent is not notified about size change;
2) dirty bitmap is not resized although we try to set the dirty bits;
Fix them just like how bdrv_co_truncate works.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As a mechanical refactoring patch, this is the first step towards
unified and more correct write code paths. This is helpful because
multiple BlockDriverState fields need to be updated after modifying
image data, and it's hard to maintain in multiple places such as copy
offload, discard and truncate.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This matches the types used for bytes in the rest parts of block layer.
In the case of bdrv_co_truncate, new_bytes can be the image size which
probably doesn't fit in a 32 bit int.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Other I/O functions are already using a BdrvChild pointer in the API, so
make discard do the same. It makes it possible to initiate the same
permission checks before doing I/O, and much easier to share the
helper functions for this, which will be added and used by write,
truncate and copy range paths.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A few trace points that can help reveal what is happening in a copy
offloading I/O path.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With in one module, trace points usually have a common prefix named
after the module name. paio_submit and paio_submit_co are the only two
trace points so far in the two file protocol drivers. As we are adding
more, having a common prefix here is better so that trace points can be
enabled with a glob. Rename them.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This reverts commit b008326744.
Hold off removing this for one more QEMU release (current libvirt
release still uses it.)
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The sector size needs to be large enough to accommodate the data
structures for the log super block and log write entries. This was
previously not properly checked, which made it possible to cause
QEMU to badly misbehave.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Fleecing scheme works as follows: we want a kind of temporary snapshot
of active drive A. We create temporary image B, with B->backing = A.
Then we start backup(sync=none) from A to B. From this point, B reads
as point-in-time snapshot of A (A continues to be active drive,
accepting guest IO).
This scheme needs some additional synchronization between reads from B
and backup COW operations, otherwise, the following situation is
theoretically possible:
(assume B is qcow2, client is NBD client, reading from B)
1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and
goes up to l2 table loading (assume cache miss)
2) guest write => backup COW => qcow2 write =>
try to take qcow2 mutex => waiting
3. l2 table loaded, we see that cluster is UNALLOCATED, go to
"case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before
bdrv_co_preadv(bs->backing, ...)
4) aha, mutex unlocked, backup COW continues, and we finally finish
guest write and change cluster in our active disk A
5. actually, do bdrv_co_preadv(bs->backing, ...) and read
_new updated_ data.
To avoid this, let's make backup writes serializing, to not intersect
with reads from B.
Note: we expand range of handled cases from (sync=none and
B->backing = A) to just (A in backing chain of B), to finally allow
safe reading from B during backup for all cases when A in backing chain
of B, i.e. B formally looks like point-in-time snapshot of A.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Serialized writes should be used in copy-on-write of backup(sync=none)
for image fleecing scheme.
We need to change an assert in bdrv_aligned_pwritev, added in
28de2dcd88. The assert may fail now, because call to
wait_serialising_requests here may become first call to it for this
request with serializing flag set. It occurs if the request is aligned
(otherwise, we should already set serializing flag before calling
bdrv_aligned_pwritev and correspondingly waited for all intersecting
requests). However, for aligned requests, we should not care about
outdating of previously read data, as there no such data. Therefore,
let's just update an assert to not care about aligned requests.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Pass read flags and write flags separately. This is needed to handle
coming BDRV_REQ_NO_SERIALISING clearly in following patches.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Here two things are fixed:
1. Architecture
On each recursion step, we go to the child of src or dst, only for one
of them. So, it's wrong to create tracked requests for both on each
step. It leads to tracked requests duplication.
2. Wait for serializing requests on write path independently of
BDRV_REQ_NO_SERIALISING
Before commit 9ded4a0114 "backup: Use copy offloading",
BDRV_REQ_NO_SERIALISING was used for only one case: read in
copy-on-write operation during backup. Also, the flag was handled only
on read path (in bdrv_co_preadv and bdrv_aligned_preadv).
After 9ded4a0114, flag is used for not waiting serializing operations
on backup target (in same case of copy-on-write operation). This
behavior change is unsubstantiated and potentially dangerous, let's
drop it and add additional asserts and documentation.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the virtual disk size isn't aligned to full clusters,
bdrv_co_do_copy_on_readv() may get pnum == 0 before having the full
cluster completed, which will let it run into an assertion failure:
qemu-io: block/io.c:1203: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed.
Check for EOF, assert that we read at least as much as the read request
originally wanted to have (which is true at EOF because otherwise
bdrv_check_byte_request() would already have returned an error) and
return success early even though we couldn't copy the full cluster.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit dcf94a23b1 ('block: Don't poll in parent drain callbacks')
removed polling in bdrv_child_cb_drained_begin() on the grounds that the
original bdrv_drain() already will poll and BdrvChildRole.drained_begin
calls must not cause graph changes (and therefore must not call
aio_poll() or the recursion through the graph will break.
This reasoning is correct for calls through bdrv_do_drained_begin().
However, BdrvChildRole.drained_begin is also called when a node that is
already in a drained section (i.e. bdrv_do_drained_begin() has already
returned and therefore can't poll any more) is attached to a new parent.
In this case, we must explicitly poll to have all requests completed
before the drained new child can be attached to the parent.
In bdrv_replace_child_noperm(), we know that we're not inside the
recursion of bdrv_do_drained_begin() because graph changes are not
allowed there, and bdrv_replace_child_noperm() is a graph change. The
call of BdrvChildRole.drained_begin() must therefore be followed by a
BDRV_POLL_WHILE() that waits for the completion of requests.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
VMDK performs a probing check in vmdk_co_create_opts() to prevent the
user from assigning non-VMDK files as a backing file, because it only
supports VMDK backing files. However, with the @backing runtime option,
it is possible to assign arbitrary nodes as backing nodes, regardless of
what the image header says. Therefore, VMDK may not just access backing
nodes assuming they are VMDK nodes -- which it does, because it needs to
compare the backing file's CID with the overlay's parentCID value, and
naturally the backing file only has a CID when it's a VMDK file.
Instead, it should report the CID of non-VMDK backing files not to match
the overlay because clearly a non-present CID does not match.
Without this change, vmdk_read_cid() reads from the backing file's
bs->file, which may be NULL (in which case we get a segfault). Also, it
interprets bs->opaque as a BDRVVmdkState and then reads from the
.desc_offset field, which usually will just return some arbitrary value
which then results in either garbage to be read, or bdrv_pread() to
return an error, both of which result in a non-matching CID to be
reported.
(In a very unlikely case, we could read something that looks like a
VMDK descriptor, and then get a CID which might actually match. But
that is highly unlikely, and the only result would be that VMDK accepts
the backing file which is not too bad (albeit unintentional).)
((And in theory, the seek to .desc_offset might leak data from another
block driver's opaque object. But then again, the user should realize
very quickly that a non-VMDK backing file does not work (because the
read will very likely fail, due to the reasons given above), so this
should not be exploitable.))
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-id: 20180702210721.4847-2-mreitz@redhat.com
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
One of them is a typo. But update both to be more readable.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180702025836.20957-3-famz@redhat.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbPfHhAAoJEH8JsnLIjy/Wxp8P/jeEJ/jJ0N2doix5KJSvtkdP
dYx4cJxHg9tMSnmBfA49qc1ilgbzZqjbF/wH3Ko3zPsmFP0xsxx+9zE4c3iL7HA6
6E4/r3szznO8DIxdgzAIeHjZc7YPpsY3alrthT55eR/FnDyc8zofi/iUMgvTKPYA
8TnWqgi2fdaVwNy+lIRJFdhzzQNtPxOEO+0DFHaOZvQ9vlc5DGPC+Hj3Qc11GK8M
u7orkeYqQPknxy0hJKPxtWHvzCQUJMcSSs6PbuslnzOerYiQLmx+RVIwMhXfVZjV
lXe2SppAszBujtcIENhZlj1cECs7MrWTmFDcWvBA+Mh/JhFEWwykmlQHYrjCqdvw
QyWhVLgP/6jQDaBls6LADSZDxX7i0sV27DzVGN4gYUX/KcyVPEDROm90MyE8nxN0
Y5hhujTkzu94Zvd/OlKZRs4tPxmbRrd2SWy0id8Kj5/gO/+UWECOZCrPrlB5uzec
bsxHoeXLVe2/56JkCIiOw3hLILMY6gPLJgeaQjz6hu1oZJqYuoof8grVFjUCSH0C
BBlz8O4upHdPIKNRl4rmdgwasg5YIiJ7SFl4h7KMCD0elkKmo3SyTKosagvqrpVN
JvW1YuosCedcfYIKNPnZdEGWDKuTaZtcnZnM9IRY6hr8ySMw1LloFsgXsJ3tqTwe
wb1Wjm4Qlz5BaoB7o/Os
=z5AC
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block layer patches:
- qcow2: Use worker threads for compression to improve performance of
'qemu-img convert -W' and compressed backup jobs
- blklogwrites: New filter driver to log write requests to an image in
the dm-log-writes format
- file-posix: Fix image locking during image creation
- crypto: Fix memory leak in error path
- Error out instead of silently truncating node names
# gpg: Signature made Thu 05 Jul 2018 11:24:33 BST
# gpg: using RSA key 7F09B272C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* remotes/kevin/tags/for-upstream:
file-posix: Unlock FD after creation
file-posix: Fix creation locking
block/blklogwrites: Add an option for the update interval of the log superblock
block/blklogwrites: Add an option for appending to an old log
block/blklogwrites: Change log_sector_size from int64_t to uint64_t
block/crypto: Fix memory leak in create error path
block: Don't silently truncate node names
block: Add blklogwrites
block: Move two block permission constants to the relevant enum
qcow2: add compress threads
qcow2: refactor data compression
qemu-img: allow compressed not-in-order writes
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Closing the FD does not necessarily mean that it is unlocked. Fix this
by relinquishing all permission locks before qemu_close().
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
raw_apply_lock_bytes() takes a bit mask of "permissions that are NOT
shared".
Also, make the "perm" and "shared" variables uint64_t, because I do not
particularly like using ~ on signed integers (and other permission masks
are usually uint64_t, too).
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a way to ensure that the log superblock is periodically
updated. Before, this was only done on flush requests, which may
not be enough if the VM exits abnormally, omitting the final flush.
The default interval is 4096 write requests.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Suggested by Kevin Wolf. May be useful when testing multiple batches
of writes or doing long-term testing involving restarts of the VM.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This was a simple oversight when working on intermediate versions
of the original patch which introduced blklogwrites.
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Implements a block device write logging system, similar to Linux kernel
device mapper dm-log-writes. The write operations that are performed
on a block device are logged to a file or another block device. The
write log format is identical to the dm-log-writes format. Currently,
log markers are not supported.
This functionality can be used for crash consistency and fs consistency
testing. By implementing it in qemu, tests utilizing write logs can be
be used to test non-Linux drivers and older kernels.
The driver accepts an optional parameter to set the sector size used
for logging. This makes the driver require all requests to be aligned
to this sector size and also makes offsets and sizes of writes in the
log metadata to be expressed in terms of this value (the log format has
a granularity of one sector for offsets and sizes). This allows
accurate logging of writes to guest block devices that have unusual
sector sizes.
The implementation is based on the blkverify and blkdebug block
drivers.
Signed-off-by: Aapo Vienamo <aapo@tuxera.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do data compression in separate threads. This significantly improve
performance for qemu-img convert with -W (allow async writes) and -c
(compressed) options.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make a separate function for compression to be parallelized later.
- use .avail_out field instead of .next_out to calculate size of
compressed data. It looks more natural and it allows to keep dest to
be void pointer
- set avail_out to be at least one byte less than input, to be sure
avoid inefficient compression earlier
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Bitmap lock/unlock were added to bdrv_enable_dirty_bitmap in
8b1402ce80, but some places were not updated correspondingly, which
leads to trying to take this lock twice, which is dead-lock. Fix this.
Actually, iotest 199 (about dirty bitmap postcopy migration) is broken
now, and this fixes it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-3-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Add _locked version of bdrv_enable_dirty_bitmap, to fix dirty bitmap
migration in the following patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-id: 20180625165745.25259-2-vsementsov@virtuozzo.com
Signed-off-by: John Snow <jsnow@redhat.com>
Bug fixes and iotest exposure of fleecing via NBD (serving a
read-only point-in-time view via blockdev-backup sync:none,
as well as serving dirty bitmaps over NBD), including a new
x-dirty-bitmap parameter when opening NBD clients as the
counterpart to x-nbd-server-add-bitmap. Also a random fix
for iscsi block_status spotted by Coverity that missed other
miscellaneous trees.
- Eric Blake: nbd/server: Fix dirty bitmap logic regression
- Eric Blake: iscsi: Avoid potential for get_status overflow
- John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
- Eric Blake: 0/2 test NBD bitmap export
-----BEGIN PGP SIGNATURE-----
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
iQEcBAABCAAGBQJbOtJPAAoJEKeha0olJ0NqEvwH/3FwWnlBdBvdYGgPjzGE1Atm
ofCKcyxE/2VJtxeWlZQHzs1VqSq81s7am5SdzOrIWnQekvHFcLu6/71RABiauzMd
neCvVOrXOVdktj1i1Z2Gg4BgjDmqbTDlo5ssVh/oXP0Zebi6OZFfQrB7y3cGBvui
4XI7lW9qJxt6F1FlKloXnofWRDENyo5vgdz6QjQXfauthw2T5045RIPTfiz03FCp
fbs+6K0+bKxfPdNLrqxxOZo/loYnEXbDYv6VBAIWBqztXVnMHxCqnh0YN05jwsfF
TRW0/YT8lpWarOZ1soIC6a/OGXQZbxgRhZ+Zr+Wa2jw0YNHJanU9isxi37aUqQo=
=Xivx
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2018-07-02' into staging
nbd patches for 2018-07-02
Bug fixes and iotest exposure of fleecing via NBD (serving a
read-only point-in-time view via blockdev-backup sync:none,
as well as serving dirty bitmaps over NBD), including a new
x-dirty-bitmap parameter when opening NBD clients as the
counterpart to x-nbd-server-add-bitmap. Also a random fix
for iscsi block_status spotted by Coverity that missed other
miscellaneous trees.
- Eric Blake: nbd/server: Fix dirty bitmap logic regression
- Eric Blake: iscsi: Avoid potential for get_status overflow
- John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing
- Eric Blake: 0/2 test NBD bitmap export
# gpg: Signature made Tue 03 Jul 2018 02:33:03 BST
# gpg: using RSA key A7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>"
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>"
# gpg: aka "[jpeg image of size 6874]"
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2018-07-02:
iotests: New test 223 for exporting dirty bitmap over NBD
nbd/client: Add x-dirty-bitmap to query bitmap from server
iotests: add 222 to test basic fleecing
blockdev: enable non-root nodes for backup source
iscsi: Avoid potential for get_status overflow
nbd/server: Fix dirty bitmap logic regression
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The implementation is similar to the 'qemu-img convert'. In the
beginning of the job, offloaded copy is attempted. If it fails, further
I/O will go through the existing bounce buffer code path.
Then, as Kevin pointed out, both this and qemu-img convert can benefit
from a local check if one request fails because of, for example, the
offset is beyond EOF, but another may well be accepted by the protocol
layer. This will be implemented separately.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-4-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
This semantics is needed by drive-backup so implement it before using
this API there.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-3-famz@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
src may be NULL if BDRV_REQ_ZERO_WRITE flag is set, in this case only
check dst and dst->bs. This bug was introduced when moving in the
request tracking code from bdrv_co_copy_range, in 37aec7d75e.
This especially fixes the possible segfault when initializing src_bs
with a NULL src.
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 20180703023758.14422-2-famz@redhat.com
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
In order to test that the NBD server is properly advertising
dirty bitmaps, we need a bare minimum client that can request
and read the context. Since feature freeze for 3.0 is imminent,
this is the smallest workable patch, which replaces the qemu
block status report with the results of the NBD server's dirty
bitmap (making it very easy to use 'qemu-img map --output=json'
to learn where the dirty portions are). Note that the NBD
protocol defines a dirty section with the same bit but opposite
sense that normal "base:allocation" uses to report an allocated
section; so in qemu-img map output, "data":true corresponds to
clean, "data":false corresponds to dirty.
A more complete solution that allows dirty bitmaps to be queried
at the same time as normal block status will be required before
this addition can lose the x- prefix. Until then, the fact that
this replaces normal status with dirty status means actions
like 'qemu-img convert' will likely misbehave due to treating
dirty regions of the file as if they are unallocated.
The next patch adds an iotest to exercise this new code.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180702191458.28741-2-eblake@redhat.com>
Detected by Coverity: Multiplying two 32-bit int and assigning
the result to a 64-bit number is a risk of overflow. Prior to
the conversion to byte-based interfaces, the block layer took
care of ensuring that a status request never exceeded 2G in
the driver; but after that conversion, the block layer expects
drivers to deal with any size request (the driver can always
truncate the request size back down, as long as it makes
progress). So, in the off-chance that someone makes a large
request, we are at the mercy of whether iscsi_get_lba_status_task()
will cap things to at most INT_MAX / iscsilun->block_size when
it populates lbasd->num_blocks; since I could not easily audit
that, it's better to be safe than sorry by just forcing a 64-bit
multiply.
Fixes: 92809c36
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180508212718.1482663-1-eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Message-Id: <20180625124238.25339-3-f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that all callers of vectored I/O have been converted
to use our preferred byte-based bdrv_co_p{read,write}v(), we can
delete the unused bdrv_co_{read,write}v().
Furthermore, this gets rid of the signature difference between the
public bdrv_co_writev() and the callback .bdrv_co_writev (the
latter still exists, because some drivers still need more work
before they are fully byte-based).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the vhdx driver.
Ideally, the vhdx driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the replication driver.
Ideally, the replication driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. The qcow driver is now ready to fully utilize the
byte-based callback interface, as long as we override the default
alignment to still be 512 (needed at least for asserts present
because of encryption, but easier to do everywhere than to audit
which sub-sector requests are handled correctly, especially since
we no longer recommend qcow for new disk images).
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internals of the qcow
driver write function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.
A later patch will then switch the qcow driver as a whole over
to byte-based operation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internals of the qcow
driver read function, by iterating over offset/bytes instead of
sector_num/nb_sectors, and with a rename of index_in_cluster and
repurposing of n to track bytes instead of sectors.
A later patch will then switch the qcow driver as a whole over
to byte-based operation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the internal helper function
get_cluster_offset(), by changing n_start and n_end to be byte
offsets rather than sector indices within the cluster being
allocated. However, assert that these values are still
sector-aligned (at least qcrypto_block_encrypt() still wants that).
For now we get that alignment for free because we still use
sector-based driver callbacks.
A later patch will then switch the qcow driver as a whole over
to byte-based operation; but will still leave things at sector
alignments as it is not worth auditing the qcow image format
to worry about sub-sector requests.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are gradually moving away from sector-based interfaces, towards
byte-based. Make the change for the last few sector-based calls
into the block layer from the parallels driver.
Ideally, the parallels driver should switch to doing everything
byte-based, but that's a more invasive change that requires a
bit more auditing.
Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
EINTR should be checked against errno, not ret. While fixing the bug,
collect the branches with a switch block.
Also, change the return value from -ENOSTUP to -ENOSPC when the actual
issue is request range passes EOF, which should be distinguishable from
the case of error == ENOSYS by the caller, so that it could still retry
with other byte ranges, whereas it shouldn't retry anymore upon ENOSYS.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Per SCSI definition the designator_length we receive from INQUIRY is 8,
12 or at most 16, but we should be careful because the remote iscsi
target may misbehave, otherwise we could have a buffer overflow.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Not updating src_offset will result in wrong data being written to dst
image.
Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This simplifies file-posix by implementing the coroutine variants of
the discard and flush BlockDriver callbacks. These were the last
remaining users of paio_submit(), which can be removed now.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
If we managed to allocate the clusters, but then failed to write the
data, there's a good chance that we'll still be able to free the
clusters again in order to avoid cluster leaks (the refcounts are
cached, so even if we can't write them out right now, we may be able to
do so when the VM is resumed after a werror=stop/enospc pause).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Eric Blake <eblake@redhat.com>
block_crypto_open_opts_init() and block_crypto_create_opts_init()
contain a virtual visit of QCryptoBlockOptions and
QCryptoBlockCreateOptions less member "format", respectively.
Change their callers to put member "format" in the QDict, so they can
use the generated visitors for these types instead.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
in_flight and tracked requests need to be tracked in every layer during
recursion. For now the only user is qemu-img convert where overlapping
requests and IOThreads don't exist, therefore this change doesn't make
much difference form user point of view, but it is incorrect as part of
the API. Fix it.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the beginning of the function, we initialize the local variable to 0,
and in the body of the function, we check the assigned values and exit
the loop immediately. So here it can never be non-zero.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This moves the code to resize an image file to the thread pool to avoid
blocking.
Creating large images with preallocation with blockdev-create is now
actually a background job instead of blocking the monitor (and most
other things) until the preallocation has completed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
When growing an image, block drivers (especially protocol drivers) may
initialise the newly added area. I/O requests to the same area need to
wait for this initialisation to be completed so that data writes don't
get overwritten and reads don't read uninitialised data.
To avoid overhead in the fast I/O path by adding new locking in the
protocol drivers and to restrict the impact to requests that actually
touch the new area, reuse the existing tracked request infrastructure in
block/io.c and mark all discard requests as serialising.
With this change, it is safe for protocol drivers to make
.bdrv_co_truncate actually asynchronous.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
This moves the bdrv_truncate() implementation from block.c to block/io.c
so it can have access to the tracked requests infrastructure.
This involves making refresh_total_sectors() public (in block_int.h).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers are coroutine_fns now, so we can just directly call
preallocate_co().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_truncate() is an operation that can block (even for a quite long
time, depending on the PreallocMode) in I/O paths that shouldn't block.
Convert it to a coroutine_fn so that we have the infrastructure for
drivers to make their .bdrv_co_truncate implementation asynchronous.
This change could potentially introduce new race conditions because
bdrv_truncate() isn't necessarily executed atomically any more. Whether
this is a problem needs to be evaluated for each block driver that
supports truncate:
* file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The
protocol drivers are trivially safe because they don't actually yield
yet, so there is no change in behaviour.
* copy-on-read, crypto, raw-format: Essentially just filter drivers that
pass the request to a child node, no problem.
* qcow2: The implementation modifies metadata, so it needs to hold
s->lock to be safe with concurrent I/O requests. In order to avoid
double locking, this requires pulling the locking out into
preallocate_co() and using qcow2_write_caches() instead of
bdrv_flush().
* qed: Does a single header update, this is fine without locking.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
If qcow2_alloc_clusters_at() returns an error, we do need to negate it
to get back the positive errno code for error_setg_errno(), but we still
need to return the negative error code.
Fixes: 772d1f973f
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Coverity can't see that qobject_input_visitor_new_flat_confused()
returns non-null when it doesn't set @local_err. Check the return
value instead, like all the other callers do.
Fixes: CID 1393615
Fixes: CID 1393616
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
laio_init() can fail for a couple of reasons, which will lead to a NULL
pointer dereference in laio_attach_aio_context().
To solve this, add a aio_setup_linux_aio() function which is called
early in raw_open_common. If this fails, propagate the error up. The
signature of aio_get_linux_aio() was not modified, because it seems
preferable to return the actual errno from the possible failing
initialization calls.
Additionally, when the AioContext changes, we need to associate a
LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
callback and call the new aio_setup_linux_aio(), which will allocate a
new AioContext if needed, and return errors on failures. If it fails for
any reason, fallback to threaded AIO with an error message, as the
device is already in-use by the guest.
Add an assert that aio_get_linux_aio() cannot return NULL.
Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com>
Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Flat unions may now have uncovered branches, so it is possible to get rid
of empty types defined for that purpose only.
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1529311206-76847-3-git-send-email-anton.nefedov@virtuozzo.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
This patch allows the user to specify whether to use active or only
background mode for mirror block jobs. Currently, this setting will
remain constant for the duration of the entire block job.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-14-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch implements active synchronous mirroring. In active mode, the
passive mechanism will still be in place and is used to copy all
initially dirty clusters off the source disk; but every write request
will write data both to the source and the target disk, so the source
cannot be dirtied faster than data is mirrored to the target. Also,
once the block job has converged (BLOCK_JOB_READY sent), source and
target are guaranteed to stay in sync (unless an error occurs).
Active mode is completely optional and currently disabled at runtime. A
later patch will add a way for users to enable it.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-13-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This will allow us to access the block job data when the mirror block
driver becomes more complex.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-11-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new function allows to look for a consecutively dirty area in a
dirty bitmap.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-10-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This new parameter allows the caller to just query the next dirty
position without moving the iterator.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: John Snow <jsnow@redhat.com>
Message-id: 20180613181823.13618-8-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
With this, the mirror_top_bs is no longer just a technically required
node in the BDS graph but actually represents the block job operation.
Also, drop MirrorBlockJob.source, as we can reach it through
mirror_top_bs->backing.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-6-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
This patch makes the mirror code differentiate between simply waiting
for any operation to complete (mirror_wait_for_free_in_flight_slot())
and specifically waiting for all operations touching a certain range of
the virtual disk to complete (mirror_wait_on_conflicts()).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-5-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Attach a CoQueue to each in-flight operation so if we need to wait for
any we can use it to wait instead of just blindly yielding and hoping
for some operation to wake us.
A later patch will use this infrastructure to allow requests accessing
the same area of the virtual disk to specifically wait for each other.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-4-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
In order to talk to the source BDS (and maybe in the future to the
target BDS as well) directly, we need to convert our existing AIO
requests into coroutine I/O requests.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 20180613181823.13618-3-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
When converting mirror's I/O to coroutines, we are going to need a point
where these coroutines are created. mirror_perform() is going to be
that point.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-id: 20180613181823.13618-2-mreitz@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.
An AIO flush can yield at some point:
blk_aio_flush_entry()
blk_co_flush(blk)
bdrv_co_flush(blk->root->bs)
...
qemu_coroutine_yield()
and let the HMP command to run, free blk->root and give control
back to the AIO flush:
hmp_drive_del()
blk_remove_bs()
bdrv_root_unref_child(blk->root)
child_bs = blk->root->bs
bdrv_detach_child(blk->root)
bdrv_replace_child(blk->root, NULL)
blk->root->bs = NULL
g_free(blk->root) <============== blk->root becomes stale
bdrv_unref(child_bs)
bdrv_delete(child_bs)
bdrv_close()
bdrv_drained_begin()
bdrv_do_drained_begin()
bdrv_drain_recurse()
aio_poll()
...
qemu_coroutine_switch()
and the AIO flush completion ends up dereferencing blk->root:
blk_aio_complete()
scsi_aio_complete()
blk_get_aio_context(blk)
bs = blk_bs(blk)
ie, bs = blk->root ? blk->root->bs : NULL
^^^^^
stale
The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and
did a subtree drain for each of them. This works fine as long as the
graph is static, but sadly, reality looks different.
If the graph changes so that root nodes are added or removed, we would
have to compensate for this. bdrv_next() returns each root node only
once even if it's the root node for multiple BlockBackends or for a
monitor-owned block driver tree, which would only complicate things.
The much easier and more obviously correct way is to fundamentally
change the way the functions work: Iterate over all BlockDriverStates,
no matter who owns them, and drain them individually. Compensation is
only necessary when a new BDS is created inside a drain_all section.
Removal of a BDS doesn't require any action because it's gone afterwards
anyway.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In the future, bdrv_drained_all_begin/end() will drain all invidiual
nodes separately rather than whole subtrees. This means that we don't
want to propagate the drain to all parents any more: If the parent is a
BDS, it will already be drained separately. Recursing to all parents is
unnecessary work and would make it an O(n²) operation.
Prepare the drain function for the changed drain_all by adding an
ignore_bds_parents parameter to the internal implementation that
prevents the propagation of the drain to BDS parents. We still (have to)
propagate it to non-BDS parents like BlockBackends or Jobs because those
are not drained separately.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Before we can introduce a single polling loop for all nodes in
bdrv_drain_all_begin(), we must make sure to run it outside of coroutine
context like we already do for bdrv_do_drained_begin().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We cannot allow aio_poll() in bdrv_drain_invoke(begin=true) until we're
done with propagating the drain through the graph and are doing the
single final BDRV_POLL_WHILE().
Just schedule the coroutine with the callback and increase bs->in_flight
to make sure that the polling phase will wait for it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_do_drained_begin() is only safe if we have a single
BDRV_POLL_WHILE() after quiescing all affected nodes. We cannot allow
that parent callbacks introduce a nested polling loop that could cause
graph changes while we're traversing the graph.
Split off bdrv_do_drained_begin_quiesce(), which only quiesces a single
node without waiting for its requests to complete. These requests will
be waited for in the BDRV_POLL_WHILE() call down the call chain.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Anything can happen inside BDRV_POLL_WHILE(), including graph
changes that may interfere with its callers (e.g. child list iteration
in recursive callers of bdrv_do_drained_begin).
Switch to a single BDRV_POLL_WHILE() call for the whole subtree at the
end of bdrv_do_drained_begin() to avoid such effects. The recursion
happens now inside the loop condition. As the graph can only change
between bdrv_drain_poll() calls, but not inside of it, doing the
recursion here is safe.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For bdrv_drain(), recursively waiting for child node requests is
pointless because we didn't quiesce their parents, so new requests could
come in anyway. Letting the function work only on a single node makes it
more consistent.
For subtree drains and drain_all, we already have the recursion in
bdrv_do_drained_begin(), so the extra recursion doesn't add anything
either.
Remove the useless code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
We already requested that block jobs be paused in .bdrv_drained_begin,
but no guarantee was made that the job was actually inactive at the
point where bdrv_drained_begin() returned.
This introduces a new callback BdrvChildRole.bdrv_drained_poll() and
uses it to make bdrv_drain_poll() consider block jobs using the node to
be drained.
For the test case to work as expected, we have to switch from
block_job_sleep_ns() to qemu_co_sleep_ns() so that the test job is even
considered active and must be waited for when draining the node.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 91af091f92 added an additional aio_poll() to BDRV_POLL_WHILE()
in order to make sure that all pending BHs are executed on drain. This
was the wrong place to make the fix, as it is useless overhead for all
other users of the macro and unnecessarily complicates the mechanism.
This patch effectively reverts said commit (the context has changed a
bit and the code has moved to AIO_WAIT_WHILE()) and instead polls in the
loop condition for drain.
The effect is probably hard to measure in any real-world use case
because actual I/O will dominate, but if I run only the initialisation
part of 'qemu-img convert' where it calls bdrv_block_status() for the
whole image to find out how much data there is copy, this phase actually
needs only roughly half the time after this patch.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All involved nodes are already idle, we called bdrv_do_drain_begin() on
them.
The comment in the code suggested that this was not correct because the
completion of a request on one node could spawn a new request on a
different node (which might have been drained before, so we wouldn't
drain the new request). In reality, new requests to different nodes
aren't spawned out of nothing, but only in the context of a parent
request, and they aren't submitted to random nodes, but only to child
nodes. As long as we still poll for the completion of the parent request
(which we do), draining each root node separately is good enough.
Remove the additional polling code from bdrv_drain_all_begin() and
replace it with an assertion that all nodes are already idle after we
drained them separately.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
All callers pass false for the 'recursive' parameter now. Remove it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>