Commit Graph

1995 Commits

Author SHA1 Message Date
Peter Maydell f3e3b083d4 Block layer core and image format patches
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJVevYFAAoJEH8JsnLIjy/W3jEP/0hiQ3rCRZ/he8s5maTdT+TR
 YSeHkB5rKpz0Uopn1DMn1QrIbUVzX7dyb+uf9zQ0/xRQIzf6k8uxqU/NWrdoF3NK
 qx91dGWedwnG+TEBIMbcR7nMrw4dP6kH7uPz/VWMXDHVLz0HIcD95qhKgs0mSY6J
 dWqex6ACjXM68zJU5IioagU9evV80WZE1S8z7zfixxtTBx5hCaTVbwalkaCxcrXw
 PbZle55rjI8B10+OzgBw0fq10nias+NTndU9CwNBboxmEtAjq8/mQ663vcWlmiFo
 9a/hkda27Z5ut/0Tqk1v4uLHauylp++rrAabPBAuCFMKes6cdkddP15Q/r52aJ29
 5meodQtbet1rGrM+Aq4vuSuWId71PGypEI/3URDdNfYFNISoeLLsk4lcQUu7VrDD
 sRX3Jt8SI3nkIgOnhPyi7NDPmafxFt8yRt5vM8MyR5ynF8NS/2hiAc3wqnbXGjUj
 a5GqDCefb1yM0R5HvksuFFt3OnXlKJQ3J+ksXNUJf9DSAZPauqWD696pcTeg8wyy
 3PIGkczgUuKTVfFWd3THZxJLAo7ZuqvXBHHV8o1SeMBDxwh4FhTd8Kjvm3rUNFfl
 VDox4qwZ1AcxLrxgqazKU7sD9iWBDHURRcpOoUsBys7oxQZnhcmMp1fRlEkTOyrD
 HNiSNByqBrtkfeVzHlSe
 =QgYk
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer core and image format patches

# gpg: Signature made Fri Jun 12 16:08:53 2015 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"

* remotes/kevin/tags/for-upstream: (25 commits)
  block: Fix reopen flag inheritance
  block: Add BlockDriverState.inherits_from
  block: Add list of children to BlockDriverState
  queue.h: Add QLIST_FIX_HEAD_PTR()
  block: Drain requests before swapping nodes in bdrv_swap()
  block: Move flag inheritance to bdrv_open_inherit()
  block: Use QemuOpts in bdrv_open_common()
  block: Use macro for cache option names
  vmdk: Use bdrv_open_image()
  quorum: Use bdrv_open_image()
  check-qdict: Test cases for new functions
  qdict: Add qdict_{set,copy}_default()
  qdict: Add qdict_array_entries()
  iotests: Add tests for overriding BDRV_O_PROTOCOL
  block: driver should override flags in bdrv_open()
  block: Change bitmap truncate conditional to assertion
  block: record new size in bdrv_dirty_bitmap_truncate
  raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size
  vmdk: Use vmdk_find_index_in_cluster everywhere
  vmdk: Fix index_in_cluster calculation in vmdk_co_get_block_status
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2015-06-15 10:43:06 +01:00
Kevin Wolf 67251a3113 block: Fix reopen flag inheritance
When reopening an image, the block layer already takes care to reopen
bs->file as well with recalculated inherited flags. The same must happen
for any other child (most notably missing before this patch: backing
files).

If bs->file (or any other child) didn't originally inherit from bs, e.g.
because it was created separately and then only referenced, it must not
inherit flags on reopen either, so check the inherited_from field before
propagation the reopen down.

VMDK already reopened its extents manually; this code can now be
dropped.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2015-06-12 17:04:59 +02:00
Kevin Wolf f3930ed0bb block: Move flag inheritance to bdrv_open_inherit()
Instead of letting every caller of bdrv_open() determine the right flags
for its child node manually and pass them to the function, pass the
parent node and the role of the newly opened child (like backing file,
protocol layer, etc.).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2015-06-12 17:04:59 +02:00
Kevin Wolf a646836784 vmdk: Use bdrv_open_image()
Besides standardising on a single interface for opening child nodes,
this patch allows the user to specify options to individual extent
nodes. Overriding file names isn't possible with this yet, so it's of
limited usefulness, but still a step forward.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
2015-06-12 16:58:07 +02:00
Kevin Wolf ea6828d81b quorum: Use bdrv_open_image()
Besides standardising on a single interface for opening child nodes,
this simplifies the .bdrv_open() implementation of the quorum block
driver by using block layer functionality for handling BlockdevRefs.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
2015-06-12 16:58:07 +02:00
Kevin Wolf b8684454e1 raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size
Image files with an unaligned image size have a final hole that starts
at EOF, i.e. in the middle of a sector. Currently, *pnum == 0 is
returned when checking the status of this sector. In qemu-img, this
triggers an assertion failure.

In order to fix this, one type for the sector that contains EOF must be
found. Treating a hole as data is safe, so this patch rounds the
calculated number of data sectors up, so that a partial sector at EOF is
treated as a full data sector.

This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1229394

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Cole Robinson <crobinso@redhat.com>
2015-06-12 15:54:01 +02:00
Fam Zheng 90df601f06 vmdk: Use vmdk_find_index_in_cluster everywhere
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-06-12 15:54:01 +02:00
Fam Zheng 61f0ed1d54 vmdk: Fix index_in_cluster calculation in vmdk_co_get_block_status
It has the similar issue with b1649fae49. Since the calculation
is repeated for a few times already, introduce a function so it can be
reused.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-06-12 15:54:01 +02:00
Max Reitz bc85ef265a qcow2: Add DEFAULT_L2_CACHE_CLUSTERS
If a relatively large cluster size is chosen, the default of 1 MB L2
cache is not really appropriate. In this case, unless overridden by the
user, the default cache size should not be determined by its size in
bytes but by the number of L2 tables (clusters) it is supposed to
contain.

Note that without this patch, MIN_L2_CACHE_SIZE will effectively take
over the same role. However, providing space for just two L2 tables is
not enough to be the default.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-06-12 15:54:01 +02:00
Max Reitz 57e2166959 qcow2: Set MIN_L2_CACHE_SIZE to 2
The L2 cache must cover at least two L2 tables, because during COW two
L2 tables are accessed simultaneously.

Reported-by: Alexander Graf <agraf@suse.de>
Cc: qemu-stable <qemu-stable@nongnu.org>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Tested-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-06-12 15:54:00 +02:00
Alberto Garcia b8fe1694e5 throttle: add the name of the ThrottleGroup to BlockDeviceInfo
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 172df91f09c69c6f0440a697bbd1b3f95b077ee4.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 14:00:00 +01:00
Alberto Garcia db6283385c throttle: acquire the ThrottleGroup lock in bdrv_swap()
bdrv_swap() touches the fields of a BlockDriverState that are
protected by the ThrottleGroup lock. Although those fields end up in
their original place, they are temporarily swapped in the process,
so there's a chance that an operation on a member of the same group
happening on a different thread can try to use them.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: d92dc40d7c4f1fc5cda5cbbf4ffb7a4670b79d17.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 14:00:00 +01:00
Alberto Garcia 76f4afb40f throttle: Add throttle group support
The throttle group support use a cooperative round robin scheduling
algorithm.

The principles of the algorithm are simple:
- Each BDS of the group is used as a token in a circular way.
- The active BDS computes if a wait must be done and arms the right
  timer.
- If a wait must be done the token timer will be armed so the token
  will become the next active BDS.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: f0082a86f3ac01c46170f7eafe2101a92e8fde39.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 14:00:00 +01:00
Alberto Garcia 2ff1f2e3a3 throttle: Add throttle group infrastructure
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 2fdb4de17210b733a13eb472c33cd08b45f8fd21.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 14:00:00 +01:00
Benoît Canet 0e5b0a2d54 throttle: Extract timers from ThrottleState into a separate structure
Group throttling will share ThrottleState between multiple bs.
As a consequence the ThrottleState will be accessed by multiple aio
context.

Timers are tied to their aio context so they must go out of the
ThrottleState structure.

This commit paves the way for each bs of a common ThrottleState to
have its own timer.

Signed-off-by: Benoit Canet <benoit.canet@nodalink.com>
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 6cf9ea96d8b32ae2f8769cead38f68a6a0c8c909.1433779731.git.berto@igalia.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 14:00:00 +01:00
Kevin Wolf f4a769abaa raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size
Image files with an unaligned image size have a final hole that starts
at EOF, i.e. in the middle of a sector. Currently, *pnum == 0 is
returned when checking the status of this sector. In qemu-img, this
triggers an assertion failure.

In order to fix this, one type for the sector that contains EOF must be
found. Treating a hole as data is safe, so this patch rounds the
calculated number of data sectors up, so that a partial sector at EOF is
treated as a full data sector.

This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1229394

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1433840108-9996-1-git-send-email-kwolf@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-06-12 13:58:33 +01:00
Markus Armbruster 8809cfc38e blkdebug: Simplify passing of Error through qemu_opts_foreach()
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
2015-06-09 07:40:23 +02:00
Markus Armbruster 28d0de7a4f QemuOpts: Convert qemu_opts_foreach() to Error
Retain the function value for now, to permit selective conversion of
its callers.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
2015-06-09 07:37:37 +02:00
Markus Armbruster a4c7367f7d QemuOpts: Drop qemu_opts_foreach() parameter abort_on_failure
When the argument is non-zero, qemu_opts_foreach() stops on callback
returning non-zero, and returns that value.

When the argument is zero, it doesn't stop, and returns the bit-wise
inclusive or of all the return values.  Funky :)

The callers that pass zero could just as well pass one, because their
callbacks can't return anything but zero:

* qemu_add_globals()'s callback qdev_add_one_global()

* qemu_config_write()'s callback config_write_opts()

* main()'s callbacks default_driver_check(), drive_enable_snapshot(),
  vnc_init_func()

Drop the parameter, and always stop.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
2015-06-08 19:33:20 +02:00
Fam Zheng 44f192f364 iscsi: Remove pointless runtime check of macro value
raw_bsd already has QEMU_BUILD_BUG_ON(BDRV_SECTOR_SIZE != 512), so iscsi
should relax.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2015-06-03 14:21:23 +03:00
Daniel P. Berrange 8336aafae1 qcow2/qcow: protect against uninitialized encryption key
When a qcow[2] file is opened, if the header reports an
encryption method, this is used to set the 'crypt_method_header'
field on the BDRVQcow[2]State struct, and the 'encrypted' flag
in the BDRVState struct.

When doing I/O operations, the 'crypt_method' field on the
BDRVQcow[2]State struct is checked to determine if encryption
needs to be applied.

The crypt_method_header value is copied into crypt_method when
the bdrv_set_key() method is called.

The QEMU code which opens a block device is expected to always
do a check

   if (bdrv_is_encrypted(bs)) {
       bdrv_set_key(bs, ....key...);
   }

If code forgets to do this, then 'crypt_method' is never set
and so when I/O is performed, QEMU writes plain text data
into a sector which is expected to contain cipher text, or
when reading, will return cipher text instead of plain
text.

Change the qcow[2] code to consult bs->encrypted when deciding
whether encryption is required, and assert(s->crypt_method)
to protect against cases where the caller forgets to set the
encryption key.

Also put an assert in the set_key methods to protect against
the case where the caller sets an encryption key on a block
device that does not have encryption

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia d1b4efe5c4 qcow2: style fixes in qcow2-cache.c
Fix pointer declaration to make it consistent with the rest of the
code.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia a3f1afb43a qcow2: make qcow2_cache_put() a void function
This function never receives an invalid table pointer, so we can make
it void and remove all the error checking code.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia 812e4082ca qcow2: use a hash to look for entries in the L2 cache
The current cache algorithm traverses the array starting always from
the beginning, so the average number of comparisons needed to perform
a lookup is proportional to the size of the array.

By using a hash of the offset as the starting point, lookups are
faster and independent from the array size.

The hash is computed using the cluster number of the table, multiplied
by 4 to make it perform better when there are collisions.

In my tests, using a cache with 2048 entries, this reduces the average
number of comparisons per lookup from 430 to 2.5.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia fdfbca82a0 qcow2: remove qcow2_cache_find_entry_to_replace()
A cache miss means that the whole array was traversed and the entry
we were looking for was not found, so there's no need to traverse it
again in order to select an entry to replace.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia 2693310ecc qcow2: use an LRU algorithm to replace entries from the L2 cache
The current algorithm to evict entries from the cache gives always
preference to those in the lowest positions. As the size of the cache
increases, the chances of the later elements of being removed decrease
exponentially.

In a scenario with random I/O and lots of cache misses, entries in
positions 8 and higher are rarely (if ever) evicted. This can be seen
even with the default cache size, but with larger caches the problem
becomes more obvious.

Using an LRU algorithm makes the chances of being removed from the
cache independent from the position.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia baf07d60f5 qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()
Since all tables are now stored together, it is possible to obtain
the position of a particular table directly from its address, so the
operation becomes O(1).

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Alberto Garcia 72e80b8901 qcow2: use one single memory block for the L2/refcount cache tables
The qcow2 L2/refcount cache contains one separate table for each cache
entry. Doing one allocation per table adds unnecessary overhead and it
also requires us to store the address of each table separately.

Since the size of the cache is constant during its lifetime, it's
better to have an array that contains all the tables using one single
allocation.

In my tests measuring freshly created caches with sizes 128MB (L2) and
32MB (refcount) this uses around 10MB of RAM less.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Fam Zheng 13c4941cdd vmdk: Fix overflow if l1_size is 0x20000000
Richard Jones caught this bug with afl fuzzer.

In fact, that's the only possible value to overflow (extent->l1_size =
0x20000000) l1_size:

l1_size = extent->l1_size * sizeof(long) => 0x80000000;

g_try_malloc returns NULL because l1_size is interpreted as negative
during type casting from 'int' to 'gsize', which yields a enormous
value. Hence, by coincidence, we get a "not too bad" behavior:

qemu-img: Could not open '/tmp/afl6.img': Could not open
'/tmp/afl6.img': Cannot allocate memory

Values larger than 0x20000000 will be refused by the validation in
vmdk_add_extent.

Values smaller than 0x20000000 will not overflow l1_size.

Cc: qemu-stable@nongnu.org
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:01 +02:00
Fam Zheng 5e82a31eb9 vmdk: Fix next_cluster_sector for compressed write
This fixes the bug introduced by commit c6ac36e (vmdk: Optimize cluster
allocation).

Sometimes, write_len could be larger than cluster size, because it
contains both data and marker.  We must advance next_cluster_sector in
this case, otherwise the image gets corrupted.

Cc: qemu-stable@nongnu.org
Reported-by: Antoni Villalonga <qemu-list@friki.cat>
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2015-05-22 17:08:00 +02:00
Kevin Wolf ecbda7a225 qcow2: Flush pending discards before allocating cluster
Before a freed cluster can be reused, pending discards for this cluster
must be processed.

The original assumption was that this was not a problem because discards
are only cached during discard/write zeroes operations, which are
synchronous so that no concurrent write requests can cause cluster
allocations.

However, the discard/write zeroes operation itself can allocate a new L2
table (and it has to in order to put zero flags there), so make sure we
can cope with the situation.

This fixes https://bugs.launchpad.net/bugs/1349972.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2015-05-22 17:08:00 +02:00
Paolo Bonzini a53f1a95f9 block: get_block_status: use "else" when testing the opposite condition
A bit of Boolean algebra (and common sense) tells us that the
second "if" here is looking for blocks that are not allocated.
This is the opposite of the "if" that sets BDRV_BLOCK_ALLOCATED,
and thus it can use an "else".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1431599702-10431-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Fam Zheng 9eeb6dd1b2 block: Fix NULL deference for unaligned write if qiov is NULL
For zero write, callers pass in NULL qiov (qemu-io "write -z" or
scsi-disk "write same").

Commit fc3959e466 fixed bdrv_co_write_zeroes which is the common case
for this bug, but it still exists in bdrv_aio_write_zeroes. A simpler
fix would be in bdrv_co_do_pwritev which is the NULL dereference point
and covers both cases.

So don't access it in bdrv_co_do_pwritev in this case, use three aligned
writes.

[Initialize ret to 0 in bdrv_co_do_zero_pwritev() to avoid uninitialized
variable warning with gcc 4.9.2.
--Stefan]

Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1431522721-3266-3-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Fam Zheng d01c07f222 Revert "block: Fix unaligned zero write"
This reverts commit fc3959e466.

The core write code already handles the case, so remove this
duplication.

Because commit 61007b316 moved the touched code from block.c to
block/io.c, the change is manually reverted.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431522721-3266-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Denis V. Lunev 459b4e6612 block: align bounce buffers to page
The following sequence
    int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
    for (i = 0; i < 100000; i++)
            write(fd, buf, 4096);
performs 5% better if buf is aligned to 4096 bytes.

The difference is quite reliable.

On the other hand we do not want at the moment to enforce bounce
buffering if guest request is aligned to 512 bytes.

The patch changes default bounce buffer optimal alignment to
MAX(page size, 4k). 4k is chosen as maximal known sector size on real
HDD.

The justification of the performance improve is quite interesting.
From the kernel point of view each request to the disk was split
by two. This could be seen by blktrace like this:
  9,0   11  1     0.000000000 11151  Q  WS 312737792 + 1023 [qemu-img]
  9,0   11  2     0.000007938 11151  Q  WS 312738815 + 8 [qemu-img]
  9,0   11  3     0.000030735 11151  Q  WS 312738823 + 1016 [qemu-img]
  9,0   11  4     0.000032482 11151  Q  WS 312739839 + 8 [qemu-img]
  9,0   11  5     0.000041379 11151  Q  WS 312739847 + 1016 [qemu-img]
  9,0   11  6     0.000042818 11151  Q  WS 312740863 + 8 [qemu-img]
  9,0   11  7     0.000051236 11151  Q  WS 312740871 + 1017 [qemu-img]
  9,0    5  1     0.169071519 11151  Q  WS 312741888 + 1023 [qemu-img]
After the patch the pattern becomes normal:
  9,0    6  1     0.000000000 12422  Q  WS 314834944 + 1024 [qemu-img]
  9,0    6  2     0.000038527 12422  Q  WS 314835968 + 1024 [qemu-img]
  9,0    6  3     0.000072849 12422  Q  WS 314836992 + 1024 [qemu-img]
  9,0    6  4     0.000106276 12422  Q  WS 314838016 + 1024 [qemu-img]
and the amount of requests sent to disk (could be calculated counting
number of lines in the output of blktrace) is reduced about 2 times.

Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest
does his job well and real requests comes properly aligned (to page).

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-3-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Denis V. Lunev 4196d2f030 block: minimal bounce buffer alignment
The patch introduces new concept: minimal memory alignment for bounce
buffers. Original so called "optimal" value is actually minimal required
value for aligment. It should be used for validation that the IOVec
is properly aligned and bounce buffer is not required.

Though, from the performance point of view, it would be better if
bounce buffer or IOVec allocated by QEMU will be aligned stricter.

The patch does not change any alignment value yet.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1431441056-26198-2-git-send-email-den@openvz.org
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Paolo Bonzini eaf5fe2dd4 block: return EPERM on writes or discards to read-only devices
This is the behavior in the operating system, for example Linux's
blkdev_write_iter has the following:

        if (bdev_read_only(I_BDEV(bd_inode)))
                return -EPERM;

This does not apply to opening a device for read/write, when the
device only supports read-only operation.  In this case any of
EACCES, EPERM or EROFS is acceptable depending on why writing is
not possible.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1431013548-22492-1-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:33 +01:00
Denis V. Lunev ddd2ef2ce8 block/parallels: improve image writing performance further
Try to perform IO for the biggest continuous block possible.
All blocks abscent in the image are accounted in the same type
and preallocation is made for all of them at once.

The performance for sequential write is increased from 200 Mb/sec to
235 Mb/sec on my SSD HDD.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-28-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 19f5dc1591 block/parallels: optimize linear image expansion
Plain image expansion spends a lot of time to update image file size.
This seriously affects the performance. The following simple test
  qemu_img create -f parallels -o cluster_size=64k ./1.hds 64G
  qemu_io -n -c "write -P 0x11 0 1024M" ./1.hds
could be improved if the format driver will pre-allocate some space
in the image file with a reasonable chunk.

This patch preallocates 128 Mb using bdrv_write_zeroes, which should
normally use fallocate() call inside. Fallback to older truncate()
could be used as a fallback using image open options thanks to the
previous patch.

The benefit is around 15%.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Karan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-27-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev d61790112f block/parallels: add prealloc-mode and prealloc-size open paramemets
This is preparational commit for tweaks in Parallels image expansion.
The idea is that enlarge via truncate by one data block is slow. It
would be much better to use fallocate via bdrv_write_zeroes and
expand by some significant amount at once.

Original idea with sequential file writing to the end of the file without
fallocate/truncate would be slower than this approach if the image is
expanded with several operations:
- each image expanding means file metadata update, i.e. filesystem
  journal write. Truncate/write to newly truncated space update file
  metadata twice thus truncate removal helps. With fallocate call
  inside bdrv_write_zeroes file metadata is updated only once and
  this should happen infrequently thus this approach is the best one
  for the image expansion
- tail writes are ordered, i.e. the guest IO queue could not be sent
  immediately to the host introducing additional IO delays

This patch just adds proper parameters into BDRVParallelsState and
performs options parsing in parallels_open.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-26-git-send-email-den@openvz.org
CC: Roman Kagan <rkagan@parallels.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 0d31c7c200 block/parallels: delay writing to BAT till bdrv_co_flush_to_os
The idea is that we do not need to immediately sync BAT to the image as
from the guest point of view there is a possibility that IO is lost
even in the physical controller until flush command was finished.
bdrv_co_flush_to_os is exactly the right place for this purpose.

Technically the patch uses loaded BAT data as a cache and performs
actual on-disk metadata updates in parallels_co_flush_to_os callback.

This patch speed ups
  qemu-img create -f parallels -o cluster_size=64k ./1.hds 64G
  qemu-io -f parallels -c "write -P 0x11 0 1024k" 1.hds
writing from 50-60 Mb/sec to 80-90 Mb/sec on rotational media and
from 160 Mb/sec to 190 Mb/sec on SSD disk.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-25-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 2d68e22e94 block/parallels: create bat_entry_off helper
calculate offset of the BAT entry in the image file.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-24-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 6953d92078 block/parallels: improve image reading performance
Try to perform IO for the biggest continuous block possible.
The performance for sequential read is increased from 220 Mb/sec to
360 Mb/sec for continous image on my SSD HDD.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-23-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 6dd6b9f144 block/parallels: implement incorrect close detection
The software driver must set inuse field in Parallels header to
0x746F6E59 when the image is opened in read-write mode. The presence of
this magic in the header on open forces image consistency check.

There is an unfortunate trick here. We can not check for inuse in
parallels_check as this will happen too late. It is possible to do
that for simple check, but during the fix this would always report
an error as the image was opened in BDRV_O_RDWR mode. Thus we save
the flag in BDRVParallelsState for this.

On the other hand, nothing should be done to clear inuse in
parallels_check. Generic close will do the job right.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-21-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 49ad646731 block/parallels: implement parallels_check method of block driver
The check is very simple at the moment. It calculates necessary stats
and fix only the following errors:
- space leak at the end of the image. This would happens due to
  preallocation
- clusters outside the image are zeroed. Nothing else could be done here

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-20-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 23d6bd3bd1 block/parallels: move parallels_open/probe to the very end of the file
This will help to avoid forward declarations for upcoming parallels_check

Some very obvious formatting fixes were made to the moved code to make
checkpatch happy.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-19-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 9eae9cca95 block/parallels: read parallels image header and BAT into single buffer
This metadata cache would allow to properly batch BAT updates to disk
in next patches. These updates will be properly aligned to avoid
read-modify-write transactions on block level.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-18-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev dd97cdc064 block/parallels: keep BAT bitmap data in little endian in memory
This will allow to use this data as buffer to BAT update directly
without any intermediate buffers.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-17-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 555cc9d9fc block/parallels: create bat2sect helper
deduplicate copy/paste arithmetcs

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-16-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00
Denis V. Lunev 369f7de9d5 block/parallels: rename catalog_ names to bat_
BAT means 'block allocation table'. Thus this name is clean and shorter
on writing.

Some obvious formatting fixes in the old code were made to make checkpatch
happy.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@parallels.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Roman Kagan <rkagan@parallels.com>
Message-id: 1430207220-24458-15-git-send-email-den@openvz.org
CC: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2015-05-22 09:37:32 +01:00