Align the 'rsvd1' reserved field declaration in NvmeBar with existing
style.
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: minor commit message fixup]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
"qemu/typedefs.h" is enough for include/block/write-threshold.h header
with forward declaration of BlockDriverState. Also drop extra includes
from block/write-threshold.c and tests/unit/test-write-threshold.c
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210506090621.11848-9-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
bdrv_write_threshold_exceeded() is unused.
bdrv_write_threshold_is_set() is used only to double check the value of
bs->write_threshold_offset in tests. No real sense in it (both tests do
check real value with help of bdrv_write_threshold_get())
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted commit message as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
They are unused now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
write-notifiers are used only for write-threshold. New code for such
purpose should create filters.
Let's better special-case write-threshold and drop write notifiers at
all. (Actually, write-threshold is special-cased anyway, as the only
user of write-notifiers)
So, create a new direct interface for bdrv_co_write_req_prepare() and
drop all write-notifier related logic from write-threshold.c.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted comment as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
If mirror is READY than cancel operation is not discarding the whole
result of the operation, but instead it's a documented way get a
point-in-time snapshot of source disk.
So, we should not cancel any requests if mirror is READ and
force=false. Let's fix that case.
Note, that bug that we have before this commit is not critical, as the
only .bdrv_cancel_in_flight implementation is nbd_cancel_in_flight()
and it cancels only requests waiting for reconnection, so it should be
rare case.
Fixes: 521ff8b779
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210421075858.40197-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Normally, blk_new_open() just shares all permissions. This was fine
originally when permissions only protected against uses in the same
process because no other part of the code would actually get to access
the block nodes opened with blk_new_open(). However, since we use it for
file locking now, unsharing permissions becomes desirable.
Add a new BDRV_O_NO_SHARE flag that is used in blk_new_open() to unshare
any permissions that can be unshared.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210422164344.283389-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Old interfaces dropped, nobody directly calls
bdrv_child_set_perm_abort() and bdrv_child_set_perm_commit(), so we can
use personal state structure for the action and stop exploiting
BdrvChild structure. Also, drop "_safe" suffix which is redundant now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-35-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move bdrv_reopen_multiple to new paradigm of permission update:
first update graph relations, then do refresh the permissions.
We have to modify reopen process in file-posix driver: with new scheme
we don't have prepared permissions in raw_reopen_prepare(), so we
should reconfigure fd in raw_check_perm(). Still this seems more native
and simple anyway.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-31-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
To be used in further commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-28-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Using bdrv_replace_node() for removing filter is not good enough: it
keeps child reference of the filter, which may conflict with original
top node during permission update.
Instead let's create new interface, which will do all graph
modifications first and then update permissions.
Let's modify bdrv_replace_node_common(), allowing it additionally drop
backing chain child link pointing to new node. This is quite
appropriate for bdrv_drop_intermediate() and makes possible to add
new bdrv_drop_filter() as a simple wrapper.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-24-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These functions are called only from bdrv_reopen_multiple() in block.c.
No reason to publish them.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-8-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Passing parent aio context is redundant, as child_class and parent
opaque pointer are enough to retrieve it. Drop the argument and use new
bdrv_child_get_parent_aio_context() interface.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-7-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add new handler to get aio context and implement it in all child
classes. Add corresponding public interface to be used soon.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-6-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Prior to this patch, if a private nvme-ns device (that is, a namespace
that is not linked to a subsystem) is wired up to an nvme-subsys linked
nvme controller device, the device fails to verify that the namespace id
is unique within the subsystem. NVM Express v1.4b, Section 6.1.6 ("NSID
and Namespace Usage") states that because the device supports Namespace
Management, "NSIDs *shall* be unique within the NVM subsystem".
Additionally, prior to this patch, private namespaces are not known to
the subsystem and the namespace is considered exclusive to the
controller with which it is initially wired up to. However, this is not
the definition of a private namespace; per Section 1.6.33 ("private
namespace"), a private namespace is just a namespace that does not
support multipath I/O or namespace sharing, which means "that it is only
able to be attached to one controller at a time".
Fix this by always allocating namespaces in the subsystem (if one is
linked to the controller), regardless of the shared/private status of
the namespace. Whether or not the namespace is shareable is controlled
by a new `shared` nvme-ns parameter.
Finally, this fix allows the nvme-ns `subsys` parameter to be removed,
since the `shared` parameter now serves the purpose of attaching the
namespace to all controllers in the subsystem upon device realization.
It is invalid to have an nvme-ns namespace device with a linked
subsystem without the parent nvme controller device also being linked to
one and since the nvme-ns devices will unconditionally be "attached" (in
QEMU terms that is) to an nvme controller device through an NvmeBus, the
nvme-ns namespace device can always get a reference to the subsystem of
the controller it is explicitly (using 'bus=' parameter) or implicitly
attaching to.
Fixes: e570768566 ("hw/block/nvme: support for shared namespace in subsystem")
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)
and a couple of misc refactorings from me.
v2:
- remove an unintended submodule update. Argh.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBTP0wACgkQTeGvMW1P
DelZvAf+Ijw77YLz2t0P97CiwoOAG4FADwo61WQJo2AHyS3JMYPVdgNUXF7UGt/S
cNhU1JjdEylgZPKi5t9o/qgy7+vtSL0KKqoXIS0z9ZWZkdFsgObNetGULhaqXgaX
4KcAt7PyJS/33uFqYGSVZxJTO4GtCy34hGw6XrVs388tQD1S+eI+oS5EYBN57rl9
MrV5Z72iMCBx8Y074u81SD/u7n34b5WWbUOJADVI1rAKhilDhnkAhBQ/gK/UBodB
trUUQY/KU+eCnXEqfBeMD6fL75UOAFOjnm7Kfk/Q/g3di3H2fu0RGLrJuSS1dG0n
zmwgE7pQ1jT1UcB3QwAJLR3tGtd6sA==
=HwqP
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging
emulated nvme updates and fixes
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)
and a couple of misc refactorings from me.
v2:
- remove an unintended submodule update. Argh.
# gpg: Signature made Thu 18 Mar 2021 11:53:48 GMT
# gpg: using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg: aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468 4272 63D5 6FC5 E55D A838
# Subkey fingerprint: 5228 33AA 75E2 DCE6 A247 66C0 4DE1 AF31 6D4F 0DE9
* remotes/nvme/tags/nvme-next-pull-request:
hw/block/nvme: add support for the format nvm command
hw/block/nvme: pull lba format initialization
hw/block/nvme: prefer runtime helpers instead of device parameters
hw/block/nvme: support multiple lba formats
hw/block/nvme: add non-mdts command size limit for verify
hw/block/nvme: add verify command
hw/block/nvme: end-to-end data protection
hw/block/nvme: add metadata support
hw/block/nvme: fix zone management receive reporting too many zones
hw/block/nvme: assert namespaces array indices
hw/block/nvme: fix potential overflow
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Format NVM admin command can make a namespace or namespaces to be
with different LBA size and metadata size with protection information
types.
This patch introduces Format NVM command with LBA format, Metadata, and
Protection Information for the device. The secure erase operation things
and support for formatting zoned namespaces are yet to be added.
The parameter checks inside of this patch has been referred from
Keith's old branch.
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
[anaidu.gollu: rebased on e2e]
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased for reworked aio tracking]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Add support for namespaces formatted with protection information. The
type of end-to-end data protection (i.e. Type 1, Type 2 or Type 3) is
selected with the `pi` nvme-ns device parameter. If the number of
metadata bytes is larger than 8, the `pil` nvme-ns device parameter may
be used to control the location of the 8-byte DIF tuple. The default
`pil` value of '0', causes the DIF tuple to be transferred as the last
8 bytes of the metadata. Set to 1 to store this in the first eight bytes
instead.
Co-authored-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
The same information is available via the 'recording' and 'busy' fields.
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
- Add Vladimir as NBD co-maintainer
- Fix reporting of holes in NBD_CMD_BLOCK_STATUS
- Improve command-line parsing accuracy of large numbers (anything going
through qemu_strtosz), including the deprecation of hex+suffix
- Improve some error reporting in the block layer
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmBHlmIACgkQp6FrSiUn
Q2q2cQgAqJWNb4J/ShjvzocDDPzJ0iBitFbg0huFPfbt4DScubEZo5wBJG7vOhOW
hIHrWCRzGvRgsn0tcSfrgFaegmHKrLgjkibM7ou8ni9NC1kUBd3R/3FBNIMxhYf7
Q8Kfspl0LRfMJDKF9jdCnQ4Gxcd6h2OIYZqiWVg8V4Tc8WdCpIVOah7e7wjuW8bT
vgZvfboUWm5AmIF9j/MxuMn+HFZ4ArSuFVL80ZaXlD00vRra7u3HZ8pUfcOlOujg
7HeouM1E5j3NNE6aZSN++x/EQ3sg0zmirbWUCcgAyRfdRkAmB15uh2PUzPxEIJKH
UHUIW5LvNtz2+yzOAz2yK29OE523Yg==
=blE1
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2021-03-09' into staging
nbd patches for 2021-03-09
- Add Vladimir as NBD co-maintainer
- Fix reporting of holes in NBD_CMD_BLOCK_STATUS
- Improve command-line parsing accuracy of large numbers (anything going
through qemu_strtosz), including the deprecation of hex+suffix
- Improve some error reporting in the block layer
# gpg: Signature made Tue 09 Mar 2021 15:38:10 GMT
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg: aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-nbd-2021-03-09:
block/qcow2: refactor qcow2_update_options_prepare error paths
block/qed: bdrv_qed_do_open: deal with errp
block/qcow2: simplify qcow2_co_invalidate_cache()
block/qcow2: read_cache_sizes: return status value
block/qcow2-bitmap: return status from qcow2_store_persistent_dirty_bitmaps
block/qcow2-bitmap: improve qcow2_load_dirty_bitmaps() interface
block/qcow2: qcow2_get_specific_info(): drop error propagation
blockjob: return status from block_job_set_speed()
block/mirror: drop extra error propagation in commit_active_start()
block: drop extra error propagation for bdrv_set_backing_hd
blockdev: fix drive_backup_prepare() missed error
block: check return value of bdrv_open_child and drop error propagation
utils: Deprecate hex-with-suffix sizes
utils: Improve qemu_strtosz() to have 64 bits of precision
utils: Enhance testsuite for do_strtosz()
nbd: server: Report holes for raw images
MAINTAINERS: add Vladimir as co-maintainer of NBD
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Support Identify command for Namespace attached controller list. This
command handler will traverse the controller instances in the given
subsystem to figure out whether the specified nsid is attached to the
controllers or not.
The 4096bytes Identify data will return with the first entry (16bits)
indicating the number of the controller id entries. So, the data can
hold up to 2047 entries for the controller ids.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
If namespace inventory is changed due to some reasons (e.g., namespace
attachment/detachment), controller can send out event notifier to the
host to manage namespaces.
This patch sends out the AEN to the host after either attach or detach
namespaces from controllers. To support clear of the event from the
controller, this patch also implemented Get Log Page command for Changed
Namespace List log type. To return namespace id list through the
command, when namespace inventory is updated, id is added to the
per-controller list (changed_ns_list).
To indicate the support of this async event, this patch set
OAES(Optional Asynchronous Events Supported) in Identify Controller data
structure.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
This patch supports Namespace Attachment command for the pre-defined
nvme-ns device nodes. Of course, attach/detach namespace should only be
supported in case 'subsys' is given. This is because if we detach a
namespace from a controller, somebody needs to manage the detached, but
allocated namespace in the NVMe subsystem.
As command effect for the namespace attachment command is registered,
the host will be notified that namespace inventory is changed so that
host will rescan the namespace inventory after this command. For
example, kernel driver manages this command effect via passthru IOCTL.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Dataset Management is not subject to MDTS, but exceeded a certain size
per range causes internal looping. Report this limit (DMRSL) in the NVM
command set specific identify controller data structure.
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Add support for using the broadcast nsid to issue a flush on all
namespaces through a single command.
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Add new data structures and types for the Simple Copy command.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Added Namespace Multi-path I/O and Namespace Sharing Capabilities (NMIC)
field to support shared namespace from controller(s).
This field is in Identify Namespace data structure in [30].
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Added Controller Multi-path I/O and Namespace Sharing Capabilities
(CMIC) field to support multi-controller in the following patches.
This field is in Identify Controller data structure in [76].
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Better to return status together with setting errp. It allows to avoid
error propagation in the caller.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20210202124956.63146-8-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Rename bytes_covered_by_bitmap_cluster() to
bdrv_dirty_bitmap_serialization_coverage() and make it public.
It is needed as we are going to share it with bitmap loading in
parallels format.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20210224104707.88430-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.
It hides the -ENOTSUPP error, and reports all other errors otherwise.
Use it in luks driver
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201217170904.946013-3-mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
- add 'transform' member to manipulate bitmaps across migration
- work towards better error handling during bdrv_open
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmAnDQsACgkQp6FrSiUn
Q2qc5Qf/SKVdpX4j7OnHF6sBuf/8LVWz4KazSqEU0ohazBJmafgJpH2EA5pXMXR4
frZDWeanGmhj1MjMkta/++uvEBU/TMpW2z98mZvjErteXdnRQAlII/hOCI+QZJvg
viQ5t1EyrkyXzUePOjs+AwqA5KHWbCKt6QqyItQ78HvI23sw/fuvHj0G67KbVzXZ
VcSrVr0J7PXnZV/hWfg+C+Nn9Ro9tsVdn79awLYVQ7/SDro3hzylpcHMQaHMK2oe
mX4D2kNq7s21E27Zb6vlknUhQPkMdETk0gfEbpn7sTVMEc58GRLC7Tqfx7l0JIFK
5izVyA5vndKVxDGYPkbDK6VL2uDg4A==
=+Epy
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/ericb/tags/pull-bitmaps-2021-02-12' into staging
bitmaps patches for 2021-02-12
- add 'transform' member to manipulate bitmaps across migration
- work towards better error handling during bdrv_open
# gpg: Signature made Fri 12 Feb 2021 23:19:39 GMT
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg: aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* remotes/ericb/tags/pull-bitmaps-2021-02-12:
block: use return status of bdrv_append()
block: return status from bdrv_append and friends
qemu-iotests: 300: Add test case for modifying persistence of bitmap
migration: dirty-bitmap: Allow control of bitmap persistence
migration: dirty-bitmap: Use struct for alias map inner members
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The recommended use of qemu error api assumes returning status together
with setting errp and avoid void functions with errp parameter. Let's
improve bdrv_append and some friends to reduce error-propagation
overhead in further patches.
Choose int return status, because bdrv_replace_node_common() has call
to bdrv_check_update_perm(), which reports int status, which seems
correct to propagate.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210202124956.63146-2-vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
It will be used to stop retrying NBD requests on mirror cancel.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210205163720.887197-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
With the new CMB logic in place, bump the implemented specification
version to v1.4 by default.
This requires adding the setting the CNTRLTYPE field and modifying the
VWC field since 0x00 is no longer a valid value for bits 2:1.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Implement v1.4 logic for configuring the Controller Memory Buffer. By
default, the v1.4 scheme will be used (CMB must be explicitly enabled by
the host), so drivers that only support v1.3 will not be able to use the
CMB anymore.
To retain the v1.3 behavior, set the boolean 'legacy-cmb' nvme device
parameter.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Padmakar Kalghatgi <p.kalghatgi@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Use the correct field names.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.
Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
During smart critical warning injection by setting property from QMP
command, also try to trigger asynchronous event.
Suggested by Keith, if a event has already been raised, there is no
need to enqueue the duplicate event any more.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: fix typo in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
There is a very low probability that hitting physical NVMe disk
hardware critical warning case, it's hard to write & test a monitor
agent service.
For debugging purposes, add a new 'smart_critical_warning' property
to emulate this situation.
The orignal version of this change is implemented by adding a fixed
property which could be initialized by QEMU command line. Suggested
by Philippe & Klaus, rework like current version.
Test with this patch:
1, change smart_critical_warning property for a running VM:
#virsh qemu-monitor-command nvme-upstream '{ "execute": "qom-set",
"arguments": { "path": "/machine/peripheral-anon/device[0]",
"property": "smart_critical_warning", "value":16 } }'
2, run smartctl in guest
#smartctl -H -l error /dev/nvme0n1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- volatile memory backup device has failed
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
According to NVM Express v1.4, Section 5.14.1.2 ("SMART / Health
Information"), introduce bit 5 for "Persistent Memory Region has become
read-only or unreliable".
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: minor brush ups in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Align with existing style and use a typedef for header-file enums.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053).
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.
While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.
However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.
The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Define the structures and constants required to implement
Namespace Types support.
Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.
The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.
Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.
This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.
All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
shared nvme.h header for use by later patches.
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).
Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.
Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.
* the underlying file system block size
* the blockdev format
* the 'discard' and 'logical_block_size' parameters
format | discard | wz (512B) wz (4KiB) wz (64KiB)
-----------------------------------------------------
qcow2 ignore n n y
qcow2 unmap n n y
raw ignore n y y
raw unmap n y y
So, this works best with an image in raw format and 4KiB LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64KiB by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Currently bdrv_all_find_snapshot() will return 0 if it finds
a snapshot, -1 if an error occurs, or if it fails to find a
snapshot. New callers to be added want to distinguish between
the error scenario and failing to find a snapshot.
Rename it to bdrv_all_has_snapshot and make it return -1 on
error, 0 if no snapshot is found and 1 if snapshot is found.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-7-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Currently the vmstate will be stored in the first block device that
supports snapshots. Historically this would have usually been the
root device, but with UEFI it might be the variable store. There
needs to be a way to override the choice of block device to store
the state in.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-6-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
When running snapshot operations, there are various rules for which
blockdevs are included/excluded. While this provides reasonable default
behaviour, there are scenarios that are not well handled by the default
logic. Some of the conditions do not have a single correct answer.
Thus there needs to be a way for the mgmt app to provide an explicit
list of blockdevs to perform snapshots across. This can be achieved
by passing a list of node names that should be used.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-5-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
The bdrv_all_*_snapshot functions return a BlockDriverState pointer
for the invalid backend, which the callers then use to report an
error message. In some cases multiple callers are reporting the
same error message, but with slightly different text. In the future
there will be more error scenarios for some of these methods, which
will benefit from fine grained error message reporting. So it is
helpful to push error reporting down a level.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
[PMD: Initialize variables]
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210204124834.774401-2-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error). This error code is propagated from the
primitives like nbd_read.
The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO. As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.
Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read. Apparently that was the original behavior,
but got broken later. Also adjust nbd_readXX to follow.
Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210129073859.683063-4-rvkagan@yandex-team.ru>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert now copy_range parameters which are already 64bit to signed
type.
It's safe as we don't work with requests overflowing BDRV_MAX_LENGTH
(which is less than INT64_MAX), and do check the requests in
bdrv_co_copy_range_internal() (by bdrv_check_request32(), which calls
bdrv_check_request()).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-17-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
Now, since bdrv_co_preadv_part() and bdrv_co_pwritev_part() have been
updated, update all their wrappers.
For all of them type of 'bytes' is widening, so callers are safe. We
have update request_fn in blkverify.c simultaneously. Still it's just a
pointer to one of bdrv_co_pwritev() or bdrv_co_preadv(), and type is
widening for callers of the request_fn anyway.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-16-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, prepare bdrv_co_preadv_part() and bdrv_co_pwritev_part() and their
remaining dependencies now.
bdrv_pad_request() is updated simultaneously, as pointer to bytes passed
to it both from bdrv_co_pwritev_part() and bdrv_co_preadv_part().
So, all callers of bdrv_pad_request() are updated to pass 64bit bytes.
bdrv_pad_request() is already good for 64bit requests, add
corresponding assertion.
Look at bdrv_co_preadv_part() and bdrv_co_pwritev_part().
Type is widening, so callers are safe. Let's look inside the functions.
In bdrv_co_preadv_part() and bdrv_aligned_pwritev() we only pass bytes
to other already int64_t interfaces (and some obviously safe
calculations), it's OK.
In bdrv_co_do_zero_pwritev() aligned_bytes may become large now, still
it's passed to bdrv_aligned_pwritev which supports int64_t bytes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-15-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
All requests in block/io must not overflow BDRV_MAX_LENGTH, all
external users of BdrvTrackedRequest already have corresponding
assertions, so we are safe. Add some assertions still.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
The function is called from 64bit io handlers, and bytes is just passed
to throttle_account() which is 64bit too (unsigned though). So, let's
convert intermediate argument to 64bit too.
This patch is a first in the 64-bit-blocklayer series, so we are
generally moving to int64_t for both offset and bytes parameters on all
io paths. Main motivation is realization of 64-bit write_zeroes
operation for fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
Patch-correctness audit by Eric Blake:
Caller has 32-bit, this patch now causes widening which is safe:
block/block-backend.c: blk_do_preadv() passes 'unsigned int'
block/block-backend.c: blk_do_pwritev_part() passes 'unsigned int'
block/throttle.c: throttle_co_pwrite_zeroes() passes 'int'
block/throttle.c: throttle_co_pdiscard() passes 'int'
Caller has 64-bit, this patch fixes potential bug where pre-patch
could narrow, except it's easy enough to trace that callers are still
capped at 2G actions:
block/throttle.c: throttle_co_preadv() passes 'uint64_t'
block/throttle.c: throttle_co_pwritev() passes 'uint64_t'
Implementation in question: block/throttle-groups.c
throttle_group_co_io_limits_intercept() takes 'unsigned int bytes'
and uses it: argument to util/throttle.c throttle_account(uint64_t)
All safe: it patches a latent bug, and does not introduce any 64-bit
gotchas once throttle_co_p{read,write}v are relaxed, and assuming
throttle_account() is not buggy.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201211183934.169161-7-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
It's better to pass &error_abort than just assert that result is 0: on
crash, we'll immediately see the reason in the backtrace.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix iotest 206 fallout]
Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-21-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Drop unused code.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-20-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to use async block-copy call in backup, so we'll need to
passthrough setting backup speed to block-copy call.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add function to cancel running async block-copy call. It will be used
in backup.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to directly use one async block-copy operation for backup
job, so we need rate limiter.
We want to maintain current backup behavior: only background copying is
limited and copy-before-write operations only participate in limit
calculation. Therefore we need one rate limiter for block-copy state
and boolean flag for block-copy call state for actual limitation.
Note, that we can't just calculate each chunk in limiter after
successful copying: it will not save us from starting a lot of async
sub-requests which will exceed limit too much. Instead let's use the
following scheme on sub-request creation:
1. If at the moment limit is not exceeded, create the request and
account it immediately.
2. If at the moment limit is already exceeded, drop create sub-request
and handle limit instead (by sleep).
With this approach we'll never exceed the limit more than by one
sub-request (which pretty much matches current backup behavior).
Note also, that if there is in-flight block-copy async call,
block_copy_kick() should be used after set-speed to apply new setup
faster. For that block_copy_kick() published in this patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
They will be used for backup.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-5-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We'll need async block-copy invocation to use in backup directly.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-4-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Experiments show, that copy_range is not always making things faster.
So, to make experimentation simpler, let's add a parameter. Some more
perf parameters will be added soon, so here is a new struct.
For now, add new backup qmp parameter with x- prefix for the following
reasons:
- We are going to add more performance parameters, some will be
related to the whole block-copy process, some only to background
copying in backup (ignored for copy-before-write operations).
- On the other hand, we are going to use block-copy interface in other
block jobs, which will need performance options as well.. And it
should be the same structure or at least somehow related.
So, there are too much unclean things about how the interface and now
we need the new options mostly for testing. Let's keep them
experimental for a while.
In do_backup_common() new x-perf parameter handled in a way to
make further options addition simpler.
We add use-copy-range with default=true, and we'll change the default
in further patch, after moving backup to use block-copy.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-2-vsementsov@virtuozzo.com>
[mreitz: s/5\.2/6.0/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
The code already don't freeze base node and we try to make it prepared
for the situation when base node is changed during the operation. In
other words, block-stream doesn't own base node.
Let's introduce a new interface which should replace the current one,
which will in better relations with the code. Specifying bottom node
instead of base, and requiring it to be non-filter gives us the
following benefits:
- drop difference between above_base and base_overlay, which will be
renamed to just bottom, when old interface dropped
- clean way to work with parallel streams/commits on the same backing
chain, which otherwise become a problem when we introduce a filter
for stream job
- cleaner interface. Nobody will surprised the fact that base node may
disappear during block-stream, when there is no word about "base" in
the interface.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201216061703.70908-11-vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
If the flag BDRV_REQ_PREFETCH was set, skip idling read/write
operations in COR-driver. It can be taken into account for the
COR-algorithms optimization. That check is being made during the
block stream job by the moment.
Add the BDRV_REQ_PREFETCH flag to the supported_read_flags of the
COR-filter.
block: Modify the comment for the flag BDRV_REQ_PREFETCH as we are
going to use it alone and pass it to the COR-filter driver for further
processing.
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add the new member supported_read_flags to the BlockDriverState
structure. It will control the flags set for copy-on-read operations.
Make the block generic layer evaluate supported read flags before they
go to a block driver.
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[vsementsov: use assert instead of abort]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[vsementsov: comment indentation, s/Since: 5.2/Since: 6.0/]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-5-vsementsov@virtuozzo.com>
[mreitz: s/commit/stream/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Provide API for insertion a node to backing chain.
Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Add flag to make serialising request no wait: if there are conflicting
requests, just return error immediately. It's will be used in upcoming
preallocate filter.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We'll need a separate function, which will only "mark" request
serialising with specified align but not wait for conflicting
requests. So, it will be like old bdrv_mark_request_serialising(),
before merging bdrv_wait_serialising_requests_locked() into it.
To reduce the possible mess, let's do the following:
Public function that does both marking and waiting will be called
bdrv_make_request_serialising, and private function which will only
"mark" will be called tracked_request_set_serialising().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-6-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.
2. We are going to add one more user of BDRV_REQ_SERIALISING, so
comment about backup becomes a bit confusing here. The use case in
backup is documented in block/backup.c, so let's just drop
duplication here.
3. The fact that BDRV_REQ_SERIALISING is only for write requests is
omitted. Add a note.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201021145859.11201-2-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.
It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.
So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.
And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.
Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.
Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.
So, alignment can't be greater than 2^30.
Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.
OK, then, let's apply it to block/io.
Let's consider all block/io entry points of offset/bytes:
4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().
We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.
And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:
file-posix.c: adopted by this patch
write-threshold.c: only read fields
test-write-threshold.c: sets obviously small constant values
Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.
After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.
For all io requests except for discard we keep for now old restriction
of 32bit request length.
iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block-export-add type=fuse allows mounting block graph nodes via FUSE on
some existing regular file. That file should then appears like a raw
disk image, and accesses to it result in accesses to the exported BDS.
Right now, we only implement the necessary block export functions to set
it up and shut it down. We do not implement any access functions, so
accessing the mount point only results in errors. This will be
addressed by a followup patch.
We keep a hash table of exported mount points, because we want to be
able to detect when users try to use a mount point twice. This is
because we invoke stat() to check whether the given mount point is a
regular file, but if that file is served by ourselves (because it is
already used as a mount point), then this stat() would have to be served
by ourselves, too, which is impossible to do while we (as the caller)
are waiting for it to settle. Therefore, keep track of mount point
paths to at least catch the most obvious instances of that problem.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function is really an internal helper for bdrv_close(). Update its
doc comment to make this clear and make the function private.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160387245480.131299.13430357162209598411.stgit@bahia>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Rename Submission Queue flags with 'Sq' to differentiate
submission queue flags from command queue flags, and introduce
Completion Queue flag definitions.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-13-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE28EdLTc7SjdV9QLsYlFWYQpPbMAFAl+gI74ACgkQYlFWYQpP
bMCBrA/9GXMZZGDfHFenXF+rS6J+ZKxtk29vq9Ly8KZ9YW7CzF9MP8qE/5iyFfmx
d1BknXGQerW2kAzpkOq2/MKDklOc+0BAhaTdUaFR/ao5ZKuv2LQ8uFnKVoTrhTx9
+HVkTVUTnez6ReCZVIrtN4+XVdyQTeQotJg6H2m5Q/BxQKcj6OMOlneuSGDn5vFN
EWgDvEmfFEkzbN8FMXtkT35bg3vA5TGmfQRMk1SMMREOPxF04CaTVTxYscCpS0WC
Cl+62mx4XLjscK7hwXuTNTrxeOLxZ2xLK5dhDd/qxBveio07mIM5X2psdKR0t5qX
HLtm437T9CAYmyo8jgvM4KL8f+rbJnLd579qyVwIMsue28Qisj9nuWCTcaEpjfck
4krhxJwxenRtqQ9wYrnbnQI5yQDIE6iUGf0toXwCNdJIr+FvyIcT7vJtTzZXtRI8
sxwK5wfJ/WSey9uNLZGFbQuv4vjOMV+Nk3mEi1gUV8ujogo+2U6WUAE3NhqFLKn1
YT6AJhDZvqL1f8gFrbiqR8xwvPrYmwK/tK38X1exSDOqiB7UNzR/apAb1oniul0e
rS5xWzIs9APvkdWQssCHvrVDdh6VISXQ5bnT8lkfmvYrCTn2gUGAFXDrxZjXIaL9
scCr8N9STkHmoYpc2ACRKIpfK3E1sDjGA8mAPemkxsLakNwBS4o=
=s4KC
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/nvme/tags/pull-nvme-20201102' into staging
nvme pull 2 Nov 2020
# gpg: Signature made Mon 02 Nov 2020 15:20:30 GMT
# gpg: using RSA key DBC11D2D373B4A3755F502EC625156610A4F6CC0
# gpg: Good signature from "Keith Busch <kbusch@kernel.org>" [unknown]
# gpg: aka "Keith Busch <keith.busch@gmail.com>" [unknown]
# gpg: aka "Keith Busch <keith.busch@intel.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DBC1 1D2D 373B 4A37 55F5 02EC 6251 5661 0A4F 6CC0
* remotes/nvme/tags/pull-nvme-20201102: (30 commits)
hw/block/nvme: fix queue identifer validation
hw/block/nvme: fix create IO SQ/CQ status codes
hw/block/nvme: fix prp mapping status codes
hw/block/nvme: report actual LBA data shift in LBAF
hw/block/nvme: add trace event for requests with non-zero status code
hw/block/nvme: add nsid to get/setfeat trace events
hw/block/nvme: reject io commands if only admin command set selected
hw/block/nvme: support for admin-only command set
hw/block/nvme: validate command set selected
hw/block/nvme: support per-namespace smart log
hw/block/nvme: fix log page offset check
hw/block/nvme: remove pointless rw indirection
hw/block/nvme: update nsid when registered
hw/block/nvme: change controller pci id
pci: allocate pci id for nvme
hw/block/nvme: support multiple namespaces
hw/block/nvme: refactor identify active namespace id list
hw/block/nvme: add support for sgl bit bucket descriptor
hw/block/nvme: add support for scatter gather lists
hw/block/nvme: harden cmb access
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
'qemu-img map' provides a way to determine which extents of an image
come from the top layer vs. inherited from a backing chain. This is
useful information worth exposing over NBD. There is a proposal to
add a QMP command block-dirty-bitmap-populate which can create a dirty
bitmap that reflects allocation information, at which point the
qemu:dirty-bitmap:NAME metadata context can expose that information
via the creation of a temporary bitmap, but we can shorten the effort
by adding a new qemu:allocation-depth metadata context that does the
same thing without an intermediate bitmap (this patch does not
eliminate the need for that proposal, as it will have other uses as
well).
While documenting things, remember that although the NBD protocol has
NBD_OPT_SET_META_CONTEXT, the rest of its documentation refers to
'metadata context', which is a more apt description of what is
actually being used by NBD_CMD_BLOCK_STATUS: the user is requesting
metadata by passing one or more context names. So I also touched up
some existing wording to prefer the term 'metadata context' where it
makes sense.
Note that this patch does not actually enable any way to request a
server to enable this context; that will come in the next patch.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20201027050556.269064-10-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
If a BDS gets deleted during blk_drain_all(), it might miss a
call to bdrv_do_drained_end(). This means missing a call to
aio_enable_external() and the AIO context remains disabled for
ever. This can cause a device to become irresponsive and to
disrupt the guest execution, ie. hang, loop forever or worse.
This scenario is quite easy to encounter with virtio-scsi
on POWER when punching multiple blockdev-create QMP commands
while the guest is booting and it is still running the SLOF
firmware. This happens because SLOF disables/re-enables PCI
devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND
register after the initial probe/feature negotiation, as it
tends to work with a single device at a time at various stages
like probing and running block/network bootloaders without
doing a full reset in-between. This naturally generates many
dataplane stops and starts, and thus many drain sections that
can race with blockdev_create_run(). In the end, SLOF bails
out.
It is somehow reproducible on x86 but it requires to generate
articial dataplane start/stop activity with stop/cont QMP
commands. In this case, seabios ends up looping for ever,
waiting for the virtio-scsi device to send a response to
a command it never received.
Add a helper that pairs all previously called bdrv_do_drained_begin()
with a bdrv_do_drained_end() and call it from bdrv_close().
While at it, update the "/bdrv-drain/graph-change/drain_all"
test in test-bdrv-drain so that it can catch the issue.
BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160346526998.272601.9045392804399803158.stgit@bahia.lan>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Since commit c8bb23cbdb when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.
In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.
This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.
This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.
After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.
Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <6d77cab968c501c44d6e1089b9bc91b04170b49e.1603731354.git.berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Address 0 is not an invalid address. Remove those invalikd checks.
Unaligned PRP2 and PRP list entries should result in Invalid PRP Offset
status code and not Invalid Field. Fix that.
See NVMe Express v1.3d, Section 4.3 ("Physical Region Page Entry and
List").
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
If the host sets CC.CSS to 111b, all commands submitted to I/O queues
should be completed with status Invalid Command Opcode.
Note that this is technically a v1.4 feature, but it does not hurt to
implement before we finally bump the reported version implemented.
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Fail to start the controller if the user requests a command set that the
controller does not support.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Let the user specify a specific namespace if they want to get access
stats for a specific namespace.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
For now, support the Data Block, Segment and Last Segment descriptor
types.
See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Inside of coroutine context, we can't directly use aio_context_acquire()
for the AioContext of a block node because we already own the lock of
the current AioContext and we need to avoid double locking to prevent
deadlocks.
This provides helper functions to lock the AioContext of a node only if
it's not the same as the current AioContext.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20201005155855.256490-14-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Add a pair of functions to temporarily move the current coroutine to the
AioContext of a given BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20201005155855.256490-13-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Add a function that can be used to move the currently running coroutine
to a different AioContext (and therefore potentially a different
thread).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201005155855.256490-12-kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Saving icount as a parameters of the snapshot allows navigation between
them in the execution replay scenario.
This information can be used for finding a specific snapshot for proceeding
the recorded execution to the specific moment of the time.
E.g., 'reverse step' action (introduced in one of the following patches)
needs to load the nearest snapshot which is prior to the current moment
of time.
This patch also updates snapshot test which verifies qemu monitor output.
Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
--
v4 changes:
- squashed format update with test output update
v7 changes:
- introduced the spaces between the fields in snapshot info output
- updated the test to match new field widths
Message-Id: <160174518865.12451.14327573383978752463.stgit@pasha-ThinkPad-X280>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is the only non-ascii character in the file and it doesn't really
needed here. Let's use normal "'" symbol for consistency with the rest
11 occurrences of "'" in the file.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Like for read/write in a previous commit, drop extra indirection layer,
generate directly bdrv_readv_vmstate() and bdrv_writev_vmstate().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20200924185414.28642-8-vsementsov@virtuozzo.com>