This makes all sync commands uninterruptible and schedules without timeout
so the controller either has to post a completion or the timeout recovery
fails the command. This fixes potential memory or data corruption from
a command timing out too early or woken by a signal. Previously any DMA
buffers mapped for that command would have been released even though we
don't know what the controller is planning to do with those addresses.
Signed-off-by: Keith Busch <keith.busch@intel.com>
We don't track queues in a llist, subscribe to hot-cpu notifications,
or internally retry commands. Delete the unused artifacts.
Signed-off-by: Keith Busch <keith.busch@intel.com>
The driver has to end unreturned commands at some point even if the
controller has not provided a completion. The driver tried to be safe by
deleting IO queues prior to ending all unreturned commands. That should
cause the controller to internally abort inflight commands, but IO queue
deletion request does not have to be successful, so all bets are off. We
still have to make progress, so to be extra safe, this patch doesn't
clear a queue to release the dma mapping for a command until after the
pci device has been disabled.
This patch removes the special handling during device initialization
so controller recovery can be done all the time. This is possible since
initialization is not inlined with pci probe anymore.
Reported-by: Nilish Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
This performs the longest parts of nvme device probe in scheduled work.
This speeds up probe significantly when multiple devices are in use.
Signed-off-by: Keith Busch <keith.busch@intel.com>
This creates a new class type for nvme devices to register their
management character devices with. This is so we do not rely on miscdev
to provide enough minors for as many nvme devices some people plan to
use. The previous limit was approximately 60 NVMe controllers, depending
on the platform and kernel. Now the limit is 1M, which ought to be enough
for anybody.
Since we have a new device class, it makes sense to attach the block
devices under this as well, so part of this patch moves the management
handle initialization prior to the namespaces discovery.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Adds support for NVMe metadata formats and exposes block devices for
all namespaces regardless of their format. Namespace formats that are
unusable will have disk capacity set to 0, but a handle to the block
device is created to simplify device management. A namespace is not
usable when the format requires host interleave block and metadata in
single buffer, has no provisioned storage, or has better data but failed
to register with blk integrity.
The namespace has to be scanned in two phases to support separate
metadata formats. The first establishes the sector size and capacity
prior to invoking add_disk. If metadata is required, the capacity will
be temporarilly set to 0 until it can be revalidated and registered with
the integrity extenstions after add_disk completes.
The driver relies on the integrity extensions to provide the metadata
buffer. NVMe requires this be a single physically contiguous region,
so only one integrity segment is allowed per command. If the metadata
is used for T10 PI, the driver provides mappings to save and restore
the reftag physical block translation. The driver provides no-op
functions for generate and verify if metadata is not used for protection
information. This way the setup is always provided by the block layer.
If a request does not supply a required metadata buffer, the command
is failed with bad address. This could only happen if a user manually
disables verify/generate on such a disk. The only exception to where
this is okay is if the controller is capable of stripping/generating
the metadata, which is possible on some types of formats.
The metadata scatter gather list now occupies the spot in the nvme_iod
that used to be used to link retryable IOD's, but we don't do that
anymore, so the field was unused.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Pull block driver changes from Jens Axboe:
"This contains:
- The 4k/partition fixes for brd from Boaz/Matthew.
- A few xen front/back block fixes from David Vrabel and Roger Pau
Monne.
- Floppy changes from Takashi, cleaning the device file creation.
- Switching libata to use the new blk-mq tagging policy, removing
code (and a suboptimal implementation) from libata. This will
throw you a merge conflict, since a bug in the original libata
tagging code was fixed since this code was branched. Trivial.
From Shaohua.
- Conversion of loop to blk-mq, from Ming Lei.
- Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
He claims it improves on unreadable code, which will cost him a
beer.
- Maintainer update or NDB, now handled by Markus Pargmann.
- NVMe:
- Optimization from me that avoids a kmalloc/kfree per IO for
smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
overhead.
- Removal of (now) dead RCU code, a relic from before NVMe was
converted to blk-mq"
* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
xen-blkback: default to X86_32 ABI on x86
xen-blkfront: fix accounting of reqs when migrating
xen-blkback,xen-blkfront: add myself as maintainer
block: Simplify bsg complete all
floppy: Avoid manual call of device_create_file()
NVMe: avoid kmalloc/kfree for smaller IO
MAINTAINERS: Update NBD maintainer
libata: make sata_sil24 use fifo tag allocator
libata: move sas ata tag allocation to libata-scsi.c
libata: use blk taging
NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
null_blk: suppress invalid partition info
brd: Request from fdisk 4k alignment
brd: Fix all partitions BUGs
axonram: Fix bug in direct_access
loop: add blk-mq.h include
block: loop: don't handle REQ_FUA explicitly
block: loop: introduce lo_discard() and lo_req_flush()
block: loop: say goodby to bio
block: loop: improve performance via blk-mq
Currently we allocate an nvme_iod for each IO, which holds the
sg list, prps, and other IO related info. Set a threshold of
2 pages and/or 8KB of data, below which we can just embed this
in the per-command pdu in blk-mq. For any IO at or below
NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.
For higher IOPS, this saves up to 1% of CPU time.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Converting from to blk-queue got rid of the driver's RCU
locking-on-queue, so removing unnecessary RCU locking-on-queue
artefacts.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kelly Nicole Kaoudis <kaoudis@colorado.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was inadvertently dropped from an earlier commit, otherwise
the check against cq_vector == -1 to prevent double free doesn't
make any sense.
Fixes: 2b25d98179
Signed-off-by: Jens Axboe <axboe@fb.com>
The queues and device need to be locked when messing with them.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This freezes and stops all the queues on device shutdown and restarts
them on resume. This fixes hotplug and reset issues when the controller
is actively being used.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Aborts all requeued commands prior to killing the request_queue. For
commands that time out on a dying request queue, set the "Do Not Retry"
bit on the command status so the command cannot be requeued. Finanally, if
the driver is requested to abort a command it did not start, do nothing.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This protects admin queue access on shutdown. When the controller is
disabled, the queue is frozen to prevent new entry, and unfrozen on
resume, and fixes cq_vector signedness to not suspend a queue twice.
Since unfreezing the queue makes it available for commands, it requires
the queue be initialized, so this moves this part after that.
Special handling is done when the device is unresponsive during
shutdown. This can be optimized to not require subsequent commands to
timeout, but saving that fix for later.
This patch also removes the kill signals in this path that were left-over
artifacts from the blk-mq conversion and no longer necessary.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since there is no gendisk associated with the admin queue, the driver
needs to hold a reference to it until all open references to the
controller are closed.
This also combines queue cleanup with freeing the tag set since these
should not be separate.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Once the nvme callback is set for a request, the driver can start it
and make it available for timeout handling. For timed out commands on a
device that is not initialized, this fixes potential deadlocks that can
occur on startup and shutdown when a device is unresponsive since they
can now be cancelled.
Asynchronous requests do not have any expected timeout, so these are
using the new "REQ_NO_TIMEOUT" request flags.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Check IS_ERR_OR_NULL(return value) instead of just return value.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reduced to IS_ERR() by me, we never return NULL.
Signed-off-by: Jens Axboe <axboe@fb.com>
Sets the vector to an invalid value after it's freed so we don't free
it twice.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we have a race between the schedule timing out and the command
completing, we could have the task issuing the command exit
nvme_submit_sync_cmd() while the irq is running sync_completion().
If that happens, we could be corrupting memory, since the stack
that held 'cmdinfo' is no longer valid.
Fix this by always calling nvme_abort_cmd_info(). Once that call
completes, we know that we have either run sync_completion() if
the completion came in, or that we will never run it since we now
have special_completion() as the command callback handler.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The logic around retrying and erroring IO in nvme_queue_rq() is broken
in a few ways:
- If we fail allocating dma memory for a discard, we return retry. We
have the 'iod' stored in ->special, but we free the 'iod'.
- For a normal request, if we fail dma mapping of setting up prps, we
have the same iod situation. Additionally, we haven't set the callback
for the request yet, so we also potentially leak IOMMU resources.
Get rid of the ->special 'iod' store. The retry is uncommon enough that
it's not worth optimizing for or holding on to resources to attempt to
speed it up. Additionally, it's usually best practice to free any
request related resources when doing retries.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
After Hot-remove of a device with a mounted partition,
when the device is hot-added again, the new node reappears
as nvme0n1. Mounting this new node fails with the error:
mount: mount /dev/nvme0n1p1 on /mnt failed: File exists.
The old nodes's FS entries still exist and the kernel can't re-create
procfs and sysfs entries for the new node with the same name.
The patch fixes this issue.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Indraneel M <indraneel.m@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We return an error pointer or the request, not NULL. Half
the call paths got it right, the others didn't. Fix those up.
Signed-off-by: Jens Axboe <axboe@fb.com>
We allocate 'abort_req', but free 'req' in case of an error
submitting the IO.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
On retry, the req->special is pointing to an already setup IOD, but we
still need to setup the command context and callback, otherwise you'll
see false twice completed errors and leak requests.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's already near impossible to tell what bits someone is running based on
a 'modinfo nvme', and I don't want to try guessing if someone is running
blk-mq or bio-based. Let's make it obvious with the module version that
the blk-mq conversion is a major change. Future bio-based versions can
increment to 0.10 in a fork if revisions occur.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The PCI init of NVMe doesn't check for valid bars before proceeding
to map and use BAR 0. If the device is hosed (or firmware is), then
we should catch this case and give up early.
This fixes a:
[ 1662.035778] WARNING: CPU: 0 PID: 4 at arch/x86/mm/ioremap.c:63 __ioremap_check_ram+0xa7/0xc0()
and later badness on such a device.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we do teardown and setup of the queue and block related parts
of the driver, then we should clear nvmeq->hctx once we kill the
hardware queue.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The setup/probe part currently relies on INTx being there and
working, that's not always the case. For devices that don't
advertise INTx, enable a single MSIx vector early on and disable
it again before we ask for our full range of queue vecs.
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We are called for async event notification issues, and the
nvmeq lock is already held. If we fail the request allocation,
we'll just retry next time.
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
No point in using blk_put_request(), since we know we are blk-mq.
This only makes sense in core code where we could be dealing with
either legacy or blk-mq drivers. Additionally, use
blk_mq_free_hctx_request() for the request completion fast path,
where we already know the mapping from request to hardware queue.
Signed-off-by: Jens Axboe <axboe@fb.com>
drivers/block/nvme-core.c:865:5: sparse: symbol '__nvme_submit_admin_cmd' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We recently converted this to blk_mq but the error checks have to be
updated to check for IS_ERR() instead of NULL.
Fixes: a4aea5623d ('NVMe: Convert to blk-mq')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This converts the NVMe driver to a blk-mq request-based driver.
The NVMe driver is currently bio-based and implements queue logic within
itself. By using blk-mq, a lot of these responsibilities can be moved
and simplified.
The patch is divided into the following blocks:
* Per-command data and cmdid have been moved into the struct request
field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
maintenance are now handled by blk-mq through the rq->tag field.
* The logic for splitting bio's has been moved into the blk-mq layer.
The driver instead notifies the block layer about limited gap support in
SG lists.
* blk-mq handles timeouts and is reimplemented within nvme_timeout().
This both includes abort handling and command cancelation.
* Assignment of nvme queues to CPUs are replaced with the blk-mq
version. The current blk-mq strategy is to assign the number of
mapped queues and CPUs to provide synergy, while the nvme driver
assign as many nvme hw queues as possible. This can be implemented in
blk-mq if needed.
* NVMe queues are merged with the tags structure of blk-mq.
* blk-mq takes care of setup/teardown of nvme queues and guards invalid
accesses. Therefore, RCU-usage for nvme queues can be removed.
* IO tracing and accounting are handled by blk-mq and therefore removed.
* Queue suspension logic is replaced with the logic from the block
layer.
Contributions in this patch from:
Sam Bradshaw <sbradshaw@micron.com>
Jens Axboe <axboe@fb.com>
Keith Busch <keith.busch@intel.com>
Robert Nelson <rlnelson@google.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@fb.com>
Updated for new ->queue_rq() prototype.
Signed-off-by: Jens Axboe <axboe@fb.com>
Discard requests are often for very large ranges. The discard size is not
representative of the data transfer size so we don't need to allocate
for such a large prp list. This patch requests allocating only enough
for the memory needed for the data transfer and saves a little over 8k
of memory per max discard request.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is possible the block layer will request to open a block device after
the driver deleted it. Subsequent releases will cause a double free,
or the disk's private_data is pointing to freed memory. This patch
protects the driver's freed disks from being opened and accessed: the
nvme namespaces are freed only when the device's refcount is 0, so at
that moment there were no active openers and no more should be allowed,
and it is safe to clear the disk's private_data that is about to be freed.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Henry Chow <henry.chow@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The nvme namespace request_queue's flags are initialized to
QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The
device-mapper indicates this flag means the block driver is requset
based, though this driver is bio-based and problems will occur if an nvme
namespace is used with a request based dm device. This patch clears the
stackable flag.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we ever do parallel device probing, we need to wake up all processes
waiting for nvme kthread to start, not just one. This is currently
serialized so the bug is not reachable today, but fixing this anyway in
the hopes we implement parallel or asynchronous probe in the future.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data
transfers and isn't usable for other NVMe commands like flush,
data set management, or any sort of vendor unique command. The
NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary
IO commands in addition to arbitrary admin commands without breaking
backward compatibility. This patch just adds a new IOCTL to distinguish
if the driver should submit the command on an IO or Admin queue.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a callback to revalidate the disk and change its block size
and capacity if needed. Before, a user would have to remove + rescan
an entire device if they changed the logical block size using an NVMe
Format or other vendor specific command; now they can just run something
that issues the BLKRRPART IOCTL, like
# hdparm -z /dev/nvmeXnY
This can also be used in response to the 1.2 Spec's Namespace Attribute
Change asynchronous event.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We need to update the nvme queue's wait_queue_t entry during each
initialization since the nvme_thread may be ended and restarted when
the device is reset. If a device reset occurs during a large amount
of buffered IO, it would take a lot longer to complete the outstanding
requests due to the 1 second polling instead of waking up as completions
occur.
Fixes: b9afca3efb
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This returns a more appropriate error for the "capacity exceeded"
status. In case other NVMe statuses have a better errno, this patch adds
a convience function to translate an NVMe status code to an errno for
IO commands, defaulting to the current -EIO.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can return -ENOIOCTLCMD and the ioctl will be handled by
fs/compat_ioctl.c instead. This removes a lot of duplicate code in the
nvme driver.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If an nvme device is removed but user space has an open reference,
the nvme driver would have been holding an invalid reference to its pci
device. You may get a general protection fault on x86 h/w when the driver
uses that reference in dma_map_sg(), as is done in nvme_map_user_pages()
from the IOCTL interface.
This patch fixes the fault by taking a reference on the pci device and
holding it even after device removal until all opens on the nvme device
are closed.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index.
This patch fixes the case where there are more cpus from which the ioctl
call can originate than online queues, which can happen when a device
supports or was allocated fewer interrupt vectors than exist cpu cores.
Thanks to Keith Busch for the implementation suggestion.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This changes the order of deleting the gendisks so it happens after the
nvme IO queues are freed. If a device is removed while a filesystem has
associated dirty data, the removal will wait on these to complete before
proceeding from del_gendisk, which could have caused deadlock before.
The implication of this is that an orderly removal of a responsive
device won't necessarily wait for dirty data to be written, but we are
not guaranteed the device is even going to respond at this point either.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rather than relying on call_rcu, this patch directly frees the
nvme_queue's memory after ensuring no readers exist. Some arch specific
dma_free_coherent implementations may not be called from a call_rcu's
soft interrupt context, hence the change.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Matthew Minter <matthew_minter@xyratex.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The current implementation hard-codes the shutdown timeout to 2 seconds.
Some devices take longer than this to complete a normal shutdown.
Changing the shutdown timeout to a module parameter with a default
timeout of 5 seconds.
Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>