Drivers can now tell blk-mq if they take advantage of the deferred
issue through 'last' or not. If they do, don't do queue-direct
for sync IO. This is a preparation patch for the nvme conversion.
Signed-off-by: Jens Axboe <axboe@fb.com>
Since we have the notion of a 'last' request in a chain, we can use
this to have the hardware optimize the issuing of requests. Add
a list_head parameter to queue_rq that the driver can use to
temporarily store hw commands for issue when 'last' is true. If we
are doing a chain of requests, pass in a NULL list for the first
request to force issue of that immediately, then batch the remainder
for deferred issue until the last request has been sent.
Instead of adding yet another argument to the hot ->queue_rq path,
encapsulate the passed arguments in a blk_mq_queue_data structure.
This is passed as a constant, and has been tested as faster than
passing 4 (or even 3) args through ->queue_rq. Update drivers for
the new ->queue_rq() prototype. There are no functional changes
in this patch for drivers - if they don't use the passed in list,
then they will just queue requests individually like before.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
All other allocs are done on the specific node, somehow the
cpumask for hw queue runs was missed. Fix that by using
zalloc_cpumask_var_node() in blk_mq_init_queue().
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch supports to run one single flush machinery for
each blk-mq dispatch queue, so that:
- current init_request and exit_request callbacks can
cover flush request too, then the buggy copying way of
initializing flush request's pdu can be fixed
- flushing performance gets improved in case of multi hw-queue
In fio sync write test over virtio-blk(4 hw queues, ioengine=sync,
iodepth=64, numjobs=4, bs=4K), it is observed that througput gets
increased a lot over my test environment:
- throughput: +70% in case of virtio-blk over null_blk
- throughput: +30% in case of virtio-blk over SSD image
The multi virtqueue feature isn't merged to QEMU yet, and patches for
the feature can be found in below tree:
git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4
And simply passing 'num_queues=4 vectors=5' should be enough to
enable multi queue(quad queue) feature for QEMU virtio-blk.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(),
so that this function can find the corresponding blk_flush_queue
bound with current mq context since the flush queue will become
per hw-queue.
For legacy queue, the parameter can be simply 'NULL'.
For multiqueue case, the parameter should be set as the context
from which the related request is originated. With this context
info, the hw queue and related flush queue can be found easily.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now mission of the two helpers is over, and just call
blk_alloc_flush_queue() and blk_free_flush_queue() directly.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch introduces 'struct blk_flush_queue' and puts all
flush machinery related fields into this structure, so that
- flush implementation details aren't exposed to driver
- it is easy to convert to per dispatch-queue flush machinery
This patch is basically a mechanical replacement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These two temporary functions are introduced for holding flush
initialization and de-initialization, so that we can
introduce 'flush queue' easier in the following patch. And
once 'flush queue' and its allocation/free functions are ready,
they will be removed for sake of code readability.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is reasonable to allocate flush req in blk_mq_init_flush().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Failure of initializing one hctx isn't handled, so this patch
introduces blk_mq_init_hctx() and its pair to handle it explicitly.
Also this patch makes code cleaner.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take above ten seconds when
scsi-mq is used.
[ 0.949892] scsi host0: Virtio SCSI HBA
[ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5
[ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz
<stall>
[ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0
[ 16.194099] osd: LOADED open-osd 0.2.1
[ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.208478] sd 0:0:0:0: [sda] Write Protect is off
[ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off
[ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
This is also the reason why request_queues start in bypass mode which
is ended on blk_register_queue() as shutting down a fully functional
queue also involves a RCU grace period and the queues for non-existent
SCSI devices never reach registration.
blk-mq basically needs to do the same thing - start the mq in a
degraded mode which is faster to shut down and then make it fully
functional only after the queue reaches registration. percpu_ref
recently grew facilities to force atomic operation until explicitly
switched to percpu mode, which can be used for this purpose. This
patch makes blk-mq initialize q->mq_usage_counter in atomic mode and
switch it to percpu mode only once blk_register_queue() is reached.
Note that this issue was previously worked around by 0a30288da1
("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during
probe") for v3.17. The temp fix was reverted in preparation of adding
persistent atomic mode to percpu_ref by 9eca80461a ("Revert "blk-mq,
percpu_ref: implement a kludge for SCSI blk-mq stall during probe"").
This patch and the prerequisite percpu_ref changes will be merged
during v3.18 devel cycle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda9 ("blk-mq: use percpu_ref for mq usage count")
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.
Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.
This patch adds @flags to percpu_ref_init() and implements the
following flags.
* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained
These flags should be able to serve the above two use cases.
v2: target_core_tpg.c conversion was missing. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
This reverts commit 0a30288da1, which
was a temporary fix for SCSI blk-mq stall issue. The following
patches will fix the issue properly by introducing atomic mode to
percpu_ref.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
blk-mq uses percpu_ref for its usage counter which tracks the number
of in-flight commands and used to synchronously drain the queue on
freeze. percpu_ref shutdown takes measureable wallclock time as it
involves a sched RCU grace period. This means that draining a blk-mq
takes measureable wallclock time. One would think that this shouldn't
matter as queue shutdown should be a rare event which takes place
asynchronously w.r.t. userland.
Unfortunately, SCSI probing involves synchronously setting up and then
tearing down a lot of request_queues back-to-back for non-existent
LUNs. This means that SCSI probing may take more than ten seconds
when scsi-mq is used.
This will be properly fixed by implementing a mechanism to keep
q->mq_usage_counter in atomic mode till genhd registration; however,
that involves rather big updates to percpu_ref which is difficult to
apply late in the devel cycle (v3.17-rc6 at the moment). As a
stop-gap measure till the proper fix can be implemented in the next
cycle, this patch introduces __percpu_ref_kill_expedited() and makes
blk_mq_freeze_queue() use it. This is heavy-handed but should work
for testing the experimental SCSI blk-mq implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
Fixes: add703fda9 ("blk-mq: use percpu_ref for mq usage count")
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.
Signed-off-by: Jens Axboe <axboe@fb.com>
It's not uncommon for crash dump kernels to be limited to 128MB or
something low in that area. This is normally not a problem for
devices as we don't use that much memory, but for some shared SCSI
setups with huge queue depths, it can potentially fill most of
memory with tons of request allocations. blk-mq does scale back
when it fails to allocate memory, but it scales back just enough
so that blk-mq succeeds. This could still leave the system with
not enough memory to make any real progress.
Check if we are in a kdump environment and limit the hardware
queues and tag depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch removes two unnecessary blk_clear_rq_complete(),
the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
so:
- The blk_clear_rq_complete() in blk_flush_restore_request()
needn't because the request will be freed later, and clearing
it here may open a small race window with timeout.
- The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
necessary too, even though REQ_ATOM_STARTED is cleared in
__blk_mq_requeue_request(), in theory it still may cause a small
race window with timeout since the two clear_bit() may be
reordered.
Signed-off-by: Ming Lei <ming.lei@canoical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allow blk-mq to pass an argument to the timeout handler to indicate
if we're timing out a reserved or regular command. For many drivers
those need to be handled different.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Duplicate the (small) timeout handler in blk-mq so that we can pass
arguments more easily to the driver timeout handler. This enables
the next patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we've changed the driver API on the submission side use the
opportunity to fix up the name on the completion side to fit into the
general scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.
Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag. Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When requests are retried due to hw or sw resource shortages,
we often stop the associated hardware queue. So ensure that we
restart the queues when running the requeue work, otherwise the
queue run will be a no-op.
Signed-off-by: Jens Axboe <axboe@fb.com>
__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
back the queue depth if we are low on memory. So don't clear
set->tags when we fail, this is handled directly in
the parent function, blk_mq_alloc_tag_set().
Reported-by: Robert Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We should not insert requests into the flush state machine from
blk_mq_insert_request. All incoming flush requests come through
blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
should only be called for BLOCK_PC requests. All other callers
deal with requests that already went through the flush statemchine
and shouldn't be reinserted into it.
Reported-by: Robert Elliott <Elliott@hp.com>
Debugged-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch should fix the bug reported in
https://lkml.org/lkml/2014/9/11/249.
We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.
Otherwise blk_mq_timeout_check() might dereference uninitialized
pointers when racing with the creation of a request.
Also move the reset of cmd_flags for the initializing code to the point
where a request is freed. So we will never end up with pending flush
request indicators that might trigger dereferences of invalid pointers
in blk_mq_timeout_check().
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When we start the request, we set the deadline and flip the bits
marking the request as started and non-complete. However, it's
important that the deadline store is ordered before flipping the
bits, otherwise we could have a small window where the request is
marked started but with an invalid deadline. This can confuse the
timeout handling.
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/scsi_ioctl.c
If we are running in a kdump environment, resources are scarce.
For some SCSI setups with a huge set of shared tags, we run out
of memory allocating what the drivers is asking for. So implement
a scale back logic to reduce the tag depth for those cases, allowing
the driver to successfully load.
We should extend this to detect low memory situations, and implement
a sane fallback for those (1 queue, 64 tags, or something like that).
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.
This patch doesn't make any functional difference.
v2: blk-mq conversion was missing. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
In blk-mq.c blk_mq_alloc_tag_set, if:
set->tags = kmalloc_node()
succeeds, but one of the blk_mq_init_rq_map() calls fails,
goto out_unwind;
needs to free set->tags so the caller is not obligated
to do so. None of the current callers (null_blk,
virtio_blk, virtio_blk, or the forthcoming scsi-mq)
do so.
set->tags needs to be set to NULL after doing so,
so other tag cleanup logic doesn't try to free
a stale pointer later. Also set it to NULL
in blk_mq_free_tag_set.
Tested with error injection on the forthcoming
scsi-mq + hpsa combination.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk_get_request function may fail in low-memory conditions or during
device removal (even if __GFP_WAIT is set). To distinguish between these
errors, modify the blk_get_request call stack to return the appropriate
ERR_PTR. Verify that all callers check the return status and consider
IS_ERR instead of a simple NULL pointer check.
For consistency, make a similar change to the blk_mq_alloc_request leg
of blk_get_request. It may fail if the queue is dead, or the caller was
unwilling to wait.
Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd]
Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes code such as the following with scsi-mq enabled:
rq = blk_get_request(...);
blk_rq_set_block_pc(rq);
rq->cmd = my_cmd_buffer; /* separate CDB buffer */
blk_execute_rq_nowait(...);
Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for
large CDBs only). Without this patch, scsi_mq_prep_fn() will set
rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device.
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
While converting to percpu_ref for freezing, add703fda9 ("blk-mq:
use percpu_ref for mq usage count") incorrectly made
blk_mq_freeze_queue() misbehave when freezing is nested due to
percpu_ref_kill() being invoked on an already killed ref.
Fix it by making blk_mq_freeze_queue() kill and kick the queue only
for the outermost freeze attempt. All the nested ones can simply wait
for the ref to reach zero.
While at it, remove unnecessary @wake initialization from
blk_mq_unfreeze_queue().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
to determine whether it should merge IO or not. However, this could
also be disabled by the admin, if merging is switched off through
sysfs. So check the general queue state as well before attempting
to merge IO.
Reported-by: Rob Elliott <Elliott@hp.com>
Tested-by: Rob Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Before doing queue release, the queue has been freezed already
by blk_cleanup_queue(), so needn't to freeze queue for deleting
tag set.
This patch fixes the WARNING of "percpu_ref_kill() called more than once!"
which is triggered during unloading block driver.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.
This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.
blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified. Let's flatten it into
its caller.
This patch doesn't make any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.
However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.
Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.
The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.
* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.
* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.
* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was
!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))
mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.
This change simplifies the code and reduces confusion in the area.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero. The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining. The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.
Fix it by draining regardless of bypass_depth. We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot. INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq uses a percpu_counter to keep track of how many usages are in
flight. The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.
Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently it calls __blk_mq_run_hw_queue(), which depends on the
CPU placement being correct. This means it's not possible to call
blk_mq_start_hw_queues(q) from a context that is correct for all
queues, leading to triggering the
WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
in __blk_mq_run_hw_queue().
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If we need to drain a queue we need to run all queues, even if they
are marked stopped to make sure the driver has a chance to error out
on all queued requests.
This fixes surprise removal with scsi-mq.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way will become consistent with non-mq case, also
avoid to update rq->deadline twice for mq.
The comment said: "We do this early, to ensure we are on
the right CPU.", but no percpu stuff is used in blk_add_timer(),
so it isn't necessary. Even when inserting from plug list, there
is no such guarantee at all.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk-mq core only initializes this if io stats are enabled, since
blk-mq only reads the field in that case. But drivers could
potentially use it internally, so ensure that we always set it to
the current time when the request is allocated.
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It'll be used in blk_mq_start_request() to set a potential timeout
for the request, so clear it to zero at alloc time to ensure that
we know if someone has set it or not.
Fixes random early timeouts on NVMe testing.
Signed-off-by: Jens Axboe <axboe@fb.com>
If the queue is going away, don't let new allocs or queueing
happen on it. Go through the normal wait process, and exit with
ENODEV in that case.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For some scsi-mq cases, the tag map can be huge. So increase the
max number of tags we support.
Additionally, don't fail with EINVAL if a user requests too many
tags. Warn that the tag depth has been adjusted down, and store
the new value inside the tag_set passed in.
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently pass in the hardware queue, and get the tags from there.
But from scsi-mq, with a shared tag space, it's a lot more convenient
to pass in the blk_mq_tags instead as the hardware queue isn't always
directly available. So instead of having to re-map to a given
hardware queue from rq->mq_ctx, just pass in the tags structure.
Signed-off-by: Jens Axboe <axboe@fb.com>
When the code was collapsed to avoid duplication, the recent patch
for ensuring that a queue is idled before free was dropped, which was
added by commit 19c5d84f14.
Add back the blk_mq_tag_idle(), to ensure we don't leak a reference
to an active queue when it is freed.
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq_map_request() can return NULL if we fail entering the queue
(dying, or removed), in which case it has already ended IO on the
bio. So nothing more to do, except just return.
Signed-off-by: Jens Axboe <axboe@fb.com>
'struct blk_mq_ctx' is __percpu, so add the annotation
and fix the sparse warning reported from Fengguang:
[block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect
type in initializer (different address spaces)
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().
This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We have callers outside of the blk-mq proper (like timeouts) that
want to call __blk_mq_complete_request(), so rename the function
and put the decision code for whether to use ->softirq_done_fn
or blk_mq_endio() into __blk_mq_complete_request().
This also makes the interface more logical again.
blk_mq_complete_request() attempts to atomically mark the request
completed, and calls __blk_mq_complete_request() if successful.
__blk_mq_complete_request() then just ends the request.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 07068d5b8e added a direct-to-hw-queue mode, but this mode
needs to remember to add the request timeout handler as well.
Without it, we don't track timeouts for these requests.
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently blk-mq registers all the hardware queues in sysfs,
regardless of whether it uses them (e.g. they have CPU mappings)
or not. The unused hardware queues lack the cpux/ directories,
and the other sysfs entries (like active, pending, etc) are all
zeroes.
Change this so that sysfs correctly reflects the current mappings
of the hardware queues.
Signed-off-by: Jens Axboe <axboe@fb.com>
flush request is special, which borrows the tag from the parent
request. Hence blk_mq_tag_to_rq needs special handling to return
the flush request from the tag.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently clear a lot more than we need to, so make that a bit
more clever. Make some of the init dependent on features, like
only setting start_time if we are going to use it.
Signed-off-by: Jens Axboe <axboe@fb.com>
If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.
Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.
Signed-off-by: Jens Axboe <axboe@fb.com>
We now only have one caller left and can open code it there in a cleaner
way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it. Just call __blk_mq_alloc_request to wait directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.
Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes. Replace them with a per-request_queue list of requests to
be requeued.
Based on an earlier test by Ming Lei.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Right now we export two ways of completing a request:
1) blk_mq_complete_request(). This uses an IPI (if needed) and
completes through q->softirq_done_fn(). It also works with
timeouts.
2) blk_mq_end_io(). This completes inline, and ignores any timeout
state of the request.
Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.
So just pass in the suggested node directly to the alloc
function.
Signed-off-by: Jens Axboe <axboe@fb.com>
The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
hctx->ctx_map should have been freed inside blk_mq_free_queue().
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Without this we can leak the active_queues reference if a command is
freed while it is considered active.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.
Signed-off-by: Jens Axboe <axboe@fb.com>
We want slightly different behavior from them:
- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.
- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.
Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.
Signed-off-by: Jens Axboe <axboe@fb.com>
Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.
Signed-off-by: Jens Axboe <axboe@fb.com>
Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.
Signed-off-by: Jens Axboe <axboe@fb.com>
In blk_mq_make_request(), do the blk_queue_nomerges() check
outside the call to blk_attempt_plug_merge() to eliminate
function call overhead when nomerges=2 (disabled)
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_queue_make_requests() overwrites our set value for q->nr_requests,
turning it into the default of 128. Set this appropriately after
initializing queue values in blk_queue_make_request().
Signed-off-by: Jens Axboe <axboe@fb.com>
For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.
Signed-off-by: Jens Axboe <axboe@fb.com>
Each hardware queue has a bitmap of software queues with pending
requests. When new IO is queued on a software queue, the bit is
set, and when IO is pruned on a hardware queue run, the bit is
cleared. This causes a lot of traffic. Switch this from the regular
BITS_PER_LONG bitmap to a sparser layout, similarly to what was
done for blk-mq tagging.
20% performance increase was observed for single threaded IO, and
about 15% performanc increase on multiple threads driving the
same device.
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.
If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.
The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit c6d600c6 opened up a small race where we could attempt to
account IO completion on a request, racing with IO start accounting.
Fix this up by ensuring that we've accounted for IO start before
inserting the request.
Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.
This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:
1) We cache the last allocated (or freed) tag on a per blk-mq
software context basis. This allows us to limit the space
we have to search. The key element here is not caching it
in the shared tag structure, otherwise we end up dirtying
more shared cache lines on each allocate/free operation.
2) The tag space is split into cache line sized groups, and
each context will start off randomly in that space. Even up
to full utilization of the space, this divides the tag users
efficiently into cache line groups, avoiding dirtying the same
one both between allocators and between allocator and freeer.
This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.
Signed-off-by: Jens Axboe <axboe@fb.com>
This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Right now we just pick the first CPU in the mask, but that can
easily overload that one. Add some basic batching and round-robin
all the entries in the mask instead.
Signed-off-by: Jens Axboe <axboe@fb.com>
We already issue a blktrace requeue event in
__blk_mq_requeue_request(), don't do it from the original caller
as well.
Signed-off-by: Jens Axboe <axboe@fb.com>
Refactor the logic around adding a new bio to a software queue,
so we nest the ctx->lock where we really need it (merge and
insertion) and don't hold it when we don't (init and IO start
accounting).
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_mq_wait_for_tags() is only able to wait for "normal" tags,
not reserved tags. Pass in which one we should attempt to get
a tag for, so that waiting for reserved tags will work.
Reserved tags are used for internal commands, which are usually
serialized. Hence no waiting generally takes place, but we should
ensure that it actually works if users need that functionality.
Signed-off-by: Jens Axboe <axboe@fb.com>
The blk-mq code is using it's own version of the I/O completion affinity
tunables, which causes a few issues:
- the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
still is present, thus breaking existing tuning setups.
- the rq_affinity = 1 mode, which is the defauly for legacy request based
drivers isn't implemented at all.
- blk-mq drivers don't implement any completion affinity with the default
flag settings.
This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
respects the queue-wide rq_affinity flags and also implements the
rq_affinity = 1 mode.
This means I/O completion affinity can now only be tuned block-queue wide
instead of per context, which seems more sensible to me anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>