Commit Graph

74 Commits

Author SHA1 Message Date
Omar Sandoval 54d5329d42 blk-mq-sched: fix crash in switch error path
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:48 -06:00
Omar Sandoval 93252632e8 blk-mq-sched: set up scheduler tags when bringing up new queues
If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:46 -06:00
Omar Sandoval 6917ff0b5b blk-mq-sched: refactor scheduler initialization
Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:44 -06:00
Omar Sandoval 81380ca107 blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 08:56:26 -06:00
Omar Sandoval 562bef4259 blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
No functional difference, it just makes a little more sense to update
the tag map where we actually allocate the tag.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
2017-03-02 08:56:04 -07:00
Omar Sandoval 6d2809d51a blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
its blk_mq_alloc_request() counterpart. It also crashes because it
doesn't update the tags->rqs map.

Fix it by making it allocate a scheduler request.

Reported-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
2017-03-02 08:56:04 -07:00
Sagi Grimberg 415b806de5 blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Modified by me to also check at driver tag allocation time if the
original request was reserved, so we can be sure to allocate a
properly reserved tag at that point in time, too.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:04 -07:00
Omar Sandoval d38d351555 blk-mq-sched: separate mark hctx and queue restart operations
In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
after we dispatch requests left over on our hardware queue dispatch
list. This is so we'll go back and dispatch requests from the scheduler.
In this case, it's only necessary to restart the hardware queue that we
are running; there's no reason to run other hardware queues just because
we are using shared tags.

So, split out blk_mq_sched_mark_restart() into two operations, one for
just the hardware queue and one for the whole request queue. The core
code only needs the hctx variant, but I/O schedulers will want to use
both.

This also requires adjusting blk_mq_sched_restart_queues() to always
check the queue restart flag, not just when using shared tags.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-23 11:55:47 -07:00
Jens Axboe b86dd815ff block: get rid of blk-mq default scheduler choice Kconfig entries
The wording in the entries were poor and not understandable
by even deities. Kill the selection for default block scheduler,
and impose a policy with sane defaults.

Architected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:19:45 -07:00
Jens Axboe 64765a75ef blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
Usually we don't ask the scheduler for work, if we already have
leftovers on the dispatch list. This is done to leave work on
the scheduler side for as long as possible, for proper merging.
But if we do have work leftover but didn't dispatch anything,
then we should ask the scheduler since we could potentially
issue requests from that.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-02-17 12:35:47 -07:00
Jens Axboe c7a571b450 blk-mq-sched: don't add flushes to the head of requeue queue
If we are currently out of driver tags, we don't want to add a
new flush (without a tag) to the head of the requeue list. We
want to add it to the back, behind the others that are
potentially also waiting for a tag.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-02-17 12:35:47 -07:00
Paolo Valente f1ba82616c blk-mq: pass bio to blk_mq_sched_get_rq_priv
bio is used in bfq-mq's get_rq_priv, to get the request group. We could
pass directly the group here, but I thought that passing the bio was
more general, giving the possibility to get other pieces of information
if needed.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-10 09:09:59 -07:00
Christoph Hellwig 34fe7c0540 block: enumify ELEVATOR_*_MERGE
Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08 13:43:06 -07:00
Jens Axboe e4d750c977 block: free merged request in the caller
If we end up doing a request-to-request merge when we have completed
a bio-to-request merge, we free the request from deep down in that
path. For blk-mq-sched, the merge path has to hold the appropriate
lock, but we don't need it for freeing the request. And in fact
holding the lock is problematic, since we are now calling the
mq sched put_rq_private() hook with the lock held. Other call paths
do not hold this lock.

Fix this inconsistency by ensuring that the caller frees a merged
request. Then we can do it outside of the lock, making it both more
efficient and fixing the blk-mq-sched problem of invoking parts of
the scheduler with an unknown lock state.

Reported-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-02-03 09:48:28 -07:00
Omar Sandoval 0cacba6cf8 blk-mq-sched: bypass the scheduler for flushes entirely
There's a weird inconsistency that flushes are mostly hidden from the
scheduler, but it needs to be aware of them in ->insert_requests().
Instead of having every scheduler call blk_mq_sched_bypass_insert(),
let's do it in the common framework.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02 16:57:56 -07:00
Jens Axboe f3a8ab7d55 block: cleanup remaining manual checks for PREFLUSH|FUA
Use op_is_flush() where applicable.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:08:23 -07:00
Jens Axboe bd6737f1ae blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()
Instead of letting the caller check this and handle the details
of inserting a flush request, put the logic in the scheduler
insertion function. This fixes direct flush insertion outside
of the usual make_request_fn calls, like from dm via
blk_insert_cloned_request().

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:03:14 -07:00
Christoph Hellwig f73f44eb00 block: add a op_is_flush helper
This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-27 09:01:45 -07:00
Jens Axboe c13660a08c blk-mq-sched: change ->dispatch_requests() to ->dispatch_request()
When we invoke dispatch_requests(), the scheduler empties everything
into the passed in list. This isn't always a good thing, since it
means that we remove items that we could have potentially merged
with.

Change the function to dispatch single requests at the time. If
we do that, we can backoff exactly at the point where the device
can't consume more IO, and leave the rest with the scheduler for
better merging and future dispatch decision making.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:35 -07:00
Jens Axboe 50e1dab86a blk-mq-sched: fix starvation for multiple hardware queues and shared tags
If we have both multiple hardware queues and shared tag map between
devices, we need to ensure that we propagate the hardware queue
restart bit higher up. This is because we can get into a situation
where we don't have any IO pending on a hardware queue, yet we fail
getting a tag to start new IO. If that happens, it's not enough to
mark the hardware queue as needing a restart, we need to bubble
that up to the higher level queue as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Tested-by: Hannes Reinecke <hare@suse.com>
2017-01-27 08:20:34 -07:00
Jens Axboe b48fda0976 blk-mq-sched: check for successful allocation before assigning tag
We don't trigger this from the normal IO path, since we always use
blocking allocations from there. But Bart saw it testing multipath
dm, since that is a heavy user of atomic request allocations in
the map and clone path.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26 14:52:20 -07:00
Jens Axboe 5a797e00dc blk-mq: don't lose flags passed in to blk_mq_alloc_request()
If we come in from blk_mq_alloc_requst() with NOWAIT set in flags,
we must ensure that we don't later overwrite that in
blk_mq_sched_get_request(). Initialize alloc_data->flags before
passing it in.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-26 12:22:11 -07:00
Jens Axboe d348499138 blk-mq-sched: allow setting of default IO scheduler
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:31 -07:00
Jens Axboe bd166ef183 blk-mq-sched: add framework for MQ capable IO schedulers
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.

We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.

We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-01-17 10:04:20 -07:00