Commit Graph

2966 Commits

Author SHA1 Message Date
Linus Torvalds a829a8445f SCSI misc on 20161213
This update includes the usual round of major driver updates (ncr5380,
 lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).  There's also
 an assortment of minor fixes, mostly in error legs or other not very
 user visible stuff.  The major change is the pci_alloc_irq_vectors
 replacement for the old pci_msix_.. calls; this effectively makes IRQ
 mapping generic for the drivers and allows blk_mq to use the
 information.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYUJOvAAoJEAVr7HOZEZN42B0P/1lj1W2N7y0LOAbR2MasyQvT
 fMD/SSip/v+R/zJiTv+5M/IDQT6ez62JnQGWyX3HZTob9VEfoqagbUuHH6y+fmib
 tHyqiYc9FC/NZSRX/0ib+rpnSVhC/YRSVV7RrAqilbpOKAzeU25FlN/vbz+Nv/XL
 ReVBl+2nGjJtHyWqUN45Zuf74c0zXOWPPUD0hRaNclK5CfZv5wDYupzHzTNSQTkj
 nWvwPYT0OgSMNe7mR+IDFyOe3UQ/OYyeJB0yBNqO63IiaUabT8/hgrWR9qJAvWh8
 LiH+iSQ69+sDUnvWvFjuls/GzcFuuTljwJbm+FyTsmNHONPVY8JRCLNq7CNDJ6Vx
 HwpNuJdTSJpne4lAVBGPwgjs+GhlMvUP/xYVLWAXdaBxU9XGePrwqQDcFu1Rbx3P
 yfMiVaY1+e45OEjLRCbDAwFnMPevC3kyymIvSsTySJxhTbYrOsyrrWt5kwWsvE3r
 SKANsub+xUnpCkyg57nXRQStJSCiSfGIDsydKmMX+pf1SR4k6gCUQZlcchUX0uOa
 dcY6re0c7EJIQQiT7qeGP5TRBblxARocCA/Igx6b5U5HmuQ48tDFlMCps7/TE84V
 JBsBnmkXcEi/ALShL/Tui+3YKA1DfOtEnwHtXx/9Ecx/nxP2Sjr9LJwCKiONv8NY
 RgLpGfccrix34lQumOk5
 =sPXh
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This update includes the usual round of major driver updates (ncr5380,
  lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).

  There's also an assortment of minor fixes, mostly in error legs or
  other not very user visible stuff. The major change is the
  pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
  effectively makes IRQ mapping generic for the drivers and allows
  blk_mq to use the information"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
  scsi: qla4xxx: switch to pci_alloc_irq_vectors
  scsi: hisi_sas: support deferred probe for v2 hw
  scsi: megaraid_sas: switch to pci_alloc_irq_vectors
  scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
  scsi: be2iscsi: set errno on error path
  scsi: be2iscsi: set errno on error path
  scsi: hpsa: fallback to use legacy REPORT PHYS command
  scsi: scsi_dh_alua: Fix RCU annotations
  scsi: hpsa: use %phN for short hex dumps
  scsi: hisi_sas: fix free'ing in probe and remove
  scsi: isci: switch to pci_alloc_irq_vectors
  scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
  scsi: dpt_i2o: double free on error path
  scsi: cxlflash: Migrate scsi command pointer to AFU command
  scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
  scsi: cxlflash: Cleanup queuecommand()
  scsi: cxlflash: Cleanup send_tmf()
  scsi: cxlflash: Remove AFU command lock
  scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
  scsi: cxlflash: Remove private command pool
  ...
2016-12-14 10:49:33 -08:00
Linus Torvalds b92e09bb5b Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:

 - Adam added opt-in ATA command priority support.

 - There are machines which hide multiple nvme devices behind an ahci
   BAR. Dan Williams proposed a solution to force-switch the mode but
   deemed too hackishd. People are gonna discuss the proper way to
   handle the situation in nvme standard meetings. For now, detect and
   warn about the situation.

 - Low level driver specific changes.

Christoph Hellwig pipes in about the hidden nvme warning:
 "I wish that was the case. We've pretty much agreed that we'll want to
  implement it as a virtual PCIe root bridge, similar to Intels other
  'innovation' VMD that we work around that way.

  But Intel management has apparently decided that they don't want to
  spend more cycles on this now that Lenovo has an optional BIOS that
  doesn't force this broken mode anymore, and no one outside of Intel
  has enough information to implement something like this.

  So for now I guess this warning is it, until Intel reconsideres and
  spends resources on fixing up the damage their Chipset people caused"

* 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  ahci: warn about remapped NVMe devices
  ahci-remap.h: add ahci remapping definitions
  nvme: move NVMe class code to pci_ids.h
  pata: imx: support controller modes up to PIO4
  pata: imx: add support of setting timings for PIO modes
  pata: imx: set controller PIO mode with .set_piomode callback
  pata: imx: sort headers out
  ata: set ncq_prio_enabled iff device has support
  ata: ATA Command Priority Disabled By Default
  ata: Enabling ATA Command Priorities
  block: Add iocontext priority to request
  ahci: qoriq: added ls1046a platform support
2016-12-13 13:26:24 -08:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Jens Axboe 9491ae4aad mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level.  Turns out it is read-ahead capping
the request size, since we use 128K as the default setting.  This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.

This patch introduces a bdi hint, io_pages.  This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side.  The latter is done to avoid reading ahead too much, if the
application asks for a huge read.  With this patch, the kernel behaves
like the application expects.

Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Jens Axboe 7cd54aa843 blk-stat: fix a few cases of missing batch flushing
Everytime we need to read ->nr_samples, we should have flushed
the batch first. The non-mq read path also needs to flush the
batch.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 13:08:35 -07:00
Jens Axboe c8e52ba5e2 blk-flush: run the queue when inserting blk-mq flush
Currently we pass in to run the queue async, but don't flag the
queue to be run. We don't need to run it async here, but we should
run it. So fixup the parameters.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Jens Axboe 70b3ea056f elevator: make the rqhash helpers exported
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Jens Axboe f04c3df3ef blk-mq: abstract out blk_mq_dispatch_rq_list() helper
Takes a list of requests, and dispatches it. Moves any residual
requests to the dispatch list.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Jens Axboe ae911c5e79 blk-mq: add blk_mq_start_stopped_hw_queue()
We have a variant for all hardware queues, but not one for a single
hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2016-12-09 09:03:02 -07:00
Christoph Hellwig f9d03f96b9 block: improve handling of the magic discard payload
Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 08:30:51 -07:00
Christoph Hellwig be07e14f96 blk-wbt: don't throttle discard or write zeroes
Both of these are metadata only commands that are not issued by the
writeback code and not directly relevant to the writeback bandwith.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 08:29:35 -07:00
Linus Torvalds a0ac402cfc Don't feed anything but regular iovec's to blk_rq_map_user_iov
In theory we could map other things, but there's a reason that function
is called "user_iov".  Using anything else (like splice can do) just
confuses it.

Reported-and-tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-07 08:23:35 -08:00
Jens Axboe 6e85eaf307 blk-mq: blk_account_io_start() takes a bool
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2016-12-05 12:14:28 -07:00
Nicolai Stange 58886785db block: fix unintended fallthrough in generic_make_request_checks()
Since commit e73c23ff73 ("block: add async variant of
blkdev_issue_zeroout") messages like the following show up:

  EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at
                  logical offset 0 with max blocks 1 with error 95
  EXT4-fs (dm-1): This should not happen!! Data will be lost

Due to the following fallthrough introduced with
commit 2d253440b5 ("block: Define zoned block device operations"),
generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only
if the block device supports "write same" *and* is a zoned one:

  switch (bio_op(bio)) {
  [...]
  case REQ_OP_WRITE_SAME:
        if (!bdev_write_same(bio->bi_bdev))
                goto not_supported;
  case REQ_OP_ZONE_REPORT:
  case REQ_OP_ZONE_RESET:
                if (!bdev_is_zoned(bio->bi_bdev))
                        goto not_supported;
                break;
  [...]
  }

Thus, although the bio setup as done by __blkdev_issue_write_same() from
commit e73c23ff73 ("block: add async variant of blkdev_issue_zeroout")
would succeed, its actual submission would not, resulting in the
EOPNOTSUPP == 95.

Fix this by removing the fallthrough which, due to the lack of an explicit
comment, seems to be unintended anyway.

Fixes: e73c23ff73 ("block: add async variant of blkdev_issue_zeroout")
Fixes: 2d253440b5 ("block: Define zoned block device operations")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-05 07:54:39 -07:00
Shaohua Li 209200efa3 blk-stat: fix a typo
Signed-off-by: Shaohua Li <shli@fb.com>
Fixes: cf43e6be86 ("block: add scalable completion tracking of requests")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-02 20:17:43 -07:00
Ritesh Harjani e0c7230009 block: factor out req_set_nomerge
Factor out common code for setting REQ_NOMERGE flag which is being used
out at certain places and make it a helper instead, req_set_nomerge().

Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>

Get rid of the inline.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 08:36:16 -07:00
Chaitanya Kulkarni a6f0788ec2 block: add support for REQ_OP_WRITE_ZEROES
This adds a new block layer operation to zero out a range of
LBAs. This allows to implement zeroing for devices that don't use
either discard with a predictable zero pattern or WRITE SAME of zeroes.
The prominent example of that is NVMe with the Write Zeroes command,
but in the future, this should also help with improving the way
zeroing discards work. For this operation, suitable entry is exported in
sysfs which indicate the number of maximum bytes allowed in one
write zeroes operation by the device.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Chaitanya Kulkarni e73c23ff73 block: add async variant of blkdev_issue_zeroout
Similar to __blkdev_issue_discard this variant allows submitting
the final bio asynchronously and chaining multiple ranges
into a single completion.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Damien Le Moal b02d8aaea1 block: Check partition alignment on zoned block devices
Both blkdev_report_zones and blkdev_reset_zones can operate on a partition of
a zoned block device. However, the first and last zones reported for a
partition make sense only if the partition start sector and size are aligned
on the device zone size. The same applies for zone reset. Resetting the first
or the last zone of a partition straddling zones may impact neighboring
partitions. Finally, if a partition start sector is not at the beginning of a
sequential zone, it will be impossible to write to the first sectors of the
partition on a host-managed device.
Avoid all these problems and incoherencies by ignoring partitions that are not
zone aligned.

Note: Even with CONFIG_BLK_DEV_ZONED disabled, bdev_is_zoned() will report the
correct disk zoning type (host-aware, host-managed or none) but
bdev_zone_size() will always return 0 for zoned block devices (i.e. the zone
size is unknown). So test this as a way to ensure that a zoned block device is
being handled as such. As a result, for a host-aware devices, unaligned zone
partitions will be accepted with CONFIG_BLK_DEV_ZONED disabled. That is, the
disk will be treated as a regular block device (as it should). If zoned block
device support is enabled, only aligned partitions will be accepted.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:56:53 -07:00
Gabriel Krisman Bertazi 415d3dab96 blk-mq: Drop explicit timeout sync in hotplug
After commit 287922eb0b ("block: defer timeouts to a workqueue"),
deleting the timeout work after freezing the queue shouldn't be
necessary, since the synchronization is already enforced by the
acquisition of a q_usage_counter reference in blk_mq_timeout_work.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 08:01:08 -07:00
Jens Axboe d62118b6dd blk-wbt: allow wbt to be enabled always through sysfs
Currently there's no way to enable wbt if it's not enabled in the
kernel config by default for a device. Allow a write to the
'wbt_lat_usec' queue sysfs file to enable wbt.

This is useful for both the kernel config case, but also if the
device is CFQ managed and it was turned off by default.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Jens Axboe fa224eed2b blk-wbt: cleanup disable-by-default for CFQ
Make it clear that we are disabling wbt for the specified queued,
if it was enabled by default. This is in preparation for allowing
users to re-enable wbt, and not have it disabled automatically
again.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Jens Axboe 80e091d10e blk-wbt: allow reset of default latency through sysfs
Allow a write of '-1' to reset the default latency target for
a given device. This removes knowledge of the different default
settings for rotational vs non-rotational from user space.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-28 10:27:03 -07:00
Tejun Heo e00f4f4d0f block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg
blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
when that fails falls back to operations which aren't specific to the
cgroup.  Occassional failures are expected under pressure and falling
back to non-cgroup operation is the right thing to do.

Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
these expected failures end up creating a lot of noise.  Add
__GFP_NOWARN.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marc MERLIN <marc@merlins.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:59:49 -07:00
Ming Lei 3a83f46775 block: bio: pass bvec table to bio_init()
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.

After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

Fixed up the new O_DIRECT cases.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-22 08:57:21 -07:00
Shaun Tancheff 778889d841 block: apply blk_partition_remap to REQ_OP_ZONE_RESET
If a ZBC device is partitioned and operations are performed on the partition
the zone information is rebased to the partition, however the zone reset
is not mapped from the partition to device as are other operations.

This causes the API (report zones / reset zone) to be unbalanced in this
regard. Checking for the zone reset op code explicitly will balance the
API.

Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-21 15:08:24 -07:00
Johannes Thumshirn a0f4bd7f2a scsi: fc: move FC transport's bsg code to bsg-lib
Now that all conversions are done, move the FibreChannel bsg code over
to the bsg library.

This patch is derived from work done by Mike Christie in 2011 [1] but
only the iscsi parts got merged back then.

[1] http://marc.info/?l=linux-scsi&m=131149780921009&w=2

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-17 20:15:26 -05:00
Johannes Thumshirn fb6f7c8d8a block: add bsg_job_put() and bsg_job_get()
Add bsg_job_put() and bsg_job_get() so don't need to export
bsg_destroy_job() any more.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-17 20:15:26 -05:00
Johannes Thumshirn 6aa858cd33 scsi: fc: use bsg_softirq_done
bsg_softirq_done() and fc_bsg_softirq_done() are copies of each other, so
ditch the fc specific one.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-17 20:15:26 -05:00
Johannes Thumshirn c00da4c90f scsi: fc: Use bsg_destroy_job
fc_destroy_bsgjob() and bsg_destroy_job() are now 1:1 copies, so use the
latter. As bsg_destroy_job() comes from bsg-lib we need to select it in
Kconfig once CONFOG_SCSI_FC_ATTRS is active.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-17 20:15:25 -05:00
Johannes Thumshirn bf0f2d380f block: add reference counting for struct bsg_job
Add reference counting to 'struct bsg_job' so we can implement a reuqest
timeout handler for bsg_jobs, which is needed for Fibre Channel.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-17 20:15:25 -05:00
Jens Axboe 64f1c21e86 blk-mq: make the polling code adaptive
The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1	Never enter hybrid sleep mode
 0	Use half of the completion mean for the sleep delay
>0	Use this specific value as the sleep delay

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:57 -07:00
Jens Axboe 06426adf07 blk-mq: implement hybrid poll mode for sync O_DIRECT
This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
Tested-By: Stephen Bates <sbates@raithlin.com>
Reviewed-By: Stephen Bates <sbates@raithlin.com>
2016-11-17 13:34:51 -07:00
Arnd Bergmann 4121d385f1 blk-wbt: fix old-style function declaration
The newly added driver causes a harmless warning in some configurations:

block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration]
 static bool inline stat_sample_valid(struct blk_rq_stat *stat)

This makes it use the expected format for the declaration.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:32:40 -07:00
Ming Lei 0a6219a95f block: deal with stale req count of plug list
In both legacy and mq path, req count of plug list is computed
before allocating request, so the number can be stale when falling
back to slept allocation, also the new introduced wbt can sleep
too.

This patch deals with the case by checking if plug list becomes
empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
which is introduced by Shaohua's patches of dispatching big request.

Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
Fixes: 50d24c34403c6(block: immediately dispatch big size request)
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-16 08:09:51 -07:00
Bart Van Assche dbb3ab0356 bsg: Add sparse annotations to bsg_request_fn()
Avoid that sparse complains about unbalanced lock actions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-14 09:57:03 -07:00
Jens Axboe 382cf633ed blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1
Since we have proper enums for the stats directions, use them.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:29 -07:00
Jens Axboe 8054b89f8f blk-wbt: remove stat ops
Again a leftover from when the throttling code was generic. Now that we
just have the block user, get rid of the stat ops and indirections.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:24 -07:00
Jens Axboe d8a0cbfd73 blk-wbt: store queue instead of bdi
The bdi was a leftover from when the code was block layer agnostic.
Now that we just support a block layer user, store the queue directly.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 16:18:18 -07:00
Jens Axboe bbd7bb7017 block: move poll code to blk-mq
The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-11-11 13:40:25 -07:00
Jens Axboe 066a4a73ce blk-mq: blk_mq_try_issue_directly() should lookup hardware queue
A previous commit changed this to pass in the hardware queue, but
it was using the wrong hardware queue. Hence a request that was
allocated on one hardware queue ended up being issued on another
one, and that caused IO timeouts and oopses on some drivers. Since
the request holds hardware queue private resources, like a tag,
we can't just issue it on a different hardware queue.

Fixes: 2253efc850 ("blk-mq: Move more code into blk_mq_direct_issue_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 12:24:46 -07:00
Jens Axboe 87760e5eef block: hook up writeback throttling
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:40 -07:00
Jens Axboe e34cbd3074 blk-wbt: add general throttling mechanism
We can hook this up to the block layer, to help throttle buffered
writes.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
               wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:32 -07:00
Jens Axboe cf43e6be86 block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 13:53:26 -07:00
Tejun Heo ebc4ff661f block: cfq_cpd_alloc() should use @gfp
cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter.  This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e4a9bde958 ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:10:04 -07:00
Jens Axboe ae5b2ec8ad block: set REQ_SYNC if we clear REQ_FUA|REQ_PREFLUSH
If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
depending on flush settings. Since op_is_sync() factors those flags
in for deciding whether this request is sync or not, we should
set REQ_SYNC to avoid screwing up this accounting.

This should be less fragile.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Fixes: b685d3d65a ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-08 19:39:28 -07:00
Christoph Hellwig 9e5a7e2295 blk-mq: export blk_mq_map_queues
This will allow SCSI to have a single blk_mq_ops structure that either
lets the LLDD map the queues to PCIe MSIx vectors or use the default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2016-11-08 17:30:00 -05:00
Gabriel Krisman Bertazi c02ebfdddb blk-mq: Always schedule hctx->next_cpu
Commit 0e87e58bf6 ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead.  Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.

The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu.  This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

Fixes: 0e87e58bf6 ("blk-mq: improve warning for running a queue on the wrong CPU")
Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-06 14:14:41 -07:00
Jens Axboe d278d4a889 block: add code to track actual device queue depth
For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-05 17:09:53 -06:00
Shaohua Li 600271d900 blk-mq: immediately dispatch big size request
This is corresponding part for blk-mq. Disk with multiple hardware
queues doesn't need this as we only hold 1 request at most.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-03 22:00:38 -06:00