linux

Commit Graph

Author	SHA1	Message	Date
Yufen Yu	510a405d94	nvme: fix memory leak for power latency tolerance Unconditionally hide device pm latency tolerance when uninitializing the controller to ensure all qos resources are released so that we're not leaking this memory. This is safe to call if none were allocated in the first place, or were previously freed. Fixes: c5552fde102fc("nvme: Enable autonomous power state transitions") Suggested-by: Keith Busch <keith.busch@intel.com> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> [changelog] Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:08:09 -06:00
Christoph Hellwig	5fb4aac756	nvme: release namespace SRCU protection before performing controller ioctls Holding the SRCU critical section protecting the namespace list can cause deadlocks when using the per-namespace admin passthrough ioctl to delete as namespace. Release it earlier when performing per-controller ioctls to avoid that. Reported-by: Kenneth Heitke <kenneth.heitke@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-17 11:07:11 -06:00
Christoph Hellwig	90ec611adc	nvme: merge nvme_ns_ioctl into nvme_ioctl Merge the two functions to make future changes a little easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-05-17 11:07:10 -06:00
Christoph Hellwig	3f98bcc58c	nvme: remove the ifdef around nvme_nvm_ioctl We already have a proper stub if lightnvm is not enabled, so don't bother with the ifdef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-05-17 11:07:08 -06:00
Christoph Hellwig	100c815cbd	nvme: fix srcu locking on error return in nvme_get_ns_from_disk If we can't get a namespace don't leak the SRCU lock. nvme_ioctl was working around this, but nvme_pr_command wasn't handling this properly. Just do what callers would usually expect. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-05-17 11:06:59 -06:00
Keith Busch	6fa0321a96	nvme: Fix known effects We're trying to append known effects to the ones reported in the controller's log. The original patch accomplished this, but something went wrong when patch was merged causing the effects log to override the known effects. Link: http://lists.infradead.org/pipermail/linux-nvme/2019-May/023710.html Fixes: `f4524cc456` ("nvme-pci: add known admin effects to augument admin effects log page") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:05:35 -06:00
Keith Busch	d6135c3a1e	nvme-pci: Sync queues on reset A controller with multiple namespaces may have multiple request_queues with their own timeout work. If a controller fails with IO outstanding to diffent namespaces, each request queue may attempt to handle it, so ensure there is no previously scheduled timeout work executing prior to starting controller initialization by synchronizing with each queue. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:04:34 -06:00
Keith Busch	2036f7263d	nvme-pci: Unblock reset_work on IO failure The reset_work waits for queued IO to complete before setting the controller to live. If any of these times out and requeues, we won't be able to restart the controller because the reset_work is already running. Flush all entered requests to a failed completion if a timeout occurs in the connecting state, and ensure the controller can't transition to the live state after we've unblocked it from waiting for completions. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:04:04 -06:00
Keith Busch	39a9dd81f8	nvme-pci: Don't disable on timeout in reset state The reset state doesn't dispatch commands that it needs to wait for anymore. If a timeout occurs in this state, the reset work is already disabling the controller, so just reset the request's timer. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:03:00 -06:00
Keith Busch	e43269e6e5	nvme-pci: Fix controller freeze wait disabling If a controller disabling didn't start a freeze, don't wait for the operation to complete. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>	2019-05-17 11:01:02 -06:00
Linus Torvalds	1718de78e6	for-5.2/block-post-20190516 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzd7PYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpggWD/46Hmn6FuiXQ30HTJd9WKtJzenAAIdUpjq8 +U985q7vvcqIUotMcG9VUOlCaxk79D5XbptInzLo5CRSn9vMv0sXmAHIFkoj201K gW3sHqajnWFFj60Eq5IVdHBZekvD8+bBZMvnX+S53QHOfwY+D1Nx/CtjkxNeq+48 98kMA/Q1d87Ied6oMW6Nyc7UEN3SanTnntYRIeSrXOJPiwxVWT6SsPUC01VZcwrt NSt6IVoW2vFgU0sg8VetzCSfJyTzI0YytjTj/WKGQzuBiKFAvChWrrYZiZ/Z4587 6W4SFR94nYkW5U1BKgrMp64KUEn20m+jk0IHRYApsFwutSBHJCeB9m2sddxur/GQ G/IyXZxv5jKFNBhUEiSedfml9OF+nBbwJGJCKF64Wnybk/gqFgxM1gzyw4fMAXr+ qYQdETv02W0rDqUG9i3/CaXlN4Lf1IvLR8al4ao0LfDJ0TSXw+UviNsuHEHAv8ey sioREF8JacSj1q42TsRGckn3k4HVmaGyFwI3ceLT5bRq8VAhJ+cp7WqML1lUEmY0 2iIz+PKPDSyigqrh1wvo8ZqhqHifo+0TbRkCOCi5j+PRX6GiYlrvShGevZXEZPqC lOFNDgCH3VBTvrcx3j05jJK1qvL4QWAwb/rDUsHZVbsnSVTEHxs/3BsIFQNZpE9/ AoXCH/ye0Q== =ZKv1 -----END PGP SIGNATURE----- Merge tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "This is mainly some late lightnvm changes that came in just before the merge window, as well as fixes that have been queued up since the initial pull request was frozen. This contains: - lightnvm changes, fixing race conditions, improving memory utilization, and improving pblk compatability (Chansol, Igor, Marcin) - NVMe pull request with minor fixes all over the map (via Christoph) - remove redundant error print in sata_rcar (Geert) - struct_size() cleanup (Jackie) - dasd CONFIG_LBADF warning fix (Ming) - brd cond_resched() improvement (Mikulas)" * tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits) block/bio-integrity: use struct_size() in kmalloc() nvme: validate cntlid during controller initialisation nvme: change locking for the per-subsystem controller list nvme: trace all async notice events nvme: fix typos in nvme status code values nvme-fabrics: remove unused argument nvme-multipath: avoid crash on invalid subsystem cntlid enumeration nvme-fc: use separate work queue to avoid warning nvme-rdma: remove redundant reference between ib_device and tagset nvme-pci: mark expected switch fall-through nvme-pci: add known admin effects to augument admin effects log page nvme-pci: init shadow doorbell after each reset brd: add cond_resched to brd_free_pages sata_rcar: Remove ata_host_alloc() error printing s390/dasd: fix build warning in dasd_eckd_build_cp_raw lightnvm: pblk: use nvm_rq_to_ppa_list() lightnvm: pblk: simplify partial read path lightnvm: do not remove instance under global lock lightnvm: track inflight target creations lightnvm: pblk: recover only written metadata ...	2019-05-16 19:08:15 -07:00
Christoph Hellwig	1b1031ca63	nvme: validate cntlid during controller initialisation The CNTLID value is required to be unique, and we do rely on this for correct operation. So reject any controller for which a non-unique CNTLID has been detected. Based on a patch from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2019-05-14 17:19:50 +02:00
Christoph Hellwig	32fd90c407	nvme: change locking for the per-subsystem controller list Life becomes a lot simpler if we just use the global nvme_subsystems_lock to protect this list. Given that it is only accessed during controller probing and removal that isn't a scalability problem either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com>	2019-05-14 17:19:50 +02:00
Chaitanya Kulkarni	521cfb8e5a	nvme: trace all async notice events This patch removes the tracing of the NVMe Async events out of the switch so that it can trace all the events including the ones which are not handled in the nvme_handle_aen_notice(). The events which are not handled in the nvme_handle_aen_notice() such as NVME_AER_NOTICE_DISC_CHANGED corresponding event identifier needs to be added in the drivers/nvme/host/trace.h so that it can stringify the AER . Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-14 17:19:47 +02:00
Minwoo Im	94e970b674	nvme-fabrics: remove unused argument The variable 'count' is not currently used by nvmf_create_ctrl(), so remove it. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-14 17:19:43 +02:00
Hannes Reinecke	8a03b27ea6	nvme-multipath: avoid crash on invalid subsystem cntlid enumeration A process holding an open reference to a removed disk prevents it from completing deletion, so its name continues to exist. A subsequent gendisk creation may have the same cntlid which risks collision when using that for the name. Use the unique ctrl->instance instead. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Hannes Reinecke	8730c1ddb6	nvme-fc: use separate work queue to avoid warning When tearing down a controller the following warning is issued: WARNING: CPU: 0 PID: 30681 at ../kernel/workqueue.c:2418 check_flush_dependency This happens as the err_work workqueue item is scheduled on the system workqueue (which has WQ_MEM_RECLAIM not set), but is flushed from a workqueue which has WQ_MEM_RECLAIM set. Fix this by providing an FC-NVMe specific workqueue. Fixes: `4cff280a5f` ("nvme-fc: resolve io failures during connect") Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Max Gurtovoy	87fd125344	nvme-rdma: remove redundant reference between ib_device and tagset In the past, before adding `f41725bb` ("nvme-rdma: Use mr pool") commit, we needed a reference on the ib_device as long as the tagset was alive, as the MRs in the request structures needed a valid ib_device. Now, we allocate/deallocate MR pool per QP and consume on demand. Also remove nvme_rdma_free_tagset function and use blk_mq_free_tag_set instead, as it unneeded anymore. This commit also fixes a memory leakage and possible segmentation fault. When configuring the system with NIC teaming (aka bonding), we use 1 network interface to create an HA connection to the target side. In case one connection breaks down, nvme-rdma driver will get notification from rdma-cm layer that underlying address was change and will start error recovery process. During this process, we'll reconnect to the target via the second interface in the bond without destroying the tagset. This will cause a leakage of the initial rdma device (ndev) and miscount in the reference count of the new created rdma device (new ndev). In the final destruction (or in another error flow), we'll get a warning dump from the ib_dealloc_pd that we still have inflight MR's related to that pd. This happens becasue of the miscount of the reference tag of the rdma device and causing access violation to it's elements (some queues are not destroyed yet). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Gustavo A. R. Silva	3b7dffb971	nvme-pci: mark expected switch fall-through In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: drivers/nvme/host/pci.c: In function ‘nvme_timeout’: drivers/nvme/host/pci.c:1298:12: warning: this statement may fall through [-Wimplicit-fallthrough=] shutdown = true; ~~~~~~~~~^~~~~~ drivers/nvme/host/pci.c:1299:2: note: here case NVME_CTRL_CONNECTING: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Maxim Levitsky	f4524cc456	nvme-pci: add known admin effects to augument admin effects log page Add known admin effects even if hardware has known admin effects page, since hardware can't be ever trusted to report sane values. (on my Intel DC P3700, it reports no side effects for namespace format) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Maxim Levitsky	e8fd41bb3c	nvme-pci: init shadow doorbell after each reset The spec states: "The settings are not retained across a Controller Level Reset" Therefore the driver must enable the shadow doorbell, after each reset. This was caught while testing the nvme driver over upcoming nvme-mdev device. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-13 16:00:03 +02:00
Linus Torvalds	d1cd7c85f9	SCSI misc on 20190507 This is mostly update of the usual drivers: qla2xxx, qedf, smartpqi, hpsa, lpfc, ufs, mpt3sas, ibmvfc and hisi_sas. Plus number of minor changes, spelling fixes and other trivia. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXNIK0yYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishbeFAP4wsOm6 BNqDvLbF7gHAH+oSVAeENd9tepXG2xKbUHmLRgEA1wPaUxon8L8v/SSsqwewqMXP y6CnU6aO2iaViTNBsPs= =RX74 -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This is mostly update of the usual drivers: qla2xxx, qedf, smartpqi, hpsa, lpfc, ufs, mpt3sas, ibmvfc and hisi_sas. Plus number of minor changes, spelling fixes and other trivia" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (298 commits) scsi: qla2xxx: Avoid that lockdep complains about unsafe locking in tcm_qla2xxx_close_session() scsi: qla2xxx: Avoid that qlt_send_resp_ctio() corrupts memory scsi: qla2xxx: Fix hardirq-unsafe locking scsi: qla2xxx: Complain loudly about reference count underflow scsi: qla2xxx: Use __le64 instead of uint32_t[2] for sending DMA addresses to firmware scsi: qla2xxx: Introduce the dsd32 and dsd64 data structures scsi: qla2xxx: Check the size of firmware data structures at compile time scsi: qla2xxx: Pass little-endian values to the firmware scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands scsi: qla2xxx: Use an on-stack completion in qla24xx_control_vp() scsi: qla2xxx: Make qla24xx_async_abort_cmd() static scsi: qla2xxx: Remove unnecessary locking from the target code scsi: qla2xxx: Remove qla_tgt_cmd.released scsi: qla2xxx: Complain if a command is released that is owned by the firmware scsi: qla2xxx: target: Fix offline port handling and host reset handling scsi: qla2xxx: Fix abort handling in tcm_qla2xxx_write_pending() scsi: qla2xxx: Fix error handling in qlt_alloc_qfull_cmd() scsi: qla2xxx: Simplify qlt_send_term_imm_notif() scsi: qla2xxx: Fix use-after-free issues in qla2xxx_qpair_sp_free_dma() scsi: qla2xxx: Fix a qla24xx_enable_msix() error path ...	2019-05-08 10:12:46 -07:00
Linus Torvalds	67a2422239	for-5.2/block-20190507 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX RcCuPeLAVg== =Ot0g -----END PGP SIGNATURE----- Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...	2019-05-07 18:14:36 -07:00
Igor Konopko	a14669ebc0	lightnvm: Inherit mdts from the parent nvme device Current lightnvm and pblk implementation does not care about NVMe max data transfer size, which can be smaller than 64*K=256K. There are existing NVMe controllers which NVMe max data transfer size is lower that 256K (for example 128K, which happens for existing NVMe controllers which are NVMe spec compliant). Such a controllers are not able to handle command which contains 64 PPAs, since the the size of DMAed buffer will be above the capabilities of such a controller. Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-05-06 10:19:17 -06:00
Christoph Hellwig	893a74b7a7	nvme: mark nvme_core_init and nvme_core_exit static Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-05-01 09:18:47 -04:00
Christoph Hellwig	811015409f	nvme: move command size checks to the core Most command aren't PCIe specific, so move the size checking for them to core.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-05-01 09:18:36 -04:00
Minwoo Im	a2faf94e57	nvme-fabrics: check more command sizes struct common_command provides a common structure for NVMe-oF command format. It also needs to be checked for unintended size growth. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:27 -04:00
Minwoo Im	a97234e1ff	nvme-pci: check more command sizes All the NVMe command has 64bytes fixed size so that it has been assured with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h also need to be checked here. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:27 -04:00
Minwoo Im	665648673e	nvme-pci: remove an unneeded variable initialization Variable "n" will be assigned once kstrtoint() succeeds, otherwise it will not be referred because kstrtoint() will return an error which means go out from this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:27 -04:00
Keith Busch	c8e9e9b764	nvme-pci: unquiesce admin queue on shutdown Just like IO queues, the admin queue also will not be restarted after a controller shutdown. Unquiesce this queue so that we do not block request dispatch on a permanently disabled controller. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:22 -04:00
Keith Busch	9dc1a38ef1	nvme-pci: shutdown on timeout during deletion We do not restart a controller in a deleting state for timeout errors. When in this state, unblock potential request dispatchers with failed completions by shutting down the controller on timeout detection. Reported-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:18 -04:00
Klaus Birkelund Jensen	049bf37262	nvme-pci: fix psdt field for single segment sgls The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: `297910571f` ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: Klaus Birkelund Jensen <klaus.jensen@cnexlabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:16 -04:00
Hannes Reinecke	592b6e7b02	nvme-multipath: don't print ANA group state by default Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:16 -04:00
Hannes Reinecke	525aa5a705	nvme-multipath: split bios with the ns_head bio_set before submitting If the bio is moved to a different queue via blk_steal_bios() and the original queue is destroyed in nvme_remove_ns() we'll be ending with a crash in bio_endio() as the mempool for the split bio bvecs had already been destroyed. So split the bio using the original queue (which will remain during the lifetime of the bio) before sending it down to the underlying device. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:15 -04:00
Sagi Grimberg	f34e25898a	nvme-tcp: fix possible null deref on a timed out io queue connect If I/O queue connect times out, we might have freed the queue socket already, so check for that on the error path in nvme_tcp_start_queue. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-05-01 09:17:15 -04:00
Sagi Grimberg	01fa017484	nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE If our target exposed a namespace with a block size that is greater than PAGE_SIZE, set 0 capacity on the namespace as we do not support it. This issue encountered when the nvmet namespace was backed by a tempfile. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-25 16:51:42 +02:00
Sagi Grimberg	efb973b19b	nvme-tcp: rename function to have nvme_tcp prefix usually nvme_ prefix is for core functions. While we're cleaning up, remove redundant empty lines Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-25 16:51:41 +02:00
Sagi Grimberg	1007709d7d	nvme-rdma: fix a NULL deref when an admin connect times out If we timeout the admin startup sequence we might not yet have an I/O tagset allocated which causes the teardown sequence to crash. Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags if the tagset wasn't allocated. Fixes: `4c174e6366` ("nvme-rdma: fix timeout handler") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-25 16:51:39 +02:00
Sagi Grimberg	7a42589654	nvme-tcp: fix a NULL deref when an admin connect times out If we timeout the admin startup sequence we might not yet have an I/O tagset allocated which causes the teardown sequence to crash. Make nvme_tcp_teardown_io_queues safe by not iterating inflight tags if the tagset wasn't allocated. Fixes: `39d5775746` ("nvme-tcp: fix timeout handler") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-25 16:51:37 +02:00
Jens Axboe	5c61ee2cd5	Linux 5.1-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAly8rGYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGmZMH/1IRB0E1Qmzz8yzw wj79UuRGYPqxDDSWW+wNc8sU4Ic7iYirn9APHAztCdQqsjmzU/OVLfSa3JhdBe5w THo7pbGKBqEDcWnKfNk/21jXFNLZ1vr9BoQv2DGU2MMhHAyo/NZbalo2YVtpQPmM OCRth5n+LzvH7rGrX7RYgWu24G9l3NMfgtaDAXBNXesCGFAjVRrdkU5CBAaabvtU 4GWh/nnutndOOLdByL3x+VZ3H3fIBnbNjcIGCglvvqzk7h3hrfGEl4UCULldTxcM IFsfMUhSw1ENy7F6DHGbKIG90cdCJcrQ8J/ziEzjj/KLGALluutfFhVvr6YCM2J6 2RgU8CY= =CfY1 -----END PGP SIGNATURE----- Merge tag 'v5.1-rc6' into for-5.2/block Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a comment, and is trivial. The other one is a conflict due to a later fix in the bio multi-page work, and needs a bit more care. * tag 'v5.1-rc6': (770 commits) Linux 5.1-rc6 block: make sure that bvec length can't be overflow block: kill all_q_node in request_queue x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping mm/kmemleak.c: fix unused-function warning init: initialize jump labels before command line option parsing kernel/watchdog_hld.c: hard lockup message should end with a newline kcov: improve CONFIG_ARCH_HAS_KCOV help text mm: fix inactive list balancing between NUMA nodes and cgroups mm/hotplug: treat CMA pages as unmovable proc: fixup proc-pid-vm test proc: fix map_files test on F29 mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock mm: swapoff: shmem_unuse() stop eviction without igrab() mm: swapoff: take notice of completion sooner mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES mm: swapoff: shmem_find_swap_entries() filter out other types slab: store tagged freelist for off-slab slabmgmt ... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-22 09:47:36 -06:00
Ingo Molnar	94e4dcc75a	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU and LKMM commits from Paul E. McKenney: - An LKMM commit adding support for synchronize_srcu_expedited() - A couple of straggling RCU flavor consolidation updates - Documentation updates. - Miscellaneous fixes - SRCU updates - RCU CPU stall-warning updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-04-18 14:42:24 +02:00
Hannes Reinecke	a6a6d0589a	scsi: scsi_transport_fc: nvme: display FC-NVMe port roles Currently the FC-NVMe driver is leverating the SCSI FC transport class to access the remote ports. Which means that all FC-NVMe remote ports will be visible to the fc transport layer, but due to missing definitions the port roles will always be 'unknown'. This patch adds the missing definitions to the fc transport class to that the port roles are correctly displayed. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Giridhar Malavali <gmalavali@marvell.com> Reviewed-by: Himanshu Madhani <hmadhani@marvell.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2019-04-12 20:09:34 -04:00
James Smart	67f471b6ed	nvme-fc: correct csn initialization and increments on error This patch fixes a long-standing bug that initialized the FC-NVME cmnd iu CSN value to 1. Early FC-NVME specs had the connection starting with CSN=1. By the time the spec reached approval, the language had changed to state a connection should start with CSN=0. This patch corrects the initialization value for FC-NVME connections. Additionally, in reviewing the transport, the CSN value is assigned to the new IU early in the start routine. It's possible that a later dma map request may fail, causing the command to never be sent to the controller. Change the location of the assignment so that it is immediately prior to calling the lldd. Add a comment block to explain the impacts if the lldd were to additionally fail sending the command. Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-11 17:28:30 +02:00
Ming Lei	eb3afb75b5	nvme: cancel request synchronously nvme_cancel_request() is used in error handler, and it is always reliable to cancel request synchronously, and avoids possible race in which request may be completed after real hw queue is destroyed. One issue is reported by our customer on NVMe RDMA, in which freed ib queue pair may be used in nvme_rdma_complete_rq(). Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Bart Van Assche <bvanassche@acm.org> Cc: James Smart <james.smart@broadcom.com> Cc: linux-nvme@lists.infradead.org Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-04-10 09:57:35 -06:00
Kenneth Heitke	d0de579c04	nvme: log the error status on Identify Namespace failure Identify Namespace failures are logged as a warning but there is not an indication of the cause for the failure. Update the log message to include the error status. Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	70479b71bc	nvme-pci: tidy up nvme_map_data Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	297910571f	nvme-pci: optimize mapping single segment requests using SGLs If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	dff824b2aa	nvme-pci: optimize mapping of small single segment requests If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	d43f1ccfad	nvme-pci: remove the inline scatterlist optimization We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	4aedb70543	nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data This prepares for some bigger changes to the data mapping helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:58 +02:00
Christoph Hellwig	783b94bd92	nvme-pci: do not build a scatterlist to map metadata We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:57 +02:00
Christoph Hellwig	b15c592de3	nvme-pci: only call nvme_unmap_data for requests transferring data This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:57 +02:00
Christoph Hellwig	7fe07d14f7	nvme-pci: merge nvme_free_iod into nvme_unmap_data This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:57 +02:00
Christoph Hellwig	915f04c93d	nvme-pci: move the call to nvme_cleanup_cmd out of nvme_unmap_data Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:57 +02:00
Christoph Hellwig	9b048119a1	nvme-pci: remove nvme_init_iod nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2019-04-05 08:07:57 +02:00
Keith Busch	39f8e36401	nvme-pci: remove unused nvme_iod member Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-05 08:07:57 +02:00
Keith Busch	88a041f4c1	nvme-pci: remove q_dmadev from nvme_queue We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-05 08:07:57 +02:00
Keith Busch	7c349dde26	nvme-pci: use a flag for polled queues A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-05 08:07:57 +02:00
Max Gurtovoy	43e2d08d07	nvme: avoid double dereference to convert le to cpu Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of le64_to_cpup. This will also align the code to nvme-core driver convention. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-04-05 08:07:56 +02:00
Martin George	cc2278c413	nvme-multipath: relax ANA state check When undergoing state transitions I/O might be requeued, hence we should always call nvme_mpath_set_live() to schedule requeue_work whenever the nvme device is live, independent on whether the old state was live or not. Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Gargi Srinivas <sring@netapp.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-03-28 18:15:02 +01:00
Christoph Hellwig	988aef9e8b	nvme-tcp: fix an endianess miss-annotation nvme_tcp_end_request just takes the status value and the converts it to little endian as well as shifting for the phase bit. Fixes: 43ce38a6d823 ("nvme-tcp: support C2HData with SUCCESS flag") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-03-28 18:15:02 +01:00
Paul E. McKenney	f5ad399149	srcu: Remove cleanup_srcu_struct_quiesced() The cleanup_srcu_struct_quiesced() function was added because NVME used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that NVME workqueues waiting on SRCU workqueues could result in deadlocks during low-memory conditions. However, SRCU now also has WQ_MEM_RECLAIM workqueues, so there is no longer a potential for deadlock. Furthermore, it turns out to be extremely hard to use cleanup_srcu_struct_quiesced() correctly due to the fact that SRCU callback invocation accesses the srcu_struct structure's per-CPU data area just after callbacks are invoked. Therefore, the usual practice of using srcu_barrier() to wait for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced() fails because SRCU's callback-invocation workqueue handler might be delayed, which can result in cleanup_srcu_struct_quiesced() being invoked (and thus freeing the per-CPU data) before the SRCU's callback-invocation workqueue handler is finished using that per-CPU data. Nor is this a theoretical problem: KASAN emitted use-after-free warnings because of this problem on actual runs. In short, NVME can now safely invoke cleanup_srcu_struct(), which avoids the use-after-free scenario. And cleanup_srcu_struct_quiesced() is quite difficult to use safely. This commit therefore removes cleanup_srcu_struct_quiesced(), switching its sole user back to cleanup_srcu_struct(). This effectively reverts the following pair of commits: `f7194ac32c` ("srcu: Add cleanup_srcu_struct_quiesced()") `4317228ad9` ("nvme: Avoid flush dependency in delete controller flow") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org>	2019-03-26 14:39:24 -07:00
Linus Torvalds	11efae3506	for-5.1/block-post-20190315 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo 0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq 1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a 0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA vsZNrNNZpw== =qBae -----END PGP SIGNATURE----- Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block Pull more block layer changes from Jens Axboe: "This is a collection of both stragglers, and fixes that came in after I finalized the initial pull. This contains: - An MD pull request from Song, with a few minor fixes - Set of NVMe patches via Christoph - Pull request from Konrad, with a few fixes for xen/blkback - pblk fix IO calculation fix (Javier) - Segment calculation fix for pass-through (Ming) - Fallthrough annotation for blkcg (Mathieu)" * tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits) blkcg: annotate implicit fall through nvme-tcp: support C2HData with SUCCESS flag nvmet: ignore EOPNOTSUPP for discard nvme: add proper write zeroes setup for the multipath device nvme: add proper discard setup for the multipath device nvme: remove nvme_ns_config_oncs nvme: disable Write Zeroes for qemu controllers nvmet-fc: bring Disconnect into compliance with FC-NVME spec nvmet-fc: fix issues with targetport assoc_list list walking nvme-fc: reject reconnect if io queue count is reduced to zero nvme-fc: fix numa_node when dev is null nvme-fc: use nr_phys_segments to determine existence of sgl nvme-loop: init nvmet_ctrl fatal_err_work when allocate nvme: update comment to make the code easier to read nvme: put ns_head ref if namespace fails allocation nvme-trace: fix cdw10 buffer overrun nvme: don't warn on block content change effects nvme: add get-feature to admin cmds tracer md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice ...	2019-03-16 12:36:39 -07:00
Sagi Grimberg	602d674ce9	nvme-tcp: support C2HData with SUCCESS flag A C2HData PDU with the SUCCESS flag set indicates that the I/O was completed by the controller successfully and means that a subsequent completion response capsule PDU will be ommitted. If we see this flag, fisrt we check that LAST_PDU flag is set as well, and then we complete the request when the data transfer (and data digest verification if its on) is done. While we're at it, reuse a bit of code with nvme_fail_request. Reported-by: Steve Blightman <steve.blightman@oracle.com> Suggested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Oliver Smith-Denny <osmithde@cisco.com> Tested-by: Oliver Smith-Denny <osmithde@cisco.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:57:34 -06:00
Christoph Hellwig	9f0916ab93	nvme: add proper write zeroes setup for the multipath device Add a gendisk argument to nvme_config_write_zeroes so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:57:34 -06:00
Christoph Hellwig	2631857160	nvme: add proper discard setup for the multipath device Add a gendisk argument to nvme_config_discard so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:57:34 -06:00
Christoph Hellwig	b1aafb35b4	nvme: remove nvme_ns_config_oncs Just opencode the two function calls in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:57:34 -06:00
Christoph Hellwig	7b210e4ed5	nvme: disable Write Zeroes for qemu controllers Qemu started out with a broken implementation of Write Zeroes written by yours truly. Disable Write Zeroes on qemu for now, eventually we need to go back and make all the qemu quirks version specific, but that is left for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:57:34 -06:00
James Smart	834d3710a0	nvme-fc: reject reconnect if io queue count is reduced to zero If: - A successful connect has occurred with an io queue count greater than zero and namespaces detected and running. - An error or something occurs which causes a termination of the prior association and then starts a reconnect, - The reconnect then creates a new controller, but for whatever reason, nvme_set_queue_count() results in io queue count set to zero. This will skip io queue and tag set changes. - But... the controller will transition to live, calling nvme_start_ctrl, which calls nvme_start_queues(), which then releases I/Os into the transport which then sends them to the driver. As there are no queues, things eventually hit the driver looking for a handle, which was cleared when the original controller was reset, and it can't proceed. Worst case, things progress, but everything fails. In the failing scenario, the nvme_set_features(NVME_FEAT_NUM_QUEUES) command actually failed with a NVME_SC_INTERNAL error. For some reason, although nvme_set_queue_count() saw the error and set io queue count to zero, it doesn't return a failure status to the transport, which allows the transport to continue using the controller. Fix the problem by simply rejecting the new association if at least 1 I/O queue can't be created. The association reject will fail the reconnect attempt and fall into the reconnect retry policy. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:40 -06:00
James Smart	06f3d71ea0	nvme-fc: fix numa_node when dev is null A recent change added a numa_node field to the nvme controller and has the transport assign the node using dev_to_node(). However, fcloop registers with a NULL device struct, so the dev_to_node() call oops. Revise the assignment to assign no node when device struct is null. Fixes: `103e515efa` ("nvme: add a numa_node field to struct nvme_ctrl") Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> [hch: small coding style fixup] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:40 -06:00
James Smart	9f7d8ae2f7	nvme-fc: use nr_phys_segments to determine existence of sgl For some nvme command, when issued by the nvme core layer, there is an internal buffer which can cause blk_rq_payload_bytes() to return a non-zero value yet there is no actual/real command payload and sg list. An example is the WRITE ZEROES command. To address this, when making choices on whether to dma map an sgl, use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes(). When there is a sgl, blk_rq_payload_bytes() will return the amount of data to be transferred by the sgl. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Yufen Yu	01fc08ff1f	nvme: update comment to make the code easier to read After commit `a686ed75c0` ("nvme: introduce a helper function for controller deletion), nvme_delete_ctrl_sync no longer use flush_work. Update comment, accordingly. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Sagi Grimberg	a63b83700b	nvme: put ns_head ref if namespace fails allocation In case nvme_alloc_ns fails after we initialize ns_head but before we add the ns to the controller namespaces list we need to explicitly put the ns_head reference because when we teardown the controller we won't find it, causing us to leak a dangling subsystem eventually. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Keith Busch	81fe928499	nvme-trace: fix cdw10 buffer overrun The field is defined to be a 24 byte array, we don't need to multiply the sizeof() that field by the number of dwords it covers. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Keith Busch	415df90b43	nvme: don't warn on block content change effects A write or flush IO passthrough command is expected to change the logical block content, so don't warn on these as no additional handling is necessary. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Max Gurtovoy	d9d53ed3f7	nvme: add get-feature to admin cmds tracer This will print get-feature cmd in more informative way. For example, run "nvme get-feature /dev/nvme0 -n 1 -f 0x9 -c 10" will trace: nvme-3907 [008] .... 1763.635054: nvme_setup_cmd: nvme0: qid=0, cmdid=6, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_admin_get_features fid=0x9 sel=0x0 cdw11=0xa) <idle>-0 [001] d.h. 1763.635112: nvme_sq: nvme0: qid=0, head=27, tail=27 <idle>-0 [008] ..s. 1763.635121: nvme_complete_rq: nvme0: qid=0, cmdid=6, res=10, retries=0, flags=0x2, status=0 Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-03-13 12:05:39 -06:00
Linus Torvalds	80201fe175	for-5.1/block-20190302 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/ ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE DRVVhYpocw== =VBaU -----END PGP SIGNATURE----- Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...	2019-03-08 14:12:17 -08:00
Linus Torvalds	78f8601354	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The interrupt departement delivers this time: - New infrastructure to manage NMIs on platforms which have a sane NMI delivery, i.e. identifiable NMI vectors instead of a single lump. - Simplification of the interrupt affinity management so drivers don't have to implement ugly loops around the PCI/MSI enablement. - Speedup for interrupt statistics in /proc/stat - Provide a function to retrieve the default irq domain - A new interrupt controller for the Loongson LS1X platform - Affinity support for the SiFive PLIC - Better support for the iMX irqsteer driver - NUMA aware memory allocations for GICv3 - The usual small fixes, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irqchip/imx-irqsteer: Add multi output interrupts support irqchip/imx-irqsteer: Change to use reg_num instead of irq_group dt-bindings: irq: imx-irqsteer: Add multi output interrupts support dt-binding: irq: imx-irqsteer: Use irq number instead of group number irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables irqdomain: Allow the default irq domain to be retrieved irqchip/sifive-plic: Implement irq_set_affinity() for SMP host irqchip/sifive-plic: Differentiate between PLIC handler and context irqchip/sifive-plic: Add warning in plic_init() if handler already present irqchip/sifive-plic: Pre-compute context hart base and enable base PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets genirq/affinity: Remove the leftovers of the original set support nvme-pci: Simplify interrupt allocation genirq/affinity: Add new callback for (re)calculating interrupt sets genirq/affinity: Store interrupt sets size in struct irq_affinity genirq/affinity: Code consolidation irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid. irqchip/i8259: Fix shutdown order by moving syscore_ops registration dt-bindings: interrupt-controller: loongson ls1x intc ...	2019-03-05 12:21:47 -08:00
Chaitanya Kulkarni	34e08191b1	nvme-rdma: use nr_phys_segments when map rq to sgl Use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes() to check if a command contains data to be mapped. This fixes the case where a struct request contains LBAs, but it has no payload, such as Write Zeroes support. Fixes: `6e02318eae` ("nvme: add support for the Write Zeroes command") Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-21 06:39:20 -07:00
Christoph Hellwig	bc50ad7501	nvme: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:28 -07:00
Christoph Hellwig	5f37396dff	nvme-pci: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:25 -07:00
Christoph Hellwig	115aa7abd7	nvme-lightnvm: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:22 -07:00
Christoph Hellwig	5d8762d568	nvme-rdma: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:19 -07:00
Christoph Hellwig	8638b24614	nvme-fc: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:17 -07:00
Christoph Hellwig	9002c4e5ff	nvme-fabrics: convert to SPDX identifiers Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-02-20 07:22:13 -07:00
Hannes Reinecke	ab4ab09cbd	nvme: return error from nvme_alloc_ns() nvme_alloc_ns() might fail, so we should be returning an error code. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:21:00 -07:00
Bart Van Assche	b9c77583b0	nvme: avoid that deleting a controller triggers a circular locking complaint Rework nvme_delete_ctrl_sync() such that it does not have to wait for queued work. This patch avoids that test nvme/008 triggers the following complaint: WARNING: possible circular locking dependency detected 5.0.0-rc6-dbg+ #10 Not tainted ------------------------------------------------------ nvme/7918 is trying to acquire lock: 000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410 but task is already holding lock: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (kn->count#389){++++}: lock_acquire+0xc5/0x1e0 __kernfs_remove+0x42a/0x4a0 kernfs_remove_by_name_ns+0x45/0x90 remove_files.isra.1+0x3a/0x90 sysfs_remove_group+0x5c/0xc0 sysfs_remove_groups+0x39/0x60 device_remove_attrs+0x68/0xb0 device_del+0x24d/0x570 cdev_device_del+0x1a/0x50 nvme_delete_ctrl_work+0xbd/0xe0 process_one_work+0x4f1/0xa40 worker_thread+0x67/0x5b0 kthread+0x1cf/0x1f0 ret_from_fork+0x24/0x30 -> #0 ((work_completion)(&ctrl->delete_work)){+.+.}: __lock_acquire+0x1323/0x17b0 lock_acquire+0xc5/0x1e0 __flush_work+0x399/0x410 flush_work+0x10/0x20 nvme_delete_ctrl_sync+0x65/0x70 nvme_sysfs_delete+0x4f/0x60 dev_attr_store+0x3e/0x50 sysfs_kf_write+0x87/0xa0 kernfs_fop_write+0x186/0x240 __vfs_write+0xd7/0x430 vfs_write+0xfa/0x260 ksys_write+0xab/0x130 __x64_sys_write+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(kn->count#389); lock((work_completion)(&ctrl->delete_work)); lock(kn->count#389); lock((work_completion)(&ctrl->delete_work)); * DEADLOCK * 3 locks held by nvme/7918: #0: 00000000e2223b44 (sb_writers#6){.+.+}, at: vfs_write+0x1eb/0x260 #1: 000000003404976f (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240 #2: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210 stack backtrace: CPU: 4 PID: 7918 Comm: nvme Not tainted 5.0.0-rc6-dbg+ #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xca print_circular_bug.isra.36.cold.54+0x173/0x1d5 check_prev_add.constprop.45+0x996/0x1110 __lock_acquire+0x1323/0x17b0 lock_acquire+0xc5/0x1e0 __flush_work+0x399/0x410 flush_work+0x10/0x20 nvme_delete_ctrl_sync+0x65/0x70 nvme_sysfs_delete+0x4f/0x60 dev_attr_store+0x3e/0x50 sysfs_kf_write+0x87/0xa0 kernfs_fop_write+0x186/0x240 __vfs_write+0xd7/0x430 vfs_write+0xfa/0x260 ksys_write+0xab/0x130 __x64_sys_write+0x43/0x50 do_syscall_64+0x71/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:19:42 -07:00
Bart Van Assche	a686ed75c0	nvme: introduce a helper function for controller deletion This patch does not change any functionality but makes the next patch in this series easier to read. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:18:21 -07:00
Bart Van Assche	d84c4b024a	nvme: unexport nvme_delete_ctrl_sync() Since nvme_delete_ctrl_sync() is not called from any other kernel module, unexport it. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:18:15 -07:00
Bart Van Assche	e895fedf12	nvme-pci: check kstrtoint() return value in queue_count_set() This patch avoids that the compiler complains about 'ret' being set but not being used when building with W=1. Fixes: `3b6592f70a` ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:18:07 -07:00
Bart Van Assche	a467fc55fc	nvme-fabrics: document the poll function argument This patch avoids that the kernel-doc tool reports a warning when building with W=1. Fixes: `26c682274e` ("nvme-fabrics: allow nvmf_connect_io_queue to poll") # v5.0-rc1 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:18:01 -07:00
Hannes Reinecke	75c10e7327	nvme-multipath: round-robin I/O policy Implement a simple round-robin I/O policy for multipathing. Path selection is done in two rounds, first iterating across all optimized paths, and if that doesn't return any valid paths, iterate over all optimized and non-optimized paths. If no paths are found, use the existing algorithm. Also add a sysfs attribute 'iopolicy' to switch between the current NUMA-aware I/O policy and the 'round-robin' I/O policy. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-20 07:17:49 -07:00
Ming Lei	612b72862b	nvme-pci: Simplify interrupt allocation The NVME PCI driver contains a tedious mechanism for interrupt allocation, which is necessary to adjust the number and size of interrupt sets to the maximum available number of interrupts which depends on the underlying PCI capabilities and the available CPU resources. It works around the former short comings of the PCI and core interrupt allocation mechanims in combination with interrupt sets. The PCI interrupt allocation function allows to provide a maximum and a minimum number of interrupts to be allocated and tries to allocate as many as possible. This worked without driver interaction as long as there was only a single set of interrupts to handle. With the addition of support for multiple interrupt sets in the generic affinity spreading logic, which is invoked from the PCI interrupt allocation, the adaptive loop in the PCI interrupt allocation did not work for multiple interrupt sets. The reason is that depending on the total number of interrupts which the PCI allocation adaptive loop tries to allocate in each step, the number and the size of the interrupt sets need to be adapted as well. Due to the way the interrupt sets support was implemented there was no way for the PCI interrupt allocation code or the core affinity spreading mechanism to invoke a driver specific function for adapting the interrupt sets configuration. As a consequence the driver had to implement another adaptive loop around the PCI interrupt allocation function and calling that with maximum and minimum interrupts set to the same value. This ensured that the allocation either succeeded or immediately failed without any attempt to adjust the number of interrupts in the PCI code. The core code now allows drivers to provide a callback to recalculate the number and the size of interrupt sets during PCI interrupt allocation, which in turn allows the PCI interrupt allocation function to be called in the same way as with a single set of interrupts. The PCI code handles the adaptive loop and the interrupt affinity spreading mechanism invokes the driver callback to adapt the interrupt set configuration to the current loop value. This replaces the adaptive loop in the driver completely. Implement the NVME specific callback which adjusts the interrupt sets configuration and remove the adaptive allocation loop. [ tglx: Simplify the callback further and restore the dropped adjustment of number of sets ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de	2019-02-18 11:21:28 +01:00
Ming Lei	9cfef55bb5	genirq/affinity: Store interrupt sets size in struct irq_affinity The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de	2019-02-18 11:21:27 +01:00
Jens Axboe	6fb845f0e7	Linux 5.0-rc6 -----BEGIN PGP SIGNATURE----- iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR A6Ktjw== =scY4 -----END PGP SIGNATURE----- Merge tag 'v5.0-rc6' into for-5.1/block Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c. This is needed since io_uring is now based on the block branch, to avoid a conflict between the multi-page bvecs and the bits of io_uring that touch the core block parts. * tag 'v5.0-rc6': (525 commits) Linux 5.0-rc6 x86/mm: Make set_pmd_at() paravirt aware MAINTAINERS: Update the ocores i2c bus driver maintainer, etc blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter MAINTAINERS: unify reference to xen-devel list x86/mm/cpa: Fix set_mce_nospec() futex: Handle early deadlock return correctly futex: Fix barrier comment net: dsa: b53: Fix for failure when irq is not defined in dt blktrace: Show requests without sector mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach() geneve: should not call rt6_lookup() when ipv6 was disabled KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) signal: Better detection of synchronous signals ...	2019-02-15 08:43:59 -07:00
Keith Busch	4726bcf30f	nvme-pci: add missing unlock for reset error The reset work holds a mutex to prevent races with removal modifying the same resources, but was unlocking only on success. Unlock on failure too. Fixes: `5c959d73db` ("nvme-pci: fix rapid add remove sequence") Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-12 09:29:07 +01:00
Keith Busch	5c959d73db	nvme-pci: fix rapid add remove sequence A surprise removal may fail to tear down request queues if it is racing with the initial asynchronous probe. If that happens, the remove path won't see the queue resources to tear down, and the controller reset path may create a new request queue on a removed device, but will not be able to make forward progress, deadlocking the pci removal. Protect setting up non-blocking resources from a shutdown by holding the same mutex, and transition to the CONNECTING state after these resources are initialized so the probe path may see the dead controller state before dispatching new IO. Link: https://bugzilla.kernel.org/show_bug.cgi?id=202081 Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-06 16:35:33 +01:00
Keith Busch	e7ad43c3ed	nvme: lock NS list changes while handling command effects If a controller supports the NS Change Notification, the namespace scan_work is automatically triggered after attaching a new namespace. Occasionally the namespace scan_work may append the new namespace to the list before the admin command effects handling is completed. The effects handling unfreezes namespaces, but if it unfreezes the newly attached namespace, its request_queue freeze depth will be off and we'll hit the warning in blk_mq_unfreeze_queue(). On the next namespace add, we will fail to freeze that queue due to the previous bad accounting and deadlock waiting for frozen. Fix that by preventing scan work from altering the namespace list while command effects handling needs to pair freeze with unfreeze. Reported-by: Wen Xiong <wenxiong@us.ibm.com> Tested-by: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-06 16:35:33 +01:00
Sagi Grimberg	794a4cb3d2	nvme: remove the .stop_ctrl callout It is used now just to flush error recovery and reconnect work items in the RDMA and TCP transports, which can simply be moved to the corresponding teardown routines. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-04 15:41:25 +01:00
Chaitanya Kulkarni	6e02318eae	nvme: add support for the Write Zeroes command Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block device, if the device supports an optional command bit set for write zeroes. Add support to setup write zeroes command. Set maximum possible write zeroes sectors in one write zeroes command according to nvme write zeroes command definition. This patch was posted as a part of block-write-zeroes support implementation (https://patchwork.kernel.org/patch/9454859/), but did not make into mainline kernel as it got reverted due to failure on the Linus's machine. In this patch in order to be more cautious, we use NVMe controller's maximum hardware sector size which is calculated based on the controller's MDTS (Maximum Data Transfer Size) field to calculate the maximum sectors for the write zeroes request. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [folded a fix from Keith Busch to properly respect NVME_QUIRK_DEALLOCATE_ZEROES] Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-02-04 15:41:25 +01:00

1 2 3 4 5 ...

1243 Commits