linux

Commit Graph

Author	SHA1	Message	Date
Christoph Hellwig	a07b4970f4	nvmet: add a generic NVMe target This patch introduces a implementation of NVMe subsystems, controllers and discovery service which allows to export NVMe namespaces across fabrics such as Ethernet, FC etc. The implementation conforms to the NVMe 1.2.1 specification and interoperates with NVMe over fabrics host implementations. Configuration works using configfs, and is best performed using the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git, which also has a detailed explanation of the required steps in the README file. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: Anthony Knapp <anthony.j.knapp@intel.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:33 -06:00
Sagi Grimberg	9645c1a233	block: Export blk_poll The new NVMe over fabrics target will make use of this outside from a module. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:31 -06:00
Sagi Grimberg	038bd4cb67	nvme: add keep-alive support Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and optional in NVMe 1.2.1 for PCIe. This patch adds periodic keep-alive sent from the host to verify that the controller is still responsive and vice-versa. The keep-alive timeout is user-defined (with keep_alive_tmo connection parameter) and defaults to 5 seconds. In order to avoid a race condition where the host sends a keep-alive competing with the target side keep-alive timeout expiration, the host adds a grace period of 10 seconds when publishing the keep-alive timeout to the target. In case a keep-alive failed (or timed out), a transport specific error recovery kicks in. For now only NVMe over Fabrics is wired up to support keep alive, but we can add PCIe support easily once controllers actually supporting it become available. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@chelsio.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:20 -06:00
Sagi Grimberg	7b89eae29e	nvme.h: Add keep-alive opcode and identify controller attribute KAS: keep-alive support and granularity of kato in units of 100 ms nvme_admin_keep_alive opcode: 0x18 Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:18 -06:00
Christoph Hellwig	07bfcd09a2	nvme-fabrics: add a generic NVMe over Fabrics library The NVMe over Fabrics library provides an interface for both transports and the nvme core to handle fabrics specific commands and attributes independent of the underlying transport. In addition, the fabrics library adds a misc device interface that allow actually creating a fabrics controller, as we can't just autodiscover it like in the PCI case. The nvme-cli utility has been enhanced to use this interface to support fabric connect and discovery. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>, Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>, Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:16 -06:00
Christoph Hellwig	eb793e2c92	nvme.h: add NVMe over Fabrics definitions The NVMe over Fabrics specification defines a protocol interface and related extensions to NVMe that enable operation over network protocols. The NVMe over Fabrics specification has an NVMe Transport binding for each NVMe Transport. This patch adds the fabrics related definitions: - fabric specific command set and error codes - transport addressing and binding definitions - fabrics sgl extensions - controller identification fabrics enhancements - discovery log page definition Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:14 -06:00
Ming Lin	1a353d85b0	nvme: add fabrics sysfs attributes - delete_controller: This attribute allows to delete a controller. A driver is not obligated to support it (pci doesn't) so it is created only if the driver supports it. The new fabrics drivers will support it (essentialy a disconnect operation). Usage: echo > /sys/class/nvme/nvme0/delete_controller - subsysnqn: This attribute shows the subsystem nqn of the configured device. If a driver does not implement the get_subsysnqn method, the file will not appear in sysfs. - transport: This attribute shows the transport name. Added a "name" field to struct nvme_ctrl_ops. For loop, cat /sys/class/nvme/nvme0/transport loop For RDMA, cat /sys/class/nvme/nvme0/transport rdma For PCIe, cat /sys/class/nvme/nvme0/transport pcie - address: This attributes shows the controller address. The fabrics drivers that will implement get_address can show the address of the connected controller. example: cat /sys/class/nvme/nvme0/address traddr=192.168.2.2,trsvcid=1023 Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:12 -06:00
Christoph Hellwig	eb71f43557	nvme: Modify and export sync command submission for fabrics NVMe over fabrics will use __nvme_submit_sync_cmd in the the transport and require a few tweaks to it. For that we export it and add a few more paramters: 1. allow passing a queue ID to the block layer For the NVMe over Fabrics connect command we need to able to specify a queue ID that we want to send the command on. Add a qid parameter to the relevant functions to enable this behavior. 2. allow submitting at_head commands In cases where we want to (re)connect to a controller where we have inflight queued commands we want to first connect and only then allow the other queued commands to be kicked. This will prevents failures in controller resets and reconnects. 3. allow passing flags to blk_mq_allocate_request Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we want to be able to use reserved requests. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:11 -06:00
Christoph Hellwig	7d2e80080d	nvme: allow transitioning from NEW to LIVE state For Fabrics we're not going through an intermediate reset state (at least for now). Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:09 -06:00
Ming Lin	1f5bd336b9	blk-mq: add blk_mq_alloc_request_hctx For some protocols like NVMe over Fabrics we need to be able to send initialization commands to a specific queue. Based on an earlier patch from Christoph Hellwig <hch@lst.de>. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> [hch: disallow sleeping allocation, req_op fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:07 -06:00
Bartlomiej Zolnierkiewicz	9e2d23f19e	mg_disk: fix error path in mg_probe() MG_DISK_MAJ is defined as 0 so dynamic block major number allocation is used by the driver and the assigned major number is stored in host->major. This patch fixes error path in mg_probe() to use host->major instead of using MG_DISK_MAJ. Cc: unsik Kim <donari75@gmail.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-28 11:01:27 -06:00
Lars Ellenberg	1b57e66384	drbd: correctly handle failed crypto_alloc_hash crypto_alloc_hash returns an ERR_PTR(), not NULL. Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash() on an errno in the cleanup path. Reported-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Lars Ellenberg	27ea1d876e	drbd: al_write_transaction: skip re-scanning of bitmap page pointer array For larger devices, the array of bitmap page pointers can grow very large (8000 pointers per TB of storage). For each activity log transaction, we need to flush the associated bitmap pages to stable storage. Currently, we just "mark" the respective pages while setting up the transaction, then tell the bitmap code to write out all marked pages, but skip unchanged pages. But one such transaction can affect only a small number of bitmap pages, there is no need to scan the full array of several (ten-)thousand page pointers to find the few marked ones. Instead, remember the index numbers of the few affected pages, and later only re-check those to skip duplicates and unchanged ones. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Lars Ellenberg	13c2088d41	drbd: finally report ms, not jiffies, in log message Also skip the message unless bitmap IO took longer than 5 ms. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:08 -06:00
Roland Kammerer	4e526a0046	drbd: get rid of empty statement in is_valid_state This should silence a warning about an empty statement. Thanks to Fabian Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and avoids an additional indent level. Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Fabian Frederick	7e5fec3168	drbd: code cleanups without semantic changes This contains various cosmetic fixes ranging from simple typos to const-ifying, and using booleans properly. Original commit messages from Fabian's patch set: drbd: debugfs: constify drbd_version_fops drbd: use seq_put instead of seq_print where possible drbd: include linux/uaccess.h instead of asm/uaccess.h drbd: use const char * const for drbd strings drbd: kerneldoc warning fix in w_e_end_data_req() drbd: use unsigned for one bit fields drbd: use bool for peer is_ states drbd: fix typo drbd: use \| for bitmask combination drbd: use true/false for bool drbd: fix drbd_bm_init() comments drbd: introduce peer state union drbd: fix maybe_pull_ahead() locking comments drbd: use bool for growing drbd: remove redundant declarations drbd: replace if/BUG by BUG_ON Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	20004e2435	drbd: bump current uuid when resuming IO with diskless peer Scenario, starting with normal operation Connected Primary/Secondary UpToDate/UpToDate NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen) ... more failures happen, secondary loses it's disk, but eventually is able to re-establish the replication link ... Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!) We used to just resume/resent suspended requests, without bumping the UUID. Which will lead to problems later, when we want to re-attach the disk on the peer, without first disconnecting, or if we experience additional failures, because we now have diverging data without being able to recognize it. Make sure we also bump the current data generation UUID, if we notice "peer disk unknown" -> "peer disk known bad". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	31d646042d	drbd: disallow promotion during resync handshake, avoid deadlock and hard reset We already serialize connection state changes, and other, non-connection state changes (role changes) while we are establishing a connection. But if we have an established connection, then trigger a resync handshake (by primary --force or similar), until now we just had to be "lucky". Consider this sequence (e.g. deployment scenario): create-md; up; -> Connected Secondary/Secondary Inconsistent/Inconsistent then do a racy primary --force on both peers. block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent ) block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate ) * HERE things go wrong. * block drbd0: role( Secondary -> Primary ) block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: Becoming sync target due to disk states. block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake. block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s) drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) The problem here is that the local promotion happens before the sync handshake triggered by the remote promotion was completed. Some assumptions elsewhere become wrong, and when the expected resync handshake is then received and processed, we get stuck in a deadlock, which can only be recovered by reboot :-( Fix: if we know the peer has good data, and our own disk is present, but NOT good, and there is no resync going on yet, we expect a sync handshake to happen "soon". So reject a racy promotion with SS_IN_TRANSIENT_STATE. Result: ... as above ... block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate ) * local promotion being postponed until ... * block drbd0: drbd_sync_handshake: block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0 block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0 ... block drbd0: conn( WFBitMapT -> WFSyncUUID ) block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000 block drbd0: conn( WFSyncUUID -> SyncTarget ) * ... after the resync handshake * block drbd0: role( Secondary -> Primary ) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	f2d3d75b66	drbd: sync_handshake: handle identical uuids with current (frozen) Primary If in a two-primary scenario, we lost our peer, freeze IO, and are still frozen (no UUID rotation) when the peer comes back as Secondary after a hard crash, we will see identical UUIDs. The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as arbitration, but that would cause the still running (but frozen) Primary to become SyncTarget (which it typically refuses), and the handshake is declined. Fix: check current roles. If we have one current primary, the Primary wins. (rule_nr = 41) Since that is a protocol change, use the newly introduced DRBD_FF_WSAME to determine if rule_nr = 41 can be applied. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	9104d31a75	drbd: introduce WRITE_SAME support We will support WRITE_SAME, if * all peers support WRITE_SAME (both in kernel and DRBD version), * all peer devices support WRITE_SAME * logical_block_size is identical on all peers. We may at some point introduce a fallback on the receiving side for devices/kernels that do not support WRITE_SAME, by open-coding a submit loop. But not yet. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:07 -06:00
Lars Ellenberg	60bac04012	drbd: report sizes if rejecting too small peer disk Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	65f5be3579	drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0 Even if discard_zeroes_data != 0, if discard_zeroes_if_aligned is set, we assume we can reliably zero-out/discard using the drbd_issue_peer_discard() helper. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	af61494ad4	drbd: only restart frozen disk io when D_UP_TO_DATE When re-attaching the local backend device to a C_STANDALONE D_DISKLESS R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the backend that is being attached as D_UP_TO_DATE. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	0ead5cca3d	drbd: if there is no good data accessible, writes should be IO errors If DRBD lost all path to good data, and the on-no-data-accessible policy is OND_SUSPEND_IO, all pending and new IO requests are suspended (will block). If that setting is OND_IO_ERROR, IO will still be completed. READ to "clean" areas (e.g. on an D_INCONSISTENT device, and bitmap indicates a block is already in sync) will succeed. READ to "unclean" areas (bitmap indicates block is out-of-sync), will return EIO. If we are already D_DISKLESS (or D_FAILED), we also return EIO. Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT, after replication link loss, new WRITE requests still went through OK. The would also set the "out-of-sync" bit on their way, so READ after WRITE would still return EIO. Also, the data generation UUIDs had not been bumped, we would cause data divergence, without being able to detect it on the next sync handshake, given the right sequence of events in a multiple error scenario and "improper" order of recovery actions. The right thing to do is to return EIO for all new writes, unless we have access to good, current, D_UP_TO_DATE data. The "established best practices" way to avoid these situations in the first place is to set OND_SUSPEND_IO, or even do a hard-reset from the pri-on-incon-degr policy helper hook. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	7bd000cb0c	drbd: don't forget error completion when "unsuspending" IO Possibly sequence of events: SyncTarget is made Primary, then loses replication link (only path to good data on SyncSource). Behavior is then controlled by the on-no-data-accessible policy, which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO). If OND_IO_ERROR is in fact the current policy, we clear the susp_fen (IO suspended due to fencing policy) flag, do NOT set the susp_nod (IO suspended due to no data) flag. But we forgot to call the IO error completion for all pending, suspended, requests. While at it, also add a race check for a theoretically possible race with a new handshake (network hickup), we may be able to re-send requests, and can avoid passing IO errors up the stack. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	26a96110ab	drbd: introduce unfence-peer handler When resync is finished, we already call the "after-resync-target" handler (on the former sync target, obviously), once per volume. Paired with the before-resync-target handler, you can create snapshots, before the resync causes the volumes to become inconsistent, and discard those snapshots again, once they are no longer needed. It was also overloaded to be paired with the "fence-peer" handler, to "unfence" once the volumes are up-to-date and known good. This has some disadvantages, though: we call "fence-peer" for the whole connection (once for the group of volumes), but would call unfence as side-effect of after-resync-target once for each volume. Also, we fence on a (current, or about to become) Primary, which will later become the sync-source. Calling unfence only as a side effect of the after-resync-target handler opens a race window, between a new fence on the Primary (SyncTarget) and the unfence on the SyncTarget, which is difficult to close without some kind of "cluster wide lock" in those handlers. We would not need those handlers if we could still communicate. Which makes trying to aquire a cluster wide lock from those handlers seem like a very bad idea. This introduces the "unfence-peer" handler, which will be called per connection (once for the group of volumes), just like the fence handler, only once all volumes are back in sync, and on the SyncSource. Which is expected to be the node that previously called "fence", the node that is currently allowed to be Primary, and thus the only node that could trigger a new "fence" that could race with this unfence. Which makes us not need any cluster wide synchronization here, serializing two scripts running on the same node is trivial. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:06 -06:00
Lars Ellenberg	5052fee2c7	drbd: finish resync on sync source only by notification from sync target If the replication link breaks exactly during "resync finished" detection, finishing too early on the sync source could again lead to UUIDs rotated too fast, and potentially a spurious full resync on next handshake. Always wait for explicit resync finished state change notification from the sync target. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	505675f96c	drbd: allow larger max_discard_sectors Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION) al-extents available, and allow up to half of that to be discarded in one bio. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	7435e9018f	drbd: zero-out partial unaligned discards on local backend For consistency, also zero-out partial unaligned chunks of discard requests on the local backend. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	69ba1ee936	drbd: possibly disable discard support, if backend has discard_zeroes_data=0 Now that we have the discard_zeroes_if_aligned setting, we should also check it when setting up our queue parameters on the primary, not only on the receiving side. We announce discard support, UNLESS * we are connected to a peer that does not support TRIM on the DRBD protocol level. Otherwise, it would either discard, or do a fallback to zero-out, depending on its backend and configuration. * our local backend does not support discards, or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	dd4f699da6	drbd: when receiving P_TRIM, zero-out partial unaligned chunks We can avoid spurious data divergence caused by partially-ignored discards on certain backends with discard_zeroes_data=0, if we translate partial unaligned discard requests into explicit zero-out. The relevant use case is LVM/DM thin. If on different nodes, DRBD is backed by devices with differing discard characteristics, discards may lead to data divergence (old data or garbage left over on one backend, zeroes due to unmapped areas on the other backend). Online verify would now potentially report tons of spurious differences. While probably harmless for most use cases (fstrim on a file system), DRBD cannot have that, it would violate our promise to upper layers that our data instances on the nodes are identical. To be correct and play safe (make sure data is identical on both copies), we would have to disable discard support, if our local backend (on a Primary) does not support "discard_zeroes_data=true". We'd also have to translate discards to explicit zero-out on the receiving (typically: Secondary) side, unless the receiving side supports "discard_zeroes_data=true". Which both would allocate those blocks, instead of unmapping them, in contrast with expectations. LVM/DM thin does set discard_zeroes_data=0, because it silently ignores discards to partial chunks. We can work around this by checking the alignment first. For unaligned (wrt. alignment and granularity) or too small discards, we zero-out the initial (and/or) trailing unaligned partial chunks, but discard all the aligned full chunks. At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1". Arguably it should behave this way internally, by default, and we'll try to make that happen. But our workaround is still valid for already deployed setups, and for other devices that may behave this way. Setting discard-zeroes-if-aligned=yes will allow DRBD to use discards, and to announce discard_zeroes_data=true, even on backends that announce discard_zeroes_data=false. Setting discard-zeroes-if-aligned=no will cause DRBD to always fall-back to zero-out on the receiving side, and to not even announce discard capabilities on the Primary, if the respective backend announces discard_zeroes_data=false. We used to ignore the discard_zeroes_data setting completely. To not break established and expected behaviour, and suddenly cause fstrim on thin-provisioned LVs to run out-of-space, instead of freeing up space, the default value is "yes". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	f9ff0da564	drbd: allow parallel flushes for multi-volume resources To maintain write-order fidelity accros all volumes in a DRBD resource, the receiver of a P_BARRIER needs to issue flushes to all volumes. We used to do this by calling blkdev_issue_flush(), synchronously, one volume at a time. We now submit all flushes to all volumes in parallel, then wait for all completions, to reduce worst-case latencies on multi-volume resources. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:05 -06:00
Lars Ellenberg	0982368bfd	drbd: fix for truncated minor number in callback command line The command line parameter the kernel module uses to communicate the device minor to userland helper is flawed in a way that the device indentifier "minor-%d" is being truncated to minors with a maximum of 5 digits. But DRBD 8.4 allows 2^20 == 1048576 minors, thus a minimum of 7 digits must be supported. Reported by Veit Wahlich on drbd-dev. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Lars Ellenberg	1b228c98ce	drbd: fix regression: protocol A sometimes synchronous, C sometimes double-latency Regression introduced with 8.4.5 drbd: application writes may set-in-sync in protocol != C Overwriting the same block (LBA) while a former version is still "in-flight" to the peer (to be exact: we did not receive the P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that former version to be acknowledged by the peer. In synchronous and quasi-synchronous protocols C and B, this may double the latency on overwrites. With protocol A, which is supposed to be asynchronous and only wait for local completion, it is even worse: it would make overwrites quasi-synchronous, they would be hit by the full RTT, which protocol A was specifically meant to avoid, and possibly the additional time it takes to drain the buffers first. Particularly bad for databases, or anything else that does frequent updates to the same blocks (various file system meta data). No impact if >= rtt passes between updates to the same block. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Lars Ellenberg	bca1cbaeac	drbd: adjust assert in w_bitmap_io to account for BM_LOCKED_CHANGE_ALLOWED Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Philipp Reisner	92d94ae66a	drbd: Create the protocol feature THIN_RESYNC If thinly provisioned volumes are used, during a resync the sync source tries to find out if a block is deallocated. If it is deallocated, then the resync target uses block_dev_issue_zeroout() on the range in question. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Philipp Reisner	a5ca66c419	drbd: Introduce new disk config option rs-discard-granularity As long as the value is 0 the feature is disabled. With setting it to a positive value, DRBD limits and aligns its resync requests to the rs-discard-granularity setting. If the sync source detects all zeros in such a block, the resync target discards the range on disk. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Philipp Reisner	700ca8c04a	drbd: Implement handling of thinly provisioned storage on resync target nodes If during resync we read only zeroes for a range of sectors assume that these secotors can be discarded on the sync target node. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:04 -06:00
Philipp Reisner	c5c2385481	drbd: Kill code duplication Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:03 -06:00
Lars Ellenberg	be115b69f1	drbd: change bitmap write-out when leaving resync states When leaving resync states because of disconnect, do the bitmap write-out synchronously in the drbd_disconnected() path. When leaving resync states because we go back to AHEAD/BEHIND, or because resync actually finished, or some disk was lost during resync, trigger the write-out from after_state_ch(). The bitmap write-out for resync -> ahead/behind was missing completely before. Note that this is all only an optimization to avoid double-resyncs of already completed blocks in case this node crashes. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:03 -06:00
Lars Ellenberg	c0065f98d5	drbd: bitmap bulk IO: do not always suspend IO The intention was to only suspend IO if some normal bitmap operation is supposed to be locked out, not always. If the bulk operation is flaged as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-13 21:43:03 -06:00
Christoph Hellwig	f5fa90dc0a	nvme: move the workaround for I/O queue-less controllers from PCIe to core We want to apply this to Fabrics drivers as well, so move it to common code. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	7a5abb4b48	nvme: factor out a add nvme_is_write helper Centralize the check if a given NVMe command reads or writes data. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	a229dbf61e	nvme: allow for size limitations from transport drivers Some transport drivers may have a lower transfer size than the controller. So allow the transport to set it in the controller max_hw_sectors. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
James Smart	3972be23bd	nvme.h: add constants for PSDT and FUSE values Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	79f370eac6	nvme.h: add AER constants Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	69cd27e251	nvme.h: add NVM command set SQE/CQE size defines Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Armen Baloyan	725b358836	nvme.h: Add get_log_page command strucure Add get_log_page command structure and a corresponding entry in nvme_command union Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed--by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	14e974a84e	nvme.h: add RTD3R, RTD3E and OAES fields These have been added in NVMe 1.2 and we'll need at least oaes for the NVMe target driver. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Bhaktipriya Shridhar	81baf90af2	bcache: Remove deprecated create_workqueue alloc_workqueue replaces deprecated create_workqueue(). Dedicated workqueues have been used since bcache_wq and moving_gc_wq are workqueues for writes and are being used on a memory reclaim path. WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Since there are only a fixed number of work items, explicit concurrency limit is unnecessary here. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-11 20:03:04 -06:00

1 2 3 4 5 ...

602041 Commits All Branches Search

602041 Commits

All Branches