Commit Graph

602059 Commits

Author SHA1 Message Date
Matias Bjørling 8c39eddbf2 lightnvm: remove unused lists from struct rrpc_block
The ->list, ->open_list, and ->closed_list lists were previously used
for statistics. However, their usage have been removed, and thus these
can safely be removed.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling 5cd907853d lightnvm: remove nested lock conflict with mm
If a media manager tries to initialize it targets upon media manager
initialization, the media manager will need to know which target types
are available in LightNVM. The lists of which managers and target types
are available shares the same lock.

Therefore, on initialization, the nvm_lock is taken by LightNVM core,
which later leads to a deadlock when target types are enumerated by the
media manager.

Add an exclusive lock for target types to resolve this conflict.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling b76eb20bb0 lightnvm: move target mgmt into media mgr
To enable persistent block management to easily control creation and
removal of targets, we move target management into the media
manager. The LightNVM core continues to maintain which target types are
registered, while the media manager now keeps track of its initialized
targets.

Two new callbacks for the media manager are introduced. create_tgt and
remove_tgt. Note that remove_tgt returns 0 on successfully removing a
target, and returns 1 if the target was not found.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling 5e60edb7dc lightnvm: rename gennvm and update description
The generic manager should be called the general media manager, and
instead of using the rather long name of "gennvm" in front of each data
structures, use "gen" instead to shorten it. Update the description of
the media manager as well to make the media manager purpose clearer.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling 077d238999 lightnvm: remove open/close statistics for gennvm
The responsibility of the media manager is not to keep track of
open/closed blocks. This is better maintained within a target,
that already manages this information on writes.

Remove the statistics and merge the states NVM_BLK_ST_OPEN and
NVM_BLK_ST_CLOSED.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling 12624af26e lightnvm: fix checkpatch terse errors
A couple of small checkpatch fixups to stop it from complaining.

./drivers/lightnvm/core.c:360: WARNING: line over 80 characters
./drivers/lightnvm/core.c:360: ERROR: trailing statements should be on
next line
./drivers/lightnvm/core.c:503: WARNING: Block comments use a trailing */
on a separate line

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Matias Bjørling 5114e2773a lightnvm: remove checkpatch warning for unsigned ints
Checkpatch found two incidents where the type was preferred to be
written out in full.

./drivers/lightnvm/rrpc.h:184: WARNING: Prefer 'unsigned int' to bare
use of 'unsigned'
./drivers/lightnvm/rrpc.h:209: WARNING: Prefer 'unsigned int' to bare
use of 'unsigned'
./drivers/lightnvm/rrpc.c:51: WARNING: Prefer 'unsigned int' to bare use
of 'unsigned'

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Johannes Thumshirn 58eaaf9b6c lightnvm: Make functions not used by ouside static
Mark functions not used by ouside of thier implementing file as static.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Johannes Thumshirn 6f92970219 nvme: lightnvm: make MLC num_pairs little endian
According to the OpenChannel SSD interface specification the NAND flash
MLC page pairing information's number of page page pairings field is the
first two bytes in the MLC Page Pairing data structure. The hardware's
data structure itself is little endian so annotate it as such, like the
rest of lighnvm's data structures.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Javier González 5389a1dfb3 lightnvm: initialize ppa_addr in dev_to_generic_addr()
The ->reserved bit is not initialized when allocated on stack.
This may lead targets to misinterpret the PPA as cached.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Javier González 529435e818 lightnvm: add media manager mark_blk helper
Expose media manager mark_blk() to targets, as done for the rest of the
media manager callback functions.

Signed-off-by: Javier González <javier@cnexlabs.com>
Updated description
Signed-off-by: Matias Bjørling <m@bjorling.me>

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Wenwei Tao 0de2415bb7 lightnvm: break the loop when rqd is not null
Break the loop when rqd is not null to reduce
an unnecessary schedule.

Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:51:52 -06:00
Dan Carpenter f98d9ca17f nvmet: fix an error code
We accidentally return zero here when ERR_PTR(-ENOMEM) is intended.

Fixes: a07b4970f4 ('nvmet: add a generic NVMe target')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:37:36 -06:00
Arnd Bergmann 1fb4704084 nvme-loop: add configfs dependency
CONFIG_NVME_TARGET has a correct CONFIG_CONFIGFS_FS dependency, but the
newly added NVME_TARGET_LOOP is missing this, resulting in a link
failure:

drivers/nvme/built-in.o: In function `nvmet_init_configfs':
loop.c:(.init.text+0x2a0): undefined reference to `config_group_init'
loop.c:(.init.text+0x2c0): undefined reference to `config_group_init_type_name'
loop.c:(.init.text+0x318): undefined reference to `configfs_register_subsystem'
drivers/nvme/built-in.o: In function `nvmet_exit_configfs':
loop.c:(.exit.text+0x9c): undefined reference to `configfs_unregister_subsystem'

This adds the same dependency here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 3a85a5de29 ("nvme-loop: add a NVMe loopback host driver")
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-07 08:34:45 -06:00
Yijing Wang 89b920e003 bcache: Remove redundant block_size assignment
We have assigned sb->block_size before the switch,
so remove the redundant one.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Eric Wheeler <bcache@lists.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:50 -06:00
Yijing Wang 7abc70d700 bcache: update document info
There is no return in continue_at(), update the documentation.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:49 -06:00
Yijing Wang c50d4d5dd3 bcache: Remove redundant parameter for cache_alloc()
Cache_sb is not used in cache_alloc, and we have copied
sb info to cache->sb already, remove it.

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:34:47 -06:00
Christoph Hellwig 3a85a5de29 nvme-loop: add a NVMe loopback host driver
This patch implements adds nvme-loop which allows to access local devices
exported as NVMe over Fabrics namespaces. This module can be useful for
easy evaluation, testing and also feature experimentation.

To createa nvme-loop device you need to configure the NVMe target to
export a loop port (see the nvmetcli documentaton for that) and then
connect to it using

	nvme connect-all -t loop

which requires the very latest nvme-cli version with Fabrics support.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:36 -06:00
Christoph Hellwig a07b4970f4 nvmet: add a generic NVMe target
This patch introduces a implementation of NVMe subsystems,
controllers and discovery service which allows to export
NVMe namespaces across fabrics such as Ethernet, FC etc.

The implementation conforms to the NVMe 1.2.1 specification
and interoperates with NVMe over fabrics host implementations.

Configuration works using configfs, and is best performed using
the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git,
which also has a detailed explanation of the required steps in the
README file.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Signed-off-by: Anthony Knapp <anthony.j.knapp@intel.com>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:33 -06:00
Sagi Grimberg 9645c1a233 block: Export blk_poll
The new NVMe over fabrics target will make use of this outside from a
module.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:30:31 -06:00
Sagi Grimberg 038bd4cb67 nvme: add keep-alive support
Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and
optional in NVMe 1.2.1 for PCIe.  This patch adds periodic keep-alive
sent from the host to verify that the controller is still responsive
and vice-versa.  The keep-alive timeout is user-defined (with
keep_alive_tmo connection parameter) and defaults to 5 seconds.

In order to avoid a race condition where the host sends a keep-alive
competing with the target side keep-alive timeout expiration, the host
adds a grace period of 10 seconds when publishing the keep-alive timeout
to the target.

In case a keep-alive failed (or timed out), a transport specific error
recovery kicks in.

For now only NVMe over Fabrics is wired up to support keep alive, but
we can add PCIe support easily once controllers actually supporting it
become available.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@chelsio.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:20 -06:00
Sagi Grimberg 7b89eae29e nvme.h: Add keep-alive opcode and identify controller attribute
KAS: keep-alive support and granularity of kato in units of 100 ms
nvme_admin_keep_alive opcode: 0x18

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:18 -06:00
Christoph Hellwig 07bfcd09a2 nvme-fabrics: add a generic NVMe over Fabrics library
The NVMe over Fabrics library provides an interface for both transports
and the nvme core to handle fabrics specific commands and attributes
independent of the underlying transport.

In addition, the fabrics library adds a misc device interface that allow
actually creating a fabrics controller, as we can't just autodiscover
it like in the PCI case.  The nvme-cli utility has been enhanced to use
this interface to support fabric connect and discovery.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>,
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>,
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:16 -06:00
Christoph Hellwig eb793e2c92 nvme.h: add NVMe over Fabrics definitions
The NVMe over Fabrics specification defines a protocol interface and
related extensions to NVMe that enable operation over network protocols.
The NVMe over Fabrics specification has an NVMe Transport binding for
each NVMe Transport.

This patch adds the fabrics related definitions:
- fabric specific command set and error codes
- transport addressing and binding definitions
- fabrics sgl extensions
- controller identification fabrics enhancements
- discovery log page definition

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:14 -06:00
Ming Lin 1a353d85b0 nvme: add fabrics sysfs attributes
- delete_controller: This attribute allows to delete a controller.
  A driver is not obligated to support it (pci doesn't) so it is
  created only if the driver supports it. The new fabrics drivers
  will support it (essentialy a disconnect operation).

  Usage:
  echo > /sys/class/nvme/nvme0/delete_controller

- subsysnqn: This attribute shows the subsystem nqn of the configured
  device. If a driver does not implement the get_subsysnqn method, the
  file will not appear in sysfs.

- transport: This attribute shows the transport name. Added a "name"
  field to struct nvme_ctrl_ops.

  For loop,
  cat /sys/class/nvme/nvme0/transport
  loop

  For RDMA,
  cat /sys/class/nvme/nvme0/transport
  rdma

  For PCIe,
  cat /sys/class/nvme/nvme0/transport
  pcie

- address: This attributes shows the controller address. The fabrics
  drivers that will implement get_address can show the address of the
  connected controller.

  example:
  cat /sys/class/nvme/nvme0/address
  traddr=192.168.2.2,trsvcid=1023

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:12 -06:00
Christoph Hellwig eb71f43557 nvme: Modify and export sync command submission for fabrics
NVMe over fabrics will use __nvme_submit_sync_cmd in the the
transport and require a few tweaks to it.  For that we export it
and add a few more paramters:

1. allow passing a queue ID to the block layer

   For the NVMe over Fabrics connect command we need to able to specify a
   queue ID that we want to send the command on.  Add a qid parameter to
   the relevant functions to enable this behavior.

2. allow submitting at_head commands

   In cases where we want to (re)connect to a controller
   where we have inflight queued commands we want to first
   connect and only then allow the other queued commands to
   be kicked. This will prevents failures in controller resets
   and reconnects.

3. allow passing flags to blk_mq_allocate_request

   Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we
   want to be able to use reserved requests.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:11 -06:00
Christoph Hellwig 7d2e80080d nvme: allow transitioning from NEW to LIVE state
For Fabrics we're not going through an intermediate reset state
(at least for now).

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:09 -06:00
Ming Lin 1f5bd336b9 blk-mq: add blk_mq_alloc_request_hctx
For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.

Based on an earlier patch from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:07 -06:00
Bartlomiej Zolnierkiewicz 9e2d23f19e mg_disk: fix error path in mg_probe()
MG_DISK_MAJ is defined as 0 so dynamic block major number
allocation is used by the driver and the assigned major
number is stored in host->major.  This patch fixes error
path in mg_probe() to use host->major instead of using
MG_DISK_MAJ.

Cc: unsik Kim <donari75@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-28 11:01:27 -06:00
Lars Ellenberg 1b57e66384 drbd: correctly handle failed crypto_alloc_hash
crypto_alloc_hash returns an ERR_PTR(), not NULL.

Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.

Reported-by: Insu Yun <wuninsu@gmail.com>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Lars Ellenberg 27ea1d876e drbd: al_write_transaction: skip re-scanning of bitmap page pointer array
For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).

For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.

But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.

Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Lars Ellenberg 13c2088d41 drbd: finally report ms, not jiffies, in log message
Also skip the message unless bitmap IO took longer than 5 ms.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:08 -06:00
Roland Kammerer 4e526a0046 drbd: get rid of empty statement in is_valid_state
This should silence a warning about an empty statement. Thanks to Fabian
Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and
avoids an additional indent level.

Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Fabian Frederick 7e5fec3168 drbd: code cleanups without semantic changes
This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.

Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 20004e2435 drbd: bump current uuid when resuming IO with diskless peer
Scenario, starting with normal operation
 Connected Primary/Secondary UpToDate/UpToDate
 NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
 ... more failures happen, secondary loses it's disk,
 but eventually is able to re-establish the replication link ...
 Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)

We used to just resume/resent suspended requests,
without bumping the UUID.

Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.

Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 31d646042d drbd: disallow promotion during resync handshake, avoid deadlock and hard reset
We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.

But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".

Consider this sequence (e.g. deployment scenario):
create-md; up;
  -> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.

 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** HERE things go wrong. ***
 block drbd0: role( Secondary -> Primary )
 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: Becoming sync target due to disk states.
 block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
 block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
 drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed.  Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(

Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.

Result:
 ... as above ...
 block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
  *** local promotion being postponed until ... ***
 block drbd0: drbd_sync_handshake:
 block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
 block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
  ...
 block drbd0: conn( WFBitMapT -> WFSyncUUID )
 block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
 block drbd0: conn( WFSyncUUID -> SyncTarget )
  *** ... after the resync handshake ***
 block drbd0: role( Secondary -> Primary )

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg f2d3d75b66 drbd: sync_handshake: handle identical uuids with current (frozen) Primary
If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.

The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.

Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)

Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 9104d31a75 drbd: introduce WRITE_SAME support
We will support WRITE_SAME, if
 * all peers support WRITE_SAME (both in kernel and DRBD version),
 * all peer devices support WRITE_SAME
 * logical_block_size is identical on all peers.

We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:07 -06:00
Lars Ellenberg 60bac04012 drbd: report sizes if rejecting too small peer disk
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 65f5be3579 drbd: discard_zeroes_if_aligned allows "thin" resync for discard_zeroes_data=0
Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg af61494ad4 drbd: only restart frozen disk io when D_UP_TO_DATE
When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 0ead5cca3d drbd: if there is no good data accessible, writes should be IO errors
If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).

If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.

If we are already D_DISKLESS (or D_FAILED), we also return EIO.

Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.

The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.

The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.

The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 7bd000cb0c drbd: don't forget error completion when "unsuspending" IO
Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).

Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).

If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.

But we forgot to call the IO error completion for all pending,
suspended, requests.

While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 26a96110ab drbd: introduce unfence-peer handler
When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.

Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.

It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.

This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.

Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.

Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.

We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.

This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.

Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.

Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:06 -06:00
Lars Ellenberg 5052fee2c7 drbd: finish resync on sync source only by notification from sync target
If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.

Always wait for explicit resync finished state change notification from
the sync target.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 505675f96c drbd: allow larger max_discard_sectors
Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 7435e9018f drbd: zero-out partial unaligned discards on local backend
For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg 69ba1ee936 drbd: possibly disable discard support, if backend has discard_zeroes_data=0
Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.

We announce discard support,
UNLESS

 * we are connected to a peer that does not support TRIM
   on the DRBD protocol level.  Otherwise, it would either discard, or
   do a fallback to zero-out, depending on its backend and configuration.

 * our local backend does not support discards,
   or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg dd4f699da6 drbd: when receiving P_TRIM, zero-out partial unaligned chunks
We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.

The relevant use case is LVM/DM thin.

If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.

While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.

To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".

We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".

Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.

LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.

We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.

At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".

Arguably it should behave this way internally, by default,
and we'll try to make that happen.

But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.

Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.

Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.

We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00
Lars Ellenberg f9ff0da564 drbd: allow parallel flushes for multi-volume resources
To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.

We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-13 21:43:05 -06:00