Commit Graph

382 Commits

Author SHA1 Message Date
Chuck Lever 9468431962 svcrdma: Add backward direction service for RPC/RDMA transport
On NFSv4.1 mount points, the Linux NFS client uses this transport
endpoint to receive backward direction calls and route replies back
to the NFSv4.1 server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 63cae47005 xprtrdma: Handle incoming backward direction RPC calls
Introduce a code path in the rpcrdma_reply_handler() to catch
incoming backward direction RPC calls and route them to the ULP's
backchannel server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 83128a60ca xprtrdma: Add support for sending backward direction RPC replies
Backward direction RPC replies are sent via the client transport's
send_request method, the same way forward direction RPC calls are
sent.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 124fa17d3e xprtrdma: Pre-allocate Work Requests for backchannel
Pre-allocate extra send and receive Work Requests needed to handle
backchannel receives and sends.

The transport doesn't know how many extra WRs to pre-allocate until
the xprt_setup_backchannel() call, but that's long after the WRs are
allocated during forechannel setup.

So, use a fixed value for now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever f531a5dbc4 xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
xprtrdma's backward direction send and receive buffers are the same
size as the forechannel's inline threshold, and must be pre-
registered.

The consumer has no control over which receive buffer the adapter
chooses to catch an incoming backwards-direction call. Any receive
buffer can be used for either a forward reply or a backward call.
Thus both types of RPC message must all be the same size.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever a5b027e189 xprtrdma: Saving IRQs no longer needed for rb_lock
Now that RPC replies are processed in a workqueue, there's no need
to disable IRQs when managing send and receive buffers. This saves
noticeable overhead per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 2da9ab3008 xprtrdma: Remove reply tasklet
Clean up: The reply tasklet is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever fe97b47cd6 xprtrdma: Use workqueue to process RPC/RDMA replies
The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 1e465fd4ff xprtrdma: Replace send and receive arrays
The rb_send_bufs and rb_recv_bufs arrays are used to implement a
pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep
structs. Replace those arrays with free lists.

To allow more than 512 RPCs in-flight at once, each of these arrays
would be larger than a page (assuming 8-byte addresses and 4KB
pages). Allowing up to 64K in-flight RPCs (as TCP now does), each
buffer array would have to be 128 pages. That's an order-6
allocation. (Not that we're going there.)

A list is easier to expand dynamically. Instead of allocating a
larger array of pointers and copying the existing pointers to the
new array, simply append more buffers to each list.

This also makes it simpler to manage receive buffers that might
catch backwards-direction calls, or to post receive buffers in
bulk to amortize the overhead of ib_post_recv.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever b0e178a2d8 xprtrdma: Refactor reply handler error handling
Clean up: The error cases in rpcrdma_reply_handler() almost never
execute. Ensure the compiler places them out of the hot path.

No behavior change expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 4220a07264 xprtrdma: Prevent loss of completion signals
Commit 8301a2c047 ("xprtrdma: Limit work done by completion
handler") was supposed to prevent xprtrdma's upcall handlers from
starving other softIRQ work by letting them return to the provider
before all CQEs have been polled.

The logic assumes the provider will call the upcall handler again
immediately if the CQ is re-armed while there are still queued CQEs.

This assumption is invalid. The IBTA spec says that after a CQ is
armed, the hardware must interrupt only when a new CQE is inserted.
xprtrdma can't rely on the provider calling again, even though some
providers do.

Therefore, leaving CQEs on queue makes sense only when there is
another mechanism that ensures all remaining CQEs are consumed in a
timely fashion. xprtrdma does not have such a mechanism. If a CQE
remains queued, the transport can wait forever to send the next RPC.

Finally, move the wcs array back onto the stack to ensure that the
poll array is always local to the CPU where the completion upcall is
running.

Fixes: 8301a2c047 ("xprtrdma: Limit work done by completion ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever 7b3d770c67 xprtrdma: Re-arm after missed events
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive
value if WCs were added to a CQ after the last completion upcall
but before the CQ has been re-armed.

Commit 7f23f6f6e3 ("xprtrmda: Reduce lock contention in
completion handlers") assumed that when ib_req_notify_cq() returned
a positive RC, the CQ had also been successfully re-armed, making
it safe to return control to the provider without losing any
completion signals. That is an invalid assumption.

Change both completion handlers to continue polling while
ib_req_notify_cq() returns a positive value.

Fixes: 7f23f6f6e3 ("xprtrmda: Reduce lock contention in ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Chuck Lever a045178887 xprtrdma: Enable swap-on-NFS/RDMA
After adding a swapfile on an NFS/RDMA mount and removing the
normal swap partition, I was able to push the NFS client well
into swap without any issue.

I forgot to swapoff the NFS file before rebooting. This pinned
the NFS mount and the IB core and provider, causing shutdown to
hang. I think this is expected and safe behavior. Probably
shutdown scripts should "swapoff -a" before unmounting any
filesystems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Steve Wise 8610586d82 xprtrdma: don't log warnings for flushed completions
Unsignaled send WRs can get flushed as part of normal unmount, so don't
log them as warnings.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02 13:45:15 -05:00
Linus Torvalds 58bd6e0602 Changes for 4.3-rc5
- Work around connection namespace lookup bug related to RoCE
 - Change usnic license to Dual GPL/BSD (was intended to be that way
   all along, but wasn't clear, permission from contributors was
   chased down)
 - Fix an issue between NFSoRDMA and mlx5 that could cause an oops
 - Fix leak of sendonly multicast groups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWHoT/AAoJELgmozMOVy/deK4QALETCToLcR5RRDR+QleFUvby
 FnP91Pu9zGOoiuP25FT5Ny0YAmTHd1KiDQBQHRe/NrYDCH2M/q8jFJSWZLwGrG6q
 8GYc1ieozGQMZvId3ZJnqUJUTEyJu9QtpiFFZJYJHriP6OShP1GiHJ/XTN9dvJ/u
 xcmViAYYIjjScjaY1MuYpxKITFwfZE0HtdvK7zzq+F9cpfmC//Zc0Po4V4o4Y9V3
 14WgbWZyhehmECKwN95hIY1pLySadgcCxoeUDHclQ3efKLar4tEC3SOM2QZsnNRc
 qlCHEZYeB5TRo0dF/2CYUMLfUHkMjnUpW2BiVDbQfmPio7lGUjh2SBAQjI5i1dEQ
 Wg69JH1TV7BYfRiwe7n49P/BJ2vIhCR2UjQrHjilZ/h6DPSfKy29hVSvOzb5xLeJ
 mwl/KSKxlfT2Z1SZy0yMlJfCm8tjPwf6WhOVwkFRAhYHD3Yf31EMVzD7gTtW2MXO
 n5S80k5ccJlXniPWjaqerhjOZHmwHViBmHNlN4zlDCRZeT9IuKDj5mi31f7HC4gx
 WqJtSjRxydpbGPKROHI4vrmfARPAKNrKhj8BiqxO5Cja+TiS2QeXXr+fbRwETrLS
 TjXWNfS3Boy564AJ8Gfug2wfBcHwY+31Uv2a6nrMmKi+wwVexF/ENOb/x9WHZrVo
 VqQVI2lUBH2LsmzadD9c
 =usb1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "We have four batched up patches for the current rc kernel.

  Two of them are small fixes that are obvious.

  One of them is larger than I would like for a late stage rc pull, but
  we found an issue in the namespace lookup code related to RoCE and
  this works around the issue for now (we allow a lookup with a
  namespace to succeed on RoCE since RoCE namespaces aren't implemented
  yet).  This will go away in 4.4 when we put in support for namespaces
  in RoCE devices.

  The last one is large in terms of lines, but is all legal and no
  functional changes.  Cisco needed to update their files to be more
  specific about their license.  They had intended the files to be dual
  licensed as GPL/BSD all along, and specified that in their module
  license tag, but their file headers were not up to par.  They
  contacted all of the contributors to get agreement and then submitted
  a patch to update the license headers in the files.

  Summary:

   - Work around connection namespace lookup bug related to RoCE

   - Change usnic license to Dual GPL/BSD (was intended to be that way
     all along, but wasn't clear, permission from contributors was
     chased down)

   - Fix an issue between NFSoRDMA and mlx5 that could cause an oops

   - Fix leak of sendonly multicast groups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/ipoib: For sendonly join free the multicast group on leave
  IB/cma: Accept connection without a valid netdev on RoCE
  xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg
  usnic: add missing clauses to BSD license
2015-10-15 13:44:35 -07:00
Linus Torvalds 5b5f145527 Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
bug.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWHUj5AAoJECebzXlCjuG+KIoP/RW5zigAEKqUiD7ycKR91BxD
 9Nt0fqTTrbkGJhKM1/DN4YEjogAHeFW5OnGiLQRUNI/qdy+I1Gyr1kgwGmCCVDt9
 d8AhnxcnXR5SmsQHk7eeUd/rnODetf0bW5YJ8PfFbnC6cmM013nR9ujEccUuCl9M
 hHTp+690Doab00PtWtsjmZv5d+eT1bktY/R2PuQhyQM2CKWh1u4FeNTd1lWE551D
 b1wSvhAGMYVEsQv8+HICDrIQ8loGfH2gpBILERLM2yJlhN1IPU3RmNSAcQpZSaql
 veJYVmHdpMACCLp0Dd3hwWKDYvcQ2lCqKk+Cpd0vLpvZ8J5OjCLC+a2dh0PRIYuf
 pwFCvbWz6dn27/9eXEKbyT2JIeBIl4qwrFjfiRKlNX0c4HGKXaE2gJrY7bxnDxe1
 BatAbEFZ+rxHyPmycaj3JdyOxafmw94XzbT8q2g7tmUCj+pvAI+Pbv6PlwN6W2r7
 aGBZzgd8Y9pT6ZbCB0e413d/t5ulxwkt6vVz9Jze4gfcUrWcqHaqt7AadMl7obUx
 AYPLAVGeHybdKlLvqv42IF2QM8ZhizM0+EnxkjfWLrsa7WbstWX5KLPpm3K80dM7
 98p1ToNQDFcNU8WBZw8AkBpFz4j32RVOkvzWFWbhCo+T3is4BmP16uEEjH90aCCY
 skQKMrq8J1ox33gz5gT7
 =Pkuy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
  bug"

* tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
  nfsd/blocklayout: accept any minlength
2015-10-13 11:31:03 -07:00
Chuck Lever 3be7f32878 svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
Now that the NFS server advertises a maximum payload size of 1MB
for RPC/RDMA again, it crashes in svc_process_common() when NFS
client sends a 1MB NFS WRITE on an NFS/RDMA mount.

The server has set up a 259 element array of struct page pointers
in rq_pages[] for each incoming request. The last element of the
array is NULL.

When an incoming request has been completely received,
rdma_read_complete() attempts to set the starting page of the
incoming page vector:

  rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];

and the page to use for the reply:

  rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];

But the value of page_no has already accounted for head->hdr_count.
Thus rq_respages now points past the end of the incoming pages.

For NFS WRITE operations smaller than the maximum, this is harmless.
But when the NFS WRITE operation is as large as the server's max
payload size, rq_respages now points at the last entry in rq_pages,
which is NULL.

Fixes: cc9a903d91 ('svcrdma: Change maximum server payload . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-12 11:55:43 -04:00
Linus Torvalds 38aa0a59a6 Just one RDMA bugfix.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWFW4mAAoJECebzXlCjuG+YQ8P/2cfPRV2QZHK0BxlHooM6WII
 ZyIOMYU9KHxtoolC7UWfTy6y+ohDzisByYS59Tpd9k0d2NWqtMgUTLHS1UbjcekF
 RBMkhqv8VLDMupiBVElaO4/FvSqhP4YTpB/YvFHn8K4i2+NnfwL4c707SlxAk2tA
 SKhvgZVIS/N+VYpQo5hFZ1RofTQ7zWsvzPEsAOJR0pbBhEFE0WemZ12nQwkdkmRI
 2/R5XbT0ngSpCBRo2OcUoCHTozJG90gVfsu8IGzs/QeqlYZ9dVxWOUh8WDP2gmDF
 iB/KrUnv+gsMg4pLKrN9pbBMi8o6zvrbe7IMNjZEhA7qqcEwgf94hViYgrGdIDlS
 pqWWf/YMYWZzT0K1U8DuqjzQyeuTjRNv7RkALBFi54kQC6T49PIDbJruerhVVdzZ
 sgmDB/4kaSJF8yutetuRogskC+E7BaqhnAqu+VDin0UCFMl2GUb+3yof7GawbQcD
 uhPNMhn94LI6zXEzd86dKCc2ZwwNRfJYpfy5gYUmRHSHllZUSQdCqT4s3oIa4eFB
 RNqd0/AulHNgRJuXX/wMPZh5IWr9AnLp1WfJXRbY6hu5Q8+btsFG1wEBuQr3USTZ
 D5yJexpVQRNSmPWllLwfXkGFY4tiJA/TNDZxwrgocamnvxdrRw82HoFNvpRKVFEn
 AZFB4UR4JbqCe4LmBV/r
 =Jent
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Just one RDMA bugfix"

* tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: handle rdma read with a non-zero initial page offset
2015-10-09 16:34:45 -07:00
Sagi Grimberg f022fa88ce xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg
There is no need to require LOCAL_DMA_LKEY support as the
PD allocation makes sure that there is a local_dma_lkey. Also
correctly set a return value in error path.

This caused a NULL pointer dereference in mlx5 which removed
the support for LOCAL_DMA_LKEY.

Fixes: bb6c96d728 ("xprtrdma: Replace global lkey with lkey local to PD")
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-06 14:23:00 -04:00
Trond Myklebust 8dbb09570d NFS: NFSoRDMA bugfix
Fixes a use-after-free bug.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWCvocAAoJENfLVL+wpUDrNlYP/29jjHedFJk5ApUhmI4I4EJ0
 CVXvGP4aT4wtH16kKsZ95rozIU9C0s4VwelcCc9hnE+x8zHKp1axeOi2cesOBb88
 K8X0zo8gW7h7Eee/0hEAc8brPT6lcK/GFdjguiywDupRI9I1ByN9p5B/xRB8dbN6
 Je43Ro989aA4Qm1ChepiYbW+0DQ88rh/3sHM6j91lOQL8AQ7M7IcDxTMCHYj8d4Q
 TBYNwiA/tDMGMcdpe0yRGANayM12LbwEQXDldYg9+cawdfbL1FM91kjwdAyjYscD
 QTL21bCPt2Uk52fG/rLoiDjJ/3xDVwEndWiGFfXAicrIF/3YliwXeHxgJO8Ah5bP
 Ma2IIlNRStXTdK3QUhF8lKTyDNZuLyY4CUYHNGkMpVLrC4W6NAdTWOcbxBKJA15y
 sGRCq4uce3JM4W0ZamzZiUkVzEE10jxH61StahzKB0Q5ep2CwMp37komZpeca88F
 gOkWlrnA+sjlcDCxR3oZlwAP1mv5s1E1FkuXdkazQyAj8I3UX8t/4k3Jkrnq9HmA
 WblV+eUPz6Z/K1aM/LLx11RXpnLID0JCF7xOo6IKZ8Ik7OPGHGQK96L8pZ9VYZgl
 Oz62pH7ipri9SMJRBDJi4vshxeXKj1Ufhj3cU6dGl2RSyHrvyQX/DKfMYgrYlp3g
 xPxY3NHwLHYWK645NvLH
 =Dd3t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.3-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA bugfix

Fixes a use-after-free bug.

Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
2015-10-02 15:49:33 -04:00
Steve Wise c91aed9896 svcrdma: handle rdma read with a non-zero initial page offset
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
were not taking into account the initial page_offset when determining
the rdma read length.  This resulted in a read who's starting address
and length exceeded the base/bounds of the frmr.

The server gets an async error from the rdma device and kills the
connection, and the client then reconnects and resends.  This repeats
indefinitely, and the application hangs.

Most work loads don't tickle this bug apparently, but one test hit it
every time: building the linux kernel on a 16 core node with 'make -j
16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.

This bug seems to only be tripped with devices having small fastreg page
list depths.  I didn't see it with mlx4, for instance.

Fixes: 0bf4828983 ('svcrdma: refactor marshalling logic')
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-09-29 12:55:44 -04:00
Steve Wise 72c0217382 xprtrdma: disconnect and flush cqs before freeing buffers
Otherwise a FRMR completion can cause a touch-after-free crash.

In xprt_rdma_destroy(), call rpcrdma_buffer_destroy() only after calling
rpcrdma_ep_destroy().

In rpcrdma_ep_destroy(), disconnect the cm_id first which should flush the
qp, then drain the cqs, then destroy the qp, and finally destroy the cqs.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-09-28 10:42:35 -04:00
Chuck Lever bb6c96d728 xprtrdma: Replace global lkey with lkey local to PD
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 10:46:51 -04:00
Linus Torvalds 26d2177e97 Changes for 4.3
- Create drivers/staging/rdma
 - Move amso1100 driver to staging/rdma and schedule for deletion
 - Move ipath driver to staging/rdma and schedule for deletion
 - Add hfi1 driver to staging/rdma and set TODO for move to regular tree
 - Initial support for namespaces to be used on RDMA devices
 - Add RoCE GID table handling to the RDMA core caching code
 - Infrastructure to support handling of devices with differing
   read and write scatter gather capabilities
 - Various iSER updates
 - Kill off unsafe usage of global mr registrations
 - Update SRP driver
 - Misc. mlx4 driver updates
 - Support for the mr_alloc verb
 - Support for a netlink interface between kernel and user space cache
   daemon to speed path record queries and route resolution
 - Ininitial support for safe hot removal of verbs devices
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
 nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
 yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
 yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
 m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
 zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
 egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
 n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
 HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
 /T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
 WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
 30ZYFuW8evTL+YQcaV65
 =EHLg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull inifiniband/rdma updates from Doug Ledford:
 "This is a fairly sizeable set of changes.  I've put them through a
  decent amount of testing prior to sending the pull request due to
  that.

  There are still a few fixups that I know are coming, but I wanted to
  go ahead and get the big, sizable chunk into your hands sooner rather
  than waiting for those last few fixups.

  Of note is the fact that this creates what is intended to be a
  temporary area in the drivers/staging tree specifically for some
  cleanups and additions that are coming for the RDMA stack.  We
  deprecated two drivers (ipath and amso1100) and are waiting to hear
  back if we can deprecate another one (ehca).  We also put Intel's new
  hfi1 driver into this area because it needs to be refactored and a
  transfer library created out of the factored out code, and then it and
  the qib driver and the soft-roce driver should all be modified to use
  that library.

  I expect drivers/staging/rdma to be around for three or four kernel
  releases and then to go away as all of the work is completed and final
  deletions of deprecated drivers are done.

  Summary of changes for 4.3:

   - Create drivers/staging/rdma
   - Move amso1100 driver to staging/rdma and schedule for deletion
   - Move ipath driver to staging/rdma and schedule for deletion
   - Add hfi1 driver to staging/rdma and set TODO for move to regular
     tree
   - Initial support for namespaces to be used on RDMA devices
   - Add RoCE GID table handling to the RDMA core caching code
   - Infrastructure to support handling of devices with differing read
     and write scatter gather capabilities
   - Various iSER updates
   - Kill off unsafe usage of global mr registrations
   - Update SRP driver
   - Misc  mlx4 driver updates
   - Support for the mr_alloc verb
   - Support for a netlink interface between kernel and user space cache
     daemon to speed path record queries and route resolution
   - Ininitial support for safe hot removal of verbs devices"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
  IB/ipoib: Suppress warning for send only join failures
  IB/ipoib: Clean up send-only multicast joins
  IB/srp: Fix possible protection fault
  IB/core: Move SM class defines from ib_mad.h to ib_smi.h
  IB/core: Remove unnecessary defines from ib_mad.h
  IB/hfi1: Add PSM2 user space header to header_install
  IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
  mlx5: Fix incorrect wc pkey_index assignment for GSI messages
  IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
  IB/uverbs: reject invalid or unknown opcodes
  IB/cxgb4: Fix if statement in pick_local_ip6adddrs
  IB/sa: Fix rdma netlink message flags
  IB/ucma: HW Device hot-removal support
  IB/mlx4_ib: Disassociate support
  IB/uverbs: Enable device removal when there are active user space applications
  IB/uverbs: Explicitly pass ib_dev to uverbs commands
  IB/uverbs: Fix race between ib_uverbs_open and remove_one
  IB/uverbs: Fix reference counting usage of event files
  IB/core: Make ib_dealloc_pd return void
  IB/srp: Create an insecure all physical rkey only if needed
  ...
2015-09-09 08:33:31 -07:00
Linus Torvalds 4e4adb2f46 NFS client updates for Linux 4.3
Highlights include:
 
 Stable patches:
 - Fix atomicity of pNFS commit list updates
 - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
 - nfs_set_pgio_error sometimes misses errors
 - Fix a thinko in xs_connect()
 - Fix borkage in _same_data_server_addrs_locked()
 - Fix a NULL pointer dereference of migration recovery ops for v4.2 client
 - Don't let the ctime override attribute barriers.
 - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
 - Ensure flexfiles pNFS driver updates the inode after write finishes
 - flexfiles must not pollute the attribute cache with attrbutes from the DS
 - Fix a protocol error in layoutreturn
 - Fix a protocol issue with NFSv4.1 CLOSE stateids
 
 Bugfixes + cleanups
 - pNFS blocks bugfixes from Christoph
 - Various cleanups from Anna
 - More fixes for delegation corner cases
 - Don't fsync twice for O_SYNC/IS_SYNC files
 - Fix pNFS and flexfiles layoutstats bugs
 - pnfs/flexfiles: avoid duplicate tracking of mirror data
 - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation issues.
 - pnfs/flexfiles: error handling retries a layoutget before fallback to MDS
 
 Features:
 - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from Kinglong
 - More RDMA client transport improvements from Chuck
 - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr() verbs
   from the SUNRPC, Lustre and core infiniband tree.
 - Optimise away the close-to-open getattr if there is no cached data
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJV7chgAAoJEGcL54qWCgDyqJQP/3kto9VXnXcatC382jF9Pfj5
 F55XeSnviOXH7CyiKA4nSBhnxg/sLuWOTpbkVI/4Y+VyWhLby9h+mtcKURHOlBnj
 d5BFoPwaBVDnUiKlHFQDkRjIyxjj2Sb6/uEb2V/u3v+3znR5AZZ4lzFx4cD85oaz
 mcru7yGiSxaQCIH6lHExcCEKXaDP5YdvS9YFsyQfv2976JSaQHM9ZG04E0v6MzTo
 E5wwC4CLMKmhuX9kmQMj85jzs1ASAKZ3N7b4cApTIo6F8DCDH0vKQphq/nEQC497
 ECjEs5/fpxtNJUpSBu0gT7G4LCiW3PzE7pHa+8bhbaAn9OzxIR5+qWujKsfGYQhO
 Oomp3K9zO6omshAc5w4MkknPpbImjoZjGAj/q/6DbtrDpnD7DzOTirwYY2yX0CA8
 qcL81uJUb8+j4jJj4RTO+lTUBItrM1XTqTSd/3eSMr5DDRVZj+ERZxh17TaxRBZL
 YrbrLHxCHrcbdSbPlovyvY+BwjJUUFJRcOxGQXLmNYR9u92fF59rb53bzVyzcRRO
 wBozzrNRCFL+fPgfNPLEapIb6VtExdM3rl2HYsJGckHj4DPQdnoB3ytIT9iEFZEN
 +/rE14XEZte7kuH3OP4el2UsP/hVsm7A49mtwrkdbd7rxMWD6XfQUp8DAggWUEtI
 1H6T7RC1Y6wsu0X1fnVz
 =knJA
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix atomicity of pNFS commit list updates
   - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
   - nfs_set_pgio_error sometimes misses errors
   - Fix a thinko in xs_connect()
   - Fix borkage in _same_data_server_addrs_locked()
   - Fix a NULL pointer dereference of migration recovery ops for v4.2
     client
   - Don't let the ctime override attribute barriers.
   - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
   - Ensure flexfiles pNFS driver updates the inode after write finishes
   - flexfiles must not pollute the attribute cache with attrbutes from
     the DS
   - Fix a protocol error in layoutreturn
   - Fix a protocol issue with NFSv4.1 CLOSE stateids

  Bugfixes + cleanups
   - pNFS blocks bugfixes from Christoph
   - Various cleanups from Anna
   - More fixes for delegation corner cases
   - Don't fsync twice for O_SYNC/IS_SYNC files
   - Fix pNFS and flexfiles layoutstats bugs
   - pnfs/flexfiles: avoid duplicate tracking of mirror data
   - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
     issues
   - pnfs/flexfiles: error handling retries a layoutget before fallback
     to MDS

  Features:
   - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
     Kinglong
   - More RDMA client transport improvements from Chuck
   - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
     verbs from the SUNRPC, Lustre and core infiniband tree.
   - Optimise away the close-to-open getattr if there is no cached data"

* tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
  NFSv4: Respect the server imposed limit on how many changes we may cache
  NFSv4: Express delegation limit in units of pages
  Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
  NFS: Optimise away the close-to-open getattr if there is no cached data
  NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
  NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
  nfs: Remove unneeded checking of the return value from scnprintf
  nfs: Fix truncated client owner id without proto type
  NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
  NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
  NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
  NFSv4.1/flexfiles: Fix freeing of mirrors
  NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
  NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
  NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
  NFSv4.1: Fix a protocol issue with CLOSE stateids
  NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
  SUNRPC: Prevent SYN+SYNACK+RST storms
  SUNRPC: xs_reset_transport must mark the connection as disconnected
  NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
  ...
2015-09-07 14:02:24 -07:00
Jason Gunthorpe 7dd78647a2 IB/core: Make ib_dealloc_pd return void
The majority of callers never check the return value, and even if they
did, they can't do anything about a failure.

All possible failure cases represent a bug in the caller, so just
WARN_ON inside the function instead.

This fixes a few random errors:
 net/rd/iw.c infinite loops while it fails. (racing with EBUSY?)

This also lays the ground work to get rid of error return from the
drivers. Most drivers do not error, the few that do are broken since
it cannot be handled.

Since uverbs can legitimately make use of EBUSY, open code the
check.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:12:39 -04:00
Steve Wise 9ac07501e1 svcrdma: limit FRMR page list lengths to device max
Svcrdma was incorrectly allocating fastreg MRs and page lists using
RPCSVC_MAXPAGES, which can exceed the device capabilities.  So limit
the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:46 -04:00
Sagi Grimberg 0410e38eca xprtrdma, svcrdma: Convert to ib_alloc_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30 18:08:45 -04:00
Steve Wise bc3fe2e376 svcrdma: Use max_sge_rd for destination read depths
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-28 23:02:11 -04:00
Chuck Lever cc9a903d91 svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOAD
Both commit 0380a3f375 ("svcrdma: Add a separate "max data segs"
macro for svcrdma") and commit 7e5be28827 ("svcrdma: advertise
the correct max payload") are incorrect. This commit reverts both
changes, restoring the server's maximum payload size to 1MB.

Commit 7e5be28827 based the server's maximum payload on the
_client's_ RPCRDMA_MAX_DATA_SEGS value. That was wrong.

Commit 0380a3f375 tried to fix this so that the client maximum
payload size could be raised without affecting the server, but
managed to confuse matters more on the server side.

More importantly, limiting the advertised maximum payload size was
meant to be a workaround, not the actual fix. We need to revisit

  https://bugzilla.linux-nfs.org/show_bug.cgi?id=270

A Linux client on a platform with 64KB pages can overrun and crash
an x86_64 NFS/RDMA server when the r/wsize is 1MB. An x86/64 Linux
client seems to work fine using 1MB reads and writes when the Linux
server's maximum payload size is restored to 1MB.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Fixes: 0380a3f375 ("svcrdma: Add a separate "max data segs" macro")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:04:57 -04:00
Devesh Sharma d0f36c46de xprtrdma: take HCA driver refcount at client
This is a rework of the following patch sent almost a year back:
http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html

In presence of active mount if someone tries to rmmod vendor-driver, the
command remains stuck forever waiting for destruction of all rdma-cm-id.
in worst case client can crash during shutdown with active mounts.

The existing code assumes that ia->ri_id->device cannot change during
the lifetime of a transport. xprtrdma do not have support for
DEVICE_REMOVAL event either. Lifting that assumption and adding support
for DEVICE_REMOVAL event is a long chain of work, and is in plan.

The community decided that preventing the hang right now is more
important than waiting for architectural changes.

Thus, this patch introduces a temporary workaround to acquire HCA driver
module reference count during the mount of a nfs-rdma mount point.

Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:22:56 -04:00
Chuck Lever 860477d1ff xprtrdma: Count RDMA_NOMSG type calls
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
calls so administrators can tell if they happen to be used more than
expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 763f7e4e4b xprtrdma: Clean up xprt_rdma_print_stats()
checkpatch.pl complained about the seq_printf() format string split
across lines and the use of %Lu.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 2fcc213a18 xprtrdma: Fix large NFS SYMLINK calls
Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.

Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.

Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.

Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 677eb17e94 xprtrdma: Fix XDR tail buffer marshalling
Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f9a ("svcrdma: Handle additional inline content").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 33943b2974 xprtrdma: Don't provide a reply chunk when expecting a short reply
Currently Linux always offers a reply chunk, even when the reply
can be sent inline (ie. is smaller than 1KB).

On the client, registering a memory region can be expensive. A
server may choose not to use the reply chunk, wasting the cost of
the registration.

This is a change only for RPC replies smaller than 1KB which the
server constructs in the RPC reply send buffer. Because the elements
of the reply must be XDR encoded, a copy-free data transfer has no
benefit in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever 02eb57d8f4 xprtrdma: Always provide a write list when sending NFS READ
The client has been setting up a reply chunk for NFS READs that are
smaller than the inline threshold. This is not efficient: both the
server and client CPUs have to copy the reply's data payload into
and out of the memory region that is then transferred via RDMA.

Using the write list, the data payload is moved by the device and no
extra data copying is necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever 5457ced0b5 xprtrdma: Account for RPC/RDMA header size when deciding to inline
When the size of the RPC message is near the inline threshold (1KB),
the client would allow messages to be sent that were a few bytes too
large.

When marshaling RPC/RDMA requests, ensure the combined size of
RPC/RDMA header and RPC header do not exceed the inline threshold.
Endpoints typically reject RPC/RDMA messages that exceed the size
of their receive buffers.

The two server implementations I test with (Linux and Solaris) use
receive buffers that are larger than the client’s inline threshold.
Thus so far this has been benign, observed only by code inspection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever b3221d6a53 xprtrdma: Remove logic that constructs RDMA_MSGP type calls
RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then "page flip" the payload
into place to avoid a data copy in certain circumstances. However:

1. The client has to have a priori knowledge of the server's
   preferred alignment

2. Requests eligible for RDMA_MSGP are requests that are small
   enough to have been sent inline, and convey a data payload
   at the _end_ of the RPC message

Today 1. is done with a sysctl, and is a global setting that is
copied during mount. Linux does not support CCP to query the
server's preferences (RFC 5666, Section 6).

A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
compound fits bullet 2.

Thus the Linux client currently leaves RDMA_MSGP disabled. The
Linux server handles RDMA_MSGP, but does not use any special
page flipping, so it confers no benefit.

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.

/proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
thus is left in place.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever d1ed857e57 xprtrdma: Clean up rpcrdma_ia_open()
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which
is different for each registration method, to the .ro_open functions.

This is refactoring only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever e531dcabec xprtrdma: Remove last ib_reg_phys_mr() call site
All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr()
fails, rpcrdma_ia_open() fails and no transport is created.

Therefore execution never reaches the ib_reg_phys_mr() call site in
rpcrdma_register_internal(), so it can be removed.

The remaining logic in rpcrdma_{de}register_internal() is folded
into rpcrdma_{alloc,free}_regbuf().

This is clean up only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever d231093023 xprtrdma: Don't fall back to PHYSICAL memory registration
PHYSICAL memory registration uses a single rkey for all of the
client's memory, thus is insecure. It is still useful in some cases
for testing.

Retain the ability to select PHYSICAL memory registration capability
via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it
if the HCA does not support FRWR or FMR.

This means amso1100 no longer works out of the box with NFS/RDMA.
When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before
performing NFS/RDMA mounts.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever 864be126fe xprtrdma: Raise maximum payload size to one megabyte
The point of larger rsize and wsize is to reduce the per-byte cost
of memory registration and deregistration. Modern HCAs can typically
handle a megabyte or more with a single registration operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever 5231eb9773 xprtrdma: Make xprt_setup_rdma() agnostic to family of server address
In particular, recognize when an IPv6 connection is bound.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever 31193fe5f6 svcrdma: Remove svc_rdma_fastreg()
Commit 0bf4828983 ("svcrdma: refactor marshalling logic") removed
the last call site for svc_rdma_fastreg().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever 10dc451218 svcrdma: Clean up svc_rdma_get_reply_array()
Kernel coding conventions frown upon having large nontrivial
functions in header files, and the preference these days is to
allow the compiler to make inlining decisions if possible.

As these functions are re-homed into a .c file, be sure that
comparisons with fields in struct rpcrdma_msg are with be32
constants.

This is a refactoring change; no behavior change is intended.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever 9d11b51ce7 svcrdma: Fix send_reply() scatter/gather set-up
The Linux NFS server returns garbage in the data payload of inline
NFS/RDMA READ replies. These are READs of under 1000 bytes or so
where the client has not provided either a reply chunk or a write
list.

The NFS server delivers the data payload for an NFS READ reply to
the transport in an xdr_buf page list. If the NFS client did not
provide a reply chunk or a write list, send_reply() is supposed to
set up a separate sge for the page containing the READ data, and
another sge for XDR padding if needed, then post all of the sges via
a single SEND Work Request.

The problem is send_reply() does not advance through the xdr_buf
when setting up scatter/gather entries for SEND WR. It always calls
dma_map_xdr with xdr_off set to zero. When there's more than one
sge, dma_map_xdr() sets up the SEND sge's so they all point to the
xdr_buf's head.

The current Linux NFS/RDMA client always provides a reply chunk or
a write list when performing an NFS READ over RDMA. Therefore, it
does not exercise this particular case. The Linux server has never
had to use more than one extra sge for building RPC/RDMA replies
with a Linux client.

However, an NFS/RDMA client _is_ allowed to send small NFS READs
without setting up a write list or reply chunk. The NFS READ reply
fits entirely within the inline reply buffer in this case. This is
perhaps a more efficient way of performing NFS READs that the Linux
NFS/RDMA client may some day adopt.

Fixes: b432e6b3d9 ('svcrdma: Change DMA mapping logic to . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Shirley Ma ff79c74dca NFS/RDMA Release resources in svcrdma when device is removed
When removing underlying RDMA device, the rmmod will hang forever if there
are any outstanding NFS/RDMA client mounts. The outstanding NFS/RDMA counts
could also prevent the server from shutting down. Further debugging shows
that the existing connections are not teared down and resource are not
released when receiving RDMA_CM_EVENT_DEVICE_REMOVAL event. It seems the
original code missing svc_xprt_put() in RDMA_CM_EVENT_REMOVAL event handler
thus svc_xprt_free is never invoked to release the existing connection
resources.

The patch has been passed removing, adding device back and forth without
stopping NFS/RDMA service. This will also allow a device to be unplugged
and swapped out without shutting down NFS service.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=252
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Linus Torvalds 8688d9540c NFS client updates for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a crash in the NFSv4 file locking code.
 - Fix an fsync() regression, where we were failing to retry I/O in some
   circumstances.
 - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
 - Fix a memory leak when an attempted pnfs fails.
 - Fix a memory leak in the backchannel code
 - Large hostnames were not supported correctly in NFSv4.1
 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
 - Fix a couple of credential issues in pNFS/flexfiles
 
 Bugfixes + cleanups:
 - Open flag sanity checks in the NFSv4 atomic open codepath
 - More NFSv4 delegation related bugfixes
 - Various NFSv4.1 backchannel bugfixes and cleanups
 - Fix the NFS swap socket code
 - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
 - Fix a UDP transport deadlock issue
 
 Features:
 - More RDMA client transport improvements
 - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVlWQgAAoJEGcL54qWCgDyXtcP/2Y3HJ9xu5qU3Bo/jzCAw4E1
 jPPMSFAz4kqy/LGoslyc1cNDEiKGzJYWU8TtCGI3KAyNxb6n3pT1mEE1tvIsSdis
 D8bpV13M452PPpZYrBawIf4+OuohXmuYHpFiVNSpLbH3Uo7dthvFFnbqCGaGlnqY
 rXYZHAnx637OGBcJsT4AXCUz12ILvxMYRnqwW6Xn+j9JmwR1coQX3v8W8e7SMf6i
 J+zOny7Uetjrg1U9C9uQB6ZvIoxUMo9QOVmtGCwsBl8lM3fLmzaQfcUf9fm76pMT
 yTrKJs4jBLvVf00bRHFDv9EHWCy97oqCkeQEw1EY2lnxp/lmM5SiI4zQqjbf0QTW
 5VQScT1MK6xwHoUbuI/sYdXXR8KGDVT1xCFFHUNcg69CvgqdgWslPQY7xLJMvUJZ
 vBWfWDd8ppdCw2ZVX4ae/bnhfc+/mVh4wRPF7tgVAjT0pobBV9xMOeMkF4mo76Wa
 pvo/nTRMt68hpESVSvq9dYEMVhy5haqFhPrSbyAGOpT4SE2V3RCCZQfhu15TMKdW
 BdvItG+mdAVPbIHqhx7vRdAudcOEZKyxbFA+l3E5FyCAXLV7XS3M8CEl3P1w7gmm
 Ccr8DW9abKFJf1RAKdX3stexIoJLGTwciSMR5smsbup/xNcx/fRgx2f1w31JMPxb
 kG3Izfk25w9uGSsbR39D
 =AREr
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a crash in the NFSv4 file locking code.
   - Fix an fsync() regression, where we were failing to retry I/O in
     some circumstances.
   - Fix an infinite loop in NFSv4.0 OPEN stateid recovery
   - Fix a memory leak when an attempted pnfs fails.
   - Fix a memory leak in the backchannel code
   - Large hostnames were not supported correctly in NFSv4.1
   - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O.
   - Fix a couple of credential issues in pNFS/flexfiles

  Bugfixes + cleanups:
   - Open flag sanity checks in the NFSv4 atomic open codepath
   - More NFSv4 delegation related bugfixes
   - Various NFSv4.1 backchannel bugfixes and cleanups
   - Fix the NFS swap socket code
   - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code
   - Fix a UDP transport deadlock issue

  Features:
   - More RDMA client transport improvements
   - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles"

* tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits)
  nfs: Remove invalid tk_pid from debug message
  nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
  nfs: Drop bad comment in nfs41_walk_client_list()
  nfs: Remove unneeded micro checking of CONFIG_PROC_FS
  nfs: Don't setting FILE_CREATED flags always
  nfs: Use remove_proc_subtree() instead remove_proc_entry()
  nfs: Remove unused argument in nfs_server_set_fsinfo()
  nfs: Fix a memory leak when meeting an unsupported state protect
  nfs: take extra reference to fl->fl_file when running a LOCKU operation
  NFSv4: When returning a delegation, don't reclaim an incompatible open mode.
  NFSv4.2: LAYOUTSTATS is optional to implement
  NFSv4.2: Fix up a decoding error in layoutstats
  pNFS/flexfiles: Fix the reset of struct pgio_header when resending
  pNFS/flexfiles: Turn off layoutcommit for servers that don't need it
  pnfs/flexfiles: protect ktime manipulation with mirror lock
  nfs: provide pnfs_report_layoutstat when NFS42 is disabled
  nfs: verify open flags before allowing open
  nfs: always update creds in mirror, even when we have an already connected ds
  nfs: fix potential credential leak in ff_layout_update_mirror_cred
  pnfs/flexfiles: report layoutstat regularly
  ...
2015-07-02 11:32:23 -07:00
Linus Torvalds d2c3ac7e7e Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "A relatively quiet cycle, with a mix of cleanup and smaller bugfixes"

* 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits)
  sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
  nfsd: wrap too long lines in nfsd4_encode_read
  nfsd: fput rd_file from XDR encode context
  nfsd: take struct file setup fully into nfs4_preprocess_stateid_op
  nfsd: refactor nfs4_preprocess_stateid_op
  nfsd: clean up raparams handling
  nfsd: use swap() in sort_pacl_range()
  rpcrdma: Merge svcrdma and xprtrdma modules into one
  svcrdma: Add a separate "max data segs macro for svcrdma
  svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
  svcrdma: Keep rpcrdma_msg fields in network byte-order
  svcrdma: Fix byte-swapping in svc_rdma_sendto.c
  nfsd: Update callback sequnce id only CB_SEQUENCE success
  nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying
  svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
  SUNRPC: Move EXPORT_SYMBOL for svc_process
  uapi/nfs: Add NFSv4.1 ACL definitions
  nfsd: Remove dead declarations
  nfsd: work around a gcc-5.1 warning
  nfsd: Checking for acl support does not require fetching any acls
  ...
2015-06-27 10:14:39 -07:00