If FRMR registration fails, it's likely to transition the QP to the
error state. Or, registration may have failed because the QP is
_already_ in ERROR.
Thus calling rpcrdma_deregister_external() in
rpcrdma_create_chunks() is useless in FRMR mode: the LOCAL_INVs just
get flushed.
It is safe to leave existing registrations: when FRMR registration
is tried again, rpcrdma_register_frmr_external() checks if each FRMR
is already/still VALID, and knocks it down first if it is.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Various reports of:
rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0
ep ffff8800bfd3e848
Ensure that rkeys in already-marshalled RPC/RDMA headers are
refreshed after the QP has been replaced by a reconnect.
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249
Suggested-by: Selvin Xavier <Selvin.Xavier@Emulex.Com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If rpcrdma_register_external() fails during request marshaling, the
current RPC request is killed. Instead, this RPC should be retried
after reconnecting the transport instance.
The most likely reason for registration failure with FRMR is a
failed post_send, which would be due to a remote transport
disconnect or memory exhaustion. These issues can be recovered
by a retry.
Problems encountered in the marshaling logic itself will not be
corrected by trying again, so these should still kill a request.
Now that we've added a clean exit for marshaling errors, take the
opportunity to defang some BUG_ON's.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Update the cwnd while processing the server's reply. Otherwise the
next task on the xprt_sending queue is still subject to the old
credit window. Currently, no task is awoken if the old congestion
window is still exceeded, even if the new window is larger, and a
deadlock results.
This is an issue during a transport reconnect. Servers don't
normally shrink the credit window, but the client does reset it to
1 when reconnecting so the server can safely grow it again.
As a minor optimization, remove the hack of grabbing the initial
cwnd size (which happens to be RPC_CWNDSCALE) and using that value
as the congestion scaling factor. The scaling value is invariant,
and we are better off without the multiplication operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the new connection is able to make forward progress, reset the
re-establish timeout. Otherwise it keeps growing even if disconnect
events are rare.
The same behavior as TCP is adopted: reconnect immediately if the
transport instance has been able to make some forward progress.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
GETACL relies on transport layer to alloc memory for reply buffer.
However xprtrdma assumes that the reply buffer (pagelist) has been
pre-allocated in upper layer. This problem was reported by IOL OFA lab
test on PPC.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Edward Mossman <emossman@iol.unh.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: All remaining callers of rpcrdma_deregister_external()
pass NULL as the last argument, so remove that argument.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
All kernel RDMA providers except amso1100 support either MTHCAFMR
or FRMR, both of which are faster than REGISTER. amso1100 can
continue to use ALLPHYSICAL.
The only other ULP consumer in the kernel that uses the reg_phys_mr
verb is Lustre.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The MEMWINDOWS and MEMWINDOWS_ASYNC memory registration modes were
intended as stop-gap modes before the introduction of FRMR. They
are now considered obsolete.
MEMWINDOWS_ASYNC is also considered unsafe because it can leave
client memory registered and exposed for an indeterminant time after
each I/O.
At this point, the MEMWINDOWS modes add needless complexity, so
remove them.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: This memory registration mode is slow and was never
meant for use in production environments. Remove it to reduce
implementation complexity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
An IB provider can invoke rpcrdma_conn_func() in an IRQ context,
thus rpcrdma_conn_func() cannot be allowed to directly invoke
generic RPC functions like xprt_wake_pending_tasks().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Some rdma devices don't support a fast register page list depth of
at least RPCRDMA_MAX_DATA_SEGS. So xprtrdma needs to chunk its fast
register regions according to the minimum of the device max supported
depth or RPCRDMA_MAX_DATA_SEGS.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
After commit a11a2bf4, "SUNRPC: Optimise away unnecessary data moves
in xdr_align_pages", Thu Aug 2 13:21:43 2012, READs larger than a
few hundred bytes via NFS/RDMA no longer work. This commit exposed
a long-standing bug in rpcrdma_inline_fixup().
I reproduce this with an rsize=4096 mount using the cthon04 basic
tests. Test 5 fails with an EIO error.
For my reproducer, kernel log shows:
NFS: server cheating in read reply: count 4096 > recvd 0
rpcrdma_inline_fixup() is zeroing the xdr_stream::page_len field,
and xdr_align_pages() is now returning that value to the READ XDR
decoder function.
That field is set up by xdr_inline_pages() by the READ XDR encoder
function. As far as I can tell, it is supposed to be left alone
after that, as it describes the dimensions of the reply xdr_stream,
not the contents of that stream.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=68391
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
tk_xprt is just a shortcut for tk_client->cl_xprt, however cl_xprt is
defined as an __rcu variable. Replace dereferences of tk_xprt with
non-rcu dereferences where it is safe to do so.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
New features include:
- Add NFS client support for containers.
This should enable most of the necessary functionality, including
lockd support, and support for rpc.statd, NFSv4 idmapper and
RPCSEC_GSS upcalls into the correct network namespace from
which the mount system call was issued.
- NFSv4 idmapper scalability improvements
Base the idmapper cache on the keyring interface to allow concurrent
access to idmapper entries. Start the process of migrating users from
the single-threaded daemon-based approach to the multi-threaded
request-key based approach.
- NFSv4.1 implementation id.
Allows the NFSv4.1 client and server to mutually identify each other
for logging and debugging purposes.
- Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of
having to use the more counterintuitive 'vers=4,minorversion=1'.
- SUNRPC tracepoints.
Start the process of adding tracepoints in order to improve debugging
of the RPC layer.
- pNFS object layout support for autologin.
Important bugfixes include:
- Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to fail
to wake up all tasks when applied to priority waitqueues.
- Ensure that we handle read delegations correctly, when we try to
truncate a file.
- A number of fixes for NFSv4 state manager loops (mostly to do with
delegation recovery).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPalZbAAoJEGcL54qWCgDyCi4P+QHcmzQhJO7HWx3Pzjs67bFT
xMSYaKHGWS4AJKUBVl5OKBxUExfrMHBNbElV3IKUIwBlDx8RVtnwfptKSe146iki
dn4TrRO5es8nmI4hRDcGMlzJDZq4y0Qg//qiUFmojiNW/Avw0ljfMoVUejJJ09FV
oeDk4EGtcxkEyH+g48ZjYbyspRnG8qtD3atf70Z3lYE0ELdG/B5Dyzw1RDrA5p73
xJX3lqy8p/4ROzw/dmNoxdAXOrr3Q4/T58Bvp/lUglPy/EHyPmWzFoH0MU0C/PFu
5VnAl6QDbNCTcIw9FvJlX/mIyErpNG9eKzUskUc9L9SA+B+J/i4rIap4KATRN3nH
7QhE5qUacPuJnvxml7MPmlQTuft3fkAQ7NhKIWrbRi1QS9FmJC5NxctIb8loqlFn
yIXdKeLfMshB+NyuFS9uzStX7SmV3eMgVd+5ZxRjYxm+PKJLw2KXeudArL6M5mHK
3QeKZpqwaYQ3RfaTNpvAp0doiXHCO5UbWfI0Pe8xQs/QcMCNReffqV2G4IJKFAu6
WpoN2UDQC9LCBifLw2nS7kku8+ZVXLQU8OC1NVl3TG15xD9cNLXuk3/y5llPGq4O
odo52uLFpJohbDaHMj5RTKOfchTQCm2iyuVmxZEeAySypMSiAXmW7COSKHs/HxI1
VBm+EI00Pvmm5+fUjIlp
=LuHE
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates for Linux 3.4 from Trond Myklebust:
"New features include:
- Add NFS client support for containers.
This should enable most of the necessary functionality, including
lockd support, and support for rpc.statd, NFSv4 idmapper and
RPCSEC_GSS upcalls into the correct network namespace from which
the mount system call was issued.
- NFSv4 idmapper scalability improvements
Base the idmapper cache on the keyring interface to allow
concurrent access to idmapper entries. Start the process of
migrating users from the single-threaded daemon-based approach to
the multi-threaded request-key based approach.
- NFSv4.1 implementation id.
Allows the NFSv4.1 client and server to mutually identify each
other for logging and debugging purposes.
- Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of
having to use the more counterintuitive 'vers=4,minorversion=1'.
- SUNRPC tracepoints.
Start the process of adding tracepoints in order to improve
debugging of the RPC layer.
- pNFS object layout support for autologin.
Important bugfixes include:
- Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to
fail to wake up all tasks when applied to priority waitqueues.
- Ensure that we handle read delegations correctly, when we try to
truncate a file.
- A number of fixes for NFSv4 state manager loops (mostly to do with
delegation recovery)."
* tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (224 commits)
NFS: fix sb->s_id in nfs debug prints
xprtrdma: Remove assumption that each segment is <= PAGE_SIZE
xprtrdma: The transport should not bug-check when a dup reply is received
pnfs-obj: autologin: Add support for protocol autologin
NFS: Remove nfs4_setup_sequence from generic rename code
NFS: Remove nfs4_setup_sequence from generic unlink code
NFS: Remove nfs4_setup_sequence from generic read code
NFS: Remove nfs4_setup_sequence from generic write code
NFS: Fix more NFS debug related build warnings
SUNRPC/LOCKD: Fix build warnings when CONFIG_SUNRPC_DEBUG is undefined
nfs: non void functions must return a value
SUNRPC: Kill compiler warning when RPC_DEBUG is unset
SUNRPC/NFS: Add Kbuild dependencies for NFS_DEBUG/RPC_DEBUG
NFS: Use cond_resched_lock() to reduce latencies in the commit scans
NFSv4: It is not safe to dereference lsp->ls_state in release_lockowner
NFS: ncommit count is being double decremented
SUNRPC: We must not use list_for_each_entry_safe() in rpc_wake_up()
Try using machine credentials for RENEW calls
NFSv4.1: Fix a few issues in filelayout_commit_pagelist
NFSv4.1: Clean ups and bugfixes for the pNFS read/writeback/commit code
...
The client side RDMA transport will bug check if it receives a duplicate
reply, instead we should simply drop the duplicate reply.
Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPCRDMA marshalling logic assumed that xdr->page_base was an
offset into the first page of xdr->page_list. It is in fact an
offset into the xdr->page_list itself, that is, it selects the
first page in the page_list and the offset into that page.
The symptom depended in part on the rpc_memreg_strategy, if it was
FRMR, or some other one-shot mapping mode, the connection would get
torn down on a base and bounds error. When the badly marshalled RPC
was retransmitted it would reconnect, get the error, and tear down the
connection again in a loop forever. This resulted in a hung-mount. For
the other modes, it would result in silent data corruption. This bug is
most easily reproduced by writing more data than the filesystem
has space for.
This fix corrects the page_base assumption and otherwise simplifies
the iov mapping logic.
Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch updates the computation to include the worst case situation
where three FRMR are required to map a single RPC REQ.
Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Certain client rpc's which contain both lengthy page-contained
metadata and a non-empty xdr_tail buffer require careful handling
to avoid overlapped memory copying. Rearranging of existing rpcrdma
marshaling code avoids it; this fixes an NFSv4 symlink creation error
detected with connectathon basic/test8 to multiple servers.
Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The send marshaling code split a particular dprintk across two
lines, which makes it hard to extract from logfiles.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The xprt_connect call path does not expect such errors as ECONNREFUSED
to be returned from failed transport connection attempts, otherwise it
translates them to EIO and signals fatal errors. For example, mount.nfs
prints simply "internal error". Translate all such errors to ENOTCONN
from RPC/RDMA to match sockets behavior.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RPC/RDMA protocol allows clients and servers to avoid RDMA
operations for data which is purely the result of XDR padding.
On the client, automatically insert the necessary padding for
such server replies, and optionally don't marshal such chunks.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
An RPC/RDMA client cannot retransmit on an unbroken connection,
doing so violates its flow control with the server.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nit: rl_nchunks is an unsigned integer, so pass it into
rpcrdma_count_chunks() via an unsigned integer argument. This eliminates
a harmless mixed sign comparison in rpcrdma_count_chunks()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Thomas Talpey <Thomas.Talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Keep the type of the buffer position the same during iovec conversion to
reduce the likelihood of unexpected results from comparisons and length
computations.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Thomas Talpey <Thomas.Talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
rpcrdma_convert_iovs is passed an xdr_buf representing either an RPC
request or an RPC reply. In the case of a request, several
calculations and tests involving pos are unnecessary. In the case of a
reply, several calculations and tests involving pos are incorrect (the
code tests pos against the reply xdr buf's len field, which is always
0 at the time rpcrdma_convert_iovs is executed). This change removes
the incorrect/unnecessary calculations and tests involving pos.
This fixes an observed problem when reading certain file sizes over
NFS/RDMA.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
as some architectures have unsigned long for u64.
net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_create_chunks':
net/sunrpc/xprtrdma/rpc_rdma.c:222: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/sunrpc/xprtrdma/rpc_rdma.c:234: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_count_chunks':
net/sunrpc/xprtrdma/rpc_rdma.c:577: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64
Noticed on PowerPC pseries_defconfig build.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
rpcrdma stuff lacks endianness annotations for on-the-wire data.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This implements the marshaling and unmarshaling of the rpcrdma transport
headers. Connection management is also addressed.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This implements the configuration and building of the core transport
switch implementation of the rpcrdma transport. Stubs are provided for
the rpcrdma protocol handling, and the infiniband/iwarp verbs interface.
These are provided in following patches.
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>