docs: networking: convert rds.txt to ReST
- add SPDX header; - add a document title; - mark code blocks and literals as such; - mark tables as such; - mark lists as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
8c6e172002
commit
bad5b6e223
|
@ -97,6 +97,7 @@ Contents:
|
||||||
proc_net_tcp
|
proc_net_tcp
|
||||||
radiotap-headers
|
radiotap-headers
|
||||||
ray_cs
|
ray_cs
|
||||||
|
rds
|
||||||
|
|
||||||
.. only:: subproject and html
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,8 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==
|
||||||
|
RDS
|
||||||
|
===
|
||||||
|
|
||||||
Overview
|
Overview
|
||||||
========
|
========
|
||||||
|
@ -24,36 +29,39 @@ as IB.
|
||||||
The high-level semantics of RDS from the application's point of view are
|
The high-level semantics of RDS from the application's point of view are
|
||||||
|
|
||||||
* Addressing
|
* Addressing
|
||||||
RDS uses IPv4 addresses and 16bit port numbers to identify
|
|
||||||
the end point of a connection. All socket operations that involve
|
|
||||||
passing addresses between kernel and user space generally
|
|
||||||
use a struct sockaddr_in.
|
|
||||||
|
|
||||||
The fact that IPv4 addresses are used does not mean the underlying
|
RDS uses IPv4 addresses and 16bit port numbers to identify
|
||||||
transport has to be IP-based. In fact, RDS over IB uses a
|
the end point of a connection. All socket operations that involve
|
||||||
reliable IB connection; the IP address is used exclusively to
|
passing addresses between kernel and user space generally
|
||||||
locate the remote node's GID (by ARPing for the given IP).
|
use a struct sockaddr_in.
|
||||||
|
|
||||||
The port space is entirely independent of UDP, TCP or any other
|
The fact that IPv4 addresses are used does not mean the underlying
|
||||||
protocol.
|
transport has to be IP-based. In fact, RDS over IB uses a
|
||||||
|
reliable IB connection; the IP address is used exclusively to
|
||||||
|
locate the remote node's GID (by ARPing for the given IP).
|
||||||
|
|
||||||
|
The port space is entirely independent of UDP, TCP or any other
|
||||||
|
protocol.
|
||||||
|
|
||||||
* Socket interface
|
* Socket interface
|
||||||
RDS sockets work *mostly* as you would expect from a BSD
|
|
||||||
socket. The next section will cover the details. At any rate,
|
|
||||||
all I/O is performed through the standard BSD socket API.
|
|
||||||
Some additions like zerocopy support are implemented through
|
|
||||||
control messages, while other extensions use the getsockopt/
|
|
||||||
setsockopt calls.
|
|
||||||
|
|
||||||
Sockets must be bound before you can send or receive data.
|
RDS sockets work *mostly* as you would expect from a BSD
|
||||||
This is needed because binding also selects a transport and
|
socket. The next section will cover the details. At any rate,
|
||||||
attaches it to the socket. Once bound, the transport assignment
|
all I/O is performed through the standard BSD socket API.
|
||||||
does not change. RDS will tolerate IPs moving around (eg in
|
Some additions like zerocopy support are implemented through
|
||||||
a active-active HA scenario), but only as long as the address
|
control messages, while other extensions use the getsockopt/
|
||||||
doesn't move to a different transport.
|
setsockopt calls.
|
||||||
|
|
||||||
|
Sockets must be bound before you can send or receive data.
|
||||||
|
This is needed because binding also selects a transport and
|
||||||
|
attaches it to the socket. Once bound, the transport assignment
|
||||||
|
does not change. RDS will tolerate IPs moving around (eg in
|
||||||
|
a active-active HA scenario), but only as long as the address
|
||||||
|
doesn't move to a different transport.
|
||||||
|
|
||||||
* sysctls
|
* sysctls
|
||||||
RDS supports a number of sysctls in /proc/sys/net/rds
|
|
||||||
|
RDS supports a number of sysctls in /proc/sys/net/rds
|
||||||
|
|
||||||
|
|
||||||
Socket Interface
|
Socket Interface
|
||||||
|
@ -66,89 +74,88 @@ Socket Interface
|
||||||
options.
|
options.
|
||||||
|
|
||||||
fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
|
fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
|
||||||
This creates a new, unbound RDS socket.
|
This creates a new, unbound RDS socket.
|
||||||
|
|
||||||
setsockopt(SOL_SOCKET): send and receive buffer size
|
setsockopt(SOL_SOCKET): send and receive buffer size
|
||||||
RDS honors the send and receive buffer size socket options.
|
RDS honors the send and receive buffer size socket options.
|
||||||
You are not allowed to queue more than SO_SNDSIZE bytes to
|
You are not allowed to queue more than SO_SNDSIZE bytes to
|
||||||
a socket. A message is queued when sendmsg is called, and
|
a socket. A message is queued when sendmsg is called, and
|
||||||
it leaves the queue when the remote system acknowledges
|
it leaves the queue when the remote system acknowledges
|
||||||
its arrival.
|
its arrival.
|
||||||
|
|
||||||
The SO_RCVSIZE option controls the maximum receive queue length.
|
The SO_RCVSIZE option controls the maximum receive queue length.
|
||||||
This is a soft limit rather than a hard limit - RDS will
|
This is a soft limit rather than a hard limit - RDS will
|
||||||
continue to accept and queue incoming messages, even if that
|
continue to accept and queue incoming messages, even if that
|
||||||
takes the queue length over the limit. However, it will also
|
takes the queue length over the limit. However, it will also
|
||||||
mark the port as "congested" and send a congestion update to
|
mark the port as "congested" and send a congestion update to
|
||||||
the source node. The source node is supposed to throttle any
|
the source node. The source node is supposed to throttle any
|
||||||
processes sending to this congested port.
|
processes sending to this congested port.
|
||||||
|
|
||||||
bind(fd, &sockaddr_in, ...)
|
bind(fd, &sockaddr_in, ...)
|
||||||
This binds the socket to a local IP address and port, and a
|
This binds the socket to a local IP address and port, and a
|
||||||
transport, if one has not already been selected via the
|
transport, if one has not already been selected via the
|
||||||
SO_RDS_TRANSPORT socket option
|
SO_RDS_TRANSPORT socket option
|
||||||
|
|
||||||
sendmsg(fd, ...)
|
sendmsg(fd, ...)
|
||||||
Sends a message to the indicated recipient. The kernel will
|
Sends a message to the indicated recipient. The kernel will
|
||||||
transparently establish the underlying reliable connection
|
transparently establish the underlying reliable connection
|
||||||
if it isn't up yet.
|
if it isn't up yet.
|
||||||
|
|
||||||
An attempt to send a message that exceeds SO_SNDSIZE will
|
An attempt to send a message that exceeds SO_SNDSIZE will
|
||||||
return with -EMSGSIZE
|
return with -EMSGSIZE
|
||||||
|
|
||||||
An attempt to send a message that would take the total number
|
An attempt to send a message that would take the total number
|
||||||
of queued bytes over the SO_SNDSIZE threshold will return
|
of queued bytes over the SO_SNDSIZE threshold will return
|
||||||
EAGAIN.
|
EAGAIN.
|
||||||
|
|
||||||
An attempt to send a message to a destination that is marked
|
An attempt to send a message to a destination that is marked
|
||||||
as "congested" will return ENOBUFS.
|
as "congested" will return ENOBUFS.
|
||||||
|
|
||||||
recvmsg(fd, ...)
|
recvmsg(fd, ...)
|
||||||
Receives a message that was queued to this socket. The sockets
|
Receives a message that was queued to this socket. The sockets
|
||||||
recv queue accounting is adjusted, and if the queue length
|
recv queue accounting is adjusted, and if the queue length
|
||||||
drops below SO_SNDSIZE, the port is marked uncongested, and
|
drops below SO_SNDSIZE, the port is marked uncongested, and
|
||||||
a congestion update is sent to all peers.
|
a congestion update is sent to all peers.
|
||||||
|
|
||||||
Applications can ask the RDS kernel module to receive
|
Applications can ask the RDS kernel module to receive
|
||||||
notifications via control messages (for instance, there is a
|
notifications via control messages (for instance, there is a
|
||||||
notification when a congestion update arrived, or when a RDMA
|
notification when a congestion update arrived, or when a RDMA
|
||||||
operation completes). These notifications are received through
|
operation completes). These notifications are received through
|
||||||
the msg.msg_control buffer of struct msghdr. The format of the
|
the msg.msg_control buffer of struct msghdr. The format of the
|
||||||
messages is described in manpages.
|
messages is described in manpages.
|
||||||
|
|
||||||
poll(fd)
|
poll(fd)
|
||||||
RDS supports the poll interface to allow the application
|
RDS supports the poll interface to allow the application
|
||||||
to implement async I/O.
|
to implement async I/O.
|
||||||
|
|
||||||
POLLIN handling is pretty straightforward. When there's an
|
POLLIN handling is pretty straightforward. When there's an
|
||||||
incoming message queued to the socket, or a pending notification,
|
incoming message queued to the socket, or a pending notification,
|
||||||
we signal POLLIN.
|
we signal POLLIN.
|
||||||
|
|
||||||
POLLOUT is a little harder. Since you can essentially send
|
POLLOUT is a little harder. Since you can essentially send
|
||||||
to any destination, RDS will always signal POLLOUT as long as
|
to any destination, RDS will always signal POLLOUT as long as
|
||||||
there's room on the send queue (ie the number of bytes queued
|
there's room on the send queue (ie the number of bytes queued
|
||||||
is less than the sendbuf size).
|
is less than the sendbuf size).
|
||||||
|
|
||||||
However, the kernel will refuse to accept messages to
|
However, the kernel will refuse to accept messages to
|
||||||
a destination marked congested - in this case you will loop
|
a destination marked congested - in this case you will loop
|
||||||
forever if you rely on poll to tell you what to do.
|
forever if you rely on poll to tell you what to do.
|
||||||
This isn't a trivial problem, but applications can deal with
|
This isn't a trivial problem, but applications can deal with
|
||||||
this - by using congestion notifications, and by checking for
|
this - by using congestion notifications, and by checking for
|
||||||
ENOBUFS errors returned by sendmsg.
|
ENOBUFS errors returned by sendmsg.
|
||||||
|
|
||||||
setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
|
setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
|
||||||
This allows the application to discard all messages queued to a
|
This allows the application to discard all messages queued to a
|
||||||
specific destination on this particular socket.
|
specific destination on this particular socket.
|
||||||
|
|
||||||
This allows the application to cancel outstanding messages if
|
This allows the application to cancel outstanding messages if
|
||||||
it detects a timeout. For instance, if it tried to send a message,
|
it detects a timeout. For instance, if it tried to send a message,
|
||||||
and the remote host is unreachable, RDS will keep trying forever.
|
and the remote host is unreachable, RDS will keep trying forever.
|
||||||
The application may decide it's not worth it, and cancel the
|
The application may decide it's not worth it, and cancel the
|
||||||
operation. In this case, it would use RDS_CANCEL_SENT_TO to
|
operation. In this case, it would use RDS_CANCEL_SENT_TO to
|
||||||
nuke any pending messages.
|
nuke any pending messages.
|
||||||
|
|
||||||
setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|
``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
|
||||||
getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
|
|
||||||
Set or read an integer defining the underlying
|
Set or read an integer defining the underlying
|
||||||
encapsulating transport to be used for RDS packets on the
|
encapsulating transport to be used for RDS packets on the
|
||||||
socket. When setting the option, integer argument may be
|
socket. When setting the option, integer argument may be
|
||||||
|
@ -180,32 +187,39 @@ RDS Protocol
|
||||||
Message header
|
Message header
|
||||||
|
|
||||||
The message header is a 'struct rds_header' (see rds.h):
|
The message header is a 'struct rds_header' (see rds.h):
|
||||||
|
|
||||||
Fields:
|
Fields:
|
||||||
|
|
||||||
h_sequence:
|
h_sequence:
|
||||||
per-packet sequence number
|
per-packet sequence number
|
||||||
h_ack:
|
h_ack:
|
||||||
piggybacked acknowledgment of last packet received
|
piggybacked acknowledgment of last packet received
|
||||||
h_len:
|
h_len:
|
||||||
length of data, not including header
|
length of data, not including header
|
||||||
h_sport:
|
h_sport:
|
||||||
source port
|
source port
|
||||||
h_dport:
|
h_dport:
|
||||||
destination port
|
destination port
|
||||||
h_flags:
|
h_flags:
|
||||||
CONG_BITMAP - this is a congestion update bitmap
|
Can be:
|
||||||
ACK_REQUIRED - receiver must ack this packet
|
|
||||||
RETRANSMITTED - packet has previously been sent
|
============= ==================================
|
||||||
|
CONG_BITMAP this is a congestion update bitmap
|
||||||
|
ACK_REQUIRED receiver must ack this packet
|
||||||
|
RETRANSMITTED packet has previously been sent
|
||||||
|
============= ==================================
|
||||||
|
|
||||||
h_credit:
|
h_credit:
|
||||||
indicate to other end of connection that
|
indicate to other end of connection that
|
||||||
it has more credits available (i.e. there is
|
it has more credits available (i.e. there is
|
||||||
more send room)
|
more send room)
|
||||||
h_padding[4]:
|
h_padding[4]:
|
||||||
unused, for future use
|
unused, for future use
|
||||||
h_csum:
|
h_csum:
|
||||||
header checksum
|
header checksum
|
||||||
h_exthdr:
|
h_exthdr:
|
||||||
optional data can be passed here. This is currently used for
|
optional data can be passed here. This is currently used for
|
||||||
passing RDMA-related information.
|
passing RDMA-related information.
|
||||||
|
|
||||||
ACK and retransmit handling
|
ACK and retransmit handling
|
||||||
|
|
||||||
|
@ -260,7 +274,7 @@ RDS Protocol
|
||||||
|
|
||||||
|
|
||||||
RDS Transport Layer
|
RDS Transport Layer
|
||||||
==================
|
===================
|
||||||
|
|
||||||
As mentioned above, RDS is not IB-specific. Its code is divided
|
As mentioned above, RDS is not IB-specific. Its code is divided
|
||||||
into a general RDS layer and a transport layer.
|
into a general RDS layer and a transport layer.
|
||||||
|
@ -281,19 +295,25 @@ RDS Kernel Structures
|
||||||
be sent and sets header fields as needed, based on the socket API.
|
be sent and sets header fields as needed, based on the socket API.
|
||||||
This is then queued for the individual connection and sent by the
|
This is then queued for the individual connection and sent by the
|
||||||
connection's transport.
|
connection's transport.
|
||||||
|
|
||||||
struct rds_incoming
|
struct rds_incoming
|
||||||
a generic struct referring to incoming data that can be handed from
|
a generic struct referring to incoming data that can be handed from
|
||||||
the transport to the general code and queued by the general code
|
the transport to the general code and queued by the general code
|
||||||
while the socket is awoken. It is then passed back to the transport
|
while the socket is awoken. It is then passed back to the transport
|
||||||
code to handle the actual copy-to-user.
|
code to handle the actual copy-to-user.
|
||||||
|
|
||||||
struct rds_socket
|
struct rds_socket
|
||||||
per-socket information
|
per-socket information
|
||||||
|
|
||||||
struct rds_connection
|
struct rds_connection
|
||||||
per-connection information
|
per-connection information
|
||||||
|
|
||||||
struct rds_transport
|
struct rds_transport
|
||||||
pointers to transport-specific functions
|
pointers to transport-specific functions
|
||||||
|
|
||||||
struct rds_statistics
|
struct rds_statistics
|
||||||
non-transport-specific statistics
|
non-transport-specific statistics
|
||||||
|
|
||||||
struct rds_cong_map
|
struct rds_cong_map
|
||||||
wraps the raw congestion bitmap, contains rbnode, waitq, etc.
|
wraps the raw congestion bitmap, contains rbnode, waitq, etc.
|
||||||
|
|
||||||
|
@ -317,53 +337,58 @@ The send path
|
||||||
=============
|
=============
|
||||||
|
|
||||||
rds_sendmsg()
|
rds_sendmsg()
|
||||||
struct rds_message built from incoming data
|
- struct rds_message built from incoming data
|
||||||
CMSGs parsed (e.g. RDMA ops)
|
- CMSGs parsed (e.g. RDMA ops)
|
||||||
transport connection alloced and connected if not already
|
- transport connection alloced and connected if not already
|
||||||
rds_message placed on send queue
|
- rds_message placed on send queue
|
||||||
send worker awoken
|
- send worker awoken
|
||||||
|
|
||||||
rds_send_worker()
|
rds_send_worker()
|
||||||
calls rds_send_xmit() until queue is empty
|
- calls rds_send_xmit() until queue is empty
|
||||||
|
|
||||||
rds_send_xmit()
|
rds_send_xmit()
|
||||||
transmits congestion map if one is pending
|
- transmits congestion map if one is pending
|
||||||
may set ACK_REQUIRED
|
- may set ACK_REQUIRED
|
||||||
calls transport to send either non-RDMA or RDMA message
|
- calls transport to send either non-RDMA or RDMA message
|
||||||
(RDMA ops never retransmitted)
|
(RDMA ops never retransmitted)
|
||||||
|
|
||||||
rds_ib_xmit()
|
rds_ib_xmit()
|
||||||
allocs work requests from send ring
|
- allocs work requests from send ring
|
||||||
adds any new send credits available to peer (h_credits)
|
- adds any new send credits available to peer (h_credits)
|
||||||
maps the rds_message's sg list
|
- maps the rds_message's sg list
|
||||||
piggybacks ack
|
- piggybacks ack
|
||||||
populates work requests
|
- populates work requests
|
||||||
post send to connection's queue pair
|
- post send to connection's queue pair
|
||||||
|
|
||||||
The recv path
|
The recv path
|
||||||
=============
|
=============
|
||||||
|
|
||||||
rds_ib_recv_cq_comp_handler()
|
rds_ib_recv_cq_comp_handler()
|
||||||
looks at write completions
|
- looks at write completions
|
||||||
unmaps recv buffer from device
|
- unmaps recv buffer from device
|
||||||
no errors, call rds_ib_process_recv()
|
- no errors, call rds_ib_process_recv()
|
||||||
refill recv ring
|
- refill recv ring
|
||||||
|
|
||||||
rds_ib_process_recv()
|
rds_ib_process_recv()
|
||||||
validate header checksum
|
- validate header checksum
|
||||||
copy header to rds_ib_incoming struct if start of a new datagram
|
- copy header to rds_ib_incoming struct if start of a new datagram
|
||||||
add to ibinc's fraglist
|
- add to ibinc's fraglist
|
||||||
if competed datagram:
|
- if competed datagram:
|
||||||
update cong map if datagram was cong update
|
- update cong map if datagram was cong update
|
||||||
call rds_recv_incoming() otherwise
|
- call rds_recv_incoming() otherwise
|
||||||
note if ack is required
|
- note if ack is required
|
||||||
|
|
||||||
rds_recv_incoming()
|
rds_recv_incoming()
|
||||||
drop duplicate packets
|
- drop duplicate packets
|
||||||
respond to pings
|
- respond to pings
|
||||||
find the sock associated with this datagram
|
- find the sock associated with this datagram
|
||||||
add to sock queue
|
- add to sock queue
|
||||||
wake up sock
|
- wake up sock
|
||||||
do some congestion calculations
|
- do some congestion calculations
|
||||||
rds_recvmsg
|
rds_recvmsg
|
||||||
copy data into user iovec
|
- copy data into user iovec
|
||||||
handle CMSGs
|
- handle CMSGs
|
||||||
return to application
|
- return to application
|
||||||
|
|
||||||
Multipath RDS (mprds)
|
Multipath RDS (mprds)
|
||||||
=====================
|
=====================
|
|
@ -14219,7 +14219,7 @@ L: linux-rdma@vger.kernel.org
|
||||||
L: rds-devel@oss.oracle.com (moderated for non-subscribers)
|
L: rds-devel@oss.oracle.com (moderated for non-subscribers)
|
||||||
S: Supported
|
S: Supported
|
||||||
W: https://oss.oracle.com/projects/rds/
|
W: https://oss.oracle.com/projects/rds/
|
||||||
F: Documentation/networking/rds.txt
|
F: Documentation/networking/rds.rst
|
||||||
F: net/rds/
|
F: net/rds/
|
||||||
|
|
||||||
RDT - RESOURCE ALLOCATION
|
RDT - RESOURCE ALLOCATION
|
||||||
|
|
Loading…
Reference in New Issue