mirror of https://gitee.com/openkylin/linux.git
RDMA/rtrs: a bit of documentation
README with description of major sysfs entries, sysfs documentation has been moved to ABI dir as suggested by Bart. Link: https://lore.kernel.org/r/20200511135131.27580-15-danil.kipnis@cloud.ionos.com Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com> Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This commit is contained in:
parent
c013fbc1fd
commit
745b6a3d4a
|
@ -0,0 +1,131 @@
|
|||
What: /sys/class/rtrs-client
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: When a user of RTRS API creates a new session, a directory entry with
|
||||
the name of that session is created under /sys/class/rtrs-client/<session-name>/
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/add_path
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RW, adds a new path (connection) to an existing session. Expected format is the
|
||||
following:
|
||||
|
||||
<[source addr,]destination addr>
|
||||
*addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ]
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/max_reconnect_attempts
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Maximum number reconnect attempts the client should make before giving up
|
||||
after connection breaks unexpectedly.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/mp_policy
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Multipath policy specifies which path should be selected on each IO:
|
||||
|
||||
round-robin (0):
|
||||
select path in per CPU round-robin manner.
|
||||
|
||||
min-inflight (1):
|
||||
select path with minimum inflights.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Each path belonging to a given session is listed here by its source and
|
||||
destination address. When a new path is added to a session by writing to
|
||||
the "add_path" entry, a directory <src@dst> is created.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/state
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains "connected" if the session is connected to the peer and fully
|
||||
functional. Otherwise the file contains "disconnected"
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/reconnect
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Write "1" to the file in order to reconnect the path.
|
||||
Operation is blocking and returns 0 if reconnect was successful.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/disconnect
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Write "1" to the file in order to disconnect the path.
|
||||
Operation blocks until RTRS path is disconnected.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/remove_path
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Write "1" to the file in order to disconnected and remove the path
|
||||
from the session. Operation blocks until the path is disconnected
|
||||
and removed from the session.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_name
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the the name of HCA the connection established on.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_port
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the port number of active port traffic is going through.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/src_addr
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the source address of the path
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/dst_addr
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the destination address of the path
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reset_all
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RW, Read will return usage help, write 0 will clear all the statistics.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/cpu_migration
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
|
||||
not the case, the processing of an I/O response could be processed on a
|
||||
different CPU than where it was originally submitted. This file shows
|
||||
how many interrupts where generated on a non expected CPU.
|
||||
"from:" is the CPU on which the IRQ was expected, but not generated.
|
||||
"to:" is the CPU on which the IRQ was generated, but not expected.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reconnects
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Contains 2 unsigned int values, the first one records number of successful
|
||||
reconnects in the path lifetime, the second one records number of failed
|
||||
reconnects in the path lifetime.
|
||||
|
||||
What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/rdma
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Contains statistics regarding rdma operations and inflight operations.
|
||||
The output consists of 6 values:
|
||||
|
||||
<read-count> <read-total-size> <write-count> <write-total-size> \
|
||||
<inflights> <failovered>
|
|
@ -0,0 +1,53 @@
|
|||
What: /sys/class/rtrs-server
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: When a user of RTRS API creates a new session on a client side, a
|
||||
directory entry with the name of that session is created in here.
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: When new path is created by writing to "add_path" entry on client side,
|
||||
a directory entry named as <source address>@<destination address> is created
|
||||
on server.
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/disconnect
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: When "1" is written to the file, the RTRS session is being disconnected.
|
||||
Operations is non-blocking and returns control immediately to the caller.
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_name
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the the name of HCA the connection established on.
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_port
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the port number of active port traffic is going through.
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/src_addr
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the source address of the path
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/dst_addr
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: RO, Contains the destination address of the path
|
||||
|
||||
What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/stats/rdma
|
||||
Date: Feb 2020
|
||||
KernelVersion: 5.7
|
||||
Contact: Jack Wang <jinpu.wang@cloud.ionos.com> Danil Kipnis <danil.kipnis@cloud.ionos.com>
|
||||
Description: Contains statistics regarding rdma operations and inflight operations.
|
||||
The output consists of 5 values:
|
||||
<read-count> <read-total-size> <write-count> <write-total-size> <inflights>
|
|
@ -0,0 +1,213 @@
|
|||
****************************
|
||||
RDMA Transport (RTRS)
|
||||
****************************
|
||||
|
||||
RTRS (RDMA Transport) is a reliable high speed transport library
|
||||
which provides support to establish optimal number of connections
|
||||
between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
|
||||
transport. It is optimized to transfer (read/write) IO blocks.
|
||||
|
||||
In its core interface it follows the BIO semantics of providing the
|
||||
possibility to either write data from an sg list to the remote side
|
||||
or to request ("read") data transfer from the remote side into a given
|
||||
sg list.
|
||||
|
||||
RTRS provides I/O fail-over and load-balancing capabilities by using
|
||||
multipath I/O (see "add_path" and "mp_policy" configuration entries in
|
||||
Documentation/ABI/testing/sysfs-class-rtrs-client).
|
||||
|
||||
RTRS is used by the RNBD (RDMA Network Block Device) modules.
|
||||
|
||||
==================
|
||||
Transport protocol
|
||||
==================
|
||||
|
||||
Overview
|
||||
--------
|
||||
An established connection between a client and a server is called rtrs
|
||||
session. A session is associated with a set of memory chunks reserved on the
|
||||
server side for a given client for rdma transfer. A session
|
||||
consists of multiple paths, each representing a separate physical link
|
||||
between client and server. Those are used for load balancing and failover.
|
||||
Each path consists of as many connections (QPs) as there are cpus on
|
||||
the client.
|
||||
|
||||
When processing an incoming write or read request, rtrs client uses memory
|
||||
chunks reserved for him on the server side. Their number, size and addresses
|
||||
need to be exchanged between client and server during the connection
|
||||
establishment phase. Apart from the memory related information client needs to
|
||||
inform the server about the session name and identify each path and connection
|
||||
individually.
|
||||
|
||||
On an established session client sends to server write or read messages.
|
||||
Server uses immediate field to tell the client which request is being
|
||||
acknowledged and for errno. Client uses immediate field to tell the server
|
||||
which of the memory chunks has been accessed and at which offset the message
|
||||
can be found.
|
||||
|
||||
Module parameter always_invalidate is introduced for the security problem
|
||||
discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
|
||||
invalidate each rdma buffer before we hand it over to RNBD server and
|
||||
then pass it to the block layer. A new rkey is generated and registered for the
|
||||
buffer after it returns back from the block layer and RNBD server.
|
||||
The new rkey is sent back to the client along with the IO result.
|
||||
The procedure is the default behaviour of the driver. This invalidation and
|
||||
registration on each IO causes performance drop of up to 20%. A user of the
|
||||
driver may choose to load the modules with this mechanism switched off
|
||||
(always_invalidate=N), if he understands and can take the risk of a malicious
|
||||
client being able to corrupt memory of a server it is connected to. This might
|
||||
be a reasonable option in a scenario where all the clients and all the servers
|
||||
are located within a secure datacenter.
|
||||
|
||||
|
||||
Connection establishment
|
||||
------------------------
|
||||
|
||||
1. Client starts establishing connections belonging to a path of a session one
|
||||
by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
|
||||
Those include uuid of the session and uuid of the path to be
|
||||
established. They are used by the server to find a persisting session/path or
|
||||
to create a new one when necessary. The message also contains the protocol
|
||||
version and magic for compatibility, total number of connections per session
|
||||
(as many as cpus on the client), the id of the current connection and
|
||||
the reconnect counter, which is used to resolve the situations where
|
||||
client is trying to reconnect a path, while server is still destroying the old
|
||||
one.
|
||||
|
||||
2. Server accepts the connection requests one by one and attaches
|
||||
RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
|
||||
protocol version, the messages include error code, queue depth supported by
|
||||
the server (number of memory chunks which are going to be allocated for that
|
||||
session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
|
||||
when always_invalidate=Y.
|
||||
|
||||
3. After all connections of a path are established client sends to server the
|
||||
RTRS_MSG_INFO_REQ message, containing the name of the session. This message
|
||||
requests the address information from the server.
|
||||
|
||||
4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
|
||||
which contains the addresses and keys of the RDMA buffers allocated for that
|
||||
session.
|
||||
|
||||
5. Session becomes connected after all paths to be established are connected
|
||||
(i.e. steps 1-4 finished for all paths requested for a session)
|
||||
|
||||
6. Server and client exchange periodically heartbeat messages (empty rdma
|
||||
messages with an immediate field) which are used to detect a crash on remote
|
||||
side or network outage in an absence of IO.
|
||||
|
||||
7. On any RDMA related error or in the case of a heartbeat timeout, the
|
||||
corresponding path is disconnected, all the inflight IO are failed over to a
|
||||
healthy path, if any, and the reconnect mechanism is triggered.
|
||||
|
||||
CLT SRV
|
||||
*for each connection belonging to a path and for each path:
|
||||
RTRS_MSG_CON_REQ ------------------->
|
||||
<------------------- RTRS_MSG_CON_RSP
|
||||
...
|
||||
*after all connections are established:
|
||||
RTRS_MSG_INFO_REQ ------------------->
|
||||
<------------------- RTRS_MSG_INFO_RSP
|
||||
*heartbeat is started from both sides:
|
||||
-------------------> [RTRS_HB_MSG_IMM]
|
||||
[RTRS_HB_MSG_ACK] <-------------------
|
||||
[RTRS_HB_MSG_IMM] <-------------------
|
||||
-------------------> [RTRS_HB_MSG_ACK]
|
||||
|
||||
IO path
|
||||
-------
|
||||
|
||||
* Write (always_invalidate=N) *
|
||||
|
||||
1. When processing a write request client selects one of the memory chunks
|
||||
on the server side and rdma writes there the user data, user header and the
|
||||
RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
|
||||
contains size of the user header. The client tells the server which chunk has
|
||||
been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
|
||||
using the IMM field.
|
||||
|
||||
2. When confirming a write request server sends an "empty" rdma message with
|
||||
an immediate field. The 32 bit field is used to specify the outstanding
|
||||
inflight IO and for the error code.
|
||||
|
||||
CLT SRV
|
||||
usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
|
||||
[RTRS_IO_RSP_IMM] <----------------- (id + errno)
|
||||
|
||||
* Write (always_invalidate=Y) *
|
||||
|
||||
1. When processing a write request client selects one of the memory chunks
|
||||
on the server side and rdma writes there the user data, user header and the
|
||||
RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
|
||||
contains size of the user header. The client tells the server which chunk has
|
||||
been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
|
||||
using the IMM field, Server invalidate rkey associated to the memory chunks
|
||||
first, when it finishes, pass the IO to RNBD server module.
|
||||
|
||||
2. When confirming a write request server sends an "empty" rdma message with
|
||||
an immediate field. The 32 bit field is used to specify the outstanding
|
||||
inflight IO and for the error code. The new rkey is sent back using
|
||||
SEND_WITH_IMM WR, client When it recived new rkey message, it validates
|
||||
the message and finished IO after update rkey for the rbuffer, then post
|
||||
back the recv buffer for later use.
|
||||
|
||||
CLT SRV
|
||||
usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
|
||||
[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
|
||||
[RTRS_IO_RSP_IMM] <----------------- (id + errno)
|
||||
|
||||
|
||||
* Read (always_invalidate=N)*
|
||||
|
||||
1. When processing a read request client selects one of the memory chunks
|
||||
on the server side and rdma writes there the user header and the
|
||||
RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
|
||||
the user header, flags (specifying if memory invalidation is necessary) and the
|
||||
list of addresses along with keys for the data to be read into.
|
||||
|
||||
2. When confirming a read request server transfers the requested data first,
|
||||
attaches an invalidation message if requested and finally an "empty" rdma
|
||||
message with an immediate field. The 32 bit field is used to specify the
|
||||
outstanding inflight IO and the error code.
|
||||
|
||||
CLT SRV
|
||||
usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
|
||||
[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
|
||||
or in case client requested invalidation:
|
||||
[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
|
||||
|
||||
* Read (always_invalidate=Y)*
|
||||
|
||||
1. When processing a read request client selects one of the memory chunks
|
||||
on the server side and rdma writes there the user header and the
|
||||
RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
|
||||
the user header, flags (specifying if memory invalidation is necessary) and the
|
||||
list of addresses along with keys for the data to be read into.
|
||||
Server invalidate rkey associated to the memory chunks first, when it finishes,
|
||||
passes the IO to RNBD server module.
|
||||
|
||||
2. When confirming a read request server transfers the requested data first,
|
||||
attaches an invalidation message if requested and finally an "empty" rdma
|
||||
message with an immediate field. The 32 bit field is used to specify the
|
||||
outstanding inflight IO and the error code. The new rkey is sent back using
|
||||
SEND_WITH_IMM WR, client When it recived new rkey message, it validates
|
||||
the message and finished IO after update rkey for the rbuffer, then post
|
||||
back the recv buffer for later use.
|
||||
|
||||
CLT SRV
|
||||
usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
|
||||
[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
|
||||
[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
|
||||
or in case client requested invalidation:
|
||||
[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
|
||||
=========================================
|
||||
Contributors List(in alphabetical order)
|
||||
=========================================
|
||||
Danil Kipnis <danil.kipnis@profitbricks.com>
|
||||
Fabian Holler <mail@fholler.de>
|
||||
Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
|
||||
Jack Wang <jinpu.wang@profitbricks.com>
|
||||
Kleber Souza <kleber.souza@profitbricks.com>
|
||||
Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
|
||||
Milind Dumbare <Milind.dumbare@gmail.com>
|
||||
Roman Penyaev <roman.penyaev@profitbricks.com>
|
Loading…
Reference in New Issue