The Subnet Administrator (SA) is not a component of the RoCE spec.
Therefore, it should not be a capability of a RoCE port.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support constant callers of agent_send_response we add const
specifiers to the its pointer arguments.
Adjust the call tree accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The process_mad device function declares some parameters as "in". Make those
parameters const and adjust the call tree under process_mad in the various
drivers accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ib_device passed to the new RDMA helpers is constant. Declare the
ib_device as const in the following functions.
rdma_protocol_ib
rdma_protocol_roce
rdma_protocol_iwarp
rdma_ib_or_roce
rdma_cap_ib_mad
rdma_cap_ib_smi
rdma_cap_ib_cm
rdma_cap_iw_cm
rdma_cap_ib_sa
rdma_cap_ib_mcast
rdma_cap_af_ib
rdma_cap_eth_ah
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After discussion upstream, it was agreed to transition the usage of iboe
in the kernel to roce. This keeps our terminology consistent with what
was finalized in the IBTA Annex 16 and IBTA Annex 17 publications.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As of commit 5eb620c81c "IB/core: Add helpers for uncached GID and P_Key
searches"; pkey_tbl_len and gid_tbl_len are immutable data which are stored in
the ib_device.
The per port core capability flags to be added later are also immutable data to
be stored in the ib_device object.
In preparation for this create a structure for per port immutable data and
place the pkey and gid table lengths within this structure.
"get_port_immutable" is added as a mandatory device function to allow the
drivers to fill in this data.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some of us keep revisiting the code to decode enumerations that
appear in out logs. Let's borrow the nice logging helpers that
exists in xprtrdma and rds for CMA events, IB events and WC statuses.
Reviewd-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously start_port and end_port were defined in 2 places, cache.c and
device.c and this prevented their use in other modules.
Make these common functions, change the name to reflect the rdma
name space, and update existing users.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Increase the level of documentation for the rdma_cap_* helpers
introduced by Michael Wang <yun.wang@profitbricks.com>.
This patch is loosely based on a patch Michael wrote to enhance the
documentation of these functions, but has been significantly modified
in terms of verbiage. In addition, the comments were moved from a kernel
Documentation/infiniband/ file to being inline in the header file itself
for the functions in question. Finally, the documentation was formated
in proper kdoc format.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_eth_ah() to help us check if the port of an
IB device support Ethernet Address Handler.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_af_ib() to help us check if the port of an
IB device support Native Infiniband Address.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_read_multi_sge() to help us check if the port of an
IB device support RDMA Read Multiple Scatter-Gather Entries.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mcast() to help us check if the port of an
IB device support Infiniband Multicast.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_sa() to help us check if the port of an
IB device support Infiniband Subnet Administration.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_iw_cm() to help us check if the port of an
IB device support IWARP Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_cm() to help us check if the port of an
IB device support Infiniband Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_smi() to help us check if the port of an
IB device support Infiniband Subnet Management Interface.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mad() to help us check if the port of an
IB device support Infiniband Management Datagrams.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add raw helpers:
rdma_protocol_ib
rdma_protocol_iboe
rdma_protocol_iwarp
rdma_ib_or_iboe (transition, clean up later)
To help us detect which technology the port supported.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While commit 7e36ef8205 ("IB/core: Temporarily disable
ex_query_device uverb") is correct as it makes the extended
QUERY_DEVICE uverb (which came as part of commit 5a77abf9a9
("IB/core: Add support for extended query device caps") and commit
860f10a799 ("IB/core: Add flags for on demand paging support")) not
available to userspace, it doesn't address the initial issue regarding
ib_copy_to_udata() [1][2].
Additionally, further discussions around this new uverb seems to
conclude it would require a different data structure than the one
currently described in <rdma/ib_user_verbs.h> [3].
Both of these issues require a revert of the changes, so this patch
partially reverts commit 8cdd312cfe ("IB/mlx5: Implement the ODP
capability query verb") and commit 860f10a799 ("IB/core: Add flags
for on demand paging support") and fully reverts commit 5a77abf9a9
("IB/core: Add support for extended query device caps").
[1] "Re: [PATCH v3 06/17] IB/core: Add support for extended query device caps"
http://mid.gmane.org/1418733236.2779.26.camel@opteya.com
[2] "Re: [PATCH] IB/core: Temporarily disable ex_query_device uverb"
http://mid.gmane.org/1423067503.3030.83.camel@opteya.com
[3] "RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask"
http://mid.gmane.org/2807E5FD2F6FDA4886F6618EAC48510E0CC12C30@CRSMSX101.amr.corp.intel.com
Cc: Eli Cohen <eli@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Add an interval tree implementation for ODP umems. Create an
interval tree for each ucontext (including a count of the number of
ODP MRs in this context, semaphore, etc.), and register ODP umems in
the interval tree.
* Add MMU notifiers handling functions, using the interval tree to
notify only the relevant umems and underlying MRs.
* Register to receive MMU notifier events from the MM subsystem upon
ODP MR registration (and unregister accordingly).
* Add a completion object to synchronize the destruction of ODP umems.
* Add mechanism to abort page faults when there's a concurrent invalidation.
The way we synchronize between concurrent invalidations and page
faults is by keeping a counter of currently running invalidations, and
a sequence number that is incremented whenever an invalidation is
caught. The page fault code checks the counter and also verifies that
the sequence number hasn't progressed before it updates the umem's
page tables. This is similar to what the kvm module does.
In order to prevent the case where we register a umem in the middle of
an ongoing notifier, we also keep a per ucontext counter of the total
number of active mmu notifiers. We only enable new umems when all the
running notifiers complete.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yuval Dagan <yuvalda@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Extend the umem struct to keep the ODP related data.
* Allocate and initialize the ODP related information in the umem
(page_list, dma_list) and freeing as needed in the end of the run.
* Store a reference to the process PID struct in the ucontext. Used to
safely obtain the task_struct and the mm during fault handling,
without preventing the task destruction if needed.
* Add 2 helper functions: ib_umem_odp_map_dma_pages and
ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses
of specific pages of the umem (and, currently, pin them).
* Support for page faults only - IB core will keep the reference on
the pages used and call put_page when freeing an ODP umem
area. Invalidations support will be added in a later patch.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Add a configuration option for enable on-demand paging support in
the infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a
later patch, this configuration option will select the MMU_NOTIFIER
configuration option to enable mmu notifiers.
* Add a flag for on demand paging (ODP) support in the IB device capabilities.
* Add a flag to request ODP MR in the access flags to reg_mr.
* Fail registrations done with the ODP flag when the low-level driver
doesn't support this.
* Change the conditions in which an MR will be writable to explicitly
specify the access flags. This is to avoid making an MR writable just
because it is an ODP MR.
* Add a ODP capabilities to the extended query device verb.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add extensible query device capabilities verb to allow adding new features.
ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to
copy capability fields to be used by both ib_uverbs_query_device and
ib_uverbs_ex_query_device.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Expose more signature setting parameters. We modify the signature API
to allow usage of some new execution parameters relevant to data
integrity feature.
This patch modifies ib_sig_domain structure by:
- Deprecate DIF type in signature API (operation will
be determined by the parameters alone, no DIF type awareness)
- Add APPTAG check bitmask (for input domain)
- Add REFTAG remap (increment) flag for each domain
- Add APPTAG/REFTAG escape options for each domain
The mlx5 driver is modified to follow the new parameters in HW
signature setup.
At the moment the callers (iser/isert) hard-code new parameters (by
DIF type). In the future, callers will retrieve them from the scsi
command structure.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Memory re-registration is a feature that enables changing the
attributes of a memory region registered by user-space, including PD,
translation (address and length) and access flags.
Add the required support in uverbs and the kernel verbs API.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a few functions that are declared with __attribute_const__ in the
ib_verbs.h header file but defined without it in verbs.c. This gets rid
of the following sparse warnings:
drivers/infiniband/core/verbs.c:51:5: error: symbol 'ib_rate_to_mult' redeclared with different type (originally declared at include/rdma/ib_verbs.h:469) - different modifiers
drivers/infiniband/core/verbs.c:68:14: error: symbol 'mult_to_ib_rate' redeclared with different type (originally declared at include/rdma/ib_verbs.h:607) - different modifiers
drivers/infiniband/core/verbs.c:85:5: error: symbol 'ib_rate_to_mbps' redeclared with different type (originally declared at include/rdma/ib_verbs.h:476) - different modifiers
drivers/infiniband/core/verbs.c:111:1: error: symbol 'rdma_node_get_transport' redeclared with different type (originally declared at include/rdma/ib_verbs.h:84) - different modifiers
Signed-off-by: Roland Dreier <roland@purestorage.com>
This addresses a problem where NFS client writes over IPoIB connected
mode may deadlock on memory allocation/writeback.
The problem is not directly memory reclamation. There is an indirect
dependency between network filesystems writing back pages and
ipoib_cm_tx_init() due to how a kworker is used. Page reclaim cannot
make forward progress until ipoib_cm_tx_init() succeeds and it is
stuck in page reclaim itself waiting for network transmission.
Ordinarily this situation may be avoided by having the caller use
GFP_NOFS but ipoib_cm_tx_init() does not have that information.
To address this, take a general approach and add a new QP creation
flag that tells the low-level hardware driver to use GFP_NOIO for the
memory allocations related to the new QP.
Use the new flag in the ipoib connected mode path, and if the driver
doesn't support it, re-issue the QP creation without the flag.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code is replaced by driver specific changes and avoids the pointer
NULL test for drivers that don't overload these operations.
Suggested-by: <Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Tested-by: Vinod Kumar <vinod.kumar@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Introduce a verbs interface for signature-related operations. A
signature handover operation configures the layouts of data and
protection attributes both in memory and wire domains.
Signature operations are:
- INSERT:
Generate and insert protection information when handing over
data from input space to output space.
- validate and STRIP:
Validate protection information and remove it when handing over
data from input space to output space.
- validate and PASS:
Validate protection information and pass it when handing over
data from input space to output space.
Once the signature handover opration is done, the HCA will offload
data integrity generation/validation while performing the actual data
transfer.
Additions:
1. HCA signature capabilities in device attributes
Verbs provider supporting signature handover operations fills
relevant fields in device attributes structure returned by
ib_query_device.
2. QP creation flag IB_QP_CREATE_SIGNATURE_EN
Creating a QP that will carry signature handover operations may
require some special preparations from the verbs provider. So we
add QP creation flag IB_QP_CREATE_SIGNATURE_EN to declare that the
created QP may carry out signature handover operations. Expose
signature support to verbs layer (no support for now).
3. New send work request IB_WR_REG_SIG_MR
Signature handover work request. This WR will define the signature
handover properties of the memory/wire domains as well as the
domains layout. The purpose of this work request is to bind all
the needed information for the signature operation:
- data to be transferred: wr->sg_list (ib_sge).
* The raw data, pre-registered to a single MR (normally, before
signature, this MR would have been used directly for the data
transfer)
- data protection guards: sig_handover.prot (ib_sge).
* The data protection buffer, pre-registered to a single MR, which
contains the data integrity guards of the raw data blocks.
Note that it may not always exist, only in cases where the user is
interested in storing protection guards in memory.
- signature operation attributes: sig_handover.sig_attrs.
* Tells the HCA how to validate/generate the protection information.
Once the work request is executed, the memory region that will
describe the signature transaction will be the sig_mr. The
application can now go ahead and send the sig_mr.rkey or use the
sig_mr.lkey for data transfer.
4. New Verb ib_check_mr_status
check_mr_status verb checks the status of the memory region post
transaction. The first check that may be used is
IB_MR_CHECK_SIG_STATUS, which will indicate if any signature
errors are pending for a specific signature-enabled ib_mr. This
verb is a lightwight check and is allowed to be taken from
interrupt context. An application must call this verb after it is
known that the actual data transfer has finished.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit introduces verbs for creating/destoying memory
regions which will allow new types of memory key operations such
as protected memory registration.
Indirect memory registration is registering several (one
of more) pre-registered memory regions in a specific layout.
The Indirect region may potentialy describe several regions
and some repitition format between them.
Protected Memory registration is registering a memory region
with various data integrity attributes which will describe protection
schemes that will be handled by the HCA in an offloaded manner.
These memory regions will be applicable for a new REG_SIG_MR
work request introduced later in this patchset.
In the future these routines may replace or implement current memory
regions creation routines existing today:
- ib_reg_user_mr
- ib_alloc_fast_reg_mr
- ib_get_dma_mr
- ib_dereg_mr
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For userspace RoCE UD QPs we need to know the GID format that the
kernel uses, e.g when working over older kernels. For that end, add a
new port capability IB_PORT_IP_BASED_GIDS and report it when query
port is issued.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add the complementary RDMA_NODE_USNIC_UDP for RDMA_TRANSPORT_USNIC_UDP.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.
When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
Thus, those attributes were added to the following structures:
* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id
For the path record structure, extra care was taken to avoid the new
fields when packing it into wire format, so we don't break the IB CM
and SA wire protocol.
On the active side, the CM fills. its internal structures from the
path provided by the ULP. We add there taking the ETH L2 attributes
and placing them into the CM Address Handle (struct cm_av).
On the passive side, the CM fills its internal structures from the WC
associated with the REQ message. We add there taking the ETH L2
attributes from the WC.
When the HW driver provides the required ETH L2 attributes in the WC,
they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
code checks for the presence of these flags, and in their absence does
address resolution from the ib_init_ah_from_wc() helper function.
ib_modify_qp_is_ok is also updated to consider the link layer. Some
parameters are mandatory for Ethernet link layer, while they are
irrelevant for IB. Vendor drivers are modified to support the new
function signature.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds preliminary support for IB L2 device-managed steering,
currently exposed only in the kernel.
This flow spec can be used by low-level drivers that need to indicate
the link layer type when creating device-managed flow rules.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When creating an IPoIB UD QP, provide a hint to the low level driver
that the QP should support flow-steering. This means that privileged
user space applications can steer TCP/IP IPoIB traffic from the
network stack, in a similar manner done with Ethernet RAW_PACKET QPs.
The hint is provided through new QP creation flag called NETIF_QP.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add RDMA_TRANSPORT_USNIC_UDP which will be used by usNIC.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Userspace input buffer is not modified by kernel, so it can be 'const'.
This is also a prerequisite to remove the implicit cast
from INIT_UDATA().
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.
According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.
But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].
So the following patch is an attempt to a revised extensible command
infrastructure.
This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.
Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.
Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).
So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).
The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.
Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.
The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.
Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:
legacy extended
Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".
One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.
The proposed scheme will format input (command) and output (response)
buffers this way:
- command:
legacy header +
extended header +
command data (core + hw):
+----------------------------------------+
| flags | 00 00 | command |
| in_words | out_words |
+----------------------------------------+
| response |
| response |
| provider_in_words | provider_out_words |
| padding |
+----------------------------------------+
| |
. <uverbs input> .
. (in_words * 8) .
| |
+----------------------------------------+
| |
. <provider input> .
. (provider_in_words * 8) .
| |
+----------------------------------------+
- response, if present:
+----------------------------------------+
| |
. <uverbs output space> .
. (out_words * 8) .
| |
+----------------------------------------+
| |
. <provider output space> .
. (provider_out_words * 8) .
| |
+----------------------------------------+
The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.
Note:
The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility). This was suggested by Roland Dreier in a previous
review[2]. But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.
[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
[ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Enforce the rule that when requesting remote write or atomic permissions, local
write must be indicated as well. See IB spec 11.2.8.2.
Spotted by: Hagay Abramovsky <hagaya@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds new rdma node and new rdma transport, and supporting
code used by Cisco's low latency driver called usNIC. usNIC uses its
own transport, distinct from IB and iWARP.
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Don't allow unsupported comp_mask values, user should check
ibv_query_device to know which features are supported.
- Add a check in ib_uverbs_create_flow() to verify the size passed
from the user space.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to
support flow steering for user space applications.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA stack allows for applications to create IB_QPT_RAW_PACKET
QPs, which receive plain Ethernet packets, specifically packets that
don't carry any QPN to be matched by the receiving side. Applications
using these QPs must be provided with a method to program some
steering rule with the HW so packets arriving at the local port can be
routed to them.
This patch adds ib_create_flow(), which allow providing a flow
specification for a QP. When there's a match between the
specification and a received packet, the packet is forwarded to that
QP, in a the same way one uses ib_attach_multicast() for IB UD
multicast handling.
Flow specifications are provided as instances of struct ib_flow_spec_yyy,
which describe L2, L3 and L4 headers. Currently specs for Ethernet, IPv4,
TCP and UDP are defined. Flow specs are made of values and masks.
The input to ib_create_flow() is a struct ib_flow_attr, which contains
a few mandatory control elements and optional flow specs.
struct ib_flow_attr {
enum ib_flow_attr_type type;
u16 size;
u16 priority;
u32 flags;
u8 num_of_specs;
u8 port;
/* Following are the optional layers according to user request
* struct ib_flow_spec_yyy
* struct ib_flow_spec_zzz
*/
};
As these specs are eventually coming from user space, they are defined and
used in a way which allows adding new spec types without kernel/user ABI
change, just with a little API enhancement which defines the newly added spec.
The flow spec structures are defined with TLV (Type-Length-Value)
entries, which allows calling ib_create_flow() with a list of variable
length of optional specs.
For the actual processing of ib_flow_attr the driver uses the number
of specs and the size mandatory fields along with the TLV nature of
the specs.
Steering rules processing order is according to the domain over which
the rule is set and the rule priority. All rules set by user space
applicatations fall into the IB_FLOW_DOMAIN_USER domain, other domains
could be used by future IPoIB RFS and Ethetool flow-steering interface
implementation. Lower numerical value for the priority field means
higher priority.
The returned value from ib_create_flow() is a struct ib_flow, which
contains a database pointer (handle) provided by the HW driver to be
used when calling ib_destroy_flow().
Applications that offload TCP/IP traffic can also be written over IB
UD QPs. The ib_create_flow() / ib_destroy_flow() API is designed to
support UD QPs too. A HW driver can set IB_DEVICE_MANAGED_FLOW_STEERING
to denote support for flow steering.
The ib_flow_attr enum type supports usage of flow steering for promiscuous
and sniffer purposes:
IB_FLOW_ATTR_NORMAL - "regular" rule, steering according to rule specification
IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
all Ethernet traffic which isn't steered to any QP
IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast
IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic
ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a potential race when event occurrs on a target XRC QP and in the
middle of reporting that on its shared qps, one of them is destroyed
by user space application. Also add note for kernel consumers in
ib_verbs.h that they must not destroy the QP from within the handler.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Continue the approach taken by commit d2b57063e4 ("IB/core: Reserve
bits in enum ib_qp_create_flags for low-level driver use") and add
reserved entries to the ib_qp_type and ib_wr_opcode enums. Low-level
drivers can then define macros to use these reserved values, giving
proper names to the macros for readability. Also add a range of
reserved flags to enum ib_send_flags.
The mlx5 IB driver uses the new additions.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch enhances the IB core support for Memory Windows (MWs).
MWs allow an application to have better/flexible control over remote
access to memory.
Two types of MWs are supported, with the second type having two flavors:
Type 1 - associated with PD only
Type 2A - associated with QPN only
Type 2B - associated with PD and QPN
Applications can allocate a MW once, and then repeatedly bind the MW
to different ranges in MRs that are associated to the same PD. Type 1
windows are bound through a verb, while type 2 windows are bound by
posting a work request.
The 32-bit memory key is composed of a 24-bit index and an 8-bit
key. The key is changed with each bind, thus allowing more control
over the peer's use of the memory key.
The changes introduced are the following:
* add memory window type enum and a corresponding parameter to ib_alloc_mw.
* type 2 memory window bind work request support.
* create a struct that contains the common part of the bind verb struct
ibv_mw_bind and the bind work request into a single struct.
* add the ib_inc_rkey helper function to advance the tag part of an rkey.
Consumer interface details:
* new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
for these features.
Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
memory windows. It can set neither to indicate it doesn't support
type 2 windows at all.
* modify existing provides and consumers code to the new param of
ib_alloc_mw and the ib_mw_bind_info structure
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reserve bits 26-31 for internal use by low-level drivers. Two such
bits are used in the mlx4_b driver SR-IOV implementation.
These enum additions guarantee that the core layer will never use
these bits, so that low level drivers may safely make use of them.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IB_QPT_RAW_PACKET allows applications to build a complete packet,
including L2 headers, when sending; on the receive side, the HW will
not strip any headers.
This QP type is designed for userspace direct access to Ethernet; for
example by applications that do TCP/IP themselves. Only processes
with the NET_RAW capability are allowed to create raw packet QPs (the
name "raw packet QP" is supposed to suggest an analogy to AF_PACKET /
SOL_RAW sockets).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Just as we don't allow PDs, CQs, etc. to be destroyed if there are QPs
that are attached to them, don't let a QP be destroyed if there are
multicast group(s) attached to it. Use the existing usecnt field of
struct ib_qp which was added by commit 0e0ec7e ("RDMA/core: Export
ib_open_qp() to share XRC TGT QPs") to track this.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use a bit in wc_flags rather then a whole integer to hold the
"checksum OK" flag. By itself, this change doesn't reduce the size of
struct ib_wc on 64bit machines -- it stays on 56 bytes because of
padding. However, it will allow to add more fields in the future
without enlarging the struct. Also, it will let us have a unified
approach with future libibverbs checksum offload reporting, because a
bit flag doesn't break the library ABI.
This patch was suggested during conversation with Liran Liss
<liranl@mellanox.com>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file. Add that enum, and use
it all over the code.
The IB speed/width notation is also used by iWARP and IBoE HW drivers,
which use the convention of rate = speed * width to advertise their
port link rate.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC TGT QPs are shared resources among multiple processes. Since the
creating process may exit, allow other processes which share the same
XRC domain to open an existing QP. This allows us to transfer
ownership of an XRC TGT QP to another process.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow user space to create XRC domains. Because XRCDs are expected to
be shared among multiple processes, we use inodes to identify an XRCD.
Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC TGT QPs are intended to be shared among multiple users and
processes. Allow the destruction of an XRC TGT QP to be done explicitly
through ib_destroy_qp() or when the XRCD is destroyed.
To support destroying an XRC TGT QP, we need to track TGT QPs with the
XRCD. When the XRCD is destroyed, all tracked XRC TGT QPs are also
cleaned up.
To avoid stale reference issues, if a user is holding a reference on a
TGT QP, we increment a reference count on the QP. The user releases the
reference by calling ib_release_qp. This releases any access to the QP
from a user above verbs, but allows the QP to continue to exist until
destroyed by the XRCD.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
XRC communication is between an initiator (INI) QP and a target (TGT)
QP. Target QPs are associated with SRQs through an XRCD. An XRC TGT QP
behaves like a receive-only RD QP. XRC INI QPs behave similarly to RC
QPs, except that work requests posted to an XRC INI QP must specify the
remote SRQ that is the target of the work request.
We define two new QP types for XRC, to distinguish between INI and TGT
QPs, and update the core layer to support XRC QPs.
This patch is derived from work by Jack Morgenstein
<jackm@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
XRC defines SRQs that are specifically used by XRC connections. Expand
the SRQ code to support XRC SRQs. An XRC SRQ is currently restricted to
only XRC use according to the IB XRC Annex.
Portions of this patch were derived from work by
Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
A few new concepts are introduced to support this. This patch adds:
- A new device capability flag, IB_DEVICE_XRC, which low-level
drivers set to indicate that a device supports XRC.
- A new object type, XRC domains (struct ib_xrcd), and new verbs
ib_alloc_xrcd()/ib_dealloc_xrcd(). XRCDs are used to limit which
XRC SRQs an incoming message can target.
This patch is derived from work by Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Introduce support for the following extended speeds:
FDR-10: a Mellanox proprietary link speed which is 10.3125 Gbps with
64b/66b encoding rather than 8b/10b encoding.
FDR: IBA extended speed 14.0625 Gbps.
EDR: IBA extended speed 25.78125 Gbps.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add IB GID change event type. This is needed for IBoE when the HW
driver updates the GID (e.g when new VLANs are added/deleted) table
and the change should be reflected to the IB core cache.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* ib_wq is added, which is used as the common workqueue for infiniband
instead of the system workqueue. All system workqueue usages
including flush_scheduled_work() callers are converted to use and
flush ib_wq.
* cancel_delayed_work() + flush_scheduled_work() converted to
cancel_delayed_work_sync().
* qib_wq is removed and ib_wq is used instead.
This is to prepare for deprecation of flush_scheduled_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows ports to have different link layers:
IB_LINK_LAYER_INFINIBAND or IB_LINK_LAYER_ETHERNET. This is required
for adding IBoE (InfiniBand-over-Ethernet, aka RoCE) support. For
devices that do not provide an implementation for querying the link
layer property of a port, we return a default value based on the
transport: RMA_TRANSPORT_IB nodes will return IB_LINK_LAYER_INFINIBAND
and RDMA_TRANSPORT_IWARP nodes will return IB_LINK_LAYER_ETHERNET.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change abbreviated IB_QPT_RAW_ETY to IB_QPT_RAW_ETHERTYPE to make
the special QP type easier to understand.
cf http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04530.html
Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs. This allows
low-level device drivers to create files in
/sys/class/infiniband/<hca>/ports/<N>/
without having to poke through the internals of the RDMA sysfs handling.
There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Add new IB_WR_MASKED_ATOMIC_CMP_AND_SWP and IB_WR_MASKED_ATOMIC_FETCH_AND_ADD
send opcodes that can be used to post "masked atomic compare and
swap" and "masked atomic fetch and add" work request respectively.
- Add masked_atomic_cap capability.
- Add mask fields to atomic struct of ib_send_wr
- Add new opcodes to ib_wc_opcode
The new operations are described more precisely below:
* Masked Compare and Swap (MskCmpSwap)
The MskCmpSwap atomic operation is an extension to the CmpSwap
operation defined in the IB spec. MskCmpSwap allows the user to
select a portion of the 64 bit target data for the “compare” check as
well as to restrict the swap to a (possibly different) portion. The
pseudo code below describes the operation:
| atomic_response = *va
| if (!((compare_add ^ *va) & compare_add_mask)) then
| *va = (*va & ~(swap_mask)) | (swap & swap_mask)
|
| return atomic_response
The additional operands are carried in the Extended Transport Header.
Atomic response generation and packet format for MskCmpSwap is as for
standard IB Atomic operations.
* Masked Fetch and Add (MFetchAdd)
The MFetchAdd Atomic operation extends the functionality of the
standard IB FetchAdd by allowing the user to split the target into
multiple fields of selectable length. The atomic add is done
independently on each one of this fields. A bit set in the
field_boundary parameter specifies the field boundaries. The pseudo
code below describes the operation:
| bit_adder(ci, b1, b2, *co)
| {
| value = ci + b1 + b2
| *co = !!(value & 2)
|
| return value & 1
| }
|
| #define MASK_IS_SET(mask, attr) (!!((mask)&(attr)))
| bit_position = 1
| carry = 0
| atomic_response = 0
|
| for i = 0 to 63
| {
| if ( i != 0 )
| bit_position = bit_position << 1
|
| bit_add_res = bit_adder(carry, MASK_IS_SET(*va, bit_position),
| MASK_IS_SET(compare_add, bit_position), &new_carry)
| if (bit_add_res)
| atomic_response |= bit_position
|
| carry = ((new_carry) && (!MASK_IS_SET(compare_add_mask, bit_position)))
| }
|
| return atomic_response
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A small change to reduce the size of ib_device to 1112 bytes
(from 1128).
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Clarify the behavior of ib_post_send() when a list of work requests is
passed in and an immediate error is returned.
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Base versions handle constant folding now. For headers exposed to
userspace, we must only expose the __ prefixed versions.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
STag support and IB HCAs to indicate reserved L_Key support.
- Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
this in with the appropriate local DMA L_Key (if they support it).
- Fix up the drivers using this flag.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch also adds a creation flag for QPs,
IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK, which when set means that
multicast sends from the QP to a group that the QP is attached to will
not be looped back to the QP's receive queue. This can be used to
save receive resources when a consumer does not want a local copy of
multicast traffic; for example IPoIB must waste CPU time throwing away
such local copies of multicast traffic.
This patch also adds a device capability flag that shows whether a
device supports this feature or not.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds a sysfs attribute group called "proto_stats" under
/sys/class/infiniband/$device/ and populates this group with protocol
statistics if they exist for a given device. Currently, only iWARP
stats are defined, but the code is designed to allow InfiniBand
protocol stats if they become available. These stats are per-device
and more importantly -not- per port.
Details:
- Add union rdma_protocol_stats in ib_verbs.h. This union allows
defining transport-specific stats. Currently only iwarp stats are
defined.
- Add struct iw_protocol_stats to define the current set of iwarp
protocol stats.
- Add new ib_device method called get_proto_stats() to return protocol
statistics.
- Add logic in core/sysfs.c to create iwarp protocol stats attributes
if the device is an RNIC and has a get_proto_stats() method.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds support for the IB "base memory management extension"
(BMME) and the equivalent iWARP operations (which the iWARP verbs
mandates all devices must implement). The new operations are:
- Allocate an ib_mr for use in fast register work requests.
- Allocate/free a physical buffer lists for use in fast register work
requests. This allows device drivers to allocate this memory as
needed for use in posting send requests (eg via dma_alloc_coherent).
- New send queue work requests:
* send with remote invalidate
* fast register memory region
* local invalidate memory region
* RDMA read with invalidate local memory region (iWARP only)
Consumer interface details:
- A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
to indicate device support for these features.
- New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
IB_WR_RDMA_READ_WITH_INV are added.
- A new consumer API function, ib_alloc_mr() is added to allocate
fast register memory regions.
- New consumer API functions, ib_alloc_fast_reg_page_list() and
ib_free_fast_reg_page_list() are added to allocate and free
device-specific memory for fast registration page lists.
- A new consumer API function, ib_update_fast_reg_key(), is added to
allow the key portion of the R_Key and L_Key of a fast registration
MR to be updated. Consumers call this if desired before posting
a IB_WR_FAST_REG_MR work request.
Consumers can use this as follows:
- MR is allocated with ib_alloc_mr().
- Page list memory is allocated with ib_alloc_fast_reg_page_list().
- MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
invalidate operation.
- MR is deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().
Applications can allocate a fast register MR once, and then can
repeatedly bind the MR to different physical block lists (PBLs) via
posting work requests to a send queue (SQ). For each outstanding
MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
allocated (the fast_reg_page_list is owned by the low-level driver
from the consumer posting a work request until the request completes).
Thus pipelining can be achieved while still allowing device-specific
page_list processing.
The 32-bit fast register memory key/STag is composed of a 24-bit index
and an 8-bit key. The application can change the key each time it
fast registers thus allowing more control over the peer's use of the
key/STag (ie it can effectively be changed each time the rkey is
rebound to a page list).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove subversion $Id lines and improve readability by fixing other
coding style problems pointed out by checkpatch.pl.
Signed-off-by: Dotan Barak <dotanba@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In 2.6.26, we added some support for send with invalidate work
requests, including a device capability flag to indicate whether a
device supports such requests. However, the support was incomplete:
the completion structure was not extended with a field for the key
contained in incoming send with invalidate requests.
Full support for memory management extensions (send with invalidate,
local invalidate, fast register through a send queue, etc) is planned
for 2.6.27. Since send with invalidate is not very useful by itself,
just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
which makes things simpler for applications, since they will not have
quite as confusing an array of fine-grained bits to check.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add support for modifying CQ parameters for controlling event
generation moderation.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
LSO (large send offload) allows the networking stack to pass SKBs with
data size larger than the MTU to the IPoIB driver and have the HCA HW
fragment the data to multiple MSS-sized packets. Add a device
capability flag IB_DEVICE_UD_TSO for devices that can perform TCP
segmentation offload, a new send work request opcode IB_WR_LSO,
header, hlen and mss fields for the work request structure, and a new
IB_WC_LSO completion type.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IDR IDs are signed, so struct ib_uobject.id should be signed. This
avoids some sparse pointer signedness warnings.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a device capability to show when it can handle checksum offload.
Also add a send flag for inserting checksums and a csum_ok field to
the completion record.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Stop using kobject_register, as this way we can control the sending of
the uevent properly, after everything is properly initialized.
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <mshefty@ichips.intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Not architecture specific code should not #include <asm/scatterlist.h>.
This patch therefore either replaces them with
#include <linux/scatterlist.h> or simply removes them if they were
unused.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
After moving the definition of struct ib_umem_chunk from ib_verbs.h to
ib_umem.h there isn't any reason for the macro IB_UMEM_MAX_PAGE_CHUNK
to stay in ib_verbs.h. Move the macro to umem.c, the only place where
it is used.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>