Commit Graph

822 Commits

Author SHA1 Message Date
Roland Dreier f7eaa7ed8f Merge branches 'core', 'cxgb4', 'ip-roce', 'iser', 'misc', 'mlx4', 'nes', 'ocrdma', 'qib', 'sgwrapper', 'srp' and 'usnic' into for-next 2014-04-03 08:30:17 -07:00
Moni Shoua b2853fd6c2 IB/core: Don't resolve passive side RoCE L2 address in CMA REQ handler
The code that resolves the passive side source MAC within the rdma_cm
connection request handler was both redundant and buggy, so remove it.

It was redundant since later, when an RC QP is modified to RTR state,
the resolution will take place in the ib_core module.  It was buggy
because this callback also deals with UD SIDR exchange, for which we
incorrectly looked at the REQ member of the CM event and dereferenced
a random value.

Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-04-01 14:05:26 -07:00
Yan Burman 2c34e68f42 IB/mad: Check and handle potential DMA mapping errors
Running with DMA_API_DEBUG enabled and not checking for DMA mapping
errors triggers a kernel stack trace with "DMA-API: device driver
failed to check map error" message.  Add these checks to the MAD
module, both to be be more robust and also eliminate these
false-positive stack traces.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-04-01 10:36:07 -07:00
Sagi Grimberg 1b01d33560 IB/core: Introduce signature verbs API
Introduce a verbs interface for signature-related operations.  A
signature handover operation configures the layouts of data and
protection attributes both in memory and wire domains.

Signature operations are:

- INSERT:
  Generate and insert protection information when handing over
  data from input space to output space.
- validate and STRIP:
  Validate protection information and remove it when handing over
  data from input space to output space.
- validate and PASS:
  Validate protection information and pass it when handing over
  data from input space to output space.

Once the signature handover opration is done, the HCA will offload
data integrity generation/validation while performing the actual data
transfer.

Additions:

1. HCA signature capabilities in device attributes
    Verbs provider supporting signature handover operations fills
    relevant fields in device attributes structure returned by
    ib_query_device.

2. QP creation flag IB_QP_CREATE_SIGNATURE_EN
    Creating a QP that will carry signature handover operations may
    require some special preparations from the verbs provider.  So we
    add QP creation flag IB_QP_CREATE_SIGNATURE_EN to declare that the
    created QP may carry out signature handover operations.  Expose
    signature support to verbs layer (no support for now).

3. New send work request IB_WR_REG_SIG_MR
    Signature handover work request. This WR will define the signature
    handover properties of the memory/wire domains as well as the
    domains layout. The purpose of this work request is to bind all
    the needed information for the signature operation:

    - data to be transferred:  wr->sg_list (ib_sge).
      * The raw data, pre-registered to a single MR (normally, before
        signature, this MR would have been used directly for the data
        transfer)
    - data protection guards: sig_handover.prot (ib_sge).
      * The data protection buffer, pre-registered to a single MR, which
        contains the data integrity guards of the raw data blocks.
        Note that it may not always exist, only in cases where the user is
        interested in storing protection guards in memory.
    - signature operation attributes: sig_handover.sig_attrs.
      * Tells the HCA how to validate/generate the protection information.

    Once the work request is executed, the memory region that will
    describe the signature transaction will be the sig_mr.  The
    application can now go ahead and send the sig_mr.rkey or use the
    sig_mr.lkey for data transfer.

4. New Verb ib_check_mr_status
    check_mr_status verb checks the status of the memory region post
    transaction.  The first check that may be used is
    IB_MR_CHECK_SIG_STATUS, which will indicate if any signature
    errors are pending for a specific signature-enabled ib_mr.  This
    verb is a lightwight check and is allowed to be taken from
    interrupt context.  An application must call this verb after it is
    known that the actual data transfer has finished.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-03-07 11:26:49 -08:00
Sagi Grimberg 17cd3a2db8 IB/core: Introduce protected memory regions
This commit introduces verbs for creating/destoying memory
regions which will allow new types of memory key operations such
as protected memory registration.

Indirect memory registration is registering several (one
of more) pre-registered memory regions in a specific layout.
The Indirect region may potentialy describe several regions
and some repitition format between them.

Protected Memory registration is registering a memory region
with various data integrity attributes which will describe protection
schemes that will be handled by the HCA in an offloaded manner.
These memory regions will be applicable for a new REG_SIG_MR
work request introduced later in this patchset.

In the future these routines may replace or implement current memory
regions creation routines existing today:
- ib_reg_user_mr
- ib_alloc_fast_reg_mr
- ib_get_dma_mr
- ib_dereg_mr

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-03-07 11:26:49 -08:00
Yishai Hadas eeb8461e36 IB: Refactor umem to use linear SG table
This patch refactors the IB core umem code and vendor drivers to use a
linear (chained) SG table instead of chunk list.  With this change the
relevant code becomes clearer—no need for nested loops to build and
use umem.

Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-03-04 10:34:28 -08:00
Linus Torvalds 4ba9920e5e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) BPF debugger and asm tool by Daniel Borkmann.

 2) Speed up create/bind in AF_PACKET, also from Daniel Borkmann.

 3) Correct reciprocal_divide and update users, from Hannes Frederic
    Sowa and Daniel Borkmann.

 4) Currently we only have a "set" operation for the hw timestamp socket
    ioctl, add a "get" operation to match.  From Ben Hutchings.

 5) Add better trace events for debugging driver datapath problems, also
    from Ben Hutchings.

 6) Implement auto corking in TCP, from Eric Dumazet.  Basically, if we
    have a small send and a previous packet is already in the qdisc or
    device queue, defer until TX completion or we get more data.

 7) Allow userspace to manage ipv6 temporary addresses, from Jiri Pirko.

 8) Add a qdisc bypass option for AF_PACKET sockets, from Daniel
    Borkmann.

 9) Share IP header compression code between Bluetooth and IEEE802154
    layers, from Jukka Rissanen.

10) Fix ipv6 router reachability probing, from Jiri Benc.

11) Allow packets to be captured on macvtap devices, from Vlad Yasevich.

12) Support tunneling in GRO layer, from Jerry Chu.

13) Allow bonding to be configured fully using netlink, from Scott
    Feldman.

14) Allow AF_PACKET users to obtain the VLAN TPID, just like they can
    already get the TCI.  From Atzm Watanabe.

15) New "Heavy Hitter" qdisc, from Terry Lam.

16) Significantly improve the IPSEC support in pktgen, from Fan Du.

17) Allow ipv4 tunnels to cache routes, just like sockets.  From Tom
    Herbert.

18) Add Proportional Integral Enhanced packet scheduler, from Vijay
    Subramanian.

19) Allow openvswitch to mmap'd netlink, from Thomas Graf.

20) Key TCP metrics blobs also by source address, not just destination
    address.  From Christoph Paasch.

21) Support 10G in generic phylib.  From Andy Fleming.

22) Try to short-circuit GRO flow compares using device provided RX
    hash, if provided.  From Tom Herbert.

The wireless and netfilter folks have been busy little bees too.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2064 commits)
  net/cxgb4: Fix referencing freed adapter
  ipv6: reallocate addrconf router for ipv6 address when lo device up
  fib_frontend: fix possible NULL pointer dereference
  rtnetlink: remove IFLA_BOND_SLAVE definition
  rtnetlink: remove check for fill_slave_info in rtnl_have_link_slave_info
  qlcnic: update version to 5.3.55
  qlcnic: Enhance logic to calculate msix vectors.
  qlcnic: Refactor interrupt coalescing code for all adapters.
  qlcnic: Update poll controller code path
  qlcnic: Interrupt code cleanup
  qlcnic: Enhance Tx timeout debugging.
  qlcnic: Use bool for rx_mac_learn.
  bonding: fix u64 division
  rtnetlink: add missing IFLA_BOND_AD_INFO_UNSPEC
  sfc: Use the correct maximum TX DMA ring size for SFC9100
  Add Shradha Shah as the sfc driver maintainer.
  net/vxlan: Share RX skb de-marking and checksum checks with ovs
  tulip: cleanup by using ARRAY_SIZE()
  ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called
  net/cxgb4: Don't retrieve stats during recovery
  ...
2014-01-25 11:17:34 -08:00
Roland Dreier fb1b5034e4 Merge branch 'ip-roce' into for-next
Conflicts:
	drivers/infiniband/hw/mlx4/main.c
2014-01-22 23:24:21 -08:00
Roland Dreier 8f399921ea Merge branches 'cma', 'cxgb4', 'flowsteer', 'ipoib', 'misc', 'mlx4', 'mlx5', 'ocrdma', 'qib', 'srp' and 'usnic' into for-next 2014-01-22 23:24:13 -08:00
Or Gerlitz 05633102d8 IB/core: Fix unused variable warning
Fix the below "make W=1" build warning:

    drivers/infiniband/core/iwcm.c: In function ‘destroy_cm_id’:
    drivers/infiniband/core/iwcm.c:330: warning: variable ‘ret’ set but not used

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-22 22:55:04 -08:00
Somnath Kotur 5462eddd7a RDMA/cma: Handle global/non-linklocal IPv6 addresses in cma_check_linklocal()
If addr is not a linklocal address, the code incorrectly fails to
return and ends up assigning the scope ID to the scope id of the
address, which is wrong.  Fix by checking if it's a link local address
first, and immediately return 0 if not.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-22 22:49:17 -08:00
Wei Yongjun 990acea616 IB/cm: Fix missing unlock on error in cm_init_qp_rtr_attr()
Add the missing unlock before return from function cm_init_qp_rtr_attr()
in the error handling case.

Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-19 15:14:04 -08:00
Matan Barak 2f85d24e60 IB/core: Make ib_addr a core IB module
IP based addressing introduces the usage of rdma_addr_find_dmac_by_grh()
within ib_core.  Since this function is declared in ib_addr, ib_addr
should be a part of the core INFINIBAND modules, rather than
INFINIBAND_ADDR_TRANS.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-19 15:14:04 -08:00
Or Gerlitz ed4c54e5b4 IB/core: Resolve Ethernet L2 addresses when modifying QP
Existing user space applications provide only IBoE L3 address
attributes to the kernel when they issue a modify QP modify.  To work
with them and let such apps (plus kernel consumers which don't use the
RDMA-CM) keep working transparently under the IBoE GID IP addressing
changes, add an Eth L2 address resolution helper.

Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-19 15:14:04 -08:00
Moni Shoua 7b85627b9f IB/cma: IBoE (RoCE) IP-based GID addressing
Currently, the IB core and specifically the RDMA-CM assumes that IBoE
(RoCE) gids encode related Ethernet netdevice interface MAC address
and possibly VLAN id.

Change GIDs to be treated as they encode interface IP address.

Since Ethernet layer 2 address parameters are not longer encoded
within gids, we have to extend the Infiniband address structures (e.g.
ib_ah_attr) with layer 2 address parameters, namely mac and vlan.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-18 14:12:35 -08:00
Upinder Malhi 5db5765e25 IB/core: Add support for RDMA_NODE_USNIC_UDP
Add the complementary RDMA_NODE_USNIC_UDP for RDMA_TRANSPORT_USNIC_UDP.

Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-18 13:48:54 -08:00
Aruna-Hewapathirane 63862b5bef net: replace macros net_random and net_srandom with direct calls to prandom
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-14 15:15:25 -08:00
Matan Barak dd5f03beb4 IB/core: Ethernet L2 attributes in verbs/cm structures
This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.

When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.

Thus, those attributes were added to the following structures:

* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id

For the path record structure, extra care was taken to avoid the new
fields when packing it into wire format, so we don't break the IB CM
and SA wire protocol.

On the active side, the CM fills. its internal structures from the
path provided by the ULP.  We add there taking the ETH L2 attributes
and placing them into the CM Address Handle (struct cm_av).

On the passive side, the CM fills its internal structures from the WC
associated with the REQ message.  We add there taking the ETH L2
attributes from the WC.

When the HW driver provides the required ETH L2 attributes in the WC,
they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
code checks for the presence of these flags, and in their absence does
address resolution from the ib_init_ah_from_wc() helper function.

ib_modify_qp_is_ok is also updated to consider the link layer. Some
parameters are mandatory for Ethernet link layer, while they are
irrelevant for IB.  Vendor drivers are modified to support the new
function signature.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14 14:20:54 -08:00
Upinder Malhi 248567f793 IB/core: Add RDMA_TRANSPORT_USNIC_UDP
Add RDMA_TRANSPORT_USNIC_UDP which will be used by usNIC.

Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14 00:44:45 -08:00
Roland Dreier 22f12c60e1 Merge branches 'cxgb4', 'flowsteer' and 'misc' into for-linus 2013-12-23 09:19:02 -08:00
Yann Droneaud 6cc3df840a IB/uverbs: Check access to userspace response buffer in extended command
This patch adds a check on the output buffer with access_ok(VERIFY_WRITE, ...)
to ensure the whole buffer is in userspace memory before using the
pointer in uverbs functions.  If the buffer or a subset of it is not
valid, returns -EFAULT to the caller.

This will also catch invalid buffer before the final call to
copy_to_user() which happen late in most uverb functions.

Just like the check in read(2) syscall, it's a sanity check to detect
invalid parameters provided by userspace. This particular check was added
in vfs_read() by Linus Torvalds for v2.6.12 with following commit message:

https://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/?id=fd770e66c9a65b14ce114e171266cf6f393df502

  Make read/write always do the full "access_ok()" tests.

  The actual user copy will do them too, but only for the
  range that ends up being actually copied. That hides
  bugs when the range has been clamped by file size or other
  issues.

Note: there's no need to check input buffer since vfs_write() already does
access_ok(VERIFY_READ, ...) as part of write() syscall.

Link: http://marc.info/?i=cover.1387273677.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:34 -08:00
Yann Droneaud 6bcca3d4a3 IB/uverbs: Check input length in flow steering uverbs
Since ib_copy_from_udata() doesn't check yet the available input data
length before accessing userspace memory, an explicit check of this
length is required to prevent:

- reading past the user provided buffer,
- underflow when subtracting the expected command size from the input
  length.

This will ensure the newly added flow steering uverbs don't try to
process truncated commands.

Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:33 -08:00
Yann Droneaud 98a37510ec IB/uverbs: Set error code when fail to consume all flow_spec items
If the flow_spec items parsed count does not match the number of items
declared in the flow_attr command, or if not all bytes are used for
flow_spec items (eg. trailing garbage), a log message is reported and
the function leave through the error path. Unfortunately the error
code is currently not set.

This patch set error code to -EINVAL in such cases, so that the error
is reported to userspace instead of silently fail.

Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:33 -08:00
Yann Droneaud c780d82a74 IB/uverbs: Check reserved fields in create_flow
As noted by Daniel Vetter in its article "Botching up ioctls"[1]

  "Check *all* unused fields and flags and all the padding for whether
   it's 0, and reject the ioctl if that's not the case.  Otherwise
   your nice plan for future extensions is going right down the
   gutters since someone *will* submit an ioctl struct with random
   stack garbage in the yet unused parts. Which then bakes in the ABI
   that those fields can never be used for anything else but garbage."

It's important to ensure that reserved fields are set to known value,
so that it will be possible to use them latter to extend the ABI.

The same reasonning apply to comp_mask field present in newer uverbs
command: per commit 22878dbc91 ("IB/core: Better checking of
userspace values for receive flow steering"), unsupported values in
comp_mask are rejected.

[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:32 -08:00
Yann Droneaud 2782c2d302 IB/uverbs: Check comp_mask in destroy_flow
Just like the check added to create_flow in 22878dbc91 ("IB/core:
Better checking of userspace values for receive flow steering"),
comp_mask must be checked in destroy_flow too.

Since only empty comp_mask is currently supported, any other value
must be rejected.

This check was silently added in a previous patch[1] to move comp_mask
in extended command header, part of previous patchset[2] against
create/destroy_flow uverbs. The idea of moving comp_mask to the header
was discarded for the final patchset[3].

Unfortunately the check added in destroy_flow uverb was not integrated
in the final patchset.

[1] http://marc.info/?i=40175eda10d670d098204da6aa4c327a0171ae5f.1381510045.git.ydroneaud@opteya.com
[2] http://marc.info/?i=cover.1381510045.git.ydroneaud@opteya.com
[3] http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com

Cc: Matan Barak <matanb@mellanox.com>
Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:31 -08:00
Yann Droneaud 7efb1b19b3 IB/uverbs: Check reserved field in extended command header
As noted by Daniel Vetter in its article "Botching up ioctls"[1]

  "Check *all* unused fields and flags and all the padding for whether
   it's 0, and reject the ioctl if that's not the case.  Otherwise
   your nice plan for future extensions is going right down the
   gutters since someone *will* submit an ioctl struct with random
   stack garbage in the yet unused parts. Which then bakes in the ABI
   that those fields can never be used for anything else but garbage."

It's important to ensure that reserved fields are set to known value,
so that it will be possible to use them latter to extend the ABI.

The same reasonning apply to comp_mask field present in newer uverbs
command: per commit 22878dbc91 ("IB/core: Better checking of
userspace values for receive flow steering"), unsupported values in
comp_mask are rejected.

[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html

Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:54:30 -08:00
Roland Dreier a96e4e2ffe IB/uverbs: New macro to set pointers to NULL if length is 0 in INIT_UDATA()
Trying to have a ternary operator to choose between NULL (or 0) and the
real pointer value in invocations leads to an impossible choice between
a sparse error about a literal 0 used as a NULL pointer, and a gcc
warning about "pointer/integer type mismatch in conditional expression."

Rather than clutter the source with more casts, move the ternary
operator into a new INIT_UDATA_BUF_OR_NULL() macro, which makes it
easier to use and simplifies its callers.

Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-20 10:53:44 -08:00
Yann Droneaud 309243ec14 IB/core: const'ify inbuf in struct ib_udata
Userspace input buffer is not modified by kernel, so it can be 'const'.

This is also a prerequisite to remove the implicit cast
from INIT_UDATA().

Link: http://marc.info/?i=cover.1386798254.git.ydroneaud@opteya.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-16 10:38:28 -08:00
Steve Wise 6b59ba609b RDMA/iwcm: Don't touch cm_id after deref in rem_ref
rem_ref() calls iwcm_deref_id(), which will wake up any blockers on
cm_id_priv->destroy_comp if the refcnt hits 0.  That will unblock
someone in iw_destroy_cm_id() which will free the cmid.  If that
happens before rem_ref() calls test_bit(IWCM_F_CALLBACK_DESTROY,
&cm_id_priv->flags), then the test_bit() will touch freed memory.

The fix is to read the bit first, then deref.  We should never be in
iw_destroy_cm_id() with IWCM_F_CALLBACK_DESTROY set, and there is a
BUG_ON() to make sure of that.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-12-15 16:47:47 -08:00
Linus Torvalds 1ea406c0e0 Main batch of InfiniBand/RDMA changes for 3.13:
- Re-enable flow steering verbs with new improved userspace ABI
  - Fixes for slow connection due to GID lookup scalability
  - IPoIB fixes
  - Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
  - Further improvements to SRP error handling
  - Add new transport type for Cisco usNIC
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCAAGBQJSil7BAAoJEENa44ZhAt0hbtgP/A+AmUalbOX6ZKzuOFxsrtY2
 r55CX9b1JBeFM/Zhn2o6y+81lpCjkckJSggESMe4izNgocGw0nW4vYGN4SBynatj
 y8sR9OSn+G3ihuENrzG41MJUGEa5WbcNMy4boN+Oa+qyTlV/WjLR7Fv4WbikK7Wm
 o8FNlXiiDhMoGfHHG5J0MD0EQsnxuLDk2XP+ciu4tLtTs+wBka+gFK8WnMvztle3
 gTeMNna5ilvCS2fdBxteuPA3KeDnJE9AgJSMJ2a4Rh+DR8uTgWYQ6n7amjmOc546
 yhAKkoBkxPE10+Yj82WOPhCFxSeWcuSwJvpgv5dTVZ1XqUUcC1V3TEcZDHmyyHQ7
 uPXgS1A+erBW3OYPBjZqtKvnHObscV12fL+rId3vIhcAQIbFroci08ZwPidEYRkn
 fvwlEKcrIsBIpRXEyjlFCxsiiDnfq1wC1VayMR3jrIK0P6idf1SXf/geiRp9+RGT
 wKUc0j51jvEx29qc65xuhEP9FQV9pCMxyd+FEE0d0KkjMz5hsIkjmcUcBbgF0CGg
 GEyDPlgRLv+vmWDGpT8XraaV/0CJOEQDIgB4WSN87/AZ4UoNt7spW2xqsLsp1toy
 5e0100tpWUleTPLe/Wig5GtBdagQ2jAUK1+186CP93pFPtkwc4/7X3hyp7qPIPTz
 VDvT9DEy6zjSMCLpMcdo
 =xxC+
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband/rdma updates from Roland Dreier:
 - Re-enable flow steering verbs with new improved userspace ABI
 - Fixes for slow connection due to GID lookup scalability
 - IPoIB fixes
 - Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
 - Further improvements to SRP error handling
 - Add new transport type for Cisco usNIC

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
  IB/core: Re-enable create_flow/destroy_flow uverbs
  IB/core: extended command: an improved infrastructure for uverbs commands
  IB/core: Remove ib_uverbs_flow_spec structure from userspace
  IB/core: Use a common header for uverbs flow_specs
  IB/core: Make uverbs flow structure use names like verbs ones
  IB/core: Rename 'flow' structs to match other uverbs structs
  IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
  IB/ucma: Convert use of typedef ctl_table to struct ctl_table
  IB/cm: Convert to using idr_alloc_cyclic()
  IB/mlx5: Fix page shift in create CQ for userspace
  IB/mlx4: Fix device max capabilities check
  IB/mlx5: Fix list_del of empty list
  IB/mlx5: Remove dead code
  IB/core: Encorce MR access rights rules on kernel consumers
  IB/mlx4: Fix endless loop in resize CQ
  RDMA/cma: Remove unused argument and minor dead code
  RDMA/ucma: Discard events for IDs not yet claimed by user space
  IB/core: Add Cisco usNIC rdma node and transport types
  RDMA/nes: Remove self-assignment from nes_query_qp()
  IB/srp: Report receive errors correctly
  ...
2013-11-18 15:36:04 -08:00
Roland Dreier b4fdf52b3f Merge branches 'cma', 'cxgb4', 'flowsteer', 'ipoib', 'misc', 'mlx4', 'mlx5', 'nes', 'ocrdma', 'qib' and 'srp' into for-next 2013-11-17 08:22:19 -08:00
Matan Barak 69ad5da41b IB/core: Re-enable create_flow/destroy_flow uverbs
This commit reverts commit 7afbddfae9 ("IB/core: Temporarily disable
create_flow/destroy_flow uverbs").  Since the uverbs extensions
functionality was experimental for v3.12, this patch re-enables the
support for them and flow-steering for v3.13.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:09 -08:00
Yann Droneaud f21519b23c IB/core: extended command: an improved infrastructure for uverbs commands
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.

According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.

But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].

So the following patch is an attempt to a revised extensible command
infrastructure.

This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.

Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.

Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).

So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).

The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.

Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.

The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.

Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:

                             legacy      extended

   Maximum command buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)
  Maximum response buffer:  256KBytes   1024KBytes (512KBytes + 512KBytes)

For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".

One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.

The proposed scheme will format input (command) and output (response)
buffers this way:

- command:

  legacy header +
  extended header +
  command data (core + hw):

    +----------------------------------------+
    | flags     |   00      00    |  command |
    |        in_words    |   out_words       |
    +----------------------------------------+
    |                 response               |
    |                 response               |
    | provider_in_words | provider_out_words |
    |                 padding                |
    +----------------------------------------+
    |                                        |
    .              <uverbs input>            .
    .              (in_words * 8)            .
    |                                        |
    +----------------------------------------+
    |                                        |
    .             <provider input>           .
    .          (provider_in_words * 8)       .
    |                                        |
    +----------------------------------------+

- response, if present:

    +----------------------------------------+
    |                                        |
    .          <uverbs output space>         .
    .             (out_words * 8)            .
    |                                        |
    +----------------------------------------+
    |                                        |
    .         <provider output space>        .
    .         (provider_out_words * 8)       .
    |                                        |
    +----------------------------------------+

The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.

Note:

The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility).  This was suggested by Roland Dreier in a previous
review[2].  But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.

[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com

[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com

[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com

[ Convert "ret ? ret : 0" to the equivalent "ret".  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:09 -08:00
Yann Droneaud 2490f20be4 IB/core: Remove ib_uverbs_flow_spec structure from userspace
The structure holding any types of flow_spec is of no use to
userspace.  It would be wrong for userspace to do:

  struct ib_uverbs_flow_spec flow_spec;

  flow_spec.type = IB_FLOW_SPEC_TCP;
  flow_spec.size = sizeof(flow_spec);

Instead, userspace should use the dedicated flow_spec structure for
  - Ethernet : struct ib_uverbs_flow_spec_eth,
  - IPv4     : struct ib_uverbs_flow_spec_ipv4,
  - TCP/UDP  : struct ib_uverbs_flow_spec_tcp_udp.

In other words, struct ib_uverbs_flow_spec is a "virtual" data
structure that can only be use by the kernel as an alias to the other.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:08 -08:00
Yann Droneaud b68c956021 IB/core: Make uverbs flow structure use names like verbs ones
This patch adds "flow" prefix to most of data structure added as part
of commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") to keep those names in sync with the data structures added in
commit 319a441d13 ("IB/core: Add receive flow steering support").

It's just a matter of translating 'ib_flow' to 'ib_uverbs_flow'.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:08 -08:00
Yann Droneaud d82693dad0 IB/core: Rename 'flow' structs to match other uverbs structs
Commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") added public data structures to support receive flow
steering.  The new structs are not following the 'uverbs' pattern:
they're lacking the common prefix 'ib_uverbs'.

This patch replaces ib_kern prefix by ib_uverbs.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:08 -08:00
Matan Barak f884827438 IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
This patch fixes the following issues:

1. Unneeded checks were removed

2. Removed the fixed size out of flow_attr.size, thus simplifying the checks.

3. Remove a 32bit hole on 64bit systems with strict alignment in
   struct ib_kern_flow_att by adding a reserved field.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-17 08:22:07 -08:00
Joe Perches f3a5e3e37e IB/ucma: Convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-16 10:32:24 -08:00
Zhao Hongjiang ab626d1a68 IB/cm: Convert to using idr_alloc_cyclic()
Commit 3e6628c4b3 ("idr: introduce idr_alloc_cyclic()") adds a new
idr_alloc_cyclic() routine and converts several of these users to it.
This is just a missed one - add it.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-16 10:30:42 -08:00
Eli Cohen 1c636f8016 IB/core: Encorce MR access rights rules on kernel consumers
Enforce the rule that when requesting remote write or atomic permissions, local
write must be indicated as well. See IB spec 11.2.8.2.

Spotted by: Hagay Abramovsky <hagaya@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-15 10:25:32 -08:00
Michal Nazarewicz 352b905635 RDMA/cma: Remove unused argument and minor dead code
The dev variable is never assigned after being initialised.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-11 10:46:54 -08:00
Sean Hefty c6b21824c9 RDMA/ucma: Discard events for IDs not yet claimed by user space
Problem reported by Avneesh Pant <avneesh.pant@oracle.com>:

    It looks like we are triggering a bug in RDMA CM/UCM interaction.
    The bug specifically hits when we have an incoming connection
    request and the connecting process dies BEFORE the passive end of
    the connection can process the request i.e. it does not call
    rdma_get_cm_event() to retrieve the initial connection event.  We
    were able to triage this further and have some additional
    information now.

    In the example below when P1 dies after issuing a connect request
    as the CM id is being destroyed all outstanding connects (to P2)
    are sent a reject message. We see this reject message being
    received on the passive end and the appropriate CM ID created for
    the initial connection message being retrieved in cm_match_req().
    The problem is in the ucma_event_handler() code when this reject
    message is delivered to it and the initial connect message itself
    HAS NOT been delivered to the client. In fact the client has not
    even called rdma_cm_get_event() at this stage so we haven't
    allocated a new ctx in ucma_get_event() and updated the new
    connection CM_ID to point to the new UCMA context.

    This results in the reject message not being dropped in
    ucma_event_handler() for the new connection request as the
    (if (!ctx->uid)) block is skipped since the ctx it refers to is
    the listen CM id context which does have a valid UID associated
    with it (I believe the new CMID for the connection initially
    uses the listen CMID -> context when it is created in
    cma_new_conn_id). Thus the assumption that new events for a
    connection can get dropped in ucma_event_handler() is incorrect
    IF the initial connect request has not been retrieved in the
    first case. We end up getting a CM Reject event on the listen CM
    ID and our upper layer code asserts (in fact this event does not
    even have the listen_id set as that only gets set up librdmacm
    for connect requests).

The solution is to verify that the cm_id being reported in the event
is the same as the cm_id referenced by the ucma context.  A mismatch
indicates that the ucma context corresponds to the listen.  This fix
was validated by using a modified version of librdmacm that was able
to verify the problem and see that the reject message was indeed
dropped after this patch was applied.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-11 10:45:06 -08:00
Upinder Malhi \(umalhi\) 180771a370 IB/core: Add Cisco usNIC rdma node and transport types
This patch adds new rdma node and new rdma transport, and supporting
code used by Cisco's low latency driver called usNIC.  usNIC uses its
own transport, distinct from IB and iWARP.

Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-09 02:36:25 -08:00
Mathias Krause 5476781bb9 IB/netlink: Remove superfluous RDMA_NL_GET_OP() masking
'op' is the already RDMA_NL_GET_OP() masked 'type'.  No need to mask it again.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Latchesar Ionkov 6b7d103c1b IB/core: Pass imm_data from ib_uverbs_send_wr to ib_send_wr correctly
Currently, we don't copy the immediate data from the userspace struct
to the kernel one when UD messages are being sent.

This patch makes sure that the immediate data is set correctly.

Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Doug Ledford be9130cc92 IB/cma: Check for GID on listening device first
As a simple optimization that should speed up the vast majority of
connect attemps on IB devices, when we are searching for the GID of an
incoming connection in the cached GID lists of devices, search the
device that received the incoming connection request first.  If we
don't find it there, then move on to other devices.

This reduces the time to perform 10,000 connections considerably.
Prior to this patch, a bad run of cmtime would look like this:

connect      :    12399.26   12351.10    8609.00    1239.93

With this patch, it looks more like this:

connect      :     5864.86    5799.80    8876.00     586.49

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Doug Ledford 29f27e8477 IB/cma: Use cached gids
The cma_acquire_dev function was changed by commit 3c86aa70bf
("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port()
because multiport devices might have either IB or IBoE formatted gids.
The old function assumed that all ports on the same device used the
same GID format.

However, when it was changed to use find_gid_port(), we inadvertently
lost usage of the GID cache.  This turned out to be a very costly
change.  In our testing, each iteration through each index of the GID
table takes roughly 35us.  When you have multiple devices in a system,
and the GID you are looking for is on one of the later devices, the
code loops through all of the GID indexes on all of the early devices
before it finally succeeds on the target device.  This pathological
search behavior combined with 35us per GID table index retrieval
results in results such as the following from the cmtime application
that's part of the latest librdmacm git repo:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       29.42       0.04       1.00       2.94
bind addr    :   186705.66      19.00   18556.00   18670.57
resolve addr :       41.93       9.68     619.00       4.19
resolve route:      486.93       0.48     101.00      48.69
create qp    :     4021.95       6.18     330.00     402.20
connect      :    68350.39   68588.17   24632.00    6835.04
disconnect   :     1460.43     252.65-1862269.00     146.04
destroy      :       41.16       0.04       2.00       4.12

ib0:
step              total ms     max ms     min us  us / conn
create id    :       28.61       0.68       1.00       2.86
bind addr    :     2178.86       2.95     201.00     217.89
resolve addr :       51.26      16.85     845.00       5.13
resolve route:      620.08       0.43      92.00      62.01
create qp    :     3344.40       6.36     273.00     334.44
connect      :     6435.99    6368.53    7844.00     643.60
disconnect   :     5095.38     321.90     757.00     509.54
destroy      :       37.13       0.02       2.00       3.71

Clearly, both the bind address and connect operations suffer
a huge penalty for being anything other than the default
GID on the first port in the system.

After applying this patch, the numbers now look like this:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       30.15       0.03       1.00       3.01
bind addr    :       80.27       0.04       7.00       8.03
resolve addr :       43.02      13.53     589.00       4.30
resolve route:      482.90       0.45     100.00      48.29
create qp    :     3986.55       5.80     330.00     398.66
connect      :     7141.53    7051.29    5005.00     714.15
disconnect   :     5038.85     193.63     918.00     503.88
destroy      :       37.02       0.04       2.00       3.70

ib0:
step              total ms     max ms     min us  us / conn
create id    :       34.27       0.05       1.00       3.43
bind addr    :       26.45       0.04       1.00       2.64
resolve addr :       38.25      10.54     760.00       3.82
resolve route:      604.79       0.43      97.00      60.48
create qp    :     3314.95       6.34     273.00     331.49
connect      :    12399.26   12351.10    8609.00    1239.93
disconnect   :     5096.76     270.72    1015.00     509.68
destroy      :       37.10       0.03       2.00       3.71

It's worth noting that we still suffer a bit of a penalty on
connect to the wrong device, but the penalty is much less than
it used to be.  Follow on patches deal with this penalty.

Many thanks to Neil Horman for helping to track the source of
slow function that allowed us to track down the fact that
the original patch I mentioned above backed out cache usage
and identify just how much that impacted the system.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Eyal Perry eb072c4b8d RDMA/cma: Set IBoE SL (user-priority) by egress map when using vlans
On top of commit 366cddb40 "IB/rdma_cm: TOS <=> UP mapping for IBoE", add
support for case vlan egress map is used.

When the IBoE session is being set over a vlan, inherit the socket priority
to vlan priority mapping which was configured for the vlan device egress map.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07 19:09:44 -05:00
David S. Miller c3fa32b976 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/qmi_wwan.c
	include/net/dst.h

Trivial merge conflicts, both were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-23 16:49:34 -04:00
Yann Droneaud 7afbddfae9 IB/core: Temporarily disable create_flow/destroy_flow uverbs
The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.

So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).

The feature will be enabled after proper cleanup for v3.13.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com

[ Add a Kconfig option to reenable these verbs.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-10-21 09:44:17 -07:00