The retured value from ib_dma_map_sg saved in dma_nents variable. To avoid
future mismatch between types, define dma_nents as an integer instead of
unsigned.
Fixes: 57b26497fa ("IB/iser: Pass the correct number of entries for dma mapped SGL")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The write access of an implicit MR is inherited to all of its children.
Therefore we must set the correct write access to the parent MR.
Pass full access_flags when creating umem to let it calculate write access
correctly.
Fixes: da6a496a34 ("IB/mlx5: Ranges in implicit ODP MR inherit its write access")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Kernel space provider driver should clean the CQs belonging to kernel
space consumers only. The current implementation is doing reverse of it.
Fixing the same by avoiding the call to __clean_cq on a kernel qp during
destroy.
Fixes: c50866e285 ("bnxt_re: fix the regression due to changes in alloc_pbl")
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no need to call kfree(pd) because ib_dealloc_pd() internally
frees PD.
Fixes: 21a428a019 ("RDMA: Handle PD allocations by IB/core")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Following the PD conversion patch, do the same for ucontext allocations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The first parameter of WARN_ONCE() is a condition, then following
parameters are the message. In this case, we left out the condition so it
will just print the ops->type string.
Fixes: 3856ec4b93 ("RDMA/core: Add RDMA_NLDEV_CMD_NEWLINK/DELLINK support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
While adding the use of for_each_sg_dma_page iterator for Brodcom's rdma
driver, there was a regression added in the __alloc_pbl path. The change
left bnxt_re in DOA state in for-next branch.
Fixing the regression to avoid the host crash when a user space object is
created. Restricting the unconditional access to hwq.pg_arr when hwq is
initialized for user space objects.
Fixes: 161ebe2498 ("RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGL")
Reported-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.
Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.
Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.
During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed
The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.
64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.
When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:
cm_rx_duplicates:
dreq 2102
cm_rx_msgs:
drep 1989
dreq 6195
rep 3968
req 4224
rtu 4224
cm_tx_msgs:
drep 4093
dreq 27568
rep 4224
req 3968
rtu 3968
cm_tx_retries:
dreq 23469
Note that the active/passive side is equally distributed between the
two nodes.
Enabling pr_debug in cm.c gives tons of:
[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!
By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:
cm_rx_duplicates:
dreq 2460
[]
cm_tx_retries:
dreq 3010
req 47
Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.
Adjustment of the CMA timeout is not part of this commit.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It is possible that during a page fault handling, the process that owns
the MR is terminating. The indication for it is failure to get the
task_struct or take reference on the mm_struct. In this case just abort
the page-fault handler with error but without a warning to the kernel log.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When prefetching odp mr it is required to verify that pd of the mr is
identical to the pd for which the advise_mr request arrived with.
This check was missing from synchronous flow and is added now.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When deferring a prefetch request we need to protect against MR or PD
being destroyed while the request is still enqueued.
The first step is to validate that PD owns the lkey that describes the MR
and that the MR that the lkey refers to is owned by that PD.
The second step is to dequeue all requests when MR is destroyed.
Since PD can't be destroyed while it owns MRs it is guaranteed that when a
worker wakes up the request it refers to is still valid.
Now, it is possible to refrain from taking a reference on the device since
it is assured to be present as pd.
While that, replace the dedicated ordered workqueue with the system
unbound workqueue to reuse an existing resource and improve
performance. This will also fix a bug of queueing to the wrong workqueue.
Fixes: 813e90b1ae ("IB/mlx5: Add advise_mr() support")
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix the following warning by adding a missing break:
drivers/infiniband/hw/hfi1/tid_rdma.c: In function ‘hfi1_tid_rdma_wqe_interlock’:
drivers/infiniband/hw/hfi1/tid_rdma.c:3251:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
switch (prev->wr.opcode) {
^~~~~~
drivers/infiniband/hw/hfi1/tid_rdma.c:3259:2: note: here
case IB_WR_RDMA_READ:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Fixes: c6c231175c ("IB/hfi1: Add interlock between TID RDMA WRITE and other requests")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Kaike Wan <Kaike.wan@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
From
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
To resolve conflicts with net-next and pick up the first patch.
* branch 'mlx5-next':
net/mlx5: Factor out HCA capabilities functions
IB/mlx5: Add support for 50Gbps per lane link modes
net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register
net/mlx5: Add new fields to Port Type and Speed register
net/mlx5: Refactor queries to speed fields in Port Type and Speed register
net/mlx5: E-Switch, Avoid magic numbers when initializing offloads mode
net/mlx5: Relocate vport macros to the vport header file
net/mlx5: E-Switch, Normalize the name of uplink vport number
net/mlx5: Provide an alternative VF upper bound for ECPF
net/mlx5: Add host params change event
net/mlx5: Add query host params command
net/mlx5: Update enable HCA dependency
net/mlx5: Introduce Mellanox SmartNIC and modify page management logic
IB/mlx5: Use unified register/load function for uplink and VF vports
net/mlx5: Use consistent vport num argument type
net/mlx5: Use void pointer as the type in address_of macro
net/mlx5: Align ODP capability function with netdev coding style
mlx5: use RCU lock in mlx5_eq_cq_get()
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The current check does not take into account the previous value of
pinned_vm; thus it is quite bogus as is. Fix this by checking the
new value after the (optimistic) atomic inc.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before calling the provider's alloc_mw function, verify that the
given memory type is either IB_MW_TYPE_1 or IB_MW_TYPE_2.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The strlen() check at the beginning of iw_cm_map() ensures that devname
and ifname strings are less than destinations to which they are supposed
to be copied. Change strncpy() call to be strcpy(), because we are
protected from overflow. Zero the entire string buffer to avoid copying
uninitialized kernel stack memory to userspace.
This fixes the compilation warning below:
In file included from ./include/linux/dma-mapping.h:6,
from drivers/infiniband/core/iwcm.c:38:
In function _strncpy_,
inlined from _iw_cm_map_ at drivers/infiniband/core/iwcm.c:519:2:
./include/linux/string.h:253:9: warning: ___builtin_strncpy_ specified
bound 32 equals destination size [-Wstringop-truncation]
return __builtin_strncpy(p, q, size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: d53ec8af56 ("RDMA/iwcm: Don't copy past the end of dev_name() string")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
old_pd is used only if IB_MR_REREG_PD flags is set.
For readability move it's initialization to where it is used.
While there rewrite the whole 'if-else' block so on error jump directly
to label and no need for 'else'
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fixes the following sparse warning:
drivers/infiniband/hw/cxgb4/cm.c:658:6: warning:
symbol 'read_tcb' was not declared. Should it be static?
Fixes: 11a27e2121 ("iw_cxgb4: complete the cached SRQ buffers")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Accroding to hip08's limitation, qp&cq specification is 1M, mtpt
specification 1M in kernel space.
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Do not assume irq_poll_sched() is called from an interrupt context only.
So use raise_softirq_irqoff() instead of __raise_softirq_irqoff() so it
will kick the ksoftirqd if the schedule is from a non-interrupt context.
This is required for RDMA drivers, like soft iwarp, that generate cq
completion notifications in a workqueue or kthread context. Without this
change, siw completion notifications to the ULP can take several hundred
usecs, depending on the system load.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for the RDMA_NLDEV_CMD_NEWLINK/DELLINK messages which allow
dynamically adding new RXE links. Deprecate the old module options for
now.
Cc: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for new LINK messages to allow adding and deleting rdma
interfaces. This will be used initially for soft rdma drivers which
instantiate device instances dynamically by the admin specifying a netdev
device to use. The rdma_rxe module will be the first user of these
messages.
The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers register
with the rdma core if they provide link add/delete functions. Each driver
registers with a unique "type" string, that is used to dispatch messages
coming from user space. A new RDMA_NLDEV_ATTR is defined for the "type"
string. User mode will pass 3 attributes in a NEWLINK message:
RDMA_NLDEV_ATTR_DEV_NAME for the desired rdma device name to be created,
RDMA_NLDEV_ATTR_LINK_TYPE for the "type" of link being added, and
RDMA_NLDEV_ATTR_NDEV_NAME for the net_device interface to use for this
link. The DELLINK message will contain the RDMA_NLDEV_ATTR_DEV_INDEX of
the device to delete.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a dead lock in usnic ib_register and netdev_notify path.
usnic_ib_discover_pf()
| mutex_lock(&usnic_ib_ibdev_list_lock);
| usnic_ib_device_add();
| ib_register_device()
| usnic_ib_query_port()
| mutex_lock(&us_ibdev->usdev_lock);
| ib_get_eth_speed()
| rtnl_lock()
order of lock: &usnic_ib_ibdev_list_lock -> usdev_lock -> rtnl_lock
rtnl_lock()
| usnic_ib_netdevice_event()
| mutex_lock(&usnic_ib_ibdev_list_lock);
order of lock: rtnl_lock -> &usnic_ib_ibdev_list_lock
Solution is to use the core's lock-free ib_device_get_by_netdev() scheme
to lookup ib_dev while handling netdev & inet events.
Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Reviewed-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Reviewed-by: Tanmay Inamdar <tinamdar@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since rxe allows unregistration from other threads the rxe pointer can
become invalid any moment after ib_register_driver returns. This could
cause a user triggered use after free.
Add another driver callback to be called right after the device becomes
registered to complete any device setup required post-registration. This
callback has enough core locking to prevent the device from becoming
unregistered.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rxe has an open coded version of this that is not as safe as the core
version. This lets us eliminate the internal device list entirely from
rxe.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rxe does not have correct locking for its registration/unregistration
paths, use the core code to handle it instead. In this mode
ib_unregister_device will also do the dealloc, so rxe is required to do
clean up from a callback.
The core code ensures that unregistration is done only once, and generally
takes care of locking and concurrency problems for rxe.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These APIs are intended to support drivers that exist outside the usual
driver core probe()/remove() callbacks. Normally the driver core will
prevent remove() from running concurrently with probe(), once this safety
is lost drivers need more support to get the locking and lifetimes right.
ib_unregister_driver() is intended to be used during module_exit of a
driver using these APIs. It unregisters all the associated ib_devices.
ib_unregister_device_and_put() is to be used by a driver-specific removal
function (ie removal by name, removal from a netdev notifier, removal from
netlink)
ib_unregister_queued() is to be used from netdev notifier chains where
RTNL is held.
The locking is tricky here since once things become async it is possible
to race unregister with registration. This is largely solved by relying on
the registration refcount, unregistration will only ever work on something
that has a positive registration refcount - and then an unregistration
mutex serializes all competing unregistrations of the same device.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Several drivers need to find the ib_device from a given netdev. rxe needs
this at speed in an unsleepable context, so choose to implement the
translation using a RCU safe hash table.
The hash table can have a many to one mapping. This is intended to support
some future case where multiple IB drivers (ie iWarp and RoCE) connect to
the same netdevs. driver_ids will need to be different to support this.
In the process this makes the struct ib_device and ib_port_data RCU safe
by deferring their kfrees.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The associated netdev should not actually be very dynamic, so for most
drivers there is no reason for a callback like this. Provide an API to
inform the core code about the net dev affiliation and use a core
maintained data structure instead.
This allows the core code to be more aware of the ndev relationship which
will allow some new APIs based around this.
This also uses locking that makes some kind of sense, many drivers had a
confusing RCU lock, or missing locking which isn't right.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Like the other cases there no real reason to have another array just for
the cache. This larger conversion gets its own patch.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no reason to have three allocations of per-port data. Combine
them together and make the lifetime for all the per-port data match the
struct ib_device.
Following patches will require more port-specific data, now there is a
good place to put it.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We have many loops iterating over all of the end port numbers on a struct
ib_device, simplify them with a for_each helper.
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Netlink dumpit handshake exchanges the index from which kernel should
start to return its value, in current code, this index included
not-visible in this PID items too and indirectly revealed the number of
entries.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch adds ability to query specific QP based on its LQPN (local
QPN), which is assigned by HW and needs special treatment while inserting
into restrack DB.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
PD, MR and QP objects have parents objects: contexts and PDs. The exposed
parent IDs allow to correlate various objects and simplify debug
investigation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Give to the user space tools unique identifier for PD, MR, CQ and CM_ID
objects, so they can be able to query on them with .doit callbacks.
QP .doit is not supported yet, till all drivers will be updated to provide
their LQPN to be equal to their restrack ID.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As a preparation to extension of rdma_restrack_root to provide software
IDs, which will be per-type too. We convert the rdma_restrack_root from
struct with arrays to array of structs.
Such conversion allows us to drop rwsem lock in favour of internal XArray
lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Combine all HCA capabilities setters under one function
and compile out the ODP related function in case kernel
was compiled without ODP support.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
There is no need to expose internals of restrack DB to IB/core.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
XArray uses internal lock for updates to XArray. This means that our
external RW lock is needed to ensure that entry is not deleted while we
are performing iteration over list.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Implement doit callbacks and ensure that users won't provide port values
on resource entry allocated in per-device mode needed for .doit callback.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add new general helper to get restrack entry given by ID and their
respective type.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The additions of .doit callbacks posses new access pattern to the resource
entries by some user visible index. Back then, the legacy DB was
implemented as hash because per-index access wasn't needed and XArray
wasn't accepted yet.
Acceptance of XArray together with per-index access requires the refresh
of DB implementation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move core device addition and removal from sysfs.c to device.c as device.c
is more appropriate place for device management.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Refactor code for device and port sysfs attributes for reuse.
While at it, rename counter part free function to ib_free_port_attrs.
Also attribute setup sequence is:
(a) port specific init.
(b) device stats alloc/init.
So for cleanup, follow reverse sequence:
(a) device stats dealloc
(b) port specific cleanup
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of holding extra reference using get_device() that
device_unregister() releases, simplify it as below.
device_add() balances with device_del(). device_initialize() balances
with put_device(), always via ib_dealloc_device().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ucontext allocation and release aren't async events and don't need kref
accounting. The common layer of RDMA subsystem ensures that dealloc
ucontext will be called after all other objects are released.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>