There are few issues with validation of netdevice and listen id lookup
for IB (IPoIB) while processing incoming CM request as below.
1. While performing lookup of bind_list in cma_ps_find(), net namespace
of the netdevice can get deleted in cma_exit_net(), resulting in use
after free access of idr and/or net namespace structures.
This lookup occurs from the workqueue context (and not userspace
context where net namespace is always valid).
CPU0 CPU1
==== ====
bind_list = cma_ps_find();
move netdevice to new namespace
delete net namespace
cma_exit_net()
idr_destroy(idr);
[..]
cma_find_listener(bind_list, ..);
2. While netdevice is validated for IP address in given net namespace,
netdevice's net namespace and/or ifindex can change in
cma_get_net_dev() and cma_match_net_dev().
Above issues are overcome by using rcu lock along with netdevice
UP/DOWN state as described below.
When a net namespace is getting deleted, netdevice is closed and
shutdown before moving it back to init_net namespace.
change_net_namespace() synchronizes with any existing use of netdevice
before changing the netdev properties such as net or ifindex.
Once netdevice IFF_UP flags is cleared, such fields are not guaranteed
to be valid.
Therefore, rcu lock along with netdevice state check ensures that,
while route lookup and cm_id lookup is in progress, netdevice of
interest won't migrate to any other net namespace.
This ensures that associated net namespace of netdevice won't get
deleted while rcu lock is held for netdevice which is in IFF_UP state.
Fixes: fa20105e09 ("IB/cma: Add support for network namespaces")
Fixes: 4be74b42a6 ("IB/cma: Separate port allocation to network namespaces")
Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, if a method contained mandatory attributes in a namespace
that wasn't given by the user, these attributes weren't validated.
Fixing this by iterating over all specification namespaces.
Fixes: fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Before [1], When MAC address of the netdevice is changed, default GID is
supposed to get deleted and added back which affects the node and/or port
GUID in below sequence.
netdevice_event()
-> NETDEV_CHANGEADDR
default_del_cmd()
del_netdev_default_ips()
bond_delete_netdev_default_gids()
ib_cache_gid_set_default_gid()
ib_cache_gid_del()
add_cmd()
[..]
However, ib_cache_gid_del() was not getting invoked in non bonding
scenarios because event_ndev and rdma_ndev are same.
Therefore, fix such condition to ignore checking upper device when event
ndev and rdma_dev are same; similar to bond_set_netdev_default_gids().
Which this fix ib_cache_gid_del() is invoked correctly; however
ib_cache_gid_del() doesn't find the default GID for deletion because
find_gid() was given default_gid = false with
GID_ATTR_FIND_MASK_DEFAULT set.
But it was getting overwritten by ib_cache_gid_set_default_gid() later
on as part of add_cmd().
Therefore, mac address change used to work for default GID.
With refactor series [1], this incorrect behavior is detected.
Therefore,
when deleting default GID, set default_gid and set MASK flag.
when deleting IP based GID, clear default_gid and set MASK flag.
[1] https://patchwork.kernel.org/patch/10319151/
Fixes: 238fdf48f2 ("IB/core: Add RoCE table bonding support")
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When IPv6 link local address is removed, if it matches with the default
GID, default GID(s)s gets removed which may not be a desired behavior.
This behavior is introduced by refactor work in Fixes tag.
When IPv6 link address is removed, removing its equivalent RoCEv2 GID
which exactly matches with default RoCEv2 GID, is right thing to do.
However achieving it correctly requires lot more changes, likely in
roce_gid_mgmt.c and core/cache.c. This should be done as independent
patch.
Therefore, this patch preserves behavior of not deleteing default GIDs.
This is done by providing explicit hint to consider default GID property
using mask and default_gid; similar to add_gid().
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Default GIDs are marked reserved at the start of the GID table at index
0 and 1 by gid_table_reserve_default(). Currently when default GID is
requested, it can still allocates an empty slot which was not marked as
RESERVED for default GID, which is incorrect.
At least in current code flow of roce_gid_mgmt.c, in theory we can
still request to allocate more than one/two default GIDs depending
on how upper devices are setup.
Therefore, it is better for cache layer to only allow our reserved slots
to be used by default GID allocation requests.
Fixes: 598ff6bae6 ("IB/core: Refactor GID modify code for RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The RDMA CM will select a source device and address by consulting
the routing table if no source address is passed into
rdma_resolve_address(). Userspace will ask for this by passing an
all-zero source address in the RESOLVE_IP command. Unfortunately
the new check for non-zero address size rejects this with EINVAL,
which breaks valid userspace applications.
Fix this by explicitly allowing a zero address family for the source.
Fixes: 2975d5de64 ("RDMA/ucma: Check AF family prior resolving address")
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This is done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
With gcc-4.1.2:
drivers/infiniband/core/uverbs_std_types_flow_action.c:366: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:376: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:383: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:384: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
Add the missing braces to fix this.
Fixes: 2eb9beaee5 ("IB/uverbs: Add flow_action create and destroy verbs")
Fixes: 7d12f8d5a1 ("IB/uverbs: Add modify ESP flow_action")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The only thing it does is block module unload while work is posted from
rdma_resolve_ip().
However, this is not the right place to do this. The users of
rdma_resolve_ip() must ensure their own module does not unload until
rdma_resolve_ip() calls the callback, or until rdma_addr_cancel() is
called.
Similarly callers to rdma_addr_find_l2_eth_by_grh() must ensure their
module does not unload while they are calling code.
The only two users are already safe, so there is no need for this.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently rdma_addr_cancel does not prevent the callback from being used,
this is surprising and hard to reason about. There does not appear to be a
bug here as the only user of this API does refcount properly, fixing it
only to increase clarity.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that the work queue is used directly to launch and track the work
there is no need for the second processing function to do 'all list
entries'. Just schedule all entries onto the main work queue directly.
We can also drop all of the useless list sorting now, as the workqueue
sorts by expiration time automatically.
This change requires switching lock to a spinlock as netdev notifiers
are called in an atomic context, this is now easy since the lock does
not need to be held across the lookup, that is already single
threaded due to the work queue.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Validating input parameters should be done before getting the cm_id
otherwise it can leak a cm_id reference.
Fixes: 6a21dfc0d0 ("RDMA/ucma: Limit possible option size")
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back). This
series addresses an IRQ performance issue (patch 1), cleanups related to
the fix for the IRQ performance problem (patches 2-6), and then extends
the fragmented completion queue support that already exists in the net
side of the driver to the ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the remaining
10 patches to the IB subsystem. This series extends the current
'representor' framework when the mlx5 driver is in switchdev mode from
being a netdev only construct to being a netdev/IB dev construct. The IB
dev is limited to raw Eth queue pairs only, but by having an IB dev of
this type attached to the representor for a switchdev port, it enables
DPDK to work on the switchdev device.
* All net related, but needed as infrastructure for the rdma driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users need to
set the port for ib_srpt to listen on in configfs in order for it to be
enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while working on new
code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory'
user API features
- Infrastructure updates for the new IOCTL interface, based on increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit
kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJax5Z0AAoJEDht9xV+IJsacCwQAJBIgmLCvVp5fBu2kJcXMMVI
y3l2YNzAUJvDDKv1r5yTC9ugBXEkDtgzi/W/C2/5es2yUG/QeT/zzQ3YPrtsnN68
5FkiXQ35Tt7+PBHMr0cacGRmF4M3Td3MeW0X5aJaBKhqlNKwA+aF18pjGWBmpVYx
URYCwLb5BZBKVh4+1Leebsk4i0/7jSauAqE5M+9notuAUfBCoY1/Eve3DipEIBBp
EyrEnMDIdujYRsg4KHlxFKKJ1EFGItknLQbNL1+SEa0Oe0SnEl5Bd53Yxfz7ekNP
oOWQe5csTcs3Yr4Ob0TC+69CzI71zKbz6qPDILTwXmsPFZJ9ipJs4S8D6F7ra8tb
D5aT1EdRzh/vAORPC9T3DQ3VsHdvhwpUMG7knnKrVT9X/g7E+gSji1BqaQaTr/xs
i40GepHT7lM/TWEuee/6LRpqdhuOhud7vfaRFwn2JGRX9suqTcvwhkBkPUDGV5XX
5RkHcWOb/7KvmpG7S1gaRGK5kO208LgmAZi7REaJFoZB74FqSneMR6NHIH07ha41
Zou7rnxV68CT2bgu27m+72EsprgmBkVDeEzXgKxVI/+PZ1oadUFpgcZ3pRLOPWVx
rEqjHu65rlA/YPog4iXQaMfSwt/oRD3cVJS/n8EdJKXi4Qt2RDDGdyOmt74w4prM
QuLEdvJIFmwrND1KDoqn
=Ku8g
-----END PGP SIGNATURE-----
Merge tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Doug and I are at a conference next week so if another PR is sent I
expect it to only be bug fixes. Parav noted yesterday that there are
some fringe case behavior changes in his work that he would like to
fix, and I see that Intel has a number of rc looking patches for HFI1
they posted yesterday.
Parav is again the biggest contributor by patch count with his ongoing
work to enable container support in the RDMA stack, followed by Leon
doing syzkaller inspired cleanups, though most of the actual fixing
went to RC.
There is one uncomfortable series here fixing the user ABI to actually
work as intended in 32 bit mode. There are lots of notes in the commit
messages, but the basic summary is we don't think there is an actual
32 bit kernel user of drivers/infiniband for several good reasons.
However we are seeing people want to use a 32 bit user space with 64
bit kernel, which didn't completely work today. So in fixing it we
required a 32 bit rxe user to upgrade their userspace. rxe users are
still already quite rare and we think a 32 bit one is non-existing.
- Fix RDMA uapi headers to actually compile in userspace and be more
complete
- Three shared with netdev pull requests from Mellanox:
* 7 patches, mostly to net with 1 IB related one at the back).
This series addresses an IRQ performance issue (patch 1),
cleanups related to the fix for the IRQ performance problem
(patches 2-6), and then extends the fragmented completion queue
support that already exists in the net side of the driver to the
ib side of the driver (patch 7).
* Mostly IB, with 5 patches to net that are needed to support the
remaining 10 patches to the IB subsystem. This series extends
the current 'representor' framework when the mlx5 driver is in
switchdev mode from being a netdev only construct to being a
netdev/IB dev construct. The IB dev is limited to raw Eth queue
pairs only, but by having an IB dev of this type attached to the
representor for a switchdev port, it enables DPDK to work on the
switchdev device.
* All net related, but needed as infrastructure for the rdma
driver
- Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers
- SRP performance updates
- IB uverbs write path cleanup patch series from Leon
- Add RDMA_CM support to ib_srpt. This is disabled by default. Users
need to set the port for ib_srpt to listen on in configfs in order
for it to be enabled
(/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port)
- TSO and Scatter FCS support in mlx4
- Refactor of modify_qp routine to resolve problems seen while
working on new code that is forthcoming
- More refactoring and updates of RDMA CM for containers support from
Parav
- mlx5 'fine grained packet pacing', 'ipsec offload' and 'device
memory' user API features
- Infrastructure updates for the new IOCTL interface, based on
increased usage
- ABI compatibility bug fixes to fully support 32 bit userspace on 64
bit kernel as was originally intended. See the commit messages for
extensive details
- Syzkaller bugs and code cleanups motivated by them"
* tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits)
IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
IB/mlx5: Device memory mr registration support
net/mlx5: Mkey creation command adjustments
IB/mlx5: Device memory support in mlx5_ib
net/mlx5: Query device memory capabilities
IB/uverbs: Add device memory registration ioctl support
IB/uverbs: Add alloc/free dm uverbs ioctl support
IB/uverbs: Add device memory capabilities reporting
IB/uverbs: Expose device memory capabilities to user
RDMA/qedr: Fix wmb usage in qedr
IB/rxe: Removed GID add/del dummy routines
RDMA/qedr: Zero stack memory before copying to user space
IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR
IB/mlx5: Add information for querying IPsec capabilities
IB/mlx5: Add IPsec support for egress and ingress
{net,IB}/mlx5: Add ipsec helper
IB/mlx5: Add modify_flow_action_esp verb
IB/mlx5: Add implementation for create and destroy action_xfrm
IB/uverbs: Introduce ESP steering match filter
IB/uverbs: Add modify ESP flow_action
...
Adding new ioctl method for the MR object - REG_DM_MR.
This command can be used by users to register an allocated
device memory buffer as an MR and receive lkey and rkey
to be used within work requests.
It is added as a new method under the MR object and using a new
ib_device callback - reg_dm_mr.
The command creates a standard ib_mr object which represents the
registered memory.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This change adds uverbs support for allocation/freeing
of device memory commands.
A new uverbs object is defined of type idr to represent
and track the new resource type allocation per context.
The API requires provider driver to implement 2 new ib_device
callbacks - one for allocation and one for deallocation which
return and accept (respectively) the ib_dm object which represents
the allocated memory on the device.
The support is added via the ioctl command infrastructure
only.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This change allows vendors to report device memory capability
max_dm_size - to user via uverbs command.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adding a new ESP steering match filter that could match against
spi and seq used in IPSec protocol.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
flow_actions of ESP type could be modified during runtime. This could be
common for example when ESN should be changed. Adding a new
UVERBS_FLOW_ACTION_ESP_MODIFY method for changing ESP parameters of an
existing ESP flow_action.
The new method uses the UVERBS_FLOW_ACTION_ESP_CREATE attributes, but
adds a new IB_FLOW_ACTION_ESP_FLAGS_MOD_ESP_ATTRS which means ESP_ATTRS
should be changed.
In addition, we add a new FLOW_ACTION_ESP_REPLAY_NONE replay type that
could be used when one wants to disable a replay protection over a
specific flow_action.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Binding a flow_action to flow steering rule requires using a new
specification. Therefore, adding such an IB_FLOW_SPEC_ACTION_HANDLE flow
specification.
Flow steering rules could use flow_action(s) and as of that we need to
avoid deleting flow_action(s) as long as they're being used.
Moreover, when the attached rules are deleted, action_handle reference
count should be decremented. Introducing a new mechanism of flow
resources to keep track on the attached action_handle(s). Later on, this
mechanism should be extended to other attached flow steering resources
like flow counters.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
A verbs application may receive and transmits packets using a data
path pipeline. Sometimes, the first stage in the receive pipeline or
the last stage in the transmit pipeline involves transforming a
packet, either in order to make it easier for later stages to process
it or to prepare it for transmission over the wire. Such transformation
could be stripping/encapsulating the packet (i.e. vxlan),
decrypting/encrypting it (i.e. ipsec), altering headers, doing some
complex FPGA changes, etc.
Some hardware could do such transformations without software data path
intervention at all. The flow steering API supports steering a
packet (either to a QP or dropping it) and some simple packet
immutable actions (i.e. tagging a packet). Complex actions, that may
change the packet, could bloat the flow steering API extensively.
Sometimes the same action should be applied to several flows.
In this case, it's easier to bind several flows to the same action and
modify it than change all matching flows.
Introducing a new flow_action object that abstracts any packet
transformation (out of a standard and well defined set of actions).
This flow_action object could be tied to a flow steering rule via a
new specification.
Currently, we support esp flow_action, which encrypts or decrypts a
packet according to the given parameters. However, we present a
flexible schema that could be used to other transformation actions tied
to flow rules.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The current implementation of kern_spec_to_ib_spec_filter, which takes
a uAPI based flow steering specification and creates the respective kernel
API flow steering structure, gets a ib_uverbs_flow_spec structure.
The new flow_action uAPI gets a match mask and filter from user-space
which aren't encoded in the flow steering's ib_uverbs_flow_spec structure.
Exporting the logic out of kern_spec_to_ib_spec_filter to get user-space
blobs rather than ib_uverbs_flow_spec structure.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Methods sometimes need to get one attribute out of a group of
pre-defined attributes. This is an enum-like behavior. Since
this is a common requirement, we add a new ENUM attribute to the
generic uverbs ioctl() layer. This attribute is embedded in methods,
like any other attributes we currently have. ENUM attributes point to
an array of standard UVERBS_ATTR_PTR_IN. The user-space encodes the
enum's attribute id in the id field and the internal PTR_IN attr id in
the enum_data.elem_id field. This ENUM attribute could be shared by
several attributes and it can get UVERBS_ATTR_SPEC_F_MANDATORY flag,
stating this attribute must be supported by the kernel, like any other
attribute.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Now that ib_gid_attr contains device, port and index, simplify the
provider APIs add_gid() and del_gid() to use device, port and index
fields from the ib_gid_attr attributes structure.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Code is refactored to prepare separate functions for RoCE which can do more
complex operations related to reference counting, while still
maintainining code readability. This includes
(a) Simplification to not perform netdevice checks and modifications
for IB link layer.
(b) Do not add RoCE GID entry which has NULL netdevice; instead return
an error.
(c) If GID addition fails at provider level add_gid(), do not add the
entry in the cache and keep the entry marked as INVALID.
(d) Simplify and reuse the ib_cache_gid_add()/del() routines so that they
can be used even for modifying default GIDs. This avoid some code
duplication in modifying default GIDs.
(e) find_gid() routine refers to the data entry flags to qualify a GID
as valid or invalid GID rather than depending on attributes and zeroness
of the GID content.
(f) gid_table_reserve_default() sets the GID default attribute at
beginning while setting up the GID table. There is no need to use
default_gid flag in low level functions such as write_gid(), add_gid(),
del_gid(), as they never need to update the DEFAULT property of the GID
entry while during GID table update.
As as result of this refactor, reserved GID 0:0:0:0:0:0:0:0 is no longer
searchable as described below.
A unicast GID entry of 0:0:0:0:0:0:0:0 is Reserved GID as per the IB
spec version 1.3 section 4.1.1, point (6) whose snippet is below.
"The unicast GID address 0:0:0:0:0:0:0:0 is reserved - referred to as
the Reserved GID. It shall never be assigned to any endport. It shall
not be used as a destination address or in a global routing header
(GRH)."
GID table cache now only stores valid GID entries. Before this patch,
Reserved GID 0:0:0:0:0:0:0:0 was searchable in the GID table using
ib_find_cached_gid_by_port() and other similar find routines.
Zero GID is no longer searchable as it shall not to be present in GRH or
path recored entry as described in IB spec version 1.3 section 4.1.1,
point (6), section 12.7.10 and section 12.7.20.
ib_cache_update() is simplified to check link layer once, use unified
locking scheme for all link layers, removed temporary gid table
allocation/free logic.
Additionally,
(a) Expand ib_gid_attr to store port and index so that GID query
routines can get port and index information from the attribute structure.
(b) Expand ib_gid_attr to store device as well so that in future code when
GID reference counting is done, device is used to reach back to the GID
table entry.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently following inconsistencies exist.
1. ib_query_gid() returns GID from the software cache for a RoCE port
and returns GID from the HCA for an IB port.
This is incorrect because software GID cache is maintained regardless
of HCA port type.
2. GID is queries from the HCA via ib_query_gid and updated in the
software cache for IB link layer. Both of them might not be in sync.
ULPs such as SRP initiator, SRP target, IPoIB driver have historically
used ib_query_gid() API to query the GID. However CM used cached version
during CM processing, When software cache was introduced, this
inconsitency remained.
In order to simplify, improve readability and avoid link layer
specific above inconsistencies, this patch brings following changes.
1. ib_query_gid() always refers to the cache layer regardless of link
layer.
2. cache module who reads the GID entry from HCA and builds the cache,
directly invokes the HCA provider verb's query_gid() callback function.
3. ib_query_port() is being called in early stage where GID cache is not
yet build while reading port immutable property. Therefore it needs to
read the default GID from the HCA for IB link layer to publish the
subnet prefix.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_query_gid() fetches the GID from the software cache maintained in
ib_core for RoCE ports.
Therefore, simplify the provider drivers for RoCE to treat query_gid()
callback as never called for RoCE, and only require non-RoCE devices to
implement it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Check to make sure that ctx->cm_id->device is set before we use it.
Otherwise userspace can trigger a NULL dereference by doing
RDMA_USER_CM_CMD_SET_OPTION on an ID that is not bound to a device.
Cc: <stable@vger.kernel.org>
Reported-by: <syzbot+a67bc93e14682d92fc2f@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
rdma_cm_state enum is internal to rdma_cm kernel module.
It is not required to expose state enums to ULP modules.
So lets keep its scope limited to rdma_cm module in cma_priv.h file.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make dst_entry pointer as const struct dst_entry* to improve code
readablity to make sure that dst structure fields are not modified by
various functions which are using it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is already used in many places, get the rest of them too, only
to make the code a bit clearer & simpler.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Export the net device name and index to easily find connection
between IB devices and relevant net devices.
We also updated the comment regarding the devices without FW.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.
This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.
Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the rdma_port_space enum is being passed between user and kernel for
user cm_id setup, we need it in a UAPI header. So add it to
rdma_user_cm.h.
This also fixes the cm_id restrack changes which pass up the port space
value via the RDMA_NLDEV_ATTR_RES_PS attribute.
Fixes: 00313983cd ("RDMA/nldev: provide detailed CM_ID information")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB. When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.
Fix this by introducing new variants
int rdma_addr_size_in6(struct sockaddr_in6 *addr);
int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);
that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in. We can use
these new variants to check what size userspace has passed in before
copying any addresses.
Reported-by: <syzbot+6800425d54ed3ed8135d@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB core maintains the GID cache entries for the GID table.
This cache table has to be maintained regardless of HCA's
support of GID table.
For IB and iWarp ports, cache is created by querying the HCA.
For RoCE cache is created based on netdev events.
Therefore just refer to the RoCE port property of the {device, port} to
decide whether to build cache by querying HCA or from netdev events.
There is no need to check if HCA support GID table or not.
ib_cache_update() referred to RoCE attribute before validating
port. Though in all current callers port is valid, it is incorrect
to query RoCE port property before validating the port. Therefore,
rdma_protocol_roce() check is done after rdma_is_port_valid() verifies
that port is valid.
Fixes: 115b68aa6e ("IB/ocrdma: Removed GID add/del null routines")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Even though API is only used by IPoIB driver, its incorrect to refer
RoCE GID table property to search for GID.
Look for only IB link layer to search for the GID.
Fixes: dbb12562f7 ("IB/{core, ipoib}: Simplify ib_find_gid to search only for IB link layer")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid_by_filter() searches GID with filter only for RoCE link
layer regardless of HCA's support for GID table.
Therefore, right way to lookup is compare RoCE port property and not
the GID table property.
Fixes: 99b27e3b5d ("IB/cache: Add ib_find_gid_by_filter cache API")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to following reasons, GID table event is generated regardless of GID
table property.
1. GID table cache is maintained at ib core layer regardless of link layer.
2. GID change event has no relation with IB link layer.
3. GID change event also doesn't depend on whether HCA supports GID table
or not.
Fixes: f3906bd360 ("IB/core: Refactor GID cache's ib_dispatch_event")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to below reasons, it is better to not support alternate path receive
messages for RoCE in near term.
1. Alternate path for RoCE is not supported at rdmacm layer.
2. It is not supported in uverbs/core layer for RoCE.
3. Alternate path for IPv6 for link local address cannot resolve route
determinstically without a valid incoming interface id whose usecase
make sense only with dual port mode.
4. init_av_from_path while processing LAP messages for IB and RoCE can
lead to adding duplicate entry of AV into the port list, leads to list
corruption.
5. rdma-core userspace a well known userspace implementation has removed
support of libucm which use ucm.ko module, which is the only module that
can trigger alternate path related messages.
6. ucm kernel module is requested to be removed from the IB core in
patch [1].
[1] https://patchwork.kernel.org/patch/10268503/
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently access to hardware stats buffer isn't protected, this can
result in multiple writes and reads at the same time to the same
memory location. This can lead to providing an incorrect value to
the user. Add a mutex to protect against it.
Fixes: b40f4757da ("IB/core: Make device counter infrastructure dynamic")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rdma_ucm_event_resp is a different length on 32 and 64 bit compiles.
The kernel requires it to be the expected length or longer so 32 bit
builds running on a 64 bit kernel will not work.
Retain full compat by having all kernels accept a struct with or without
the trailing reserved field.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ensure that device exists prior to accessing its properties.
Reported-by: <syzbot+71655d44855ac3e76366@syzkaller.appspotmail.com>
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently CM request for RoCE follows following flow.
rdma_create_id()
rdma_resolve_addr()
rdma_resolve_route()
For RC QPs:
rdma_connect()
->cma_connect_ib()
->ib_send_cm_req()
->cm_init_av_by_path()
->ib_init_ah_attr_from_path()
For UD QPs:
rdma_connect()
->cma_resolve_ib_udp()
->ib_send_cm_sidr_req()
->cm_init_av_by_path()
->ib_init_ah_attr_from_path()
In both the flows, route is already resolved before sending CM requests.
Therefore, code is refactored to avoid resolving route second time in
ib_cm layer.
ib_init_ah_attr_from_path() is extended to resolve route when it is not
yet resolved for RoCE link layer. This is achieved by caller setting
route_resolved field in path record whenever it has route already
resolved.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ib_query_gid() in commit [1] refers to RoCE GID table capability of
the HCA using rdma_cap_roce_gid_table().
ib_core maintains the GID table cache regardless of the HCA provider
drivers capability to maintain RoCE GID table.
Therefore, whether to return a GID table entry from the software cache or
from HCA should be done based on whether the port is RoCE or not.
[1] commit 03db3a2d81 ("IB/core: Add RoCE GID table management")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The restrack clean routine had simple, but powerful WARN_ON check
to see if all resources are cleared prior to releasing device.
The WARN_ON check performed very well, but lack of information
which device caused to resource leak, the object type and origin
made debug to be fun and challenging at the same time.
The fact that all dumps were the same because restrack_clean() is
called in dealloc() didn't help either.
So let's fix spelling error and convert WARN_ON to be more debug
friendly. The dmesg cut below gives example of how the output
will look output for the case fixed in patch [1]
[ 438.421372] restrack: ------------[ cut here ]------------
[ 438.423448] restrack: BUG: RESTRACK detected leak of resources on mlx5_2
[ 438.425600] restrack: Kernel PD object allocated by mlx5_ib is not freed
[ 438.427753] restrack: Kernel CQ object allocated by mlx5_ib is not freed
[ 438.429660] restrack: ------------[ cut here ]------------
[1] https://patchwork.kernel.org/patch/10298695/
Cc: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The option size check is using optval instead of optlen
causing the set option call to fail. Use the correct
field, optlen, for size check.
Fixes: 6a21dfc0d0 ("RDMA/ucma: Limit possible option size")
Signed-off-by: Chien Tin Tung <chien.tin.tung@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable the ioctl() uAPI for IB by default if the standard write()
uAPI (INFINIBAND_USER_ACCESS) is enabled. Verbs that are
also available under the old write() uAPI are put inside a new
INFINIBAND_EXP_LEGACY_VERBS_NEW_UAPI Kconfig.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently, all objects are declared in uverbs_std_types. This could lead
to a huge file once we implement all objects, methods and handlers.
Moving each object to its own file to keep the files smaller and more
readable. uverbs_std_types.c will only contain the parsing tree
definition and objects without any methods.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ioctl() based uverbs is based on merging feature trees. This teaches
the generic parser how to parse methods according to the provider's
support. In order to support merging with the common objects, exporting
the common-object-tree to the provider drivers.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Previously, we've used UVERBS_ATTR_SPEC_F_MIN_SZ for extending existing
attributes. The behavior of this flag was the kernel accepts anything
bigger than the minimum size it specified. This is unsafe, since in
order to safely extend an attribute, we need to make sure unknown size
is zeroed. Replacing UVERBS_ATTR_SPEC_F_MIN_SZ with
UVERBS_ATTR_SPEC_F_MIN_SZ_OR_ZERO, which essentially checks that the
unknown size is zero. In addition, attributes are now decorated with
UVERBS_ATTR_TYPE and UVERBS_ATTR_STRUCT, so we can provide the minimum
and known length.
Users of this flag needs to use copy_from_or_zero functions/macros.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Downstream patches extend uverbs_attr_spec with new fields.
In order to save space, we move the type and flags fields to
the various attribute flavors contained in the union.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Extending uverbs_ioctl header with driver_id and another reserved
field. driver_id should be used in order to identify the driver.
Since every driver could have its own parsing tree, this is necessary
for strace support.
Downstream patches take off the EXPERIMENTAL flag from the ioctl() IB
support and thus we add some reserved fields for future usage.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use macros to make names consistent in ioctl() uAPI:
The ioctl() uAPI works with object-method hierarchy. The method part
also states which handler should be executed when this method is called
from user-space. Therefore, we need to tie method, method's id, method's
handler and the object owning this method together.
Previously, this was done through explicit developer chosen names.
This makes grepping the code harder. Changing the method's name,
method's handler and object's name to be automatically generated based
on the ids.
The headers are split in a way so they be included and used by
user-space. One header strictly contains structures that are used
directly by user-space applications, where another header is used for
internal library (i.e. libibverbs) to form the ioctl() commands.
Other header simply contains the required general command structure.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The error in ucma_create_id() left ctx in the list of contexts belong
to ucma file descriptor. The attempt to close this file descriptor causes
to use-after-free accesses while iterating over such list.
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Reported-by: <syzbot+dcfd344365a56fbebd0f@syzkaller.appspotmail.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use rdma_is_port_valid() which performs port validity check instead of
open coding the same check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before commit f1b65df5a2 ("IB/mlx5: Add support for active_width and
active_speed in RoCE"), the mlx5_ib driver set default active_width and
active_speed to IB_WIDTH_4X and IB_SPEED_QDR.
Now, the active_width and active_speed are zeros if the RoCE port
is in DOWN state. The speed string should be set to " SDR" instead of
a blank string when active_speed is zero.
Signed-off-by: Honggang Li <honli@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Before commit [1], rdma_addr_find_l2_eth_by_grh() was an exported function
and therefore declaration in include/rdma/ib_addr.h was fine.
But now that its scope is limited to ib_core module, its better to have it
in core_priv.h.
[1] commit 1060f86534 ("IB/{core/cm}: Fix generating a return AH for
RoCEE")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce and use helper function get_cm_port_from_path() to get
cm_port based on the the path record entry.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resolving route for RoCE for a path record is needed only for the
received CM requests.
Therefore,
(a) ib_init_ah_attr_from_path() is refactored first to isolate the
code of resolving route.
(b) Setting dlid, path bits is not needed for RoCE.
Additionally ah attribute initialization is done from the path record
entry, so it is better to refer to path record entry type for
different link layer instead of ah attribute type while initializing
ah attribute itself.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add and use helper function add_cm_id_to_port_list() to attach
cm_id to port list.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_resolve_ip_route() is used only by ib_core module. Therefore it is
removed as an exported symbol.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_protocol_roce() API from the ib_core already provides a way to
detect whether a given device+port is RoCE or not.
Therefore, make use of it and avoid implementing it again in rdmacm
module.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ah_attr contains the port number to which cm_id is bound. However, while
searching for GID table for matching GID entry, the port number is
ignored.
This could cause the wrong GID to be used when the ah_attr is converted to
an AH.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid() is only used by IPoIB driver. For IB link layer, GID table
entries are not based on netdevice. Netdevice parameter is unused here.
Therefore, it is removed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Exported symbol's comments should be with function definition and not in
the header file. Therefore comments of ib_find_cached_gid() and
ib_find_cached_gid_by_port() functions are moved closer to their
definitions.
The function name in then comment is different than the actual function
name, fix it to be same as ib_cache_gid_find_by_filter().
Also current comment section of ib_find_cached_gid_by_port() contains the
desciption of ib_find_cached_gid(), fix that as well.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
Signed-off-by: Doug Ledford <dledford@redhat.com>
gcc-4.4.4 has issues with initialization of anonymous unions.
drivers/infiniband/core/verbs.c: In function '__ib_drain_sq':
drivers/infiniband/core/verbs.c:2204: error: unknown field 'wr_cqe' specified in initializer
drivers/infiniband/core/verbs.c:2204: warning: initialization makes integer from pointer without a cast
Work around this.
Fixes: a1ae7d0345 ("RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
cma_port_is_unique() allows local port reuse if the quad (source
address and port, destination address and port) for this connection
is unique. However, if the destination info is zero or unspecified, it
can't make a correct decision but still allows port reuse. For example,
sometimes rdma_bind_addr() is called with unspecified destination and
reusing the port can lead to creating a connection with a duplicate quad,
after the destination is resolved. The issue manifests when MPI scale-up
tests hang after the duplicate quad is used.
Set the destination address family and add checks for zero destination
address and port to prevent source port reuse based on invalid destination.
Fixes: 19b752a19d ("IB/cma: Allow port reuse for rdma_id")
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state
makes the checks of out-of-scope redundant. Let's remove them
together with updating function signature to return boolean result.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The QP state is internal enum which is checked at the driver
level by calling to ib_modify_qp_is_ok(). Move this check closer
to user and leave kernel users to be checked by compiler.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Implement RDMA nldev netlink interface to get detailed CM_ID information.
Because cm_id's are attached to rdma devices in various work queue
contexts, the pid and task information at restrak_add() time is sometimes
not useful. For example, an nvme/f host connection cm_id ends up being
bound to a device in a work queue context and the resulting pid at attach
time no longer exists after connection setup. So instead we mark all
cm_id's created via the rdma_ucm as "user", and all others as "kernel".
This required tweaking the restrack code a little. It also required
wrapping some rdma_cm functions to allow passing the module name string.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move struct rdma_id_private to a new header cma_priv.h so the resource
tracking services in core/nldev.c can read useful information about cm_ids.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Create a common dumpit function that can be used by all common resource
types. This reduces code replication and simplifies the code as we add
more resource types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Simplify res_to_dev() to make it easier to read/maintain.
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The QP state is limited and declared in enum ib_qp_state,
but ucma user was able to supply any possible (u32) value.
Reported-by: syzbot+0df1ab766f8924b1edba@syzkaller.appspotmail.com
Fixes: 7521663857 ("RDMA/cma: Export rdma cm interface to userspace")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Users of ucma are supposed to provide size of option level,
in most paths it is supposed to be equal to u8 or u16, but
it is not the case for the IB path record, where it can be
multiple of struct ib_path_rec_data.
This patch takes simplest possible approach and prevents providing
values more than possible to allocate.
Reported-by: syzbot+a38b0e9f694c379ca7ce@syzkaller.appspotmail.com
Fixes: 7ce86409ad ("RDMA/ucma: Allow user space to set service type")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
resolved_dev returned might be NULL as ifindex is transient number.
Ignoring NULL check of resolved_dev might crash the kernel.
Therefore perform NULL check before accessing resolved_dev.
Additionally rdma_resolve_ip_route() invokes addr_resolve() which
performs check and address translation for loopback ifindex.
Therefore, checking it again in rdma_resolve_ip_route() is not helpful.
Therefore, the code is simplified to avoid IFF_LOOPBACK check.
Fixes: 200298326b ("IB/core: Validate route when we init ah")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix warning limit for kernel stack consumption:
drivers/infiniband/core/cq.c: In function 'ib_process_cq_direct':
drivers/infiniband/core/cq.c:78:1: error: the frame size of 1032 bytes
is larger than 1024 bytes [-Werror=frame-larger-than=]
Using smaller ib_wc array on the stack brings us comfortably below that
limit again.
Fixes: 246d8b184c ("IB/cq: Don't force IB_POLL_DIRECT poll context for ib_process_cq_direct")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sergey Gorenko <sergeygo@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Maintaining the uobjects list is mandatory, hoist it into the common
rdma_alloc_commit_uobject() function and inline it as there is now
only one caller.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
During IB device registration process, if query_device() fails or if
ib_core fails to registers sysfs entries, rdma cgroup cleanup is
skipped.
Cc: <stable@vger.kernel.org> # v4.2+
Fixes: 4be3a4fa51 ("IB/core: Fix kernel crash during fail to initialize device")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These pernet_operations just create and destroy IDR.
So, we mark them as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The proper return error is -EOPNOTSUPP and not -ENOSYS, so update
all places in verbs.c to match this semantics.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Simplify the code by directly checking the availability of extended
command flog instead of doing multiple shift operations.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The internal to kernel variable declarations don't need to be
declared with user types. This patch converts such occurrences
appeared in ib_uverbs_write().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move all header validation logic to be performed before SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The SRCU read lock protects the IB device pointer
and doesn't need to be called before copying user
provided header.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no need to take SRCU lock before checking
file->ucontext, so move it do it before it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The check based on index is not sufficient because
IB_USER_VERBS_EX_CMD_CREATE_CQ = IB_USER_VERBS_CMD_CREATE_CQ
and IB_USER_VERBS_CMD_CREATE_CQ <= IB_USER_VERBS_CMD_OPEN_QP,
so if we execute IB_USER_VERBS_EX_CMD_CREATE_CQ this code checks
ib_dev->uverbs_cmd_mask not ib_dev->uverbs_ex_cmd_mask.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move all command header processing into separate function
and perform those checks before acquiring SRCU read lock.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The non-existing command is supposed to return -EOPNOTSUPP, but the
current code returns different errors for different flows for the
same failure. This patch unifies those flows.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Command that doesn't exist means that it is not supported,
so update code to return -EOPNOTSUPP in case of failure.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fail as early as possible if not enough header data
was provided.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since commit f21519b23c ("IB/core: extended command: an
improved infrastructure for uverbs commands"), the uverbs
supports extra flags as an input to the command interface.
However actually, there is only one flag available and used,
so it is better to refactor the code, so the resolution and
report to the users is done as early as possible.
As part of this change, we changed the return value of failure case
from ENOSYS to be EINVAL to be consistent with the rest flags checks.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update sizeof() users to be consistent with coding style.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The function validate_command_mask() returns only two results: success
or failure, so convert it to return bool instead of 0 and -1.
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
uaccess_kernel() isn't sufficient to determine if an rdma resource is
user-mode or not. For example, resources allocated in the add_one()
function of an ib_client get falsely labeled as user mode, when they
are kernel mode allocations. EG: mad qps.
The result is that these qps are skipped over during a nldev query
because of an erroneous namespace mismatch.
So now we determine if the resource is user-mode by looking at the object
struct's uobject or similar pointer to know if it was allocated for user
mode applications.
Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Update all the flows to ensure that function pointer exists prior
to accessing it.
This is much safer than checking the uverbs_ex_mask variable, especially
since we know that test isn't working properly and will be removed
in -next.
This prevents a user triggereable oops.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is no matching lock for this mutex. Git history suggests this is
just a missed remnant from an earlier version of the function before
this locking was moved into uverbs_free_xrcd.
Originally this lock was protecting the xrcd_table_delete()
=====================================
WARNING: bad unlock balance detected!
4.15.0+ #87 Not tainted
-------------------------------------
syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
[<00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller223405/269:
#0: (&uverbs_dev->disassociate_srcu){....}, at: [<000000005af3b960>] ib_uverbs_write+0x265/0xef0
stack backtrace:
CPU: 0 PID: 269 Comm: syzkaller223405 Not tainted 4.15.0+ #87
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
? ib_uverbs_write+0x265/0xef0
? console_unlock+0x502/0xbd0
? ib_uverbs_close_xrcd+0x195/0x1f0
print_unlock_imbalance_bug+0x131/0x160
lock_release+0x59d/0x1100
? ib_uverbs_close_xrcd+0x195/0x1f0
? lock_acquire+0x440/0x440
? lock_acquire+0x440/0x440
__mutex_unlock_slowpath+0x88/0x670
? wait_for_completion+0x4c0/0x4c0
? rdma_lookup_get_uobject+0x145/0x2f0
ib_uverbs_close_xrcd+0x195/0x1f0
? ib_uverbs_open_xrcd+0xdd0/0xdd0
ib_uverbs_write+0x7f9/0xef0
? cyc2ns_read_end+0x10/0x10
? ib_uverbs_open_xrcd+0xdd0/0xdd0
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x358/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x1e/0x8b
RIP: 0033:0x4335c9
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: fd3c7904db ("IB/core: Change idr objects to use the new schema")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once the uobj is committed it is immediately possible another thread
could destroy it, which worst case, can result in a use-after-free
of the restrack objects.
Cc: syzkaller <syzkaller@googlegroups.com>
Fixes: 08f294a152 ("RDMA/core: Add resource tracking for create and destroy CQs")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The command number is not bounds checked against the command mask before it
is shifted, resulting in an ubsan hit. This does not cause malfunction since
the command number is eventually bounds checked, but we can make this ubsan
clean by moving the bounds check to before the mask check.
================================================================================
UBSAN: Undefined behaviour in
drivers/infiniband/core/uverbs_main.c:647:21
shift exponent 207 is too large for 64-bit type 'long long unsigned int'
CPU: 0 PID: 446 Comm: syz-executor3 Not tainted 4.15.0-rc2+ #61
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0xde/0x164
? dma_virt_map_sg+0x22c/0x22c
ubsan_epilogue+0xe/0x81
__ubsan_handle_shift_out_of_bounds+0x293/0x2f7
? debug_check_no_locks_freed+0x340/0x340
? __ubsan_handle_load_invalid_value+0x19b/0x19b
? lock_acquire+0x440/0x440
? lock_acquire+0x19d/0x440
? __might_fault+0xf4/0x240
? ib_uverbs_write+0x68d/0xe20
ib_uverbs_write+0x68d/0xe20
? __lock_acquire+0xcf7/0x3940
? uverbs_devnode+0x110/0x110
? cyc2ns_read_end+0x10/0x10
? sched_clock_cpu+0x18/0x200
? sched_clock_cpu+0x18/0x200
__vfs_write+0x10d/0x700
? uverbs_devnode+0x110/0x110
? kernel_read+0x170/0x170
? __fget+0x35b/0x5d0
? security_file_permission+0x93/0x260
vfs_write+0x1b0/0x550
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f033f567c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f033f5686bc RCX: 0000000000448e29
RDX: 0000000000000060 RSI: 0000000020001000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000056a0 R14: 00000000006e8740 R15: 0000000000000000
================================================================================
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.5
Fixes: 2dbd5186a3 ("IB/core: IB/core: Allow legacy verbs through extended interfaces")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If remove_commit fails then the lock is left locked while the uobj still
exists. Eventually the kernel will deadlock.
lockdep detects this and says:
test/4221 is leaving the kernel with locks still held!
1 lock held by test/4221:
#0: (&ucontext->cleanup_rwsem){.+.+}, at: [<000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]
Fixes: 4da70da23e ("IB/core: Explicitly destroy an object while keeping uobject")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is really being used as an assert that the expected usecnt
is being held and implicitly that the usecnt is valid. Rename it to
assert_uverbs_usecnt and tighten the checks to only accept valid
values of usecnt (eg 0 and < -1 are invalid).
The tigher checkes make the assertion cover more cases and is more
likely to find bugs via syzkaller/etc.
Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The race is between lookup_get_idr_uobject and
uverbs_idr_remove_uobj -> uverbs_uobject_put.
We deliberately do not call sychronize_rcu after the idr_remove in
uverbs_idr_remove_uobj for performance reasons, instead we call
kfree_rcu() during uverbs_uobject_put.
However, this means we can obtain pointers to uobj's that have
already been released and must protect against krefing them
using kref_get_unless_zero.
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
Read of size 4 at addr ffff88005fda1ac8 by task syz-executor2/441
CPU: 1 PID: 441 Comm: syz-executor2 Not tainted 4.15.0-rc2+ #56
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xd4
print_address_description+0x73/0x290
kasan_report+0x25c/0x370
? copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
? uverbs_try_lock_object+0x68/0xc0
? modify_qp.isra.7+0xdc4/0x10e0
modify_qp.isra.7+0xdc4/0x10e0
ib_uverbs_modify_qp+0xfe/0x170
? ib_uverbs_query_qp+0x970/0x970
? __lock_acquire+0xa11/0x1da0
ib_uverbs_write+0x55a/0xad0
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_query_qp+0x970/0x970
? ib_uverbs_open+0x760/0x760
? futex_wake+0x147/0x410
? sched_clock_cpu+0x18/0x180
? check_prev_add+0x1680/0x1680
? do_futex+0x3b6/0xa30
? sched_clock_cpu+0x18/0x180
__vfs_write+0xf7/0x5c0
? ib_uverbs_open+0x760/0x760
? kernel_read+0x110/0x110
? lock_acquire+0x370/0x370
? __fget+0x264/0x3b0
vfs_write+0x18a/0x460
SyS_write+0xc7/0x1a0
? SyS_read+0x1a0/0x1a0
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL_64_fastpath+0x18/0x85
RIP: 0033:0x448e29
RSP: 002b:00007f443fee0c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f443fee16bc RCX: 0000000000448e29
RDX: 0000000000000078 RSI: 00000000209f8000 RDI: 0000000000000012
RBP: 000000000070bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000008e98 R14: 00000000006ebf38 R15: 0000000000000000
Allocated by task 1:
kmem_cache_alloc_trace+0x16c/0x2f0
mlx5_alloc_cmd_msg+0x12e/0x670
cmd_exec+0x419/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
Freed by task 1:
kfree+0xeb/0x2f0
mlx5_free_cmd_msg+0xcd/0x140
cmd_exec+0xeba/0x1810
mlx5_cmd_exec+0x40/0x70
mlx5_core_mad_ifc+0x187/0x220
mlx5_MAD_IFC+0xd7/0x1b0
mlx5_query_mad_ifc_gids+0x1f3/0x650
mlx5_ib_query_gid+0xa4/0xc0
ib_query_gid+0x152/0x1a0
ib_query_port+0x21e/0x290
mlx5_port_immutable+0x30f/0x490
ib_register_device+0x5dd/0x1130
mlx5_ib_add+0x3e7/0x700
mlx5_add_device+0x124/0x510
mlx5_register_interface+0x11f/0x1c0
mlx5_ib_init+0x56/0x61
do_one_initcall+0xa3/0x250
kernel_init_freeable+0x309/0x3b8
kernel_init+0x14/0x180
ret_from_fork+0x24/0x30
The buggy address belongs to the object at ffff88005fda1ab0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 24 bytes inside of
32-byte region [ffff88005fda1ab0, ffff88005fda1ad0)
The buggy address belongs to the page:
page:00000000d5655c19 count:1 mapcount:0 mapping: (null)
index:0xffff88005fda1fc0
flags: 0x4000000000000100(slab)
raw: 4000000000000100 0000000000000000 ffff88005fda1fc0 0000000180550008
raw: ffffea00017f6780 0000000400000004 ffff88006c803980 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb
ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc
ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb
ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb
ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==================================================================@
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org> # 4.11
Fixes: 3832125624 ("IB/core: Add support for idr types")
Reported-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This clarifies the design intention that time between allocate and
commit has the uobj exclusive to the caller. We already guarantee
this by delaying publishing the uobj pointer via idr_insert,
fd_install, list_add, etc.
Additionally holding the usecnt lock during this period provides
extra clarity and more protection against future mistakes.
Fixes: 3832125624 ("IB/core: Add support for idr types")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the same attribute is listed twice by the user in the ioctl attribute
list then error unwind can cause the kernel to deref garbage.
This happens when an object with WRITE access is sent twice. The second
parse properly fails but corrupts the state required for the error unwind
it triggers.
Fixing this by making duplicates in the attribute list invalid. This is
not something we need to support.
The ioctl interface is currently recommended to be disabled in kConfig.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
32 bit processes running on a 64 bit kernel call compat_ioctl so that
implementations can revise any structure layout issues. Point compat_ioctl
at our normal ioctl because:
- All our structures are designed to be the same on 32 and 64 bit, ie we
use __aligned_u64 when required and are careful to manage padding.
- Any pointers are stored in u64's and userspace is expected
to prepare them properly.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix a bug in uverbs_ioctl_merge that looked at the object's iterator
number instead of the method's iterator number when merging methods.
While we're at it, make the uverbs_ioctl_merge code a bit more clear
and faster.
Fixes: 118620d368 ('IB/core: Add uverbs merge trees functionality')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The union approach will get the endianness wrong sometimes if the kernel's
pointer size is 32 bits resulting in EFAULTs when trying to copy to/from
user.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rule for the API is pointers less than 8 bytes are inlined into
the .data field of the attribute. Fix the creation of the driver udata
struct to follow this rule and point to the .data itself when the size
is less than 8 bytes.
Otherwise if the UHW struct is less than 8 bytes the driver will get
EFAULT during copy_from_user.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This fixes several bugs around the copy_to/from user path:
- copy_to used the user provided size of the attribute
and could copy data beyond the end of the kernel buffer into
userspace.
- copy_from didn't know the size of the kernel buffer and
could have left kernel memory unexpectedly un-initialized.
- copy_from did not use the user length to determine if the
attribute data is inlined or not.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resource tracking of XRCD objects is not implemented in current
version of restrack and hence can be removed.
Fixes: 02d8883f52 ("RDMA/restrack: Add general infrastructure to track RDMA resources")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Clean up some function signatures in rxe for clarity
- Tidy the RDMA netlink header to remove unimplemented constants
- bnxt_re driver fixes, one is a regression this window.
- Minor hns driver fixes
- Various fixes from Dan Carpenter and his tool
- Fix IRQ cleanup race in HFI1
- HF1 performance optimizations and a fix to report counters in the right units
- Fix for an IPoIB startup sequence race with the external manager
- Oops fix for the new kabi path
- Endian cleanups for hns
- Fix for mlx5 related to the new automatic affinity support
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaePL/AAoJELgmozMOVy/dqhsQALUhzDuuJ+/F6supjmyqZG53
Ak/PoFjTmHToGQfDq/1TRzyKwMx12aB2l6WGZc31FzhvCw4daPWkoEVKReNWUUJ+
fmESxjLgo8ZRGSqpNxn9Q8agE/I/5JZQoA8bCFCYgdZPKTPNKdtAVBphpdhmrOX4
ygjABikWf/wBsNF1A8lnX9xkfPO21cPHrFQLTnuOzOT/hc6U+PPklHSQCnS91svh
1+Pqjtssg54rxYkJqiFq3giSnfwvmAXO8WyVGmRRPFGLpB0nIjq0Sl6ZgLLClz7w
YJdiBGr7rlnNMgGCjlPU2ZO3lO6J0ytXQzFNqRqvKryXQOv+uVeJgep7WqHTcdQU
UN30FCKQMgLL/F6NF8wKaKcK4X0VgXQa7gpuH2fVSXF0c3LO3/mmWNjixbGSzT2c
Wj+EW3eOKlTddhRLhgbMOdwc32tIGhaD85z2F4+FZO+XI9ZQtJaDewWVDjYoumP/
RlDIFw+KCgSq7+UZL8CoXuh0BuS1nu9TGfkx1HW0DLMF1+yigNiswpUfksV4cISP
JqE2I3yH0A4UobD/a+f9IhIfk2MjxO0tJWNjU8IA9LXgUFlskQ6MpH/AcE9G8JNv
tlfLGR3s4PJa/7j/Iy2F84og/b/KH8v7vyj4Eknq/hLq63/BiM5wj0AUBRrGulN6
HhAMOegxGZ7IKP/y0L7I
=xwZz
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull more rdma updates from Doug Ledford:
"Items of note:
- two patches fix a regression in the 4.15 kernel. The 4.14 kernel
worked fine with NVMe over Fabrics and mlx5 adapters. That broke in
4.15. The fix is here.
- one of the patches (the endian notation patch from Lijun) looks
like a lot of lines of change, but it's mostly mechanical in
nature. It amounts to the biggest chunk of change in it (it's about
2/3rds of the overall pull request).
Summary:
- Clean up some function signatures in rxe for clarity
- Tidy the RDMA netlink header to remove unimplemented constants
- bnxt_re driver fixes, one is a regression this window.
- Minor hns driver fixes
- Various fixes from Dan Carpenter and his tool
- Fix IRQ cleanup race in HFI1
- HF1 performance optimizations and a fix to report counters in the right units
- Fix for an IPoIB startup sequence race with the external manager
- Oops fix for the new kabi path
- Endian cleanups for hns
- Fix for mlx5 related to the new automatic affinity support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (38 commits)
net/mlx5: increase async EQ to avoid EQ overrun
mlx5: fix mlx5_get_vector_affinity to start from completion vector 0
RDMA/hns: Fix the endian problem for hns
IB/uverbs: Use the standard kConfig format for experimental
IB: Update references to libibverbs
IB/hfi1: Add 16B rcvhdr trace support
IB/hfi1: Convert kzalloc_node and kcalloc to use kcalloc_node
IB/core: Avoid a potential OOPs for an unused optional parameter
IB/core: Map iWarp AH type to undefined in rdma_ah_find_type
IB/ipoib: Fix for potential no-carrier state
IB/hfi1: Show fault stats in both TX and RX directions
IB/hfi1: Remove blind constants from 16B update
IB/hfi1: Convert PortXmitWait/PortVLXmitWait counters to flit times
IB/hfi1: Do not override given pcie_pset value
IB/hfi1: Optimize process_receive_ib()
IB/hfi1: Remove unnecessary fecn and becn fields
IB/hfi1: Look up ibport using a pointer in receive path
IB/hfi1: Optimize packet type comparison using 9B and bypass code paths
IB/hfi1: Compute BTH only for RDMA_WRITE_LAST/SEND_LAST packet
IB/hfi1: Remove dependence on qp->s_hdrwords
...
The ev_file is an optional parameter for CQ creation. If the parameter
is not passed, the ev_file pointer will be NULL. Using that pointer
to set the cq_context will result in an OOPs.
Verify that ev_file is not NULL before using.
Cc: <stable@vger.kernel.org> # 4.14.x
Fixes: 9ee79fce36 ("IB/core: Add completion queue (cq) object actions")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
We should return -ENOMEM if the allocation fails. The current code
accidentally returns success.
Fixes: bf3c5a93c5 ("RDMA/nldev: Provide global resource utilization")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Misc small driver fixups to
bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
- Several major feature adds to bnxt_re driver: SRIOV VF RoCE support,
HugePages support, extended hardware stats support, and SRQ support
- A notable number of fixes to the i40iw driver from debugging scale up
testing
- More work to enable the new hip08 chip in the hns driver
- Misc small ULP fixups to srp/srpt//ipoib
- Preparation for srp initiator and target to support the RDMA-CM
protocol for connections
- Add RDMA-CM support to srp initiator, srp target is still a WIP
- Fixes for a couple of places where ipoib could spam the dmesg log
- Fix encode/decode of FDR/EDR data rates in the core
- Many patches from Parav with ongoing work to clean up inconsistencies
and bugs in RoCE support around the rdma_cm
- mlx5 driver support for the userspace features 'thread domain', 'wallclock
timestamps' and 'DV Direct Connected transport'. Support for the firmware
dual port rocee capability
- Core support for more than 32 rdma devices in the char dev allocation
- kernel doc updates from Randy Dunlap
- New netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
- One minor change to the kobject code acked by GKH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJacfljAAoJEDht9xV+IJsaUnwP+QFJvfIDEfRlfU2rTmcfymPs
Rz9bW1KLgETcJx/XOE2ba2DOaqdFr56TLflsDfEfOSIL8AtzBQqH3vTqEj49bBP7
4JZAkzWllUS/qoYD2XmvOM0IrIfFXzZtLM/lzLi+5dwK26x3GAB9hHXpKzUrJ1vj
I1Naq14qOFXoNBndEtZJqtIKOhR/Pnd6YtxAiNCmViZGdqm3DIU3D4VJhU5B7pO9
j6ovJs16wfJl/gV1iiz9xO49ViVFpwzSIzYE/Q2ZCegcrsF3EEVN2J4vZHkKgDuN
0/Ar/WOvkPzKBFR8hJ7M4kwp0Fy/69/U49s7kpGNxdhML9sU3+Qfse6JYGj0M9L8
01gTM0SShyAZMNAvjVFbIKLQPg806OAit4cooMwlObbwJ6b7B8K0uN17/uVIkIqp
gXqertyl1BLhUtTOby/8Fox/f/oEvaZksKiwcTKSb7D1Y5jGZZUPRknJ5SwAFWQB
RiTPJ6mY7BUsM9zuYQtRE8x2mpgIezYXFcrAz7iT76WuoZQgo1QLIyYRM1+MlhnC
wNrp5BtqoVfW2Ps0CbSdxJ9vDtDf3cwLg0RzcCB8+NJJccsRD9IVMDev/TDY5k9U
M9LxxtW3WuulRWgliU0Q9VaswUQoIao16vBMVL7GwUm+ClLvbRVoPe8jxgtfk+W3
GAANAI7Kv/vUoV/6CFfP
=sMXV
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull RDMA subsystem updates from Jason Gunthorpe:
"Overall this cycle did not have any major excitement, and did not
require any shared branch with netdev.
Lots of driver updates, particularly of the scale-up and performance
variety. The largest body of core work was Parav's patches fixing and
restructing some of the core code to make way for future RDMA
containerization.
Summary:
- misc small driver fixups to
bnxt_re/hfi1/qib/hns/ocrdma/rdmavt/vmw_pvrdma/nes
- several major feature adds to bnxt_re driver: SRIOV VF RoCE
support, HugePages support, extended hardware stats support, and
SRQ support
- a notable number of fixes to the i40iw driver from debugging scale
up testing
- more work to enable the new hip08 chip in the hns driver
- misc small ULP fixups to srp/srpt//ipoib
- preparation for srp initiator and target to support the RDMA-CM
protocol for connections
- add RDMA-CM support to srp initiator, srp target is still a WIP
- fixes for a couple of places where ipoib could spam the dmesg log
- fix encode/decode of FDR/EDR data rates in the core
- many patches from Parav with ongoing work to clean up
inconsistencies and bugs in RoCE support around the rdma_cm
- mlx5 driver support for the userspace features 'thread domain',
'wallclock timestamps' and 'DV Direct Connected transport'. Support
for the firmware dual port rocee capability
- core support for more than 32 rdma devices in the char dev
allocation
- kernel doc updates from Randy Dunlap
- new netlink uAPI for inspecting RDMA objects similar in spirit to 'ss'
- one minor change to the kobject code acked by Greg KH"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (259 commits)
RDMA/nldev: Provide detailed QP information
RDMA/nldev: Provide global resource utilization
RDMA/core: Add resource tracking for create and destroy PDs
RDMA/core: Add resource tracking for create and destroy CQs
RDMA/core: Add resource tracking for create and destroy QPs
RDMA/restrack: Add general infrastructure to track RDMA resources
RDMA/core: Save kernel caller name when creating PD and CQ objects
RDMA/core: Use the MODNAME instead of the function name for pd callers
RDMA: Move enum ib_cq_creation_flags to uapi headers
IB/rxe: Change RDMA_RXE kconfig to use select
IB/qib: remove qib_keys.c
IB/mthca: remove mthca_user.h
RDMA/cm: Fix access to uninitialized variable
RDMA/cma: Use existing netif_is_bond_master function
IB/core: Avoid SGID attributes query while converting GID from OPA to IB
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
IB/umad: Fix use of unprotected device pointer
IB/iser: Combine substrings for three messages
IB/iser: Delete an unnecessary variable initialisation in iser_send_data_out()
IB/iser: Delete an error message for a failed memory allocation in iser_send_data_out()
...
Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.
Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.
Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.
The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.
As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"
* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...
Implement RDMA nldev netlink interface to get detailed information on each
QP in the system. This includes the owning process or kernel ULP and
detailed information from the qp_attrs.
Currently only the dumpit variant is implemented.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Expose through the netlink interface the global per-device utilization of
the supported object types.
Provide both dumpit and doit callbacks.
As an example of possible output from rdmatool for system with 5
mlx5 cards:
$ rdma res
1: mlx5_0: qp 4 cq 5 pd 3
2: mlx5_1: qp 4 cq 5 pd 3
3: mlx5_2: qp 4 cq 5 pd 3
4: mlx5_3: qp 2 cq 3 pd 2
5: mlx5_4: qp 4 cq 5 pd 3
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of PD objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of CQ objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Track create and destroy operations of QP objects.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The RDMA subsystem has very strict set of objects to work with, but it
completely lacks tracking facilities and has no visibility of resource
utilization.
The following patch adds such infrastructure to keep track of RDMA
resources to help with debugging of user space applications. The primary
user of this infrastructure is RDMA nldev netlink (following patches), to
be exposed to userspace via rdmatool, but it is not limited too that.
At this stage, the main three objects (PD, CQ and QP) are added, and more
will be added later.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The KBUILD_MODNAME variable contains the module name and it is known for
kernel users during compilation, so let's reuse it to track the owners.
Followup patches will store this for resource tracking.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ndev will be initialized and held only for successful
ib_get_cached_gid(), otherwise it is garbage stack memory.
Calling dev_put() in failure path is wrong.
Fixes: 16c72e4028 ("IB/cm: Refactor to avoid setting path record software only fields")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When checking whatever the current netdev is the bond master interface,
use kernel API netif_is_bond_master() instead of hardcoding the check.
No functionality is changed.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
SGID attributes are not used during OPA to IB GID conversion.
Therefore don't query it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The ib_write_umad() is protected by taking the umad file mutex.
However, it accesses file->port->ib_dev -- which is protected only by the
port's mutex (field file_mutex).
The ib_umad_remove_one() calls ib_umad_kill_port() which sets
port->ib_dev to NULL under the port mutex (NOT the file mutex).
It then sets the mad agent to "dead" under the umad file mutex.
This is a race condition -- because there is a window where
port->ib_dev is NULL, while the agent is not "dead".
As a result, we saw stack traces like:
[16490.678059] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
[16490.678246] IP: ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.678333] PGD 0 P4D 0
[16490.678404] Oops: 0000 [#1] SMP PTI
[16490.678466] Modules linked in: rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx4_en(OE) ptp pps_core mlx4_ib(OE-) ib_core(OE) mlx4_core(OE) mlx_compat
(OE) memtrack(OE) devlink mst_pciconf(OE) mst_pci(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache cfg80211 rfkill esp6_offload esp6 esp4_offload esp4 sunrpc kvm_intel kvm ppdev parport_pc irqbypass
parport joydev i2c_piix4 virtio_balloon cirrus drm_kms_helper ttm drm e1000 serio_raw virtio_pci virtio_ring virtio ata_generic pata_acpi qemu_fw_cfg [last unloaded: mlxfw]
[16490.679202] CPU: 4 PID: 3115 Comm: sminfo Tainted: G OE 4.14.13-300.fc27.x86_64 #1
[16490.679339] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[16490.679477] task: ffff9cf753890000 task.stack: ffffaf70c26b0000
[16490.679571] RIP: 0010:ib_umad_write+0x29c/0xa3a [ib_umad]
[16490.679664] RSP: 0018:ffffaf70c26b3d90 EFLAGS: 00010202
[16490.679747] RAX: 0000000000000010 RBX: ffff9cf75610fd80 RCX: 0000000000000000
[16490.679856] RDX: 0000000000000001 RSI: 00007ffdf2bfd714 RDI: ffff9cf6bb2a9c00
In the above trace, ib_umad_write is trying to dereference the NULL
file->port->ib_dev pointer.
Fix this by using the agent's device pointer (the device field
in struct ib_mad_agent) -- which IS protected by the umad file mutex.
Cc: <stable@vger.kernel.org> # v4.11
Fixes: 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Returning EOPNOTSUPP is problematic because it can also be
returned by the method function, and we use it in quite a few
places in drivers these days.
Instead, dedicate EPROTONOSUPPORT to indicate that the ioctl framework
is enabled but the requested object and method are not supported by
the kernel. No other case will return this code, and it lets userspace
know to fall back to write().
grep says we do not use it today in drivers/infiniband subsystem.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma_dev_addr contains the net namespace pointer, while referring
bound_dev_if of the rdma_dev_addr, refer to the net namespace of
rdma_cm_id stored in rdma_dev_addr.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cma_validate_port uses rdma_dev_addr to validate the port of the cm_id.
It needs to honor the net namespace which is setup during cm_id creation
when finding netdevice.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Pass the rdma_cm_id so that multiple fields of the rdma_dev_addr
structure can be accessed, instead of passing each individual fields.
This is needed to access some additional fields in followup patches.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If valid netdevice is not found for RoCE, GID table should not be
searched with NULL netdevice.
Doing so causes the search routines to ignore the netdev argument and may
match the wrong GID table entry if the netdev is deleted.
Fixes: abae1b71dd ("IB/cma: cma_validate_port should verify the port and netdevice")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of rdma_read_gids() API to read SGID and DGID which returns
correct GIDs for RoCE and other transports.
rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.
Therefore use transport agnostic rdma_read_gids() API provided by rdma_cm
module.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch introduces an API that allows legacy applications to query
GIDs for a rdma_cm_id which is used during connection establishment.
GIDs are stored and created differently for iWarp, IB and RoCE transports.
Therefore rdma_read_gids() returns GID for all the transports hiding
such internal details to caller.
It is usable for client side and server side connections.
In general continued use of GID based addressing outside of IB is
discouraged, so rdma_read_gids() should not be used by any new ULPs.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
polling the completion queue directly does not interfere
with the existing polling logic, hence drop the requirement.
Be aware that running ib_process_cq_direct with non IB_POLL_DIRECT
CQ may trigger concurrent CQ processing.
This can be used for polling mode ULPs.
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[maxg: added wcs array argument to __ib_process_cq]
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to initialize completion and WR in case we fail
during QP modification.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
gcc-8 reports
drivers/infiniband/core/cma_configfs.c: In function 'make_cma_dev':
./include/linux/string.h:245:9: warning: '__builtin_strncpy' specified
bound 64 equals destination size [-Wstringop-truncation]
We need to use strlcpy() to make sure the string is nul-terminated.
Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This matches what the userspace copy of this header has been doing
for a while. imm_data is an opaque 4 byte array carried over the network,
and invalidate_rkey is in CPU byte order.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Resolving DMAC for RoCE is applicable to only Connected mode QPs.
So resolve DMAC for only for Connected mode QPs.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Instead of returning 0 (success) for RoCE scenarios where DMAC should
not be resolved, avoid such attempt and make code consistent with
ib_create_user_ah().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently ah_attr is initialized by the ib_cm layer for rdma_cm
based applications. For RoCE transport ah_attr.roce.dmac is already
initialized by ib_cm, rdma_cm either from wc, path record, route
resolve, explicit path record setting depending on active or passive
side QP. Therefore avoid resolving DMAC for QP of kernel consumers.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently qp->port stores the port number whenever IB_QP_PORT
QP attribute mask is set (during QP state transition to INIT state).
This port number should be stored for the real QP when XRC target QP
is used.
Follow the ib_modify_qp() implementation and hide the access to ->real_qp.
Fixes: a512c2fbef ("IB/core: Introduce modify QP operation with udata")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix kernel-doc warning for ib_fmr_pool_map_phys() and also format it
with function description and text spacing.
../drivers/infiniband/core/fmr_pool.c:404: warning: Excess function parameter 'pool' description in 'ib_fmr_pool_map_phys'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Change function parameter name in kernel-doc notation and other comments
to eliminate a kernel-doc warning.
../drivers/infiniband/core/verbs.c:1790: warning: Excess function parameter 'wq_init_attr' description in 'ib_create_wq'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The 'if' logic in ucma_query_path was broken with OPA was introduced
and started to treat RoCE paths as as OPA paths. Invert the logic
of the 'if' so only OPA paths are treated as OPA paths.
Otherwise the path records returned to rdma_cma users are mangled
when in RoCE mode.
Fixes: 5752075144 ("IB/SA: Add OPA path record type")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_set_ib_path() missed setting path record fields for RoCE
transport when RoCE support was added.
This results in setting incorrect ndev, destination mac address,
incorrect GID type etc errors when user space attempts to set a raw
IB path using the roce IB path compatibility mapping from userspace.
Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since 2006 there has been no user of rdmacm based application to make use
of setting multiple path records using rdma_set_ib_paths API.
Therefore code is simplified to allow setting one path record entry.
Now that it sets only single path, it is renamed to reflect the same.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce a helper function to set path record L2 fields for RoCE.
This includes setting GID type, destination mac address and netdev
ifindex.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The net namespace is set in addr during create_rdma_id(),
cma_resolve_iboe_route() should use that instead of the
init namespace.
The original code was added in commit fa20105e09 ("IB/cma: Add support
for network namespaces"), but this path wasn't in use back then.
This patch updates the code to use right namespace, as preparation
for improving namespace support.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is a need to increase number of possible char devices to support
large number of SR-IOV instances. The current limit is in the range of
64-128 devices/ports. Increase it to support up to 1024.
The patch performs the following steps to refactor the code:
1. Removes the split bitmap for fixed and overflow dev numbers.
2. Pre-allocates the non-legacy major number range during driver
initialization, choosen for simplicity.
3. Add new define (RDMA_MAX_PORTS) that is shared between all drivers.
This is the maximum total number of ports on all struct ib_devices.
4. Set RDMA_MAX_PORTS to 1024.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Remove the locks that protect character device bitmaps of
uverbs, umad and issm.
The character device bitmaps are accessed in "client->add" and
"client->remove" calls from ib_register_device and ib_unregister_device
respectively. These calls are already protected by the "device_mutex"
mutex. Thus, the spinlocks are not needed.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- One line fix to mlx4 error flow (same as mlx5 fix in last pull request,
just in the mlx4 driver)
- Fix a race condition in the IPoIB driver. This patch is larger than
just a one line fix, but resolves a race condition in a fairly
straight forward manner
- Fix a locking issue in the RDMA netlink code. This patch is also
larger than I would like for a late -rc. It has, however, had a week
to bake in the rdma tree prior to this pull request
- One line fix to fix granting remote machine access to memory that they
don't need and shouldn't have
- One line fix to correct the fact that our sgid/dgid pair is swapped
from what you would expect when receiving an incoming connection
request
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaU+ZkAAoJELgmozMOVy/dLw8P/1f27k9c7Bg91VfuyQeIcSxA
kyRDdzlkRzuI/6QJ4ErK+IkOH8ADG6UGmQa+fOv1dxG8do+YwVflcY7gEgjJA7fP
k0oPuGjiq8wrEWZrFGinln38ou0KALYd4F2C32unVYrsIohQLHSr1D6Ttw0W5FA6
NQG4nVn9FzmilgjqtkW2zOGKw4jdAn57J47tUp49KufuPBTUcxjmZCdaV5AmiuzN
5JpZUieL49Zoc18pcm1OreqDPZcj5LV1XquDNV+AZgU9+uGKoIb932k6hQjBRuml
FSePxpPjdN8zX/KVaa4HQHX4U4uMBp0HcRHYME1bDsKwTh/d9xKM/yTPzzCtJz+r
wmGJ9TPr2nq8blJJq17nSXbaJ4LmzlScCwork3LomdZJi880JwWJlvjFG3M/Yir9
HvS2zIOUJm+xZBNCDVEayYcBMkXew5XjxETtDwOvfYX8FM419LLk1WOp2y/4LKDD
hIR8QYkZMl37lMYqWZUghNjR7Rov6jdd30KDiCGdOAO/qszlNyTSL+icWyzc1t/X
VT4ai7vc0RTicPWwb8H8o8/dQNj8Ed8w5NnMq3hjen+KrTKShkZTMuW+or/E9jZN
ha9jIzSPLRfOvX6mZRrQVe6hiY3fOWMZXdw7gtehUy2hX7LCSwwbn2v6FcsDxyMQ
UW6ZVG3ccP9YSY+tBWKg
=kUnv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Doug Ledford:
- One line fix to mlx4 error flow (same as mlx5 fix in last pull
request, just in the mlx4 driver)
- Fix a race condition in the IPoIB driver. This patch is larger than
just a one line fix, but resolves a race condition in a fairly
straight forward manner
- Fix a locking issue in the RDMA netlink code. This patch is also
larger than I would like for a late -rc. It has, however, had a week
to bake in the rdma tree prior to this pull request
- One line fix to fix granting remote machine access to memory that
they don't need and shouldn't have
- One line fix to correct the fact that our sgid/dgid pair is swapped
from what you would expect when receiving an incoming connection
request
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/srpt: Fix ACL lookup during login
IB/srpt: Disable RDMA access by the initiator
RDMA/netlink: Fix locking around __ib_get_device_by_index
IB/ipoib: Fix race condition in neigh creation
IB/mlx4: Fix mlx4_ib_alloc_mr error flow
Merging in 12 patch series from Bart that required changes in the
current for-rc branch in order to apply cleanly.
Signed-off-by: Doug Ledford <dledford@redhat.com>
When mlx5_ib_add is called determine if the mlx5 core device being
added is capable of dual port RoCE operation. If it is, determine
whether it is a master device or a slave device using the
num_vhca_ports and affiliate_nic_vport_criteria capabilities.
If the device is a slave, attempt to find a master device to affiliate it
with. Devices that can be affiliated will share a system image guid. If
none are found place it on a list of unaffiliated ports. If a master is
found bind the port to it by configuring the port affiliation in the NIC
vport context.
Similarly when mlx5_ib_remove is called determine the port type. If it's
a slave port, unaffiliate it from the master device, otherwise just
remove it from the unaffiliated port list.
The IB device is registered as a multiport device, even if a 2nd port is
not available for affiliation. When the 2nd port is affiliated later the
GID cache must be refreshed in order to get the default GIDs for the 2nd
port in the cache. Export roce_rescan_device to provide a mechanism to
refresh the cache after a new port is bound.
In a multiport configuration all IB object (QP, MR, PD, etc) related
commands should flow through the master mlx5_core_dev, other commands
must be sent to the slave port mlx5_core_mdev, an interface is provide
to get the correct mdev for non IB object commands.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
It always returns 0. Change return type to void.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The cases for FDR/EDR signalling speed, were missing in
ib_rate_to_mult and mult_to_ib_rate giving wrong return values when
drivers convert static rate to/from inter-packet-delay.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The commit 1a1c116f3d ("RDMA/netlink: Simplify the put_msg and put_attr")
removes nlmsg_len calculation in ibnl_put_attr causing netlink messages and
caused to miss source and destination addresses.
Fixes: 1a1c116f3d ("RDMA/netlink: Simplify the put_msg and put_attr")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Holding locks is mandatory when calling __ib_device_get_by_index,
otherwise there are races during the list iteration with device removal.
Since the locks are static to device.c, __ib_device_get_by_index can
never be called correctly by any user out side the file.
Make the function static and provide a safe function that gets the
correct locks and returns a kref'd pointer. Fix all callers.
Fixes: e5c9469efc ("RDMA/netlink: Add nldev device doit implementation")
Fixes: c3f66f7b00 ("RDMA/netlink: Implement nldev port doit callback")
Fixes: 7d02f605f0 ("RDMA/netlink: Add nldev port dumpit implementation")
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The NLDEV commands are using IB device indexes and names as a handle
for netlink communications. Put all relevant code into one function
to remove code duplication in followup patches.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
There is an existing function to decrease reference counter
of the device, let's use it.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The request_module() call is internally wrapped by CONFIG_MODULE,
so there is no need to check it in our RDMA code too.
Refactor to simplify the code.
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- cxgb4 fix for an iser testing failure as debugged by Steve and Sagi.
The problem was a driver bug in the handling of shutting down a QP.
- Various vmw_pvrdma fixes for bogus WARN_ON, missed resource free on error
unwind and a use after free bug
- Improper congestion counter values on mlx5 when link aggregation is enabled
- ipoib lockdep regression introduced in this merge window
- hfi1 regression supporting the device in a VM introduced in a recent patch
- Typo that breaks future uAPI compatibility in the verbs core
- More SELinux related oops fixing
- Fix an oops during error unwind in mlx5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaRIC/AAoJEDht9xV+IJsaJfQP/1Z97/kDlJGIJQ4vBJ52xdHV
LfRdmCBqU5nrAihEBpFLRc2S+kaSJbYAY48tRn28Jx6s9dmSvU6v2J2IqhmnM6p6
ruWLR0Yqjg+xHcw+eaEoscJjRw+jDUEeVOgfbYc0HViWwvMNTrBB32HpAV48HuAl
aCbM/qrQYXdYuJBImM4glERIpjlvYKoxv4D9xCJhJRRQvTnKOymHzZpKbqNujWxl
dzCmZeOrw+HVxNW9MHHtUxClBoLNnykfRVKzMcdDjsqJ+Fdo2bY3ksgMvgiatRwY
NxGfixhouhOz9vjN/ljpWXxTV5TTm6Nrib8XcHuOWjcYn/AFwJMMRsM+1w1AuCKs
Zviq7QVApZzYuvHw1ewupRGvDX+P13sufD5sbc6cfVUT3w6ZX0Clpspl4++JN4ER
WvBZikozaviL3w9ir0drlZ6k9BDnjQ6P7wZcBjDZC/j0zXKM65rISZrTsK7TeiTk
lBNdLCkwZhO0dvafCNwA910tTaXEPhqqAh8Okob2A5U5lUAewd0AEHJusL/iCmSl
uXnnxu8ik61QzOqwneEHSyVMkOSLEC+kk13fiFAq/LjPUSm9N/MihZd4JNxwSa6W
4Rah7IKdh9F6qEnaKLPEfHxPhfghhb7O51zCA8mwA/JNCneqc4Gqi0U2JXkuloml
395aK2aZSShIkZvIwbI8
=IkGi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is the next batch of for-rc patches from RDMA. It includes the
fix for the ipoib regression I mentioned last time, and the result of
a fairly major debugging effort to get iser working reliably on cxgb4
hardware - it turns out the cxgb4 driver was not handling QP error
flushing properly causing iser to fail.
- cxgb4 fix for an iser testing failure as debugged by Steve and
Sagi. The problem was a driver bug in the handling of shutting down
a QP.
- Various vmw_pvrdma fixes for bogus WARN_ON, missed resource free on
error unwind and a use after free bug
- Improper congestion counter values on mlx5 when link aggregation is
enabled
- ipoib lockdep regression introduced in this merge window
- hfi1 regression supporting the device in a VM introduced in a
recent patch
- Typo that breaks future uAPI compatibility in the verbs core
- More SELinux related oops fixing
- Fix an oops during error unwind in mlx5"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/mlx5: Fix mlx5_ib_alloc_mr error flow
IB/core: Verify that QP is security enabled in create and destroy
IB/uverbs: Fix command checking as part of ib_uverbs_ex_modify_qp()
IB/mlx5: Serialize access to the VMA list
IB/hfi: Only read capability registers if the capability exists
IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush
IB/mlx5: Fix congestion counters in LAG mode
RDMA/vmw_pvrdma: Avoid use after free due to QP/CQ/SRQ destroy
RDMA/vmw_pvrdma: Use refcount_dec_and_test to avoid warning
RDMA/vmw_pvrdma: Call ib_umem_release on destroy QP path
iw_cxgb4: when flushing, complete all wrs in a chain
iw_cxgb4: reflect the original WR opcode in drain cqes
iw_cxgb4: Only validate the MSN for successful completions
Patches for 4.16 that are dependent on patches sent to 4.15-rc.
These are small clean ups for the vmw_pvrdma and i40iw drivers.
* 'from-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git:
RDMA/vmw_pvrdma: Remove usage of BIT() from UAPI header
RDMA/vmw_pvrdma: Use refcount_t instead of atomic_t
RDMA/vmw_pvrdma: Use more specific sizeof in kcalloc
RDMA/vmw_pvrdma: Clarify QP and CQ is_kernel logic
RDMA/vmw_pvrdma: Add UAR SRQ macros in ABI header file
i40iw: Change accelerated flag to bool
Delete ibnl_chk_listeners() and its kernel-doc comments from the
core_priv.h header file. There is no such function.
Fixes: 233c195583 ("RDMA/netlink: Reduce exposure of RDMA netlink functions")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The rq/sq->psn is 24 bits as defined in the IB spec, therefore we mask
out the 8 most significant bits to avoid overflow in modify_qp.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The XRC target QP create flow sets up qp_sec only if there is an IB link with
LSM security enabled. However, several other related uAPI entry points blindly
follow the qp_sec NULL pointer, resulting in a possible oops.
Check for NULL before using qp_sec.
Cc: <stable@vger.kernel.org> # v4.12
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
If the input command length is larger than the kernel supports an error should
be returned in case the unsupported bytes are not cleared, instead of the
other way aroudn. This matches what all other callers of ib_is_udata_cleared
do and will avoid user ABI problems in the future.
Cc: <stable@vger.kernel.org> # v4.10
Fixes: 189aba99e7 ("IB/uverbs: Extend modify_qp and support packet pacing")
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Use rdma_cap_opa_mad() to check for OPA to promote code reuse.
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
SA queries SM for class port info when there is a LID_CHANGE event.
When a base lid is configured before fm is started ie when smlid is
not yet assigned, SA handles the LID_CHANGE event and tries query SM
with lid 0. This will cause an hang.
[ 1106.958820] INFO: task kworker/2:0:23 blocked for more than 120 seconds.
[ 1106.965082] Tainted: G O 4.12.0+ #1
[ 1106.969602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[ 1106.977227] kworker/2:0 D 0 23 2 0x00000000
[ 1106.977250] Workqueue: infiniband update_ib_cpi [ib_core]
[ 1106.977261] Call Trace:
[ 1106.977273] __schedule+0x28e/0x860
[ 1106.977285] schedule+0x36/0x80
[ 1106.977298] schedule_timeout+0x1a3/0x2e0
[ 1106.977310] ? radix_tree_iter_tag_clear+0x1b/0x20
[ 1106.977322] ? idr_alloc+0x64/0x90
[ 1106.977334] wait_for_completion+0xe3/0x140
[ 1106.977347] ? wake_up_q+0x80/0x80
[ 1106.977369] update_ib_cpi+0x163/0x210 [ib_core]
[ 1106.977381] process_one_work+0x147/0x370
[ 1106.977394] worker_thread+0x4a/0x390
[ 1106.977406] kthread+0x109/0x140
[ 1106.977418] ? process_one_work+0x370/0x370
[ 1106.977430] ? kthread_park+0x60/0x60
[ 1106.977443] ret_from_fork+0x22/0x30
Always ensure a proper smlid is assigned before querying SM for cpi.
Fixes: ee1c60b1bf ("IB/SA: Modify SA to implicitly cache Class Port info")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.
Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When path ah_attr initialization from path record
fails, ib_cm_send_rej() uses av.ah_attr fields to send out reject
message. In such cases initialization of path record software fields
is not needed. Code is simplified for same.
Additionally in current code in cm_req_handler, when ib_get_cached_gid
fails for a given sgid_index of the GID of the GRH of the incoming CM MAD,
error code 12 is sent. This error code refers to primary GID in incoming
CM REQ and not for the GID in in MAD packet.
Therefore code is refactored to send code 5 (unsupported request) for such
error.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently ib_init_ah_from_wc initializes address handle attributes and
not the address handle object itself.
To avoid confusion between ah_attr vs ah, ib_init_ah_from_wc is
renamed to ib_init_ah_attr_from_wc to reflect that its initialzes
ah_attr.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since ib_init_ah_from_path initializes the address handle attribute, it is
renamed to reflect so.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In case of LAP are used for RoCE, it can lead to a problem of sleeping a
context while spin lock is held in below flow.
cm_lap_handler
->spin_lock
-> <..switch_case..>
-> cm_init_av_for_response
-> ib_init_ah_from_wc
-> rdma_addr_find_l2_eth_by_grh
wait_for_completion()
Therefore ah attribute initialization is done for incoming lap requests
outside of the lock context.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cm_init_av_by_path depends on ib_init_ah_from_path to initialize ah
attribute and ib_init_ah_from_path() can fail, such error should not
be ignored.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
cm_init_av_for_response depends on ib_init_ah_from_wc() whose return
status is ignored.
ib_init_ah_from_wc() can fail and its return status should be handled as
done in this patch.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently there are no users of ib_find_gid for RoCE transport. It is
only used by IPoIB.
Therefore its simplified to ignore RoCE ports and GID type check which
was previously done for every port.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_copy_addr copies the ifndex to bound_dev_if.
Therefore avoid copying it again after rdma_copy_addr call is completed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Since no caller needs vlan, rdma_translate_ip is simplified to avoid
vlan pointer.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_addr_find_smac_by_sgid() is exported symbol not used by any kernel
module. Therefore its removed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_resolve_ip already copies 'addr' to its dev_addr argument.
Remove the duplicate memcpy and since it was the only user, remove the
'addr' member from resolve_cb_context.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_find_gid_by_filter() is used only by ib_core, therefore avoid
exporting it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently on every gid entry comparison miss found variable is checked;
which is not needed as those two comparison fail already indicate that
GID is not found yet.
So refactor to avoid such check and copy the GID index when found.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Type cast from void to struct find_gid_index_context is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Introduce and user helper functions to initialize work for address
resolved and route resolved event that avoid code duplication at few
places.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Avoid setting path record type twice for RoCE.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Current code checks for NULL ndev twice where 2nd check is always
invalid given the fact that during route resolving stage, device address
must be bound to netdevice interface.
This patch simplifies such check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
As the function name suggests cma_resolve_iboe_route() resolves RoCE
route. However, its default GID type is IB_GID_TYPE_IB and not
IB_GID_TYPE_ROCE, even though both are mapped to the same enum value.
Change default GID type to IB_GID_TYPE_ROCE.
cma_iboe_set_mgid() is updated to reflect the RoCEv2 GID check.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In ib_umem structure npages holds original number of sg entries, while
nmap is number of DMA blocks returned by dma_map_sg.
Fixes: c5d76f130b ('IB/core: Add umem function to read data from user-space')
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add debug prints to the error paths in the connection manager control
flows, to help debug connection management problems.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The code was using the src size when formatting the dst. They are almost
certainly the same value but it reads wrong.
Fixes: ce117ffac2 ("RDMA/cma: Export AF_IB statistics")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ib_security_modify_qp and ib_security_pkey_access are core internal
function. So avoid exporting them.
ib_security_pkey_access is used only when secuirty hooks are enabled so
avoid defining it otherwise.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RoCEv1 does not use the IPv6 stack to resolve the link local DGID since it
uses GID address. It forms the DMAC directly from the DGID.
The code became confused and also tried to use this bypass for RoCEv2
packets, however RoCEv2 always uses a IP address in the GID and must
always use ARP or neighbor discovery to get the DMAC address.
Now that rdma_addr_find_l2_eth_by_grh() supports resolving link local
address to find destination mac address, lets make use of it.
This aligns it to how the rest of the IPv6 stack resolves link local
destination IPv6 address.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
When computing a UD reverse path (return AH) from a WC the code was not
doing a route lookup anchored in a specific netdevice. This caused several
bugs, including broken IPv6 link-local address support in RoCEv2. [1]
This fixes the lookup by determining the GID table entry that the HW
matched to the SGID for the WC and then using the netdevice from that
entry to perform the route and ND lookup for the 'DGID' to build a return
AH.
RoCE GID table management ensures that right upper netdevices of the
physical netdevices are added. Therefore init_ah_from_wc doesn't need to
perform such check.
Now that route lookup is done based on the netdevice of the GID entry,
simplify code to not have ifindex and vlan pointers. As part of that,
refactor to have netdevice as input parameter. This is already discussed
at [2].
Finally ib_init_ah_from_wc resolves dmac for unicast GID in similar way as
what ib_resolve_eth_dmac() does. So ib_resolve_eth_dmac is refactored to
split for unicast and non unicast GIDs, so that it can be reused by
ib_init_ah_from_wc.
While we are at refactoring ib_resolve_eth_dmac(), it is further
simplified
(a) to avoid hoplimit as optional parameter, as there is only one
user who always queries hoplimit.
(b) for empty line.
(c) avoided zero initialization of ret.
(d) removed as exported symbol as only ib core uses it.
For IPv6, this is tested using simple rping test as below.
rping -sv -a ::0
rping -c -a fe80::268a:7ff:fe55:4661%ens2f1 -C 1 -v -d
[1] https://www.spinics.net/lists/linux-rdma/msg45690.html
[2] https://www.spinics.net/lists/linux-rdma/msg45710.html
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reported-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
- Fix for SELinux on the umad SMI path. Some old hardware
does not fill the PKey properly exposing another bug in the newer
SELinux code.
- Check the input port as we can exceed array bounds from this
user supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaNGJ9AAoJEDht9xV+IJsawK4P/iVlUR8DReXKVPkxYOQk15bI
GEKG3t2Ce1GaeFZY7TBmsdHBRTHhxf2osEM57TbBmWv6N/pG83GLresE6xxOhRHz
3s2hzWElJXpYnM0QttHCNjJvySIzjzLZiaQyhiWFqs0+cPVUM9zQd0G77LwHngRf
1gO7toTMYk8eZkJt0ClQwHMeH6qR892o+zDUtorX/Ez4Ly4tT3I/RwRpbZ1HHpsA
uWMYcsge7lRzFbZnC+lDoeqozcv20B7n9UBEcAHJkVSh5JFC+TByRmCAZ/hPzjXb
Pr2E4gTYT+ULUsPRECtIwupT30xfFdByFYBAl+EQ+fiJvGgBxcgVjdLDQ3Ddlb6n
ga5UEverYKivizitKowtpMCJ0nVH6R4qLt5vcPwxuoHKQmUtXQFeg/haZPWCiPwr
B4Ahm371yRx8xo4AITBFX4L4PdtmdAueyrrjz/MxJm5YM2eRy08OONFVlBqXTuqT
EdbtHFCbXtE3aAIiWGmUA0jbswKN9fkUct/wMwkny8T3h/XPKhBqA/SWN5SX1KC3
EHAjczAcX+MOS52pyhf07C3Z/oq4gpXSCQQHSat9es8oxst4w0CWcCGqIhyqyu2q
s5CZG3Ok+OvTmKWRJkaEAJXHRoTB1OjkgEod13xRDQmONW/cabeBKe/BC1zmvctM
g2eyl4amP8MGaRTldpou
=oyIl
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"More fixes from testing done on the rc kernel, including more SELinux
testing. Looking forward, lockdep found regression today in ipoib
which is still being fixed.
Summary:
- Fix for SELinux on the umad SMI path. Some old hardware does not
fill the PKey properly exposing another bug in the newer SELinux
code.
- Check the input port as we can exceed array bounds from this user
supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/ipoib: Restore MM behavior in case of tx_ring allocation failure
iw_cxgb4: only insert drain cqes if wq is flushed
iw_cxgb4: only clear the ARMED bit if a notification is needed
RDMA/netlink: Fix general protection fault
IB/mlx4: Fix RSS hash fields restrictions
IB/core: Don't enforce PKey security on SMI MADs
IB/core: Bound check alternate path port number
With gcc-4.1.2:
drivers/infiniband/core/iwpm_util.c: In function ‘iwpm_send_mapinfo’:
drivers/infiniband/core/iwpm_util.c:647: warning: ‘ret’ may be used uninitialized in this function
Indeed, if nl_client is not found in any of the scanned has buckets, ret
will be used uninitialized.
Preinitialize ret to -EINVAL to fix this.
Fixes: 30dc5e63d6 ("RDMA/core: Add support for iWARP Port Mapper user space service")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Fix ptr_ret.cocci warnings:
drivers/infiniband/core/uverbs_cmd.c:1156:1-3: WARNING: PTR_ERR_OR_ZERO can be used
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
Generated by: scripts/coccinelle/api/ptr_ret.cocci
Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
ULPs do not understand OPA GIDs and will reject CM requests
if the sgid does not match the local_gid. In order to
fix this behavior we convert the OPA GID back to an IB GID.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Per the infiniband spec an SMI MAD can have any PKey. Checking the pkey
on SMI MADs is not necessary, and it seems that some older adapters
using the mthca driver don't follow the convention of using the default
PKey, resulting in false denials, or errors querying the PKey cache.
SMI MAD security is still enforced, only agents allowed to manage the
subnet are able to receive or send SMI MADs.
Reported-by: Chris Blake <chrisrblake93@gmail.com>
Cc: <stable@vger.kernel.org> # v4.12
Fixes: 47a2b338fe ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The alternate port number is used as an array index in the IB
security implementation, invalid values can result in a kernel panic.
Cc: <stable@vger.kernel.org> # v4.12
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
fix for a regression in iWarp if SELinux is enabled, a fix for a compilation
regression introduced in this merge window, and one obscure kconfig
combination that oops's the kernel.
For drivers, we have hns fixes needed to make their devices work on certain
ARM IOMMU configurations, a stack data leak for hfi1, and various testing
discovered -rc bug fixes for i40iw.
This cycle we pushed back on the driver maintainers to have better commit
messages for -rc material.
You may need to pull my latest PGP key from the GPG key servers for this, I am
not certain if the subkey update will make it to kernel.org's WKD before you
need it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJaJcu6AAoJEDht9xV+IJsa8aoP/Rh3tHT3mqN6v/p9HgUyUNRS
gyDJ4BHg3A+O1UrnBjFjrAhpX9bqik/96n9t14Er7kVM9gxaDzOaCNd9+ASsKjDf
fXRusCnKS5RP5CpQ16e6qurkOsBXghsJKTL+zpqGSmDf0yUBQCJUkmRNJNhiaUtW
YEp92dfZytTK+iEmuXW4fJoIKWK3N5aOkttiK8BFb6XvmsUnWSp1wlBS2FhRzDq9
PPwfM2EE/x46dFF1/w04M5hVDPO6Bngq0Tvo+EdOlAMwKN3Zmun+fSOLKaxg44Of
dyN6dsu5tKi200Nbdq6cBkehWL6CukSGdJnepeI+xW+8hve9Eu9O6j6O3pMb/dYn
/vvqE14KhrR1B3F5LFkJLcxxKRl97S2uPhOY2j3oU4L93s9B4X6geXX2oLVIos1r
41YPu1/7OQyQffp4eKgsz4eA38TpdG6DoOlFMXgdIboJ8bASuRuyfLISVviMc8dx
SKQTZTY54FK7uJMRw4rkcOlVUpJ2tyuVZr+Lt8p80IpnySCdJsEgAZkJngCPOKRT
2h8VdfFwzhdlf3Ni5tZRZdMtE6oMD5BMa0jri7xtyKYa0o3gUqvHDGMTSVlQ1maF
qXMP2mApTcpdFuFvdbnxIeLzP8zigJVkvIsqeKHGS8gt+dxF/934rTY1NTj3rQN5
zmClIoiVg7NvHlDzvwg+
=YaUB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Here is the first rc pull request for RDMA. This includes an important
core fix for a regression in iWarp if SELinux is enabled, a fix for a
compilation regression introduced in this merge window, and one
obscure kconfig combination that oops's the kernel.
For drivers, we have hns fixes needed to make their devices work on
certain ARM IOMMU configurations, a stack data leak for hfi1, and
various testing discovered -rc bug fixes for i40iw.
This cycle we pushed back on the driver maintainers to have better
commit messages for -rc material"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/core: Only enforce security for InfiniBand
RDMA/hns: Get rid of page operation after dma_alloc_coherent
RDMA/hns: Get rid of virt_to_page and vmap calls after dma_alloc_coherent
RDMA/hns: Fix the issue of IOVA not page continuous in hip08
IB/core: Init subsys if compiled to vmlinuz-core
RDMA/cma: Make sure that PSN is not over max allowed
i40iw: Notify user of established connection after QP in RTS
i40iw: Move MPA request event for loopback after connect
i40iw: Correct ARP index mask
i40iw: Do not free sqbuf when event is I40IW_TIMER_TYPE_CLOSE
i40iw: Allocate a sdbuf per CQP WQE
IB: INFINIBAND should depend on HAS_DMA
IB/hfi1: Initialize bth1 in 16B rc ack builder
For now the only LSM security enforcement mechanism available is
specific to InfiniBand. Bypass enforcement for non-IB link types.
This fixes a regression where modify_qp fails for iWARP because
querying the PKEY returns -EINVAL.
Cc: Paul Moore <paul@paul-moore.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: stable@vger.kernel.org
Reported-by: Potnuri Bharat Teja <bharat@chelsio.com>
Fixes: d291f1a65232("IB/core: Enforce PKey security on QPs")
Fixes: 47a2b338fe63("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Tested-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Once infiniband is compiled as a core component its subsystem must be
enabled before device initialization. Otherwise there is a NULL pointer
dereference during mlx4_core init, calltrace:
->device_add
if (dev->class) {
deref dev->class->p =>NULLPTR
#Config
CONFIG_NET_DEVLINK=y
CONFIG_MAY_USE_DEVLINK=y
CONFIG_MLX4_EN=y
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This patch limits the initial value for PSN to 24 bits as
spec requires.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3a6 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add iWARP support to qedr driver
- Lots of misc fixes across subsystem
- Multiple update series to hns roce driver
- Multiple update series to hfi1 driver
- Updates to vnic driver
- Add kref to wait struct in cxgb4 driver
- Updates to i40iw driver
- Mellanox shared pull request
- timer_setup changes
- massive cleanup series from Bart Van Assche
- Two series of SRP/SRPT changes from Bart Van Assche
- Core updates from Mellanox
- i40iw updates
- IPoIB updates
- mlx5 updates
- mlx4 updates
- hns updates
- bnxt_re fixes
- PCI write padding support
- Sparse/Smatch/warning cleanups/fixes
- CQ moderation support
- SRQ support in vmw_pvrdma
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJaDF9JAAoJELgmozMOVy/dDXUP/i92g+G4OJ+4hHMh4KCjQMHT
eMr/w9l1C033HrtsU1afPhqHOsKSxwCuJSiTgN4uXIm67/2kPK5Vlx+ir7mbOLwB
3ukVK6Q/aFdigWCUhIaJSlDpjbd2sEj7JwKtM3rucvMWJlBJ4mAbcVQVfU96CCsv
V9mO7dpR3QtYWDId9DukfnAfPUPFa3SMZnD7tdl6mKNRg/MjWGYLAL4nJoBfex5f
b4o+MTrbuFWXYsfDru1m9BpHgyul20ldfcnbe8C/sVOQmOgkX7ngD5Sdi1FLeRJP
GF/DnAqInC9N7cAxZHx4kH9x6mLMmEdfnwQ9VTVqGUHBsj3H4hQTVIAFfHUhWUbG
TP5ZHgZG2CewZ0rf092cWlDZwp6n0BalnbQJr+QN4MzPmYbofs3AccSKUwrle+e+
E6yYf4XxJdt7wRr4F1QKygtUEXSnNkNYUDQ4ZFbpJS/D4Sq80R1ZV/WZ7PJxm1D/
EIKoi7NU9cbPMIlbCzn8kzgfjS7Pe4p0WW/Xxc/IYmACzpwNPkZuFGSND79ksIpF
jhHqwZsOWFuXISjvcR4loc8wW6a5w5vjOiX0lLVz0NSdXSzVqav/2at7ZLDx/PT+
Lh9YVL51akA3hiD+3X6iOhfOUu6kskjT9HijE5T8rJnf0V+C6AtIRpwrQ7ONmjJm
3JMrjjLxtCIvpUyzCvDW
=A1oL
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"This is a fairly plain pull request. Lots of driver updates across the
stack, a huge number of static analysis cleanups including a close to
50 patch series from Bart Van Assche, and a number of new features
inside the stack such as general CQ moderation support.
Nothing really stands out, but there might be a few conflicts as you
take things in. In particular, the cleanups touched some of the same
lines as the new timer_setup changes.
Everything in this pull request has been through 0day and at least two
days of linux-next (since Stephen doesn't necessarily flag new
errors/warnings until day2). A few more items (about 30 patches) from
Intel and Mellanox showed up on the list on Tuesday. I've excluded
those from this pull request, and I'm sure some of them qualify as
fixes suitable to send any time, but I still have to review them
fully. If they contain mostly fixes and little or no new development,
then I will probably send them through by the end of the week just to
get them out of the way.
There was a break in my acceptance of patches which coincides with the
computer problems I had, and then when I got things mostly back under
control I had a backlog of patches to process, which I did mostly last
Friday and Monday. So there is a larger number of patches processed in
that timeframe than I was striving for.
Summary:
- Add iWARP support to qedr driver
- Lots of misc fixes across subsystem
- Multiple update series to hns roce driver
- Multiple update series to hfi1 driver
- Updates to vnic driver
- Add kref to wait struct in cxgb4 driver
- Updates to i40iw driver
- Mellanox shared pull request
- timer_setup changes
- massive cleanup series from Bart Van Assche
- Two series of SRP/SRPT changes from Bart Van Assche
- Core updates from Mellanox
- i40iw updates
- IPoIB updates
- mlx5 updates
- mlx4 updates
- hns updates
- bnxt_re fixes
- PCI write padding support
- Sparse/Smatch/warning cleanups/fixes
- CQ moderation support
- SRQ support in vmw_pvrdma"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (296 commits)
RDMA/core: Rename kernel modify_cq to better describe its usage
IB/mlx5: Add CQ moderation capability to query_device
IB/mlx4: Add CQ moderation capability to query_device
IB/uverbs: Add CQ moderation capability to query_device
IB/mlx5: Exposing modify CQ callback to uverbs layer
IB/mlx4: Exposing modify CQ callback to uverbs layer
IB/uverbs: Allow CQ moderation with modify CQ
iw_cxgb4: atomically flush the qp
iw_cxgb4: only call the cq comp_handler when the cq is armed
iw_cxgb4: Fix possible circular dependency locking warning
RDMA/bnxt_re: report vlan_id and sl in qp1 recv completion
IB/core: Only maintain real QPs in the security lists
IB/ocrdma_hw: remove unnecessary code in ocrdma_mbx_dealloc_lkey
RDMA/core: Make function rdma_copy_addr return void
RDMA/vmw_pvrdma: Add shared receive queue support
RDMA/core: avoid uninitialized variable warning in create_udata
RDMA/bnxt_re: synchronize poll_cq and req_notify_cq verbs
RDMA/bnxt_re: Flush CQ notification Work Queue before destroying QP
RDMA/bnxt_re: Set QP state in case of response completion errors
RDMA/bnxt_re: Add memory barriers when processing CQ/EQ entries
...
- proper use of the bool type (Thomas Meyer)
- constification of struct config_item_type (Bhumika Goyal)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAloLSTALHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNxfhAAv3cunxiEPEAvs+1xuGd3cZYaxz7qinvIODPxIKoF
kRWiuy5PUklRMnJ8seOgJ1p1QokX6Sk4cZ8HcctDJVByqODjOq4K5eaKVN1ZqJoz
BUzO/gOqfs64r9yaFIlKfe8nFA+gpUftSeWyv3lThxAIJ1iSbue7OZ/A10tTOS1m
RWp9FPepFv+nJMfWqeQU64BsoDQ4kgZ2NcEA+jFxNx5dlmIbLD49tk0lfddvZQXr
j5WyAH73iugilLtNUGVOqSzHBY4kUvfCKUV7leirCegyMoGhFtA87m6Wzwbo6ZUI
DwQLzWvuPaGv1P2PpNEHfKiNbfIEp75DRyyyf87DD3lc5ffAxQSm28mGuwcr7Rn5
Ow/yWL6ERMzCLExoCzEkXYJISy7T5LIzYDgNggKMpeWxysAduF7Onx7KfW1bTuhK
mHvY7iOXCjEvaIVaF8uMKE6zvuY1vCMRXaJ+kC9jcIE3gwhg+2hmQvrdJ2uAFXY+
rkeF2Poj/JlblPU4IKWAjiPUbzB7Lv0gkypCB2pD4riaYIN5qCAgF8ULIGQp2hsO
lYW1EEgp5FBop85oSO/HAGWeH9dFg0WaV7WqNRVv0AGXhKjgy+bVd7iYPpvs7mGw
z9IqSQDORcG2ETLcFhZgiJpCk/itwqXBD+wgMOjJPP8lL+4kZ8FcuhtY9kc9WlJE
Tew=
=+tMO
-----END PGP SIGNATURE-----
Merge tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs
Pull configfs updates from Christoph Hellwig:
"A couple of configfs cleanups:
- proper use of the bool type (Thomas Meyer)
- constification of struct config_item_type (Bhumika Goyal)"
* tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs:
RDMA/cma: make config_item_type const
stm class: make config_item_type const
ACPI: configfs: make config_item_type const
nvmet: make config_item_type const
usb: gadget: configfs: make config_item_type const
PCI: endpoint: make config_item_type const
iio: make function argument and some structures const
usb: gadget: make config_item_type structures const
dlm: make config_item_type const
netconsole: make config_item_type const
nullb: make config_item_type const
ocfs2/cluster: make config_item_type const
target: make config_item_type const
configfs: make ci_type field, some pointers and function arguments const
configfs: make config_item_type const
configfs: Fix bool initialization/comparison
Current ib_modify_cq() is used to set CQ moderation parameters.
This patch renames ib_modify_cq() to be rdma_set_cq_moderation(),
because the kernel version of RDMA API doesn't need to follow already
exposed to user's API pattern (create_XXX/modify_XXX/query_XXX/destroy_XXX)
and better to have more accurate name which describes the actual usage.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The query_device function can now obtain the maximum values for
cq_max_count and cq_period, needed for CQ moderation.
cq_max_count is a 16 bits number that determines the number
of CQEs to accumulate before generating an event.
cq_period is a 16 bits number that determines the timeout in micro
seconds from the last event generated, upon which a new event will
be generated even if cq_max_count was not reached.
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Uverbs support in modify_cq for CQ moderation only.
Gives ability to change cq_max_count and cq_period.
CQ moderation enhance performance by moderating the number
of CQEs needed to create an event instead of application
having to suffer from event per-CQE.
To achieve CQ moderation the application needs to set cq_max_count
and cq_period.
cq_max_count - defines the number of CQEs needed to create an event.
cq_period - defines the timeout (micro seconds) between last
event and a new one that will occur even if
cq_max_count was not satisfied
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When modify QP is called on a shared QP update the security context for
the real QP. When security is subsequently enforced the shared QP
handles will be checked as well.
Without this change shared QP handles get added to the port/pkey lists,
which is a bug, because not all shared QP handles will be checked for
access. Also the shared QP security context wouldn't get removed from
the port/pkey lists causing access to free memory and list corruption
when they are destroyed.
Cc: stable@vger.kernel.org
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Function returns zero - make it void.
While there make struct net_device const.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As Dan pointed out, the rework I did makes it harder for smatch and other
static checkers to figure out what is going on with the uninitialized
pointers.
By open-coding the call in create_udata(), we make it more readable for
both humans and tools.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 12f727721e ("IB/uverbs: clean up INIT_UDATA_BUF_OR_NULL usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When deciding to convert an OPA AH to IB we were incorrectly
including the IB multicast range. At this layer, all Extended
LIDs will be larger than IB_LID_PERMISSIVE. Change comparison
accordingly.
Fixes: d541e45500 ("IB/core: Convert ah_attr from OPA to IB when copying to user")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since there is nothing done with non zero return value, such check is
avoided.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There are root complexes that are able to optimize their
performance when incoming data is multiple full cache lines.
PCI write end padding is the device's ability to pad the ending of
incoming packets (scatter) to full cache line such that the last
upstream write generated by an incoming packet will be a full cache
line.
Add a relevant entry to ib_device_cap_flags to report such capability
of an RDMA device.
Add the QP and WQ create flags:
* A QP/WQ created with a scatter end padding flag will cause
HW to pad the last upstream write generated by a packet to cache line.
User should consider several factors before activating this feature:
- In case of high CPU memory load (which may cause PCI back pressure in
turn), if a large percent of the writes are partial cache line, this
feature should be checked as an optional solution.
- This feature might reduce performance if most packets are between one
and two cache lines and PCIe throughput has reached its maximum
capacity. E.g. 65B packet from the network port will lead to 128B
write on PCIe, which may cause traffic on PCIe to reach high
throughput.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The RDMA/umem uses generic RB-trees macros to generate various ib_umem
access functions. The generation is performed with INTERVAL_TREE_DEFINE
macro, which allows one of two modes: declare all functions as static or
declare none of the function to be static.
The second mode of operation produces the following sparse errors:
drivers/infiniband/core/umem_rbtree.c:69:1:
warning: symbol 'rbt_ib_umem_iter_first' was not declared.
Should it be static?
drivers/infiniband/core/umem_rbtree.c:69:1:
warning: symbol 'rbt_ib_umem_iter_next' was not declared.
Should it be static?
Code relocation together with declaration of such functions to be
"static" solves the issue.
Because there is no need to have separate file for two functions,
let's consolidate umem_rtree.c and umem_odp.c into one file.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWfswbQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykvEwCfXU1MuYFQGgMdDmAZXEc+xFXZvqgAoKEcHDNA
6dVh26uchcEQLN/XqUDt
=x306
-----END PGP SIGNATURE-----
Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull initial SPDX identifiers from Greg KH:
"License cleanup: add SPDX license identifiers to some files
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the
'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
binding shorthand, which can be used instead of the full boiler plate
text.
This patch is based on work done by Thomas Gleixner and Kate Stewart
and Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset
of the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to
license had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied
to a file was done in a spreadsheet of side by side results from of
the output of two independent scanners (ScanCode & Windriver)
producing SPDX tag:value files created by Philippe Ombredanne.
Philippe prepared the base worksheet, and did an initial spot review
of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537
files assessed. Kate Stewart did a file by file comparison of the
scanner results in the spreadsheet to determine which SPDX license
identifier(s) to be applied to the file. She confirmed any
determination that was not immediately clear with lawyers working with
the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained
>5 lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that
became the concluded license(s).
- when there was disagreement between the two scanners (one detected
a license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply
(and which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases,
confirmation by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.
The Windriver scanner is based on an older version of FOSSology in
part, so they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot
checks in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect
the correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial
patch version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch
license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
applied SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
License cleanup: add SPDX license identifier to uapi header files with a license
License cleanup: add SPDX license identifier to uapi header files with no license
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IB device index is nldev's handler and it should be checked always.
Fixes: c3f66f7b00 ("RDMA/netlink: Implement nldev port doit callback")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Doug Ledford <dledford@redhat.com>
[ Applying directly, since Doug fried his SSD's and is rebuilding - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In recent code, two path record entries are alwasy cleared while
allocated could be either one or two path record entries.
This leads to zero out of unallocated memory.
This fix initializes alternative path record only when alternative path
is set.
While we are at it, path record allocation doesn't check for OPA
alternative path, but rest of the code checks for OPA alternative path.
Path record allocation code doesn't check for OPA alternative LID.
This can further lead to memory corruption when only one path record is
allocated, but there is actually alternative OPA path record present in CM
request.
Cc: <stable@vger.kernel.org> # v4.12+
Fixes: 9fdca4da4d ("IB/SA: Split struct sa_path_rec based on IB and ROCE specific fields")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Make these structures const as they are either passed to the functions
having the argument as const or stored as a reference in the "ci_type"
const field of a config_item structure.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The early for-next branch was based on v4.14-rc2, while the shared pull
request I got from Mellanox used a v4.14-rc4 base. I'm making the
branch that was the shared Mellanox pull request the new for-next branch
and merging the early for-next branch into it.
Signed-off-by: Doug Ledford <dledford@redhat.com>
The IB/core provides address resolution service and invokes callback
handler when address resolve request completes of requester in worker
thread context.
Such caller might allocate or free memory in callback handler
depending on the completion status to make further progress or to
terminate a connection. Most ULPs resolve route which involves
allocating route entry and path record elements in callback event handler.
It has been noticed that WQ_MEM_RECLAIM flag should not be used for
workers that tend to allocate memory in this [1] thread discussion.
In order to mitigate this situation, WQ_MEM_RECLAIM flag was dropped for
other such WQs in this [2] patch.
Similar problem might arise with address resolution path, though its not
yet noticed. The ib_addr workqueue is not memory reclaim path due to its
nature of invoking callback that might allocate memory or don't free any
memory under memory pressure.
[1] https://www.spinics.net/lists/linux-rdma/msg53239.html
[2] https://www.spinics.net/lists/linux-rdma/msg53416.html
Fixes: f54816261c ("IB/addr: Remove deprecated create_singlethread_workqueue")
Fixes: 5fff41e1f8 ("IB/core: Fix race condition in resolving IP to MAC")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch fixes the case where 'lifespan' entry of the hw_counters
is not writable. Currently write callback is not exposed for for
the hw_counters sysfs operation. Due to this, modifying lifespan
value results into permission denied error in below example.
echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan
-bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan:
Permission denied
This patch adds the hook to modify any attribute which implements
store() operation.
Fixes: b40f4757da ("IB/core: Make device counter infrastructure dynamic")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since IB/core resolves the destination mac address for user and kernel
consumers, avoid resolving in multiple provider drivers.
Only ib_core resolves DMAC now, therefore resolve_eth_dmac is removed as
exported symbol.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce rdma_create_user_ah API which allows passing udata to
provider driver and additionally which resolves DMAC for RoCE.
ib_resolve_eth_dmac() resolves destination mac address for unicast,
multicast, link local ipv4 mapped ipv6 and ipv6 destination gid entry.
This allows all RoCE provider drivers to avoid duplicating such code.
Such change brings consistency where IB core always resolves dmac and pass
it to RoCE provider drivers for user and kernel consumers, with this
ah_attr->roce.dmac is always an input field for provider drivers.
This uniformity avoids exporting ib_resolve_eth_dmac symbol to providers
or other modules. Therefore its removed as exported symbol at later in
the patch series.
Now uverbs and umad both makes use of rdma_create_user_ah API which
fixes the issue where umad has invalid DMAC for address.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch reduces the number of #ifdefs and also avoids that
smatch reports the following:
drivers/infiniband/core/uverbs_ioctl.c:276: ib_uverbs_cmd_verbs() warn: if statement not indented
drivers/infiniband/core/uverbs_ioctl.c:280: ib_uverbs_cmd_verbs() warn: possible memory leak of 'ctx'
drivers/infiniband/core/uverbs_ioctl.c:315: ib_uverbs_cmd_verbs() warn: if statement not indented
References: commit fac9658cab ("IB/core: Add new ioctl interface")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Matan Barak <matanb@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
According to the C standard the behavior of computations with
integer operands is as follows:
* A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting
unsigned integer type is reduced modulo the number that is one
greater than the largest value that can be represented by the
resulting type.
* The behavior for signed integer underflow and overflow is
undefined.
Hence only use unsigned integers when checking for integer
overflow.
This patch is what I came up with after having analyzed the
following smatch warnings:
drivers/infiniband/core/cma.c:3448: cma_resolve_ib_udp() warn: signed overflow undefined. 'offset + conn_param->private_data_len < conn_param->private_data_len'
drivers/infiniband/core/cma.c:3505: cma_connect_ib() warn: signed overflow undefined. 'offset + conn_param->private_data_len < conn_param->private_data_len'
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that gcc 7 reports the following warning when building with W=1:
warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
prot_sg_cnt cannot be zero as a previous check on ret (from which
prot_sg_cnt is assigned) returns -ENOMEM if is it zero. Since
it cannot be zero we can simplify the code by removing the non
-zero check on prot_sg_cnt and redundant else statement.
Detected by CoverityScan, COD#1357188 ("Logically dead code")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of making every caller convert the second argument of
sa_path_set_slid() and sa_path_set_dlid() to big endian format,
make these two functions accept LIDs in CPU endian format.
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Cc: Don Hiatt <don.hiatt@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit 1a1c116f3d removes nlmsg_len calculation in
ibnl_put_attr causing netlink messages to be rejected due
to incorrect length.
Add nlmsg_end after all attributes are appended to calculate
the nlmsg_len.
Fixes: 1a1c116f3d ("RDMA/netlink: Simplify the put_msg and put_attr")
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After changing INIT_UDATA_BUF_OR_NULL() to an inline function,
this does the same change to INIT_UDATA for consistency.
I'm keeping it separate as this part is much larger and
we wouldn't want to backport this to stable kernels if we
ever want to address the gcc warnings by backporting the
first patch.
Again, using an inline function gives us better type
safety here among other issues with macros. I'm using
u64_to_user_ptr() to convert the user pointer to simplify
the logic rather than adding lots of new type casts.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We get a harmless warning about the fact that we use the result of a
multiplication as a condition:
drivers/infiniband/core/uverbs_main.c: In function 'ib_uverbs_write':
drivers/infiniband/core/uverbs_main.c:787:40: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
drivers/infiniband/core/uverbs_main.c:787:117: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
drivers/infiniband/core/uverbs_main.c:790:50: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
drivers/infiniband/core/uverbs_main.c:790:151: error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]
This avoids the problem by using an inline function in place of
the macro.
Fixes: a96e4e2ffe ("IB/uverbs: New macro to set pointers to NULL if length is 0 in INIT_UDATA()")
Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://patchwork.kernel.org/patch/9940777/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Trivial fix to spelling mistake in WARN message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When security_ib_alloc_security fails, qp->qp_sec memory is freed.
However ib_destroy_qp still tries to access this memory which result
in kernel crash. So its initialized to NULL to avoid such access.
Fixes: d291f1a652 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The tag matching functionality is implemented by mlx5 driver
by extending XRQ, however this internal kernel information was
exposed to user space applications with *xrq* name instead of *tm*.
This patch renames *xrq* to *tm* to handle that.
Fixes: 8d50505ada ("IB/uverbs: Expose XRQ capabilities")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
- Smattering of miscellanous fixes
- A five patch series for i40iw that had a patch (5/5) that was larger
than I would like, but I took it because it's needed for large scale
users
- An 8 patch series for bnxt_re that landed right as I was leaving on
PTO and so had to wait until now...they are all appropriate fixes for
-rc IMO
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZxU+WAAoJELgmozMOVy/dQwEP/ja5+3zNbkX69T/ch5Q9koKO
7O1Onw/ePn9va/hC0IJm910syeyUcnkl+0GJH9JhS/Q/7bd9S97TjdSMjZpOSTjA
qCkFWOJ2zZPsGVijsiFF+BQa1jPgUc2VRwbuC4sWm19Ma8iLZ86aXKot9prBPoU7
dEnpwX5LrUIQCcNmWaudXoctiqN3y6oQzIobzGJXXQzlT5VPudIPYKUZMixuLYH2
XXJ5MtrHlvB+aKIURcHey03q8Vah5HQ6P467249fNBsLoYbycx7aPYhR7NyFDEEX
IkucBT7FOZUqcklxIXQHRQOTvj8dru91TvsZ6aNVPuS6SvYTf95cSFu7yBBP+DNd
g3UWpuRXwvJYQosXbpHhGNevq2M3XLZmzEvOBul8j7Fq/4rw6HxFYtA9um/8V4h9
UxJjjAu59gbkmnrG2cGJCLwnC75BId84cZ4Nc8vfB/mhShE3n8YjRXfb1clS9DB7
CTNLp7AtFujTdWc4iQ3vMZ9cCILQtKnSXvnETHq65WDnqfaPT7NfwIrFxGHDUa5N
m94l+Neg3rNrsxcRFxXQ9HzmG2ZTiGK956Nvpxn6/cDD6ZVd6RQBOYjZ4QxVd+lS
jdkA0gImS88HlupyosILMPjQm+BCqmDjpZx/yWyRRCBe7XP1MgX9S2ySDqFgiy1j
J9KGzXFIV73DA8nVfNtM
=iiKF
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
- Smattering of miscellanous fixes
- A five patch series for i40iw that had a patch (5/5) that was larger
than I would like, but I took it because it's needed for large scale
users
- An 8 patch series for bnxt_re that landed right as I was leaving on
PTO and so had to wait until now...they are all appropriate fixes for
-rc IMO
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (22 commits)
bnxt_re: Don't issue cmd to delete GID for QP1 GID entry before the QP is destroyed
bnxt_re: Fix memory leak in FRMR path
bnxt_re: Remove RTNL lock dependency in bnxt_re_query_port
bnxt_re: Fix race between the netdev register and unregister events
bnxt_re: Free up devices in module_exit path
bnxt_re: Fix compare and swap atomic operands
bnxt_re: Stop issuing further cmds to FW once a cmd times out
bnxt_re: Fix update of qplib_qp.mtu when modified
i40iw: Add support for port reuse on active side connections
i40iw: Add missing VLAN priority
i40iw: Call i40iw_cm_disconn on modify QP to disconnect
i40iw: Prevent multiple netdev event notifier registrations
i40iw: Fail open if there are no available MSI-X vectors
RDMA/vmw_pvrdma: Fix reporting correct opcodes for completion
IB/bnxt_re: Fix frame stack compilation warning
IB/mlx5: fix debugfs cleanup
IB/ocrdma: fix incorrect fall-through on switch statement
IB/ipoib: Suppress the retry related completion errors
iw_cxgb4: remove the stid on listen create failure
iw_cxgb4: drop listen destroy replies if no ep found
...
and a small cleanup to our xdr encoding.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZst0LAAoJECebzXlCjuG+o30QALbchoIvs7BiDrUxYMfJ2nCa
7UW69STwX79B3NZTg7RrScFTLPEFW9DMpb/Og7AYTH3/wdgGYQNM1UxGUYe7IxSN
xemH7BSmQzJ7ryaxouO/jskUw5nvNRXhY0PMxJApjrCs837vTjduIVw9zUa8EDeH
9toxpTM4k3z/1myj60PuHnuQF9EyLDL6W581loDF04nQB3pVRbAZOh1lUeqMgLUd
7IF+CDECFcjL7oZSA3wDGpsVySLdZ+GYxloFIDO/d8kHEsZD3OaN2MdfRki8EOSQ
qibTYO0284VeyNLUOIHjspqbDh0Lr2F7VolMmlM5GF1IuApih0/QYidqsH6/As3U
JIAK53vgqZfK2qI0ud7dGGFEnT/vlE7pQiXiza36xI8YZu4Xz6uGbM41p38RU8jO
3fr38xdPqqO7YE6F7ZUHYyrmW81Vi0lFdQkw1DBEipHV8UquuCmdtAeR9xgDsdQ/
LsMVevM1mF+19krOIGbBnENq1GX78ecfHEYGxlTjf/MeO4JYl+8/x7Ow2e/ZbwSa
7hpUeCiVuVmy1hqOEtraBl5caAG0hCE8PeGRrdr5dA6ZS9YTm0ANgtxndKabwDh2
CjXF3gRnQNUGdFGCi/fmvfb89tVNj1tL52pbQqfgOb/VFrrL328vyNNg/1p2VY4Q
qzmKtxZhi/XBewQjaSQl
=E3UQ
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"More RDMA work and some op-structure constification from Chuck Lever,
and a small cleanup to our xdr encoding"
* tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux:
svcrdma: Estimate Send Queue depth properly
rdma core: Add rdma_rw_mr_payload()
svcrdma: Limit RQ depth
svcrdma: Populate tail iovec when receiving
nfsd: Incoming xdr_bufs may have content in tail buffer
svcrdma: Clean up svc_rdma_build_read_chunk()
sunrpc: Const-ify struct sv_serv_ops
nfsd: Const-ify NFSv4 encoding and decoding ops arrays
sunrpc: Const-ify instances of struct svc_xprt_ops
nfsd4: individual encoders no longer see error cases
nfsd4: skip encoder in trivial error cases
nfsd4: define ->op_release for compound ops
nfsd4: opdesc will be useful outside nfs4proc.c
nfsd4: move some nfsd4 op definitions to xdr4.h
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().
As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().
[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fix in the parent made me look at that function, and react to how
illogical and illegible the array initializer was.
Use named array indexes to make it clearer what is going on, and make
the initializer not depend silently on the exact index numbers.
[ The initializer now also shows an odd inconsistency in the naming:
note the IWCM vs IWPM.. - Linus ]
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The netlink message sent with type == 0, which doesn't have any client
behind it, caused to the overflow in max_num_ops array.
Fix it by declaring zero number of ops for the first client.
Fixes: c9901724a2 ("RDMA/netlink: Remove netlink clients infrastructure")
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The amount of payload per MR depends on device capabilities and
the memory registration mode in use. The new rdma_rw API hides both,
making it difficult for ULPs to determine how large their transport
send queues need to be.
Expose the MR payload information via a new API.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
- Lots of hfi1 driver updates (mixed with a few qib and core updates as
well)
- rxe updates
- various mlx updates
- Set default roce type to RoCEv2
- Several larger fixes for bnxt_re that were too big for -rc
- Several larger fixes for qedr that, likewise, were too big for -rc
- Misc core changes
- Make the hns_roce driver compilable on arches other than aarch64 so we
can more easily debug build issues related to it
- Add rdma-netlink infrastructure updates
- Add automatic IRQ affinity infrastructure
- Add 32bit lid support
- Lots of misc fixes across the subsystem from random people
- Autoloading of RDMA netlink modules
- PCI pool cleanups from Romain Perier
- mlx5 driver feature additions and fixes
- Hardware tag matchine feature
- Fix sleeping in atomic when resolving roce ah
- Add experimental ioctl interface as posted to linux-api@
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZqBDtAAoJELgmozMOVy/dNlcQAJhYNRGaNUBx0L6+8t2xwUrt
7ndP6qlMar30DJY9FjTQCzRBw0CRMWkXdJD8rYlyaHy07pjWDKG8LZtxEXu1FLdZ
oNRvQX6ZJh8Bz7db2SQFBCTF2uWGZZFqWQCrSbQwjj9xxjMDs59u/knmwHVY9dKk
egjPG4IQBDmcTeNY7h1otG2hXpx7QPIOilQW2EFN5SWAuBAazdF2JKxjjxqhnUfp
gD2pSdgsm3VSMoo0zpMa6qOP+9GcOu8J97fYFhasRYWCavPdWHyq+XNu9S/eicRd
xbv+seCYM+9jPb2dsNdjEKll7w3yyWdu7h6tSCMPYv54eN9sDDiO1w2L2ZnESMZa
JRnSfB+HXru1r4RyHOTPO8peaNhYlR1V4u8bTS5G2dffbHis9BajkWoAR/oSiUcB
AIjIIDcdJFVGfpF9KIt/pEl+adHNgESibSijzOUYkyw6RNbPqDmdd7YakPHcQhKN
clE3zQfIsPRLWsToP/nkBE0tUd3tQocRuLy7ote7hXQK+0p7TBz0a6Kkj87MvX33
8dVbUI+q6WRlEY90l71y0ZdXy/AvkxkFxAc4Y7FQZyJxhEArTaKgfa5fmpRwVxBm
yi9baoYCspHNRNv6AO4IL86ZCJqmWBuch8CBY1n2X3h8IGfKYEZUAZ+T/mnTTeUq
A4joXduz94ZD4w23leD1
=2ntC
-----END PGP SIGNATURE-----
Merge tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"This is a big pull request.
Of note is that I'm sending you the new ioctl API for the rdma
subsystem. We put it up on linux-api@, but didn't get much response.
The API is complex, but it solves two different problems in one go:
1) The bi-directional nature of the RDMA file write calls, which
created the security hole we had to handle (and for which the fix
is now causing problems for systems in production, we were a bit
over zealous in the fix and the ability to open a device, then
fork, then create new queue pairs on the device and use them is
broken).
2) The bloat caused by different vendors implementing extensions to
the base verbs API. Each vendor's hardware is slightly different,
and the hardware might be suitable for one extension but not
another.
By the time we add generic extensions for all the different ways
that the different hardware can offload things, the API becomes
bloated. Things like our completion structs have started to exceed
a cache line in size because of all the elements needed to support
this. That in turn shows up heavily in the performance graphs with
a noticable drop in performance on 100Gigabit links as our
completion structs go from occupying one cache line to 1+.
This API makes things like the completion structs modular in a
very similar way to netlink so that your structs can only include
the items needed for the offloads/features you are actually using
on a given queue pair. In that way we support everything, but only
use what we need, and our structs stay smaller.
The ioctl API is better explained by the posting on linux-api@ than I
can explain it here, so I'll just leave it at that.
The rest of the pull request is typical stuff.
Updates for 4.14 kernel merge window
- Lots of hfi1 driver updates (mixed with a few qib and core updates
as well)
- rxe updates
- various mlx updates
- Set default roce type to RoCEv2
- Several larger fixes for bnxt_re that were too big for -rc
- Several larger fixes for qedr that, likewise, were too big for -rc
- Misc core changes
- Make the hns_roce driver compilable on arches other than aarch64 so
we can more easily debug build issues related to it
- Add rdma-netlink infrastructure updates
- Add automatic IRQ affinity infrastructure
- Add 32bit lid support
- Lots of misc fixes across the subsystem from random people
- Autoloading of RDMA netlink modules
- PCI pool cleanups from Romain Perier
- mlx5 driver feature additions and fixes
- Hardware tag matchine feature
- Fix sleeping in atomic when resolving roce ah
- Add experimental ioctl interface as posted to linux-api@"
* tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits)
IB/core: Expose ioctl interface through experimental Kconfig
IB/core: Assign root to all drivers
IB/core: Add completion queue (cq) object actions
IB/core: Add legacy driver's user-data
IB/core: Export ioctl enum types to user-space
IB/core: Explicitly destroy an object while keeping uobject
IB/core: Add macros for declaring methods and attributes
IB/core: Add uverbs merge trees functionality
IB/core: Add DEVICE object and root tree structure
IB/core: Declare an object instead of declaring only type attributes
IB/core: Add new ioctl interface
RDMA/vmw_pvrdma: Fix a signedness
RDMA/vmw_pvrdma: Report network header type in WC
IB/core: Add might_sleep() annotation to ib_init_ah_from_wc()
IB/cm: Fix sleeping in atomic when RoCE is used
IB/core: Add support to finalize objects in one transaction
IB/core: Add a generic way to execute an operation on a uobject
Documentation: Hardware tag matching
IB/mlx5: Support IB_SRQT_TM
net/mlx5: Add XRQ support
...
Calls to mmu_notifier_invalidate_page() were replaced by calls to
mmu_notifier_invalidate_range() and are now bracketed by calls to
mmu_notifier_invalidate_range_start()/end()
Remove now useless invalidate_page callback.
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Leon Romanovsky <leonro@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add CONFIG_INFINIBAND_EXP_USER_ACCESS that enables the ioctl
interface. This interface is experimental and is subject to change.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to use the parsing tree, we need to assign the root
to all drivers. Currently, we just assign the default parsing
tree via ib_uverbs_add_one. The driver could override this by
assigning a parsing tree prior to registering the device.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Adding CQ ioctl actions:
1. create_cq
2. destroy_cq
This requires adding the following:
1. A specification describing the method
a. Handler
b. Attributes specification
Each attribute is one of the following:
a. PTR_IN - input data
Note: This could be encoded inlined for
data < 64bit
b. PTR_OUT - response data
c. IDR - idr based object
d. FD - fd based object
Blobs attributes (clauses a and b) contain their type,
while objects specifications (clauses c and d)
contains the expected object type (for example, the
given id should be UVERBS_TYPE_PD) and the required
access (READ, WRITE, NEW or DESTROY). If a NEW is
required, the new object's id will be assigned to this
attribute. All attributes could get UA_FLAGS
attribute. Currently we support stating that an
attribute is mandatory or that the specification size
corresponds to a lower bound (and that this attribute
could be extended).
We currently add both default attributes and the two
generic UHW_IN and UHW_OUT driver specific attributes.
2. Handler
A handler gets a uverbs_attr_bundle. The handler developer uses
uverbs_attr_get to fetch an attribute of a given id.
Each of these attribute groups correspond to the specification
group defined in the action (clauses 1.b and 1.c respectively).
The indices of these arrays corresponds to the attribute ids
declared in the specifications (clause 2).
The handler is quite simple. It assumes the infrastructure fetched
all objects and locked, created or destroyed them as required by
the specification. Pointer (or blob) attributes were validated to
match their required sizes. After the handler finished, the
infrastructure commits or rollbacks the objects.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In this phase, we don't want to change all the drivers to use
flexible driver's specific attributes. Therefore, we add two default
attributes: UHW_IN and UHW_OUT. These attributes are optional in some
methods and they encode the driver specific command data. We add
a function that extract this data and creates the legacy udata over
it.
Driver's data should start from UVERBS_UDATA_DRIVER_DATA_FLAG. This
turns on the first bit of the namespace, indicating this attribute
belongs to the driver's namespace.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When some objects are destroyed, we need to extract their status at
destruction. After object's destruction, this status
(e.g. events_reported) relies in the uobject. In order to have the
latest and correct status, the underlying object should be destroyed,
but we should keep the uobject alive and read this information off the
uobject. We introduce a rdma_explicit_destroy function. This function
destroys the class type object (for example, the IDR class type which
destroys the underlying object as well) and then convert the uobject
to be of a null class type. This uobject will then be destroyed as any
other uobject once uverbs_finalize_object[s] is called.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Different drivers support different features and even subset of the
common uverbs implementation. Currently, this is handled as bitmask
in every driver that represents which kind of methods it supports, but
doesn't go down to attributes granularity. Moreover, drivers might
want to add their specific types, methods and attributes to let
their user-space counter-parts be exposed to some more efficient
abstractions. It means that existence of different features is
validated syntactically via the parsing infrastructure rather than
using a complex in-handler logic.
In order to do that, we allow defining features and abstractions
as parsing trees. These per-feature parsing tree could be merged
to an efficient (perfect-hash based) parsing tree, which is later
used by the parsing infrastructure.
To sum it up, this makes a parse tree unique for a device and
represents only the features this particular device supports.
This is done by having a root specification tree per feature.
Before a device registers itself as an IB device, it merges
all these trees into one parsing tree. This parsing tree
is used to parse all user-space commands.
A future user-space application could read this parse tree. This
tree represents which objects, methods and attributes are
supported by this device.
This is based on the idea of
Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This adds the DEVICE object. This object supports creating the context
that all objects are created from. Moreover, it supports executing
methods which are related to the device itself, such as QUERY_DEVICE.
This is a singleton object (per file instance).
All standard objects are put in the root structure. This root will later
on be used in drivers as the source for their whole parsing tree.
Later on, when new features are added, these drivers could mix this root
with other customized objects.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Switch all uverbs_type_attrs_xxxx with DECLARE_UVERBS_OBJECT
macros. This will be later used in order to embed the object
specific methods in the objects as well.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In this ioctl interface, processing the command starts from
properties of the command and fetching the appropriate user objects
before calling the handler.
Parsing and validation is done according to a specifier declared by
the driver's code. In the driver, all supported objects are declared.
These objects are separated to different object namepsaces. Dividing
objects to namespaces is done at initialization by using the higher
bits of the object ids. This initialization can mix objects declared
in different places to one parsing tree using in this ioctl interface.
For each object we list all supported methods. Similarly to objects,
methods are separated to method namespaces too. Namespacing is done
similarly to the objects case. This could be used in order to add
methods to an existing object.
Each method has a specific handler, which could be either a default
handler or a driver specific handler.
Along with the handler, a bunch of attributes are specified as well.
Similarly to objects and method, attributes are namespaced and hashed
by their ids at initialization too. All supported attributes are
subject to automatic fetching and validation. These attributes include
the command, response and the method's related objects' ids.
When these entities (objects, methods and attributes) are used, the
high bits of the entities ids are used in order to calculate the hash
bucket index. Then, these high bits are masked out in order to have a
zero based index. Since we use these high bits for both bucketing and
namespacing, we get a compact representation and O(1) array access.
This is mandatory for efficient dispatching.
Each attribute has a type (PTR_IN, PTR_OUT, IDR and FD) and a length.
Attributes could be validated through some attributes, like:
(*) Minimum size / Exact size
(*) Fops for FD
(*) Object type for IDR
If an IDR/fd attribute is specified, the kernel also states the object
type and the required access (NEW, WRITE, READ or DESTROY).
All uobject/fd management is done automatically by the infrastructure,
meaning - the infrastructure will fail concurrent commands that at
least one of them requires concurrent access (WRITE/DESTROY),
synchronize actions with device removals (dissociate context events)
and take care of reference counting (increase/decrease) for concurrent
actions invocation. The reference counts on the actual kernel objects
shall be handled by the handlers.
objects
+--------+
| |
| | methods +--------+
| | ns method method_spec +-----+ |len |
+--------+ +------+[d]+-------+ +----------------+[d]+------------+ |attr1+-> |type |
| object +> |method+-> | spec +-> + attr_buckets +-> |default_chain+--> +-----+ |idr_type|
+--------+ +------+ |handler| | | +------------+ |attr2| |access |
| | | | +-------+ +----------------+ |driver chain| +-----+ +--------+
| | | | +------------+
| | +------+
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
+--------+
[d] = Hash ids to groups using the high order bits
The right types table is also chosen by using the high bits from
the ids. Currently we have either default or driver specific groups.
Once validation and object fetching (or creation) completed, we call
the handler:
int (*handler)(struct ib_device *ib_dev, struct ib_uverbs_file *ufile,
struct uverbs_attr_bundle *ctx);
ctx bundles attributes of different namespaces. Each element there
is an array of attributes which corresponds to one namespaces of
attributes. For example, in the usually used case:
ctx core
+----------------------------+ +------------+
| core: +---> | valid |
+----------------------------+ | cmd_attr |
| driver: | +------------+
|----------------------------+--+ | valid |
| | cmd_attr |
| +------------+
| | valid |
| | obj_attr |
| +------------+
|
| drivers
| +------------+
+> | valid |
| cmd_attr |
+------------+
| valid |
| cmd_attr |
+------------+
| valid |
| obj_attr |
+------------+
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For RoCE, ib_init_ah_from_wc() can follow the path
ib_init_ah_from_wc() ->
rdma_addr_find_l2_eth_by_grh() ->
rdma_resolve_ip()
and rdma_resolve_ip() will sleep in kzalloc() and wait_for_completion().
However, developers will not see any warnings if they use ib_init_ah_from_wc()
in an atomic context and test only on IB, because the function doesn't
sleep in that case.
Add a might_sleep() so that lockdep will catch bugs no matter what hardware is
used to test.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A couple of places in the CM do
spin_lock_irq(&cm_id_priv->lock);
...
if (cm_alloc_response_msg(work->port, work->mad_recv_wc, &msg))
However when the underlying transport is RoCE, this leads to a sleeping function
being called with the lock held - the callchain is
cm_alloc_response_msg() ->
ib_create_ah_from_wc() ->
ib_init_ah_from_wc() ->
rdma_addr_find_l2_eth_by_grh() ->
rdma_resolve_ip()
and rdma_resolve_ip() starts out by doing
req = kzalloc(sizeof *req, GFP_KERNEL);
not to mention rdma_addr_find_l2_eth_by_grh() doing
wait_for_completion(&ctx.comp);
to wait for the task that rdma_resolve_ip() queues up.
Fix this by moving the AH creation out of the lock.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The new ioctl based infrastructure either commits or rollbacks
all objects of the method as one transaction. In order to do
that, we introduce a notion of dealing with a collection of
objects that are related to a specific method.
This also requires adding a notion of a method and attribute.
A method contains a hash of attributes, where each bucket
contains several attributes. The attributes are hashed according
to their namespace which resides in the four upper bits of the id.
For example, an object could be a CQ, which has an action of CREATE_CQ.
This action has multiple attributes. For example, the CQ's new handle
and the comp_channel. Each layer in this hierarchy - objects, methods
and attributes is split into namespaces. The basic example for that is
one namespace representing the default entities and another one
representing the driver specific entities.
When declaring these methods and attributes, we actually declare
their specifications. When a method is executed, we actually
allocates some space to hold auxiliary information. This auxiliary
information contains meta-data about the required objects, such
as pointers to their type information, pointers to the uobjects
themselves (if exist), etc.
The specification, along with the auxiliary information we allocated
and filled is given to the finalize_objects function.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ioctl infrastructure treats all user-objects in the same manner.
It gets objects ids from the user-space and by using the object type
and type attributes mentioned in the object specification, it executes
this required method. Passing an object id from the user-space as
an attribute is carried out in three stages. The first is carried out
before the actual handler and the last is carried out afterwards.
The different supported operations are read, write, destroy and create.
In the first stage, the former three actions just fetches the object
from the repository (by using its id) and locks it. The last action
allocates a new uobject. Afterwards, the second stage is carried out
when the handler itself carries out the required modification of the
object. The last stage is carried out after the handler finishes and
commits the result. The former two operations just unlock the object.
Destroy calls the "free object" operation, taking into account the
object's type and releases the uobject as well. Creation just adds the
new uobject to the repository, making the object visible to the
application.
In order to abstract these details from the ioctl infrastructure
layer, we add uverbs_get_uobject_from_context and
uverbs_finalize_object functions which corresponds to the first
and last stages respectively.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add new SRQ type capable of new tag matching feature.
When SRQ receives a message it will search through the matching list
for the corresponding posted receive buffer. The process of searching
the matching list is called tag matching.
In case the tag matching results in a match, the received message will
be placed in the address specified by the receive buffer. In case no
match was found the message will be placed in a generic buffer until the
corresponding receive buffer will be posted. These messages are called
unexpected and their set is called an unexpected list.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Before this change CQ attached to SRQ was part of XRC specific extension.
Moving CQ handle out makes it available to other types extending SRQ
functionality.
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB CM calls ib_modify_port() irrespective of link layer. If the
failure is returned, the mad agent gets unregistered for those
devices. Recently, modify_port() hook was removed from some of the
low level drivers as it was always returning success. This breaks
rdma connection establishment over those devices.
For ethernet devices, Qkey violation and port capabilities are not
applicable. So returning success for RoCE when modify_port hook is
is not implemented.
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The return values from rdma_node_get_transport() are strict
and IB_LINK_LAYER_UNSPECIFIED is unreachable in this flow.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove call to BUG() in case wrong node_type was provided.
This flow is unreachable, because node_types are supplied
from specific enum.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The functions ib_register_event_handler() and
ib_unregister_event_handler() always returned success and they can't fail.
Let's convert those functions to be void, remove redundant checks and
cleanup tons of goto statements.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch introduces two helper functions to copy ah attributes
from uverbs to internal ib_ah_attr structure and the other way
during modify qp and query qp respectively.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When rdma_cm is initializing a cma_device it checks if this device
supports the preferred default GID type. This check was done in a wrong way
and therefore sometimes rdma_cm is coming up with default GID type that is
not supported by the device.
Fix that by checking for supported GID type properly.
Fixes: 3c7f67d188 ("IB/cma: Fix default RoCE type setting")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit 44c58487d5 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
introduced the concept of type in ah_attr:
* During ib_register_device, each port is checked for its type which
is stored in ib_device's port_immutable array.
* During uverbs' modify_qp, the type is inferred using the port number
in ib_uverbs_qp_dest struct (address vector) by accessing the
relevant port_immutable array and the type is passed on to
providers.
IB spec (version 1.3) enforces a valid port value only in Reset to
Init. During Init to RTR, the address vector must be valid but port
number is not mentioned as a field in the address vector, so its
value is not validated, which leads to accesses to a non-allocated
memory when inferring the port type.
Save the real port number in ib_qp during modify to Init (when the
comp_mask indicates that the port number is valid) and use this value
to infer the port type.
Avoid copying the address vector fields if the matching bit is not set
in the attr_mask. Address vector can't be modified before the port, so
no valid flow is affected.
Fixes: 44c58487d5 ('IB/core: Define 'ib' and 'roce' rdma_ah_attr types')
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If a message comes in and we do not have the client in the table, then
try to load the module supplying that client using MODULE_ALIAS to find
it.
This duplicates the scheme seen in other netlink muxes (eg nfnetlink).
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Provide a module alias so that if userspace opens a netlink
socket for RDMA the kernel support is loaded automatically.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When address handle attributes are initialized, the LIDs are
transformed to be in the 32 bit LID space.
When constructing the header, hfi1 driver will look at the LID
to determine the packet header to be created.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Most user verbs pass user data to the kernel with the inclusion of the
ib_uverbs_cmd_hdr structure. This is problematic because the vendor has
no ideas if the verb was called by a legacy verb or an extended verb.
Also, the incosistency between the verbs is confusing.
Fixes: 565197dd8f ("IB/core: Extend ib_uverbs_create_cq")
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Initializing cq_context with ev_queue in create_cq(), leads to NULL pointer
dereference in ib_uverbs_comp_handler(), if application doesnot use completion
channel. This patch fixes the cq_context initialization.
Fixes: 1e7710f3f6 ("IB/core: Change completion channel to use the reworked")
Cc: stable@vger.kernel.org # 4.12
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 699a2d5b1b)
This patch series primarily increases sizes of variables that hold
lid values from 16 to 32 bits. Additionally, it adds a check in
the IB mad stack to verify a properly formatted MAD when OPA
extended LIDs are used.
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Merging our (hopefully) final -rc pull branch into our for-next branch
because some of our pending patches won't apply cleanly without having
the -rc patches in our tree.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Conflicts:
drivers/infiniband/core/iwcm.c - The rdma_netlink patches in
HEAD and the iwarp cm workqueue fix (don't use WQ_MEM_RECLAIM,
we aren't safe for that context) touched the same code.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Initializing cq_context with ev_queue in create_cq(), leads to NULL pointer
dereference in ib_uverbs_comp_handler(), if application doesnot use completion
channel. This patch fixes the cq_context initialization.
Fixes: 1e7710f3f6 ("IB/core: Change completion channel to use the reworked")
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Its very likely that iwcm work execution will yield memory
allocations (for example cm connection request).
Reported-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
create_workqueue always creates the workqueue with WQ_MEM_RECLAIM
and silences a flush dependency warn for WQ_LEGACY. Instead, we
want to keep the warn in case the allocator tries to flush the
cm workqueue because its very likely that cm work execution will
yield memory allocations (for example cm connection requests).
Reported-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_clients can indeed fill .add to NULL, but then they will not see
any device removal notifications. The reason is that that
ib_register_client and ib_register_device checked existence of .add
before adding the creating a corresponding client_data and adding
it to the list. Simple condition reverse fixes the issue.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As part of ib_uverbs_remove_one which might be triggered upon
reset flow, we trigger IB_EVENT_DEVICE_FATAL event to userspace
application.
If device was removed after uverbs fd was opened but before
ib_uverbs_get_context was called, the event file will be accessed
before it was allocated, result in NULL pointer dereference:
[ 72.325873] BUG: unable to handle kernel NULL pointer dereference at (null)
...
[ 72.325984] IP: _raw_spin_lock_irqsave+0x22/0x40
[ 72.327123] Call Trace:
[ 72.327168] ib_uverbs_async_handler.isra.8+0x2e/0x160 [ib_uverbs]
[ 72.327216] ? synchronize_srcu_expedited+0x27/0x30
[ 72.327269] ib_uverbs_remove_one+0x120/0x2c0 [ib_uverbs]
[ 72.327330] ib_unregister_device+0xd0/0x180 [ib_core]
[ 72.327373] mlx5_ib_remove+0x74/0x140 [mlx5_ib]
[ 72.327422] mlx5_remove_device+0xfb/0x110 [mlx5_core]
[ 72.327466] mlx5_unregister_interface+0x3c/0xa0 [mlx5_core]
[ 72.327509] mlx5_ib_cleanup+0x10/0x962 [mlx5_ib]
[ 72.327546] SyS_delete_module+0x155/0x230
[ 72.328472] ? exit_to_usermode_loop+0x70/0xa6
[ 72.329370] do_syscall_64+0x54/0xc0
[ 72.330262] entry_SYSCALL64_slow_path+0x25/0x25
Fix it by checking that user context was allocated before
trigger the event.
Fixes: 036b106357 ('IB/uverbs: Enable device removal when there are active user space applications')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Conflicts:
include/rdma/ib_verbs.h - Modified a function signature adjacent
to a newly added function signature from a previous merge
Signed-off-by: Doug Ledford <dledford@redhat.com>
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Both add new code
include/rdma/ib_verbs.h - Both add new code
Signed-off-by: Doug Ledford <dledford@redhat.com>
According to the IB specification, the LID and SM_LID
are 16-bit wide, but to support OmniPath users, export
it as 32-bit value from the beginning.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Add Node GUID and system image GUID to the device properties
exported by RDMA netlink, to be used by RDMAtool.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
There is a need to forward FW version to user space
application through RDMA netlink. In order to make it safe, there
is need to declare nla_policy and limit the size of FW string.
The new define IB_FW_VERSION_NAME_MAX will limit the size of
FW version string. That define was chosen to be equal to
ETHTOOL_FWVERS_LEN, because many drivers anyway are limited
by that value indirectly.
The introduction of this define allows us to remove the string size
from get_fw_str function signature.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The port capability mask is exposed to user space via sysfs interface,
while device capabilities are available for verbs only.
This patch provides those capabilities through netlink interface.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Provide ability to get specific to device and port information.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
This patch implements the query interface to get all
ports data for the specific device.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
This patch adds the ability to return all available devices
together with their properties.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Add nldev init and exit flows to the RDMA/core.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Introduce new defines to rdma_netlink.h, so the RDMA configuration tool
will be able to communicate with RDMA subsystem by using the shared defines.
The addition of new client (NLDEV) revealed the fact that we exposed by
mistake the RDMA_NL_I40IW define which is not backed by any RDMA netlink
by now and it won't be exposed in the future too. So this patch reuses
the value and deletes the old defines.
The NLDEV operates with objects. The struct ib_device has two straightforward
objects: device itself and ports of that device.
This brings us to propose the following commands to work on those objects:
* RDMA_NLDEV_CMD_{GET,SET,NEW,DEL} - works on ib_device itself
* RDMA_NLDEV_CMD_PORT_{GET,SET,NEW,DEL} - works on ports of specific ib_device
Those commands receive/return the device index (RDMA_NLDEV_ATTR_DEV_INDEX)
and port index (RDMA_NLDEV_ATTR_PORT_INDEX). For device object accesses,
the RDMA_NLDEV_ATTR_PORT_INDEX will return the maximum number of ports
for specific ib_device and for port access the actual port index.
The port index starts from 1 to follow RDMA/core internal semantics and
the sysfs exposed knobs.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
RDMA_NL_LS protocol is actually does not dump anything,
but sets data and it should be handled by doit callback.
This patch actually converts RDMA_NL_LS to doit callback, while
preserving IWCM and RDMA_CM flows through netlink_dump_start().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Introduce intermediate variable to store access to fields
of cb_table.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The .doit callback is used by netlink core to differentiate
between get and set operations. Common convention is to use
that call for command operations like (SET, ADD, e.t.c.) and/or
access without NLF_M_DUMP flag.
This commit adds proper declaration and implementation
to RDMA netlink.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
This patch adds static device index in similar fashion to
already available in netdev world (struct net->ifindex).
In downstream patches, the RDMA nelink will use this idx-to-ib_device
conversion, so as part of this commit, we are exposing the translation
function to be visible for IB/core users.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
The coming nldev needs to iterate over all IB devices in the system
and in order to not expose the ib_devices list outside the devices.c,
it is necessary to provide function iterator.
Current version is written explicitly for nldev callback to avoid
over-engineering at this stage, but it can be easily extended for
other types.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The RDMA netlink client infrastructure was removed and made obsolete.
The old infrastructure defined struct ibnl_client_cbs. Now that all
uses of this have been updated to the new infrastructure, rename the
struct to be compliant with the current stack naming standards:
struct rdma_nl_cbs.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Make ibnl_chk_listeners function to be one line by removing
unneeded comparison.
Rename that function to be complaint to other functions in RDMA netlink.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The pointer to netlink header was not used in the ibnl_multicast
function, so let's remove it and simplify the function
signature.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Netlink message header is not needed for unicast reply, hence remove it.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reuse standard macros to cancel the netlink message
in case of error.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Add ability to provide flags to control RDMA netlink callbacks
and convert addr.c and sa_query.c to be first users of such
infrastructure. It allows to move their CAP_NET_ADMIN checks
into netlink core.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
The iwcm exports functions which are not used outside of ib_core.
This patch simply removes these EXPORT_SYMBOLS.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chien Tin Tung <chien.tin.tung@intel.com>
RDMA netlink implementation guarantees that supplied
client number is in allowed range.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Chien Tin Tung <chien.tin.tung@intel.com>
The standard netlink_rcv_skb function skips messages without
NLM_F_REQUEST flag in it, while SA netlink client issues them.
In commit bc10ed7d3d ("IB/core: Add rdma netlink helper functions")
the local function was introduced to allow such messages.
This led to double pass for every incoming message.
In this patch, we unify that local implementation and netlink_rcv_skb
functions, so there will be no need for double pass anymore.
As a outcome, this combined function gained more strict check
for NLM_F_REQUEST flag and it is now allowed for SA pathquery
client only.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>