Have fast_reg_descriptor hold struct iser_reg_resources
(mr, frpl, valid flag). This will be useful when the
actual buffer registration routines will be passed with
the needed registration resources (i.e. iser_reg_resources)
without being aware of their nature (i.e. data or protection).
In order to achieve this, we remove reg_indicators flags container
and place specific flags (mr_valid) within iser_reg_resources struct.
We also place the sig_mr_valid and sig_protcted flags in iser_pi_context.
This patch also modifies iser_fast_reg_mr to receive the
reg_resources instead of the fast_reg_descriptor and a data/protection
indicator.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We can do it in iser_aligned_data_len instead and
it will save us an argument that is passed to
fall_to_counce_buf just for the print.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We always call iser_initialize_task_headers() and set
the header tx_sg.lkey to the device mr lkey, so no
point in checking it in iser_create_send_desc().
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If iser_initialize_task_headers() routine failed before
dma mapping, we should not attempt to unmap in cleanup_task().
Fixes: 7414dde0a6 (IB/iser: Fix race between iser connection ...)
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We don't update those anywhere in the code and they
seem pretty useless (no one seem to care about those).
qp_tx_queue_full: We never should get this
fmr_map_not_avail: We can never get to this
eh_abort_cnt: We don't monitor aborts
Go ahead and remove them.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since commit "IB/iser: Fix race between iser connection teardown..."
iser_initialize_task_headers() might fail, so we need to check that.
Fixes: 7414dde0a6 (IB/iser: Fix race between iser connection ...)
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While we're at it, use permission defines instead
of octal values and rearrange a little bit.
Signed-off-by: Jenny Derzhavetz <jennyf@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Replace kmalloc with kzalloc so that all uninitialized fields in SA query
will be zero-ed out to avoid unintentional consequence. This prepares the
SA query structure to accept new fields in the future.
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a function to check if listeners for a netlink multicast
group are present. It also adds a function to receive netlink response
messages.
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
scsi_host_alloc() not only allocates memory for a SCSI host but also
creates the scsi_eh_<n> kernel thread and the scsi_tmf_<n> workqueue.
Stop these threads if login fails by calling scsi_host_put().
Reported-by: Konstantin Krotov <kkv@clodo.ru>
Fixes: fb49c8bbaa ("Remove an extraneous scsi_host_put() from an error path")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: <stable@vger.kernel.org> #v3.19
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since version 1.0 e.g. scsi-mq has been added. Since this is
a significant change, bump the driver version and release date.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that the following kernel warning is reported if the SRP
target system accepts fewer channels per connection than what
was requested by the initiator system:
WARNING: at drivers/infiniband/ulp/srp/ib_srp.c:617 srp_destroy_qp+0xb1/0x120 [ib_srp]()
Call Trace:
[<ffffffff8105d67f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8105d6da>] warn_slowpath_null+0x1a/0x20
[<ffffffffa05419e1>] srp_destroy_qp+0xb1/0x120 [ib_srp]
[<ffffffffa05445fb>] srp_create_ch_ib+0x19b/0x420 [ib_srp]
[<ffffffffa0545257>] srp_create_target+0x7d7/0xa94 [ib_srp]
[<ffffffff8138dac0>] dev_attr_store+0x20/0x30
[<ffffffff812079ef>] sysfs_write_file+0xef/0x170
[<ffffffff81191fc4>] vfs_write+0xb4/0x130
[<ffffffff8119276f>] sys_write+0x5f/0xa0
[<ffffffff815a0a59>] system_call_fastpath+0x16/0x1b
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When handling a device internal error, the driver is responsible to
drain the completion queue with flush errors.
In case a completion queue was assigned to multiple send queues, the
driver iterates over the send queues and generates flush errors of
inflight wqes. The driver must correctly pass the wc array with an
offset as a result of the previous send queue iteration. Not doing so
will overwrite previously set completions and return a wrong number
of polled completions which includes ones which were not correctly set.
Fixes: 35f05dabf9 (IB/mlx4: Reset flow support for IB kernel ULPs)
Signed-off-by: Ariel Nahum <arieln@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mlx4 IB driver implementation for ib_query_ah used a wrong offset
(28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
Fixes: fa417f7b52 ('IB/mlx4: Add support for IBoE')
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The pkey mapping for RoCE must remain the default mapping:
VFs:
virtual index 0 = mapped to real index 0 (0xFFFF)
All others indices: mapped to a real pkey index containing an
invalid pkey.
PF:
virtual index i = real index i.
Don't allow users to change these mappings using files found in
sysfs.
Fixes: c1e7e46612 ('IB/mlx4: Add iov directory in sysfs under the ib device')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mcg "too many pending requests" warning message fills the log
when OpenSM is downed. Demote the message from warning level to
debug level.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
send_mad_to_wire takes the same spinlock that is taken in
the interrupt context. Therefore, it needs irqsave/restore.
Fixes: b9c5d6a643 ('IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
1.Change query_gid hook to return value from IB/Core GID
management APIs.
2.Get rid of all the netdev notifier chain subscription code as well
as maintenance of SGID Table in memory.
3.Implement get_netdev hook in driver.
Signed-off-by: Somnath Kotur <somnath.kotur@avagotech.com>
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Manage RoCE gid table with logic in IB/core, which is common to all
vendors, and remove the mechanism from the mlx4 IB driver.
Since management of the GID cache may lead to index mismatch with the
hardware GID table, a translation between indexes is required when
modifying a QP or creating an address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.
modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Handling bonding and other devices require us to all all GIDs of the
net-devices which are upper-devices of the RoCE port related
net-device.
Active-backup configurations imposes even more challenges as the
default GID should only be set on the active devices (this is
necessary as otherwise the same MAC could be used for several
slaves and thus several slaves will have identical GIDs).
Managing these configurations are done by listening to:
(a) NETDEV_CHANGEUPPER event
(1) if a related net-device is linked, delete all inactive
slaves default GIDs and add the upper device GIDs.
(2) if a related net-device is unlinked, delete all upper GIDs
and add the default GIDs.
(b) NETDEV_BONDING_FAILOVER:
(1) delete the bond GIDs from inactive slaves
(2) delete the inactive slave's default GIDs
(3) Add the bond GIDs to the active slave.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Smatch says that, based on the indenting, we should probably add curly
braces here.
Fixes: 03db3a2d81 ('IB/core: Add RoCE GID table management')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
RoCE GIDs are based on IP addresses configured on Ethernet net-devices
which relate to the RDMA (RoCE) device port.
Currently, each of the low-level drivers that support RoCE (ocrdma,
mlx4) manages its own RoCE port GID table. As there's nothing which is
essentially vendor specific, we generalize that, and enhance the RDMA
core GID cache to do this job.
In order to populate the GID table, we listen for events:
(a) netdev up/down/change_addr events - if a netdev is built onto
our RoCE device, we need to add/delete its IPs. This involves
adding all GIDs related to this ndev, add default GIDs, etc.
(b) inet events - add new GIDs (according to the IP addresses)
to the table.
For programming the port RoCE GID table, providers must implement
the add_gid and del_gid callbacks.
RoCE GID management requires us to state the associated net_device
alongside the GID. This information is necessary in order to manage
the GID table. For example, when a net_device is removed, its
associated GIDs need to be removed as well.
RoCE mandates generating a default GID for each port, based on the
related net-device's IPv6 link local. In contrast to the GID based on
the regular IPv6 link-local (as we generate GID per IP address),
the default GID is also available when the net device is down (in
order to support loopback).
Locking is done as follows:
The patch modify the GID table code both for new RoCE drivers
implementing the add_gid/del_gid callbacks and for current RoCE and
IB drivers that do not. The flows for updating the table are
different, so the locking requirements are too.
While updating RoCE GID table, protection against multiple writers is
achieved via mutex_lock(&table->lock). Since writing to a table
requires us to find an entry (possible a free entry) in the table and
then modify it, this mutex protects both the find_gid and write_gid
ensuring the atomicity of the action.
Each entry in the GID cache is protected by rwlock. In RoCE, writing
(usually results from netdev notifier) involves invoking the vendor's
add_gid and del_gid callbacks, which could sleep.
Therefore, an invalid flag is added for each entry. Updates for RoCE are
done via a workqueue, thus sleeping is permitted.
In IB, updates are done in write_lock_irq(&device->cache.lock), thus
write_gid isn't allowed to sleep and add_gid/del_gid are not called.
When passing net-device into/out-of the GID cache, the device
is always passed held (dev_hold).
The code uses a single work item for updating all RDMA devices,
following a netdev or inet notifier.
The patch moves the cache from being a client (which was incorrect,
as the cache is part of the IB infrastructure) to being explicitly
initialized/freed when a device is registered/removed.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This gets rid of the weird in-between state where struct ib_device
was allocated but the kobject didn't work.
Consequently ib_device_release is now guaranteed to be called in
all situations and we needn't duplicate its kfrees on error paths.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fully replaced by a more generic and suitable
ib_alloc_mr.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use ib_alloc_mr with specific parameters.
Change the existing callers.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This was added in a thought of uniting all mr allocation
and deallocation routines but the fact is we have a single
deallocation routine already, ib_dereg_mr.
And, move mlx5_ib_destroy_mr specific logic into mlx5_ib_dereg_mr
(includes only signature stuff for now).
And, fixup the only callers (iser/isert) accordingly.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When no matching listening ID is found for a given request, the net_dev
that was used to find the request isn't released.
Fixes: 0b3ca768fc ("IB/cma: Use found net_dev for passive connections")
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Now that there are no ib_cm clients using the compare_data feature for
matching IB CM requests' private data, remove the compare_data parameter of
ib_cm_listen and remove the code implementing the feature.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use ib_cm_insert_listen to create listening IB CM IDs or share existing
ones if needed. When given a request on a specific CM ID, the code now
matches the request to the RDMA CM ID based on the request parameters, so
it no longer needs to rely on the ib_cm's private data matching
capabilities.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a new connection in cma_req_handler, we actually already
know the net_dev that is used for the connection's creation. Instead of
calling cma_translate_addr to resolve the new connection id's source
address, just use the net_dev that was found.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pass incoming request parameters through the relevant IPv4/IPv6 routing
tables and make sure the network stack is configured to handle such
requests.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of relying on a the ib_cm module to check an incoming CM request's
private data header, add these checks to the RDMA CM module. This allows a
following patch to to clean up the ib_cm interface and remove the code that
looks into the private headers. It will also allow supporting namespaces in
RDMA CM by making these checks namespace aware later on.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The rdma_cm module will later use the P_Key from the BTH to de-mux
requests.
See discussion at:
http://www.spinics.net/lists/netdev/msg336067.html
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Liran Liss <liranl@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add helper functions to access the IDRs by port-space and port number.
Pass around the port-space enum in cma.c instead of using pointers to
port-space IDRs.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When receiving a connection request, rdma_cm needs to associate the request
with a network device, in order to disambiguate requests. To do this, it
needs to know the request's destination IP. For this the module needs to
allow getting this information from the private data in the request packet,
instead of relying on the information already being in the listening RDMA
CM ID.
When creating a new incoming connection ID, the code in
cma_save_ip{4,6}_info can no longer rely on the listener's private data to
find the port number, so it reads it from the requested service ID.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enabling network namespaces for RDMA CM will allow processes on different
namespaces to listen on the same port. In order to leave namespace support
out of the CM layer, this requires that multiple RDMA CM IDs will be able
to share a single CM ID.
This patch adds infrastructure to retrieve an existing listening ib_cm_id,
based on its device and service ID, or create a new one if one does not
already exist. It also adds a reference count for such instances
(cm_id_private.listen_sharecount), and prevents cm_destroy_id from
destroying a CM if it is still shared. See the relevant discussion [1].
[1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
http://www.spinics.net/lists/netdev/msg328860.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose the service ID on an incoming CM or SIDR request to the event
handler. This will allow the RDMA CM module to de-multiplex connection
requests based on the information encoded in the service ID.
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Implement the get_net_device_by_port_pkey_ip callback that returns network
device to ib_core according to connection parameters. Check the ipoib
device and iterate over all child devices to look for a match.
For each IPoIB device we iterate through all upper devices when searching
for a matching IP, in order to support bonding.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.
The IB device and port, together with the P_Key and the GID should
be enough to uniquely identify the ULP net device. However, in current
kernels there can be multiple IPoIB interfaces created with the same GID.
Furthermore, such configuration may be desireable to support ipvlan-like
configurations for RDMA CM with IPoIB. To resolve the device in these
cases the code will also take the IP address as an additional input.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
An ib_client callback that is called with the lists_rwsem locked only for
read is protected from changes to the IB client lists, but not from
ib_unregister_device() freeing its client data. This is because
ib_unregister_device() will remove the device from the device list with
lists_rwsem locked for write, but perform the rest of the cleanup,
including the call to remove() without that lock.
Mark client data that is undergoing de-registration with a new going_down
flag in the client data context. Lock the client data list with lists_rwsem
for write in addition to using the spinlock, so that functions calling the
callback would be able to lock only lists_rwsem for read and let callbacks
sleep.
Since ib_unregister_client() now marks the client data context, no need for
remove() to search the context again, so pass the client data directly to
remove() callbacks.
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.
The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification. Also, in order to make sure
client callbacks are called only between add() and remove() calls, the code
is changed to only add items to the lists after the add() calls and remove
from the lists before the remove() calls.
This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.
[1] http://www.spinics.net/lists/linux-rdma/msg24733.html
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Matan Barak <matanb@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Applications must not assume that max_sge and max_sge_rd are the same,
Hence expose max_sge_rd correctly as well.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Applications must not assume that max_sge and max_sge_rd are the same,
Hence expose max_sge_rd correctly as well.
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds the value of the CNP opcode to the existing list of enumerated
opcodes in ib_pack.h
Add common OPA header definitions for driver
build:
- opa_port_info.h
- opa_smi.h
- hfi1_user.h
Additionally, ib_mad.h, has additional definitions
that are common to ib_drivers including:
- trap support
- cca support
The qib driver has the duplication removed in favor
those in ib_mad.h
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: John, Jubin <jubin.john@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The HW hasn't been sold since 2005, and the SW has definite bit rot.
Its time to remove it. So move it to staging for a few releases and
then remove it after that.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It is now time for the ipath driver to begin to be phased out of the kernel.
This patch moves the ipath driver from the Infiniband sub tree to the staging
area where it will remain until the code is removed from the kernel in a few
releases.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support for ipv6 address handling clip api provided by lld
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Resolving a link-local IPv6 address with an unspecified source address
was broken by commit 5462eddd7a, which prevented the IPv6 stack from
learning the scope id of the link-local IPv6 address, causing random
failures as the IP stack chose a random link to resolve the address on.
This commit 5462eddd7a made us bail out of cma_check_linklocal early if
the address passed in was not an IPv6 link-local address. On the address
resolution path, the address passed in is the source address; if the
source address is the unspecified address, which is not link-local, we
will bail out early.
This is mostly correct, but if the destination address is a link-local
address, then we will be following a link-local route, and we'll need to
tell the IPv6 stack what the scope id of the destination address is.
This used to be done by last line of cma_check_linklocal, which is
skipped when bailing out early:
dev_addr->bound_dev_if = sin6->sin6_scope_id;
(In cma_bind_addr, the sin6_scope_id of the source address is set to the
sin6_scope_id of the destination address, so this is correct)
This line is required in turn for the following line, L279 of
addr6_resolve, to actually inform the IPv6 stack of the scope id:
fl6.flowi6_oif = addr->bound_dev_if;
Since we can only know we are in this failure case when we have access
to both the source IPv6 address and destination IPv6 address, we have to
deal with this further up the stack. So detect this failure case in
cma_bind_addr, and set bound_dev_if to the destination address scope id
to correct it.
Signed-off-by: Spencer Baugh <sbaugh@catern.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Something like this:
CPU A CPU B
Acked-by: Sean Hefty <sean.hefty@intel.com>
======================== ================================
ucma_destroy_id()
wait_for_completion()
.. anything
ucma_put_ctx()
complete()
.. continues ...
ucma_leave_multicast()
mutex_lock(mut)
atomic_inc(ctx->ref)
mutex_unlock(mut)
ucma_free_ctx()
ucma_cleanup_multicast()
mutex_lock(mut)
kfree(mc)
rdma_leave_multicast(mc->ctx->cm_id,..
Fix it by latching the ref at 0. Once it goes to 0 mc and ctx cannot
leave the mutex(mut) protection.
The other atomic_inc in ucma_get_ctx is OK because mutex(mut) protects
it from racing with ucma_destroy_id.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This enables ORD/IRD negotiation and its about time to enable it by
default
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the sg tablesize, which dictates fast register page list
depth to use, does not take into account the limits of the rdma device.
So adjust it once we discover the device fastreg max depth limit. Also
adjust the max_sectors based on the resulting sg tablesize.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The only place that assigns mr inside the loop already does a break.
So "if (mr)" will never be true here since the function initializes mr
to NULL at the top. We can just drop the extra if and break here.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The lkey table is allocated with with a get_user_pages() with an
order based on a number of index bits from a module parameter.
The underlying kernel code cannot allocate that many contiguous pages.
There is no reason the underlying memory needs to be physically
contiguous.
This patch:
- switches the allocation/deallocation to vmalloc/vfree
- caps the number of bits to 23 to insure at least 1 generation bit
o this matches the module parameter description
Cc: stable@vger.kernel.org
Reviewed-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Should be all the page sizes that are supported by the
device.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mlx5 driver exposes device capability IB_DEVICE_LOCAL_DMA_LKEY
but does not set the the device local_dma_lkey. This breaks
rpcrdma drivers.
Query and set this lkey when creating the device resources.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It appears to be what the rest of the kernel does, so let's do it too.
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Very similar to how it went with local_sockaddr.
It was embedded in iscsi_login_stats so some changes there, and we needed
to copy in a sockaddr_storage comparison function. Hopefully the kernel
will get a standard one soon, our implementation makes the 3rd.
isert_set_conn_info() became much smaller.
IPV6_ADDRESS_SPACE define goes away, had to modify a call to in6_pton(),
can just use -1 since we are sure string is null-terminated.
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This is a more natural format that lets us format it with the appropriate
printk specifier as needed.
This also lets us handle v4-mapped ipv6 addresses a little more nicely, by
storing the addr as an actual v4 sockaddr in conn->local_sockaddr.
Finally, we no longer need to maintain variables for port, since this is
contained in sockaddr. Remove iscsi_np.np_port and iscsi_conn.local_port.
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The verbs are obsolete. The ib_rereg_phys_mr() verb is not used by
kernel ULPs, and the last ib_reg_phys_mr() call site in the kernel
tree has now been removed.
Two staging tree call sites remain in the Lustre client. The Lustre
team has been notified of the deprecation of reg_phys_mr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI target fixes from Nicholas Bellinger:
"This series is larger than what I'd normally be conformable with
sending for a -rc5 PULL request..
However, the bulk of the series is localized to qla2xxx target
specific fixes that address a number of real-world correctness issues,
that have been outstanding on the list for ~6 weeks now. They where
submitted + verified + acked by the HW LLD vendor, contributed by a
major production customer of the code, and are marked for v3.18.y
stable code.
That said, I don't see a good reason to wait another month to get
these fixes into mainline.
Beyond the qla2xx specific fixes, this series also includes:
- bugfix for a long standing use-after-free in iscsi-target during
TPG shutdown + demo-mode sessions.
- bugfix for a >= v4.0 regression OOPs in iscsi-target during a
iscsi_start_kthreads() failure.
- bugfix for a >= v4.0 regression hang in iscsi-target for iser
explicit session/connection logout.
- bugfix for a iser-target bug where a early CMA REJECTED status
during login triggers a NULL pointer dereference OOPs.
- bugfixes for a handful of v4.2-rc1 specific regressions related to
the larger set of recent backend configfs attribute changes.
A big thanks to QLogic + Pure Storage for the qla2xxx target bugfixes"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
Documentation/target: Fix tcm_mod_builder.py build breakage
iser-target: Fix REJECT CM event use-after-free OOPs
iscsi-target: Fix iser explicit logout TX kthread leak
iscsi-target: Fix iscsit_start_kthreads failure OOPs
iscsi-target: Fix use-after-free during TPG session shutdown
qla2xxx: terminate exchange when command is aborted by LIO
qla2xxx: drop cmds/tmrs arrived while session is being deleted
qla2xxx: disable scsi_transport_fc registration in target mode
qla2xxx: added sess generations to detect RSCN update races
qla2xxx: Abort stale cmds on qla_tgt_wq when plogi arrives
qla2xxx: delay plogi/prli ack until existing sessions are deleted
qla2xxx: cleanup cmd in qla workqueue before processing TMR
qla2xxx: kill sessions/log out initiator on RSCN and port down events
qla2xxx: fix command initialization in target mode.
qla2xxx: Remove msleep in qlt_send_term_exchange
qla2xxx: adjust debug flags
qla2xxx: release request queue reservation.
qla2xxx: Add flush after updating ATIOQ consumer index.
qla2xxx: Enable target mode for ISP27XX
qla2xxx: Fix hardware lock/unlock issue causing kernel panic.
...
c4iw_poll_cq_on() shouldn't fail the poll operation just because
the CQE status is unknown. Rather, it should map this to the
"fatal error" status and log the anomaly.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To add Hardware accelerated support in 802.1ad vlan, replace
Current VLAN macros to CVLAN.
Replace:
MLX4_WQE_CTRL_INS_VLAN
MLX4_CQE_VLAN_PRESENT_MASK
With:
MLX4_WQE_CTRL_INS_CVLAN
MLX4_CQE_CVLAN_PRESENT_MASK
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a bug in iser-target code where the REJECT CM event
handler code currently performs a isert_put_conn() for the final
isert_conn->kref put, while iscsi_np process context is still blocked
in isert_get_login_rx().
Once isert_get_login_rx() is awoking due to login timeout, iscsi_np
process context will attempt to invoke iscsi_target_login_sess_out()
to cleanup iscsi_conn as expected, and calls isert_wait_conn() +
isert_free_conn() which triggers the use-after-free OOPs.
To address this bug, move the kref_get_unless_zero() call from
isert_connected_handler() into isert_connect_request() immediately
preceeding isert_rdma_accept() to ensure the CM handler cleanup
paths and isert_free_conn() are always operating with two refs.
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Change of license from GPLv2 to dual-license (GPLv2 and BSD 2-Clause)
All contributors were contacted off-list and permission to make this
change was received. The complete list of contributors are Cc:ed here.
Cc: Tejun Heo <tj@kernel.org>
Cc: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Li RongQing <roy.qing.li@gmail.com>
Cc: Devendra Naga <devendra.aaru@gmail.com>
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the above is turned off then ipoib_cm_dev_init unconditionally
returns ENOSYS, and the newly added error handling in
0b3957 prevents ipoib from coming up at all:
kernel: mlx4_0: ipoib_transport_dev_init failed
kernel: mlx4_0: failed to initialize port 1 (ret = -12)
Fixes: 0b39578bcd (IB/ipoib: Use dedicated workqueues per interface)
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
T3 HW only supports 32 bit MRs. If the system uses 64 bit memory
addresses, then a registered 32 bit MR will wrap and write to the
wrong memory when used with addresses > 4GB. To prevent this,
simply fail to allocate an MR on 64 bit machines (other means
of registering memory are still available and software can still
work, we just don't allow this means of memory registration).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is little chance our memory allocation will fail, so we can
combine initializing the work structs with allocating them instead of
looping through all of them once to allocate and again to initialize.
Then when we need to actually find out if our device is up or in the
process of going down, have all of our work structs batched up, take the
spin_lock once and only once, and do all of the batch under the one
spin_lock invocation instead of incurring all of the locked memory cycles
we would otherwise incur to take/release the spin_lock over and over
again.
Signed-off-by: Doug Ledford <dledford@redhat.com>
We create a number of work structs to be queued up to a workqueue, and
on completion of the workqueue handler, the workqueue handler frees the
allocated memory. If, however, we don't queue the work struct because
the device is going down, then we need to free the memory ourselves.
Signed-off-by: Doug Ledford <dledford@redhat.com>
On failure, we loop through all possible pointers and test them before
calling kfree. But really, why even attempt to free items we didn't
allocate when we can easily loop through exactly and only the devices
for which the original memory allocation succeeded and free just those.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For IB links, reading HCA flow counters through iboe_process_mad() should
be used when mlx4_ib_process_mad() is invoked only for VFs PMA queries and
exactly nothing else.
Fixes: 7193a141eb ('IB/mlx4: Set VF to read from QP counters')
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In little endian cases, the macros be16_to_cpu and cpu_to_be64
unfolds to __swab{16,64} which provides special case for constants.
In big endian cases, __constant_be16_to_cpu and be16_to_cpu
expand directly to the same expression. The same applies for
__constant_cpu_to_be64 and cpu_to_be64.
So, replace __constant_be16_to_cpu with be16_to_cpu and
__constant_cpu_to_be64 with cpu_to_be64, with the goal of getting
rid of the definition of __constant_be16_to_cpu and
__constant_cpu_to_be64 completely.
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When switching between modes (datagram / connected) change the MTU
accordingly.
datagram mode up to 4K, connected mode up to (64K - 0x10).
Signed-off-by: ELi Cohen <eli@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
performance.
This MTU plus overhead puts the memory allocation for IP based packets at
32 4k pages (order 5), which have to be contiguous.
When the system memory under pressure, it was observed that allocating 128k
contiguous physical memory is difficult and causes serious errors (such as
system becomes unusable).
This enhancement resolve the issue by removing the physically contiguous
memory requirement using Scatter/Gather feature that exists in Linux stack.
With this fix Scatter-Gather will be supported also in connected mode.
This change reverts some of the change made in commit e112373fd6
("IPoIB/cm: Reduce connected mode TX object size").
The ability to use SG in IPoIB CM is possible because the coupling
between NETIF_F_SG and NETIF_F_CSUM was removed in commit
ec5f061564 ("net: Kill link between CSUM and SG features.")
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Christian Marie <christian@ponies.io>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_ucm_release_dev clears the wrong bit if devnum is greater
than IB_UCM_MAX_DEVICES.
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix for incorrect recording of the MAC address
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Neighbor resolution doesn't work without this fix
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fixes to allow clients to make remove mapping requests, after
they have provided the user space service with the mapping
information, they are using when the service is restarted.
1) Adding IWPM_REG_VALID, IWPM_REG_INCOMPL and IWPM_REG_UNDEF
registration types for the port mapper clients and functions
to set/check the registration type.
2) If the port mapper user space service is not available to register
the client, then its registration stays IWPM_REG_UNDEF and the
registration isn't checked until the service becomes available
(no mappings are possible, if the user space service isn't running).
3) After the service is restarted, the user space port mapper pid is set
to valid and the client registration is set to IWPM_REG_INCOMPL
to allow the client to make remove mapping requests.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Error values of ib_query_port() and ib_query_device() weren't propagated
correctly. Because of that, ipoib_add_port() could return NULL value,
which escaped the IS_ERR() check in ipoib_add_one() and we crashed.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
mlx4 VFs can provide CQE raw time-stamping services, but they
don't have the hca core clock mapped to their PCI bars.
As such, we should not attempt to query and report the clock offset
to user space for VFs. Doing so causes query_device over VFs to fail
with -ENOSUPP.
Fixes: 4b664c4355 ('IB/mlx4: Add support for CQ time-stamping')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Whenever ib_cm gets remove_one call, like when there is a hot-unplug
event, the driver should mark itself as going_down and confirm that no
new works are going to be queued for that device.
so, the order of the actions are:
1. mark the going_down bit.
2. flush the wq.
3. [make sure no new works for that device.]
4. unregister mad agent.
otherwise, works that are already queued can be scheduled after the mad
agent was freed.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We might return res which is not initialized. Also
reduce code duplication by exporting srp_parse_tmo so
srp_tmo_set can reuse it.
Detected by Coverity.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jenny Falkovich <jennyf@mellanox.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In little endian cases, the macro cpu_to_be{16,32,64} unfolds to
__swab{16,32,64} which provides special case for constants. In
big endian cases, __constant_cpu_to_be{16,32,64} and
cpu_to_be{16,32,64} expand directly to the same expression. So,
replace __constant_cpu_to_be{16,32,64} with cpu_to_be{16,32,64}
with the goal of getting rid of the definitions of
__constant_cpu_to_be{16,32,64} completely.
The Coccinelle semantic patch that performs this transformation
is as follows:
@@expression x;@@
(
- __constant_cpu_to_be16(x)
+ cpu_to_be16(x)
|
- __constant_cpu_to_be32(x)
+ cpu_to_be32(x)
|
- __constant_cpu_to_be64(x)
+ cpu_to_be64(x)
)
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We recently added BUG_ON's which were inappropriate for a condition which
should never happen. Change these to be WARN_ON_ONCE as a debugging aid.
Fixes: 4cd7c9479a ('IB/mad: Add support for additional MAD info to/from drivers')
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The define OPA_LID_PERMISSIVE is big endian and was compared to the
cpu endian variable opa_drslid.
Problem caught by 0-day build infrastructure.
Fixes: 8e4349d13f (IB/mad: Add final OPA MAD processing)
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: John, Jubin <jubin.john@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Persuant to Liran's comments on node_type on linux-rdma
mailing list:
In an effort to reform the RDMA core and ULPs to minimize use of
node_type in struct ib_device, an additional bit is added to
struct ib_device for is_switch (IB switch). This is needed
to be initialized by any IB switch device driver. This is a
NEW requirement on such device drivers which are all
"out of tree".
In addition, an ib_switch helper was added to ib_verbs.h
based on the is_switch device bit rather than node_type
(although those should be consistent).
The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
as well as (IPoIB and SRP) ULPs are updated where
appropriate to use this new helper. In some cases,
the helper is now used under the covers of using
rdma_[start end]_port rather than the open coding
previously used.
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Pull SCSI target updates from Nicholas Bellinger:
"It's been a busy development cycle for target-core in a number of
different areas.
The fabric API usage for se_node_acl allocation is now within
target-core code, dropping the external API callers for all fabric
drivers tree-wide.
There is a new conversion to RCU hlists for se_node_acl and
se_portal_group LUN mappings, that turns fast-past LUN lookup into a
completely lockless code-path. It also removes the original
hard-coded limitation of 256 LUNs per fabric endpoint.
The configfs attributes for backends can now be shared between core
and driver code, allowing existing drivers to use common code while
still allowing flexibility for new backend provided attributes.
The highlights include:
- Merge sbc_verify_dif_* into common code (sagi)
- Remove iscsi-target support for obsolete IFMarker/OFMarker
(Christophe Vu-Brugier)
- Add bidi support in target/user backend (ilias + vangelis + agover)
- Move se_node_acl allocation into target-core code (hch)
- Add crc_t10dif_update common helper (akinobu + mkp)
- Handle target-core odd SGL mapping for data transfer memory
(akinobu)
- Move transport ID handling into target-core (hch)
- Move task tag into struct se_cmd + support 64-bit tags (bart)
- Convert se_node_acl->device_list[] to RCU hlist (nab + hch +
paulmck)
- Convert se_portal_group->tpg_lun_list[] to RCU hlist (nab + hch +
paulmck)
- Simplify target backend driver registration (hch)
- Consolidate + simplify target backend attribute implementations
(hch + nab)
- Subsume se_port + t10_alua_tg_pt_gp_member into se_lun (hch)
- Drop lun_sep_lock for se_lun->lun_se_dev RCU usage (hch + nab)
- Drop unnecessary core_tpg_register TFO parameter (nab)
- Use 64-bit LUNs tree-wide (hannes)
- Drop left-over TARGET_MAX_LUNS_PER_TRANSPORT limit (hannes)"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (76 commits)
target: Bump core version to v5.0
target: remove target_core_configfs.h
target: remove unused TARGET_CORE_CONFIG_ROOT define
target: consolidate version defines
target: implement WRITE_SAME with UNMAP bit using ->execute_unmap
target: simplify UNMAP handling
target: replace se_cmd->execute_rw with a protocol_data field
target/user: Fix inconsistent kmap_atomic/kunmap_atomic
target: Send UA when changing LUN inventory
target: Send UA upon LUN RESET tmr completion
target: Send UA on ALUA target port group change
target: Convert se_lun->lun_deve_lock to normal spinlock
target: use 'se_dev_entry' when allocating UAs
target: Remove 'ua_nacl' pointer from se_ua structure
target_core_alua: Correct UA handling when switching states
xen-scsiback: Fix compile warning for 64-bit LUN
target: Remove TARGET_MAX_LUNS_PER_TRANSPORT
target: use 64-bit LUNs
target: Drop duplicate + unused se_dev_check_wce
target: Drop unnecessary core_tpg_register TFO parameter
...
Merge third patchbomb from Andrew Morton:
- the rest of MM
- scripts/gdb updates
- ipc/ updates
- lib/ updates
- MAINTAINERS updates
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits)
genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
genalloc: rename dev_get_gen_pool() to gen_pool_get()
x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
MAINTAINERS: add zpool
MAINTAINERS: BCACHE: Kent Overstreet has changed email address
MAINTAINERS: move Jens Osterkamp to CREDITS
MAINTAINERS: remove unused nbd.h pattern
MAINTAINERS: update brcm gpio filename pattern
MAINTAINERS: update brcm dts pattern
MAINTAINERS: update sound soc intel patterns
MAINTAINERS: remove website for paride
MAINTAINERS: update Emulex ocrdma email addresses
bcache: use kvfree() in various places
libcxgbi: use kvfree() in cxgbi_free_big_mem()
target: use kvfree() in session alloc and free
IB/ehca: use kvfree() in ipz_queue_{cd}tor()
drm/nouveau/gem: use kvfree() in u_free()
drm: use kvfree() in drm_free_large()
cxgb4: use kvfree() in t4_free_mem()
cxgb3: use kvfree() in cxgb_free_mem()
...
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
speed module address lookup. He found some abusers of the module lock
doing that too.
A little bit of parameter work here too; including Dan Streetman's breaking
up the big param mutex so writing a parameter can load another module (yeah,
really). Unfortunately that broke the usual suspects, !CONFIG_MODULES and
!CONFIG_SYSFS, so those fixes were appended too.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
tLdh/a9GiCitqS0bT7GE
=tWPQ
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Main excitement here is Peter Zijlstra's lockless rbtree optimization
to speed module address lookup. He found some abusers of the module
lock doing that too.
A little bit of parameter work here too; including Dan Streetman's
breaking up the big param mutex so writing a parameter can load
another module (yeah, really). Unfortunately that broke the usual
suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
appended too"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
modules: only use mod->param_lock if CONFIG_MODULES
param: fix module param locks when !CONFIG_SYSFS.
rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
module: add per-module param_lock
module: make perm const
params: suppress unused variable error, warn once just in case code changes.
modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
kernel/module.c: avoid ifdefs for sig_enforce declaration
kernel/workqueue.c: remove ifdefs over wq_power_efficient
kernel/params.c: export param_ops_bool_enable_only
kernel/params.c: generalize bool_enable_only
kernel/module.c: use generic module param operaters for sig_enforce
kernel/params: constify struct kernel_param_ops uses
sysfs: tightened sysfs permission checks
module: Rework module_addr_{min,max}
module: Use __module_address() for module_address_lookup()
module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
module: Optimize __module_address() using a latched RB-tree
rbtree: Implement generic latch_tree
seqlock: Introduce raw_read_seqcount_latch()
...
Use kvfree() instead of open-coding it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Cc: Christoph Raisch <raisch@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
This is the usual grab bag of driver updates (lpfc, hpsa,
megaraid_sas, cxgbi, be2iscsi) plus an assortment of minor updates.
There are also one new driver: the Cisco snic; the advansys driver has
been rewritten to get rid of the warning about converting it to the
DMA API, the tape statistics patch got in and finally, there's a
resuffle of SCSI header files to separate more cleanly initiator from
target mode (and better share the common definitions).
Signed-off-by: James Bottomley <JBottomley@Odin.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJViKWdAAoJEDeqqVYsXL0MAr8IAMmlA6HBVjMJJFCEOY9corHj
e70MNQa7LUgf+JCdOtzGcvHXTiFFd4IHZAwXUJAnsC4IU2QWEfi1bjUTErlqBIGk
LoZlXXpEHnFpmWot3OluOzzcGcxede8rVgPiKWVVdojIngBC2+LL/i2vPCJ84ri9
WCVlk6KBvWZXuU6JuOKAb2FO9HOX7Q61wuKAMast2Qc6RNc2ksgc7VbstsITqzZ9
FVEsjmQ5lqUj+xdxBpiUOdUpc22IJ4VcpBgQ2HrThvg6vf4aq937RJ/g4vi/g0SU
Utk0a3bUw1H/WnYAfJVFx83nVEsS/954Z7/ERDg1sjlfLYwQtQnpov0XIbPIbZU=
=k9IT
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is the usual grab bag of driver updates (lpfc, hpsa,
megaraid_sas, cxgbi, be2iscsi) plus an assortment of minor updates.
There is also one new driver: the Cisco snic. The advansys driver has
been rewritten to get rid of the warning about converting it to the
DMA API, the tape statistics patch got in and finally, there's a
resuffle of SCSI header files to separate more cleanly initiator from
target mode (and better share the common definitions)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (156 commits)
snic: driver for Cisco SCSI HBA
qla2xxx: Fix indentation
qla2xxx: Comment out unreachable code
fusion: remove dead MTRR code
advansys: fix compilation errors and warnings when CONFIG_PCI is not set
mptsas: fix depth param in scsi_track_queue_full
megaraid: fix irq setup process regression
lpfc: Update version to 10.7.0.0 for upstream patch set.
lpfc: Fix to drop PLOGIs from fabric node till LOGO processing completes
lpfc: Fix scsi task management error message.
lpfc: Fix cq_id masking problem.
lpfc: Fix scsi prep dma buf error.
lpfc: Add support for using block multi-queue
lpfc: Devices are not discovered during takeaway/giveback testing
lpfc: Fix vport deletion failure.
lpfc: Check for active portpeerbeacon.
lpfc: Update driver version for upstream patch set 10.6.0.1.
lpfc: Change buffer pool empty message to miscellaneous category
lpfc: Fix incorrect log message reported for empty FCF record.
lpfc: Fix rport leak.
...
- A large cleanup of how device capabilities are checked for various
features
- Additional cleanups in the MAD processing
- Update to the srp driver
- Creation and use of centralized log message helpers
- Add const to a number of args to calls and clean up call chain
- Add support for extended cq create verb
- Add support for timestamps on cq completion
- Add support for processing OPA MAD packets
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVeyzqAAoJELgmozMOVy/di3wP/jml4F9crvQn7UBJjGm/rgcI
wzZ2GZTqxQE8dn+W6gQsdKOzy0Ibxx5UYGp9ruInuxAcVh9t1PcylanasaiGMEtY
mrGRFjipJ9jYa+yDQDTQi8EFMClZuMSvtRLKjzYITudCXQck37V+F5YlP6VphjX7
JeiM4a+4rD0ukk5PKGvUw51sP1eawKtEdUvnqcOEI2tJgQmzJBP4mXrhVtS/0wSc
Pi8TRN5QKi3Drom/tK9QQ/ncoYngi4BKLfszCeU373HJq6qXqsxBYvs3jX6MPzfv
Aooj272JxBgCYxkmEfECezDpmi3PbWDJjXj/xCLjfhjISDtHHHVLGVMODZpwUEsL
2wBgwlzdajVopSbSLvsjQNtQw25s7sDWpu+TFKbS0u+W2d0ZOyipM1Xeje+OtDHQ
clhwvDhgSfeI/bJ1YdtNLbvINrwsfZD213zD+WH21A/9weAVr3hEfTuSaNFiTiRn
5yywP36TM0wH90KhiWoLrztcHvoE5p7kGuqzv04MRjrMMNHEJK2/IhWvT97Ewngu
vWrZl7QRzXYcGspCOp2aJW9Wr2rhGRrv28TF+thpNrIJOB2JM4q4koCKZCcI0s2D
E6pY2YQSzvrA/ZSfcWIg4yhugcycIJkOf7ur2N/U43cwGXtaCzPWVnKMApmdnVOO
ZEMwD3OZ1OGcCHLhRL8Y
=yISf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
- a large cleanup of how device capabilities are checked for various
features
- additional cleanups in the MAD processing
- update to the srp driver
- creation and use of centralized log message helpers
- add const to a number of args to calls and clean up call chain
- add support for extended cq create verb
- add support for timestamps on cq completion
- add support for processing OPA MAD packets
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (92 commits)
IB/mad: Add final OPA MAD processing
IB/mad: Add partial Intel OPA MAD support
IB/mad: Add partial Intel OPA MAD support
IB/core: Add OPA MAD core capability flag
IB/mad: Add support for additional MAD info to/from drivers
IB/mad: Convert allocations from kmem_cache to kzalloc
IB/core: Add ability for drivers to report an alternate MAD size.
IB/mad: Support alternate Base Versions when creating MADs
IB/mad: Create a generic helper for DR forwarding checks
IB/mad: Create a generic helper for DR SMP Recv processing
IB/mad: Create a generic helper for DR SMP Send processing
IB/mad: Split IB SMI handling from MAD Recv handler
IB/mad cleanup: Generalize processing of MAD data
IB/mad cleanup: Clean up function params -- find_mad_agent
IB/mlx4: Add support for CQ time-stamping
IB/mlx4: Add mmap call to map the hardware clock
IB/core: Pass hardware specific data in query_device
IB/core: Add timestamp_mask and hca_core_clock to query_device
IB/core: Extend ib_uverbs_create_cq
IB/core: Add CQ creation time-stamping flag
...
Pull x86 core updates from Ingo Molnar:
"There were so many changes in the x86/asm, x86/apic and x86/mm topics
in this cycle that the topical separation of -tip broke down somewhat -
so the result is a more traditional architecture pull request,
collected into the 'x86/core' topic.
The topics were still maintained separately as far as possible, so
bisectability and conceptual separation should still be pretty good -
but there were a handful of merge points to avoid excessive
dependencies (and conflicts) that would have been poorly tested in the
end.
The next cycle will hopefully be much more quiet (or at least will
have fewer dependencies).
The main changes in this cycle were:
* x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
Gleixner)
- This is the second and most intrusive part of changes to the x86
interrupt handling - full conversion to hierarchical interrupt
domains:
[IOAPIC domain] -----
|
[MSI domain] --------[Remapping domain] ----- [ Vector domain ]
| (optional) |
[HPET MSI domain] ----- |
|
[DMAR domain] -----------------------------
|
[Legacy domain] -----------------------------
This now reflects the actual hardware and allowed us to distangle
the domain specific code from the underlying parent domain, which
can be optional in the case of interrupt remapping. It's a clear
separation of functionality and removes quite some duct tape
constructs which plugged the remap code between ioapic/msi/hpet
and the vector management.
- Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
injection into guests (Feng Wu)
* x86/asm changes:
- Tons of cleanups and small speedups, micro-optimizations. This
is in preparation to move a good chunk of the low level entry
code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
Brian Gerst)
- Moved all system entry related code to a new home under
arch/x86/entry/ (Ingo Molnar)
- Removal of the fragile and ugly CFI dwarf debuginfo annotations.
Conversion to C will reintroduce many of them - but meanwhile
they are only getting in the way, and the upstream kernel does
not rely on them (Ingo Molnar)
- NOP handling refinements. (Borislav Petkov)
* x86/mm changes:
- Big PAT and MTRR rework: making the code more robust and
preparing to phase out exposing direct MTRR interfaces to drivers -
in favor of using PAT driven interfaces (Toshi Kani, Luis R
Rodriguez, Borislav Petkov)
- New ioremap_wt()/set_memory_wt() interfaces to support
Write-Through cached memory mappings. This is especially
important for good performance on NVDIMM hardware (Toshi Kani)
* x86/ras changes:
- Add support for deferred errors on AMD (Aravind Gopalakrishnan)
This is an important RAS feature which adds hardware support for
poisoned data. That means roughly that the hardware marks data
which it has detected as corrupted but wasn't able to correct, as
poisoned data and raises an APIC interrupt to signal that in the
form of a deferred error. It is the OS's responsibility then to
take proper recovery action and thus prolonge system lifetime as
far as possible.
- Add support for Intel "Local MCE"s: upcoming CPUs will support
CPU-local MCE interrupts, as opposed to the traditional system-
wide broadcasted MCE interrupts (Ashok Raj)
- Misc cleanups (Borislav Petkov)
* x86/platform changes:
- Intel Atom SoC updates
... and lots of other cleanups, fixlets and other changes - see the
shortlog and the Git log for details"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
x86/hpet: Use proper hpet device number for MSI allocation
x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
genirq: Prevent crash in irq_move_irq()
genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
iommu, x86: Properly handle posted interrupts for IOMMU hotplug
iommu, x86: Provide irq_remapping_cap() interface
iommu, x86: Setup Posted-Interrupts capability for Intel iommu
iommu, x86: Add cap_pi_support() to detect VT-d PI capability
iommu, x86: Avoid migrating VT-d posted interrupts
iommu, x86: Save the mode (posted or remapped) of an IRTE
iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
iommu: dmar: Provide helper to copy shared irte fields
iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
iommu: Add new member capability to struct irq_remap_ops
x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
...
Pull scsi target fixes from Nicholas Bellinger:
"Apologies for the late pull request.
Here are the outstanding target-pending fixes for v4.1 code.
The series contains three patches from Sagi + Co that address a few
iser-target issues that have been uncovered during recent testing at
Mellanox.
Patch #1 has a v3.16+ stable tag, and #2-3 have v3.10+ stable tags"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
iser-target: Fix possible use-after-free
iser-target: release stale iser connections
iser-target: Fix variable-length response error completion
We are burrying direct access to MTRR code support on
x86 in order to take advantage of PAT. In the future, we
also want to make the default behaviour of ioremap_nocache()
to use strong UC, use of mtrr_add() on those systems
would make write-combining void.
In order to help both enable us to later make strong
UC default and in order to phase out direct MTRR access
code port the driver over to arch_phys_wc_add() and
annotate that the device driver requires systems to
boot with PAT disabled, with the 'nopat' kernel parameter.
This is a workable compromise given that the ipath device
driver powers the old HTX bus cards that only work in
AMD systems, while the newer IB/qib device driver
powers all PCI-e cards. The ipath device driver is
obsolete, hardware is hard to find and because of this
its a reasonable compromise to require users of ipath
to boot with 'nopat'.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: infinipath@intel.com
Cc: jbeulich@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-rdma@vger.kernel.org
Cc: mchehab@osg.samsung.com
Cc: toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1434053994-2196-4-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1434356898-25135-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch drops unnecessary target_core_fabric_ops parameter usage
for core_tpg_register() during fabric driver TFO->fabric_make_tpg()
se_portal_group creation callback execution.
Instead, use the existing se_wwn->wwn_tf->tf_ops pointer to ensure
fabric driver is really using the same TFO provided at module_init
time.
Also go ahead and drop the forward TFO declarations tree-wide, and
handling the special case for iscsi-target discovery TPG.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This is an infrastructure step for querying VF and PF counters.
This code was in the IB driver, move it to the mlx4 core driver
so it will be accessible for more use cases.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As IB VFs are not capable to read the port counters through MADs,
move there to read their own QP counters to gather statistics.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an infrastructure step to attach all the QPs opened from the
IB driver to a counter in order to collect VF stats from the PF using
those counters.
If the port's type is Ethernet, the counter policy demands two counters
per port (one for RoCE and one for Ethernet). The port default counter
(allocated in mlx4_core) is used for the Ethernet netdev QPs and we
allocate another counter for RoCE.
If the port's traffic is Infiniband, the counter policy demands
one counter per port, so it can use the port's default counter.
Also, Add 'allocated' flag for each counter in order to clean it at
unload.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reserve the last valid counter index for "sink" counter, when a
new counter cannot be allocated, the driver will use this counter.
In order to avoid allocating this counter on any other flow, fix the
indices bitmap allocation range, and reserve the sink counter index.
Add macro for the sink counter index and replace all appearences of the
index with the macro.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For devices which support OPA MADs
1) Use previously defined SMP support functions.
2) Pass correct base version to ib_create_send_mad when processing OPA MADs.
3) Process out_mad_key_index returned by agents for a response. This is
necessary because OPA SMP packets must carry a valid pkey.
4) Carry the correct segment size (OPA vs IBTA) of RMPP messages within
ib_mad_recv_wc.
5) Handle variable length OPA MADs by:
* Adjusting the 'fake' WC for locally routed SMP's to represent the
proper incoming byte_len
* out_mad_size is used from the local HCA agents
1) when sending agent responses on the wire
2) when passing responses through the local_completions
function
NOTE: wc.byte_len includes the GRH length and therefore is different
from the in_mad_size specified to the local HCA agents.
out_mad_size should _not_ include the GRH length as it is added
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add OPA SMP processing functionality.
Define the new OPA SMP format, create support functions for this format using
the previously defined helper functions as appropriate.
These functions are defined in this patch and used in the final OPA MAD support
patch.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch is the first of 3 which adds processing of OPA MADs
1) Add Intel Omni-Path Architecture defines
2) Increase max management version to accommodate OPA
3) update ib_create_send_mad
If the device supports OPA MADs and the MAD being sent is the OPA base
version alter the MAD size and sg lengths as appropriate
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support alternate sized MADs (and variable sized MADs on OPA
devices) add in/out MAD size parameters to the process_mad core call.
In addition, add an out_mad_pkey_index to communicate the pkey index the driver
wishes the MAD stack to use when sending OPA MAD responses.
The out MAD size and the out MAD PKey index are required by the MAD
stack to generate responses on OPA devices.
Furthermore, the in and out MAD parameters are made generic by specifying them
as ib_mad_hdr rather than ib_mad.
Drivers are modified as needed and are protected by BUG_ON flags if the MAD
sizes passed to them is incorrect.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch implements allocating alternate receive MAD buffers within the MAD
stack. Support for OPA to send/recv variable sized MADs is implemented later.
1) Convert MAD allocations from kmem_cache to kzalloc
kzalloc is more flexible to support devices with different sized MADs
and research and testing showed that the current use of kmem_cache does
not provide performance benefits over kzalloc.
2) Change struct ib_mad_private to use a flex array for the mad data
3) Allocate ib_mad_private based on the size specified by devices in
rdma_max_mad_size.
4) Carry the allocated size in ib_mad_private to be used when processing
ib_mad_private objects.
5) Alter DMA mappings based on the mad_size of ib_mad_private.
6) Replace the use of sizeof and static defines as appropriate
7) Add appropriate casts for the MAD data when calling processing
functions.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add max MAD size to the device immutable data set and have all drivers that
support MADs report the current IB MAD size (IB_MGMT_MAD_SIZE) to the core.
Verify MAD size data in both the MAD core and when reading the immutable data.
OPA drivers will report alternate MAD sizes in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In preparation to support the new OPA MAD Base version, add a base version
parameter to ib_create_send_mad and set it to IB_MGMT_BASE_VERSION for current
users.
Definition of the new base version and it's processing will occur in later
patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing the DR forwarding checks which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_check_forward_dr_smp.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing DR SMP Recv messages which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_handle_dr_smp_recv.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection.
Add a helper function which is generic to processing DR SMP Send messages which
can be used by both IB and OPA SMP code.
Use this function in the current IB function smi_handle_dr_smp_send.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Make a helper function to process Directed Route SMPs to be called by the IB
MAD Recv Handler, ib_mad_recv_done_handler.
This cleans up the MAD receive handler code a bit and allows for us to better
share the SMP processing code between IB and OPA SMPs.
IB and OPA SMPs share the same processing algorithm but have different header
formats and permissive LID detection. Therefore this and subsequent patches
split the common processing code from the IB specific code in anticipation of
sharing those algorithms with the OPA code.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_find_send_mad only needs access to the MAD header not the full IB MAD.
Change the local variable to ib_mad_hdr and change the corresponding cast.
This allows for clean usage of this function with both IB and OPA MADs because
OPA MADs carry the same header as IB MADs.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
find_mad_agent only needs read only access to the MAD header. Update the
ib_mad pointer to be const ib_mad_hdr. Adjust call tree.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This includes:
* support allocation of CQ with the TIMESTAMP_COMPLETION creation flag.
* add timestamp_mask and hca_core_clock to query_device, reporting the
number of supported timestamp bits (mask) and the hca_core_clock frequency.
* return hca core clock's offset in query_device vendor's data,
this is needed in order to read the HCA's core clock.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to read the HCA's cycle counter efficiently in
user space, we need to map the HCA's register.
This is done through mmap call.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vendors should be able to pass vendor specific data to/from
user-space via query_device uverb. In order to do this,
we need to pass the vendors' specific udata.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to expose timestamp we need to expose two new attributes in
query_device to be used for CQ completion time-stamping:
timestamp_mask - how many bits are valid in the timestamp, where timestamp
values could be 64bits the most.
hca_core_clock - timestamp is given in HW cycles, the frequency in KHZ units
of the HCA, necessary in order to convert cycles to seconds.
This is added both to ib_query_device and its respective uverbs counterpart.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_uverbs_ex_create_cq follows the extension verbs
mechanism. New features (for example, CQ creation flags
field which is added in a downstream patch) could used
via user-space libraries without breaking the ABI.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.
Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a new ib_cq_init_attr structure which contains the
previous cqe (minimum number of CQ entries) and comp_vector
(completion vector) in addition to a new flags field.
All vendors' create_cq callbacks are changed in order
to work with the new API.
This commit does not change any functionality.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com> to patch #2
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously we configured HW MTU to be netdev->mtu, actually we
need to configure netdev->mtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN).
Also, query MTU can not fail, hence make the relevant helper a
void functionm, add mlx5e_set_dev_port_mtu, helper function to
handle MTU setting.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle this configuration:
Queues Per Page * SGE BAR2 Queue Register Area Size > Page Size
Use cxgb4_bar2_sge_qregs() to obtain the proper location within the
bar2 region for a given qid.
Rework the DB and GTS write functions to make use of this bar2 info.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A reorganisation of the PD allocation and deallocation in commit
9ba1377daa ("RDMA/ocrdma: Move PD resource management to driver.")
introduced a double free on pd, as detected by static analysis by
smatch:
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:682 ocrdma_alloc_pd()
error: double free of 'pd'^
The original call to ocrdma_mbx_dealloc_pd() (which does not kfree
pd) was replaced with a call to _ocrdma_dealloc_pd() (which does
kfree pd). The kfree following this call causes the double free,
so just remove it to fix the problem.
Fixes: 9ba1377daa ("RDMA/ocrdma: Move PD resource management to driver.")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This code causes a static checker warning:
drivers/infiniband/hw/usnic/usnic_uiom.c:476 usnic_uiom_alloc_pd()
warn: passing zero to 'PTR_ERR'
This code isn't buggy, but iommu_domain_alloc() doesn't return an error
pointer so we can simplify the error handling and silence the static
checker warning.
The static checker warning is to catch place which do:
if (!ptr)
return ERR_PTR(ptr);
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Goodell <dgoodell@cisco.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use kernel.h macro definition.
Thanks to Julia Lawall for Coccinelle scripting support.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Registering an event handler is done for a device. This device may have
one RoCE port (no SA cap) and one InfiniBand port (has SA cap).
Therefore, warning from the event handler about a specific port that
doesn't have SA cap is correct but pollutes the kernel log without a
need.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
iser connection termination process happens in 2 stages:
- isert_wait_conn:
- resumes rdma disconnect
- wait for session commands
- wait for flush completions (post a marked wr to signal we are done)
- wait for logout completion
- queue work for connection cleanup (depends on disconnected/timewait
events)
- isert_free_conn
- last reference put on the connection
In case we are terminating during IOs, we might be posting send/recv
requests after we posted the last work request which might lead
to a use-after-free condition in isert_handle_wc.
After we posted the last wr in isert_wait_conn we are guaranteed that
no successful completions will follow (meaning no new work request posts
may happen) but other flush errors might still come. So before we
put the last reference on the connection, we repeat the process of
posting a marked work request (isert_wait4flush) in order to make sure all
pending completions were flushed.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jenny Falkovich <jennyf@mellanox.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
When receiving a new iser connect request we serialize
the pending requests by adding the newly created iser connection
to the np accept list and let the login thread process the connect
request one by one (np_accept_wait).
In case we received a disconnect request before the iser_conn
has begun processing (still linked in np_accept_list) we should
detach it from the list and clean it up and not have the login
thread process a stale connection. We do it only when the connection
state is not already terminating (initiator driven disconnect) as
this might lead us to access np_accept_mutex after the np was released
in live shutdown scenarios.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jenny Falkovich <jennyf@mellanox.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Since commit "2426bd456a6 target: Report correct response ..."
we might get a command with data_size that does not fit to
the number of allocated data sg elements. Given that we rely on
cmd t_data_nents which might be different than the data_size,
we sometimes receive local length error completion. The correct
approach would be to take the command data_size into account when
constructing the ib sg_list.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jenny Falkovich <jennyf@mellanox.com>
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Ethernet functionality is only available when working in ISSI > 0 mode.
Previously, the IB driver wasn't ready to work on that mode, and hence
building both the IB driver and the Ethernet functionality in the core
driver were disallowed by Kconfigs.
Now, once we have all the pre-steps in place, we can remove this limitation.
The last steps in the IB driver for getting that setup to work are:
create dummy SRQ for the driver's use (until now we could use XRC_SRQ
as SRQ and XRC_SRQ, after moving to ISSI > 0, we separate XRC SRQs from
basic SRQs) and adapt the create QP function to be compatible with ISSI > 0.
Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we still don't have RoCE support in mlx5, avoid
creating IB driver instance over Ethernet ports.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ISSI > 0 mode, most of the MAD_IFC command features are deprecated, and can't
be used. Therefore, when in that mode, we replace all of them with other commands
that provide the required functionality.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When working in ISSI > 0 mode, the model exposed by the device for
XRCs and SRQs is different. XRCs use XRC SRQs and plain SRQs are based
on RPM (Receive Memory Pool).
Add helper functions to create, modify, query, and arm XRC SRQs and RMPs.
Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ethtool support to get adapter specific hardware statistics
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only include SCSI initiator header files in target code that needs
these header files, namely the SCSI pass-through code and the tcm_loop
driver. Change SCSI_SENSE_BUFFERSIZE into TRANSPORT_SENSE_BUFFER in
target code because the former is intended for initiator code and the
latter for target code. With this patch the only initiator include
directives in target code that remain are as follows:
$ git grep -nHE 'include .scsi/(scsi.h|scsi_host.h|scsi_device.h|scsi_cmnd.h)' drivers/target drivers/infiniband/ulp/{isert,srpt} drivers/usb/gadget/legacy/tcm_*.[ch] drivers/{vhost,xen} include/{target,trace/events/target.h}
drivers/target/loopback/tcm_loop.c:29:#include <scsi/scsi.h>
drivers/target/loopback/tcm_loop.c:31:#include <scsi/scsi_host.h>
drivers/target/loopback/tcm_loop.c:32:#include <scsi/scsi_device.h>
drivers/target/loopback/tcm_loop.c:33:#include <scsi/scsi_cmnd.h>
drivers/target/target_core_pscsi.c:39:#include <scsi/scsi_device.h>
drivers/target/target_core_pscsi.c:40:#include <scsi/scsi_host.h>
drivers/xen/xen-scsiback.c:52:#include <scsi/scsi_host.h> /* SG_ALL */
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
In order to support constant callers of agent_send_response we add const
specifiers to the its pointer arguments.
Adjust the call tree accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The process_mad device function declares some parameters as "in". Make those
parameters const and adjust the call tree under process_mad in the various
drivers accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The unwinding clean up code are err_create_flow starts at the current
index i. That means we shouldn't increment i until we're really sure
we won't have to destroy the current flow; otherwise we might
increment the index, fail inside an is_bonded block, and end up
accessing off the end of the reg_id[] array.
This was detected by Coverity (CID 1271229).
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If ocrdma_get_pd_num() fails, then we need to free the pd struct we allocated.
This was detected by Coverity (CID 1271245).
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that sparse complains about ipoib_neigh_hash_init(). This
patch does not change any functionality. See also patch "IPoIB:
Fix memory leak in the neigh table deletion flow" (commit ID
66172c0993).
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
RDMA/nes: Enable the use of the tos field in the nes driver
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma-cma/iw_cm: Export tos field to iwarp providers
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI target fixes from Nicholas Bellinger:
"These are mostly minor fixes, with the exception of the following that
address fall-out from recent v4.1-rc1 changes:
- regression fix related to the big fabric API registration changes
and configfs_depend_item() usage, that required cherry-picking one
of HCH's patches from for-next to address the issue for v4.1 code.
- remaining TCM-USER -v2 related changes to enforce full CDB
passthrough from Andy + Ilias.
Also included is a target_core_pscsi driver fix from Andy that
addresses a long standing issue with a Scsi_Host reference being
leaked on PSCSI device shutdown"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
iser-target: Fix error path in isert_create_pi_ctx()
target: Use a PASSTHROUGH flag instead of transport_types
target: Move passthrough CDB parsing into a common function
target/user: Only support full command pass-through
target/user: Update example code for new ABI requirements
target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem
target: Drop signal_pending checks after interruptible lock acquire
target: Add missing parentheses
target: Fix bidi command handling
target/user: Disallow full passthrough (pass_level=0)
ISCSI: fix minor memory leak
Previously, mlx4_en allocated EQs and used them exclusively.
This affected RoCE performance, as applications which are
events sensitive were limited to use only the legacy EQs.
Change that by introducing an EQ pool. This pool is managed
by mlx4_core. EQs are assigned to ports (when there are limited
number of EQs, multiple ports could be assigned to the same EQs).
An exception to this rule is the ASYNC EQ which handles various events.
Legacy EQs are completely removed as all EQs could be shared.
When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for
EQ serving on a specific port. The core driver calculates which
EQ should be assigned to that request.
Because IRQs are shared between IB and Ethernet modules, their
names only include the PCI device BDF address.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In SRIOV, when simple (i.e - Ethernet L2 only) flow steering rules are
created, always create them at MLX4_DOMAIN_NIC priority (instead of
the real priority the function created them at). This is done in order
to let multiple functions add broadcast/multicast rules without
affecting other functions, which is necessary for DPDK in SRIOV.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Detected these variables by building with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Simplify target core and target drivers by storing the task tag
a.k.a. command identifier inside struct se_cmd.
For several transports (e.g. SRP) tags are 64 bits wide.
Hence add support for 64-bit tags.
(Fix core_tmr_abort_task conversion spec warnings - nab)
(Fix up usb-gadget to use 16-bit tags - HCH + bart)
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Now that struct se_portal_group contains a protocol identifier field we can
take all the code to format an parse protocol identifiers in CDBs into common
code instead of leaving this to low-level drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Now that we store the protocol identifier in the tpg structure we don't
need this method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Remove the unneeded fabric_ptr argument, and change the type argument
to pass in a SPC protocol identifier.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
By always allocating and adding, respectively removing and freeing
the se_node_acl structure in core code we can remove tons of repeated
code in the init_nodeacl and drop_nodeacl routines. Additionally
this now respects the get_default_queue_depth method in this code
path as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
All fabric drivers except for iSCSI always return 1, so implement
that as default behavior.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The first argument of these two functions is always identical
to se_cmd->se_sess. Hence remove the first argument.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
We don't assign pi_ctx to desc->pi_ctx until we're certain to succeed
in the function. That means the cleanup path should use the local
pi_ctx variable, not desc->pi_ctx.
This was detected by Coverity (CID 1260062).
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This is the Ethernet part of the driver for the Mellanox ConnectX(R)-4
Single/Dual-Port Adapter supporting 100Gb/s with VPI. The driver
extends the existing mlx5 driver with Ethernet functionality.
This patch contains the driver entry points but does not include
transmit and receive (see the previous patch in the series) routines.
It also adds the option MLX5_CORE_EN to Kconfig to enable/disable the
Ethernet functionality. Currently, Kconfig is programmed to make
Ethernet and Infiniband functionality mutally exclusive.
Also changed MLX5_INFINIBAND to be depandant on MLX5_CORE instead of
selecting it, since MLX5_CORE could be selected without MLX5_INFINIBAND
being selected.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Query all supported types of dev caps on driver load.
- Store the Cap data outbox per cap type into driver private data.
- Introduce new Macros to access/dump stored caps (using the auto
generated data types).
- Obsolete SW representation of dev caps (no need for SW copy for each
cap).
- Modify IB driver to use new macros for checking caps.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As David Daney pointed in mlx4_core driver [1], mlx5_core is also
misusing the DMA-API.
This patch is removing the code that vmap() memory allocated by
dma_alloc_coherent().
After this patch, users of this drivers might fail allocating resources
on memory fragmeneted systems. This will be fixed later on.
[1] - https://patchwork.ozlabs.org/patch/458531/
CC: David Daney <david.daney@cavium.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most code already uses consts for the struct kernel_param_ops,
sweep the kernel for the last offending stragglers. Other than
include/linux/moduleparam.h and kernel/params.c all other changes
were generated with the following Coccinelle SmPL patch. Merge
conflicts between trees can be handled with Coccinelle.
In the future git could get Coccinelle merge support to deal with
patch --> fail --> grammar --> Coccinelle --> new patch conflicts
automatically for us on patches where the grammar is available and
the patch is of high confidence. Consider this a feature request.
Test compiled on x86_64 against:
* allnoconfig
* allmodconfig
* allyesconfig
@ const_found @
identifier ops;
@@
const struct kernel_param_ops ops = {
};
@ const_not_found depends on !const_found @
identifier ops;
@@
-struct kernel_param_ops ops = {
+const struct kernel_param_ops ops = {
};
Generated-by: Coccinelle SmPL
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: cocci@systeme.lip6.fr
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
As part of enabling single ported VFs over IB ports we need to handle
some of the flows for generting EQ events for VFs which don't come
into play under Eth ports.
This mainly includes port management events derived from changes of the
phyiscal port (lid change, client re-register, down/up, etc), VF pkey table
changes and VF guid changes initiated by the IB driver.
(1) make sure that events are generated only for VFs sitting on
the relevant physical port (under the ALL_SLAVES flow).
(2) before generating the event, convert from physical (one or two)
to VF port (always equals one).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiplexling a MAD sent from VF, we should convert the port used
by the guest to send the packet to the actual physical port which will be
used to transmit the packet, before building the relevant address-handle (AH).
This is needed under VPI for single ported VFs, since the code that builds
the AH (mlx4_ib_query_ah()) makes decisions based on the input port. If we
use the port number provided by the guest, it might have different protocol
vs. the one this packat has to go from, and hence the result could be wrong.
So far, the conversion was done after the AH was built and it worked for
single ported Eth VFs which were not enabled under VPI. When adding support
for single ported IB VFs and VPI, we hit that.
Fixes: 449fc48866 ('net/mlx4: Adapt code for N-Port VF')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for using UD and AF_IB is currently broken. The
IB_CM_SIDR_REQ_RECEIVED message is not handled properly in
cma_save_net_info() and we end up falling into code that will try and
process the request as ipv4/ipv6, which will end up failing.
The resolution is to add a check for the SIDR_REQ and call
cma_save_ib_info() with a NULL path record. Change cma_save_ib_info()
to copy the src sib info from the listen_id when the path record is NULL.
Reported-by: Hari Shankar <Hari.Shankar@netapp.com>
Signed-off-by: Matt Finlay <matt@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After discussion upstream, it was agreed to transition the usage of iboe
in the kernel to roce. This keeps our terminology consistent with what
was finalized in the IBTA Annex 16 and IBTA Annex 17 publications.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Problem reported by: Ted Kim <ted.h.kim@oracle.com>:
We have a case where a Linux system and a non-Linux system are
trying to interoperate. The Linux host is the active side and
starts the connection establishment, but later decides to not go
through with the connection setup and does rdma_destroy_id().
The rdma_destroy_id() eventually works its way down to cm_destroy_id()
in core/cm.c, where a REJ is sent. The non-Linux system
has some trouble recognizing the REJ because of:
A. CM states which can't receive the REJ
B. Some issues about REJ formatting (missing comm ID)
ISSUE A: That part of the spec says, a Consumer Reject REJ can be
sent for a connection abort, but it goes further
and says: can send a REJ message with a "Consumer Reject"
Reason code if they are in a CM state (i.e. REP
Rcvd, MRA(REP) Sent, REQ Rcvd, MRA Sent) that allows
a REJ to be sent (lines 35-38).
Of the states listed there in that sentence, it would
seem to limit the active side to using the Consumer Reject
(for the abort case) in just the REP-Rcvd and MRA-REP-Sent
states. That is basically only after the active side
sees a REP (or alternatively goes down the state transitions
to timeout in which case a Timeout REJ is sent).
As a fix, in cm-destroy-id() move the IB-CM-MRA-REQ-RCVD case
to the same as REQ-SENT. Essentially, make a REJ sent after
getting an MRA on active side a timeout rather than Consumer-
Reject, which is arguably more correct with the CM state
diagrams previous to getting a REP.
Signed-off-by: Ted Kim <ted.h.kim@oracle.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
As of commit 5eb620c81c "IB/core: Add helpers for uncached GID and P_Key
searches"; pkey_tbl_len and gid_tbl_len are immutable data which are stored in
the ib_device.
The per port core capability flags to be added later are also immutable data to
be stored in the ib_device object.
In preparation for this create a structure for per port immutable data and
place the pkey and gid table lengths within this structure.
"get_port_immutable" is added as a mandatory device function to allow the
drivers to fill in this data.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The addition of the rdma_cap_ib_mad is technically broken in ib_umad_remove_one
because the loop "i" value is not a port value.
This bug resulted in the ib_umad failing to properly remove its resources when
the core capability functions were converted to bit fields.
NOTE: e17371d73908 did not result in broken behavior on its own. It was only
an issue when the implementation of rdma_cap_ib_mad was changed.
Pass the port value to rdma_cap_ib_mad.
Fixes: e17371d73908 ("IB/Verbs: Use management helper rdma_cap_ib_mad()")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use the new common rdma_[start|end]_port functions instead of using
local variables and figuring it out on the fly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The following functions only need read access to the data passed to them.
ib_mad_kernel_rmpp_agent
is_rmpp_data_mad
rcv_has_same_gid
ib_find_send_mad
Clarify with const specifiers
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rcv_has_same_class only needs access to the MAD header
specify WR and Receive WC as const
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
ib_response_mad only needs read access to the MAD header, not write access
to the entire mad struct, so replace struct ib_mad with const struct
ib_mad_hdr
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
validate_mad only needs read access to the MAD header, not write access
to the entire mad struct, so replace struct ib_mad with const struct
ib_mad_hdr
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some of us keep revisiting the code to decode enumerations that
appear in out logs. Let's borrow the nice logging helpers that
exists in xprtrdma and rds for CMA events, IB events and WC statuses.
Reviewd-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The SCSI standard defines 64-bit values for LUNs. Large arrays
employing large or hierarchical LUN numbers become more and more
common. So update the SRP initiator to use 64-bit LUN numbers.
See also Hannes Reinecke, commit 9cb78c16f5 ("scsi: use 64-bit LUNs"),
June 2014.
The largest LUN number that has been tested is 0xd2003fff00000000.
Checked the following structure sizes with gdb:
* sizeof(struct srp_cmd) = 48
* sizeof(struct srp_tsk_mgmt) = 48
* sizeof(struct srp_aer_req) = 36
The ibmvscsi changes have been compile tested only (on a PPC system).
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove the !ch->target tests from the reconnect code. These
tests are not needed: upon entry of srp_rport_reconnect()
it is guaranteed that all ch->target pointers are non-NULL.
None of the functions srp_new_cm_id(), srp_finish_req(),
srp_create_ch_ib() nor srp_connect_ch() modifies this pointer.
srp_free_ch_ib() is never called concurrently with
srp_rport_reconnect().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The function srp_free_req_data() does not use ch->target.
Hence remove the ch->target != NULL check.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move the module version and release date into separate fields.
This makes the modinfo output easier to read.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A long time ago the data type int64_t was declared as long long
on x86 systems and as long on PPC systems. Today that data type
is declared as long long on all Linux architectures. This means
that the casts from uint64_t into unsigned long long are
superfluous. Remove these superfluous casts.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Although it is possible to let SRP I/O continue if a reconnect
results in a reduction of the number of channels, the current
code does not handle this scenario correctly. Instead of making
the reconnect code more complex, consider this as a reconnection
failure.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: <stable@vger.kernel.org> #v3.19
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reception of a DREQ message only causes the state of a single
channel to change. Hence move the 'connected' member variable
from the target to the channel data structure. This patch
avoids that following false positive warning can be reported
by srp_destroy_qp():
WARNING: at drivers/infiniband/ulp/srp/ib_srp.c:617 srp_destroy_qp+0xa6/0x120 [ib_srp]()
Call Trace:
[<ffffffff8106e10f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8106e16a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa0440226>] srp_destroy_qp+0xa6/0x120 [ib_srp]
[<ffffffffa0440322>] srp_free_ch_ib+0x82/0x1e0 [ib_srp]
[<ffffffffa044408b>] srp_create_target+0x7ab/0x998 [ib_srp]
[<ffffffff81346f60>] dev_attr_store+0x20/0x30
[<ffffffff811dd90f>] sysfs_write_file+0xef/0x170
[<ffffffff8116d248>] vfs_write+0xc8/0x190
[<ffffffff8116d411>] sys_write+0x51/0x90
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: <stable@vger.kernel.org> #v3.19
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that receiving a DREQ while RDMA channels are being
established causes target->qp_in_error to be reset.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: <stable@vger.kernel.org> #v3.19
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix a scsi_get_host() / scsi_host_put() imbalance in the error
path of srp_create_target(). See also patch "IB/srp: Avoid that
I/O hangs due to a cable pull during LUN scanning" (commit ID
34aa654ecb).
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Cc: <stable@vger.kernel.org> #v3.19
Signed-off-by: Doug Ledford <dledford@redhat.com>
Return values of 0 do not make sense for functions which return enum
smi_action
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
is_rmpp_data_mad is more descriptive for this function.
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously start_port and end_port were defined in 2 places, cache.c and
device.c and this prevented their use in other modules.
Make these common functions, change the name to reflect the rdma
name space, and update existing users.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_eth_ah() to help us check if the port of an
IB device support Ethernet Address Handler.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_af_ib() to help us check if the port of an
IB device support Native Infiniband Address.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mcast() to help us check if the port of an
IB device support Infiniband Multicast.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_sa() to help us check if the port of an
IB device support Infiniband Subnet Administration.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_iw_cm() to help us check if the port of an
IB device support IWARP Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_cm() to help us check if the port of an
IB device support Infiniband Communication Manager.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_smi() to help us check if the port of an
IB device support Infiniband Subnet Management Interface.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Introduce helper rdma_cap_ib_mad() to help us check if the port of an
IB device support Infiniband Management Datagrams.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform rest part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reform cma_acquire_dev() with management helpers, introduce
cma_validate_port() to make the code more clean.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform mcast related part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform route related part in IB-core cma.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform cm related part in IB-core cma/ucm.
Few checks focus on the device cm type rather than the port capability,
directly pass port 1 works currently, but can't support mixing cm type
device in future.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core verbs
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-ulp ipoib.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core multicast.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core sa_query.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core cm.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use raw management helpers to reform IB-core mad/agent/user_mad.
Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
HW currently restricts the IB MTU range between 512 and 4096.
Fail connection for MTUs lesser than 512.
Signed-off-by: Naga Irrinki <naga.irrinki@avagotech.com>
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rdma_addr_find_dmac_by_grh fails to resolve dmac for link local address.
Use rdma_get_ll_mac to resolve the link local address.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com>
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@avagotech.com>
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If DPP PDs are not supported by the FW, allocate only normal PDs.
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@avagotech.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com>
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the adapter ports are in PFC mode and VLAN is not configured,
use vlan tag 0 for RoCE traffic. Also, log an advisory message
in system logs.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@avagotech.com>
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Don't move QP to error state, if QP is in reset state during QP
destroy operation.
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Changing the destroy sequence of mailbox queue and event queues.
FW expects mailbox queue to be destroyed before desroying the EQs.
Signed-off-by: Selvin Xavier <selvin.xavier@avagotech.com>
Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These KERN_<LEVEL> uses are unnecessary with pr_<level> and cause
bad logging output so remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit d4988623cc ("IB/qib: use arch_phys_wc_add()")
adjusted mtrr inititialization to use the new interface.
Unfortunately, the new interface returns a signed
value and the patch tested the unsigned wc_cookie.
Fix the issue by changing the type of wc_cookie to int. For
the success case the ret left at zero to avoid
a warning from the caller. For failure wc_cookie
is used as the ret.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The string iwpm_ulib_name is recorded in a nlmsg as a netlink attribute.
Without this fix parsing of the nlmsg by the userspace port mapper service fails
because of unknown attribute length, causing the port mapper service not to
register the client, which has sent the nlmsg.
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Cc: <stable@vger.kernel.org> #v3.16
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For listening endpoints bound to the wildcard address, we need to pass
the wildcard address mapping to iwpm_get_remote_info() instead of the
mapped address of the new child connection.
Without this fix, and with iwarp port mapping enabled, each iw_cxgb4
connection that is spawned from a listening endpoint bound to the wildcard
address, will generate an annoying dmesg entry about failing to find
the remote address mapping info, and the connection state displayed in
debugfs under /sys/kernel/debug/iw_cxgb4/<pci-slot-no>/eps will not have
the peer's address/port mapping info. The connection still works though.
Fixes: 5b6b8fe ("RDMA/cxgb4: Report the actual address of the remote connecting peer")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Using an element of a struct as the address for the memcpy of the whole
struct may introduce a buffer overflow and does not help readability either
simply pass the real thing as first argument to memcpy.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
See also patch "IPoIB/cm: Add connected mode support for devices
without SRQs" (commit ID 68e995a295). Detected by smatch.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove these log messages in favor of per-endpoint counters as well as
device-global counters that can be inspected via debugfs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Addresses the following kernel logs seen during boot of sparc systems:
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Kernel unaligned access at TPC[103bce50] cm_find_listen+0x34/0xf8 [ib_cm]
Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This driver already makes use of ioremap_wc() on PIO buffers,
so convert it to use arch_phys_wc_add().
The qib driver uses a mmap() special case for when PAT is
not used, this behaviour used to be determined with a
module parameter but since we have been asked to just
remove that module parameter this checks for the WC cookie,
if not set we can assume PAT was used. If its set we do
what we used to do for the mmap for when MTRR was enabled.
The removal of the module parameter is OK given that Andy
notes that even if users of module parameter are still around
it will not prevent loading of the module on recent kernels.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: konrad.wilk@oracle.com
Cc: ville.syrjala@linux.intel.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: infinipath@intel.com
Cc: linux-rdma@vger.kernel.org
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: xen-devel@lists.xensource.com
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no good reason not to, we eventually delete it as well.
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Mike Marciniszyn <infinipath@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While unmapping an ODP writable page, the dirty bit of the page is set. In
order to do so, the head of the compound page is found.
Currently, the compound head is found even on non-writable pages, where it is
never used, leading to unnecessary cpu barrier that impacts performance.
This patch moves the search for the compound head to be done only when needed.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Acked-by: Shachar Raindel <raindel@mellanox.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, while mapping or unmapping pages for ODP, the umem mutex is locked
and unlocked once for each page. Such lock/unlock operation take few tens to
hundreds of nsecs. This makes a significant impact when mapping or unmapping few
MBs of memory.
To avoid this, the mutex should be locked only once per operation, and not per
page.
Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Acked-by: Shachar Raindel <raindel@mellanox.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Get the actual (non-mapped) ip/tcp address of the connecting peer from
the port mapper
Also setup the passive side endpoint to correctly display the actual
and mapped addresses for the new connection.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Get the actual (non-mapped) ip/tcp address of the connecting peer from
the port mapper and report the address info to the user space application
at the time of connection establishment
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add functionality to enable the port mapper on the passive side to provide to its
clients the actual (non-mapped) ip/tcp address information of the connecting peer
1) Adding remote_info_cb() to process the address info of the connecting peer
The address info is provided by the user space port mapper service when
the connection is initiated by the peer
2) Adding a hash list to store the remote address info
3) Adding functionality to add/remove the remote address info
After the info has been provided to the port mapper client,
it is removed from the hash list
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently the iw_cxgb4 implementation requires the qp and cq qid densities
to match as well as the qp and cq id ranges. So fail a device open if
the device configuration doesn't meet the requirements.
The reason for these restictions has to do with the fact that IQ qid X
has a UGTS register in the same bar2 page as EQ qid X. Thus both qids
need to be allocated to the same user process for security reasons.
The logic that does this (the qpid allocator in iw_cxgb4/resource.c)
handles this but requires the above restrictions.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For T5, we must not use the kdb/kgts registers, in order avoid db drops
under extreme loads.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
- get_dma_mr() was using ~0UL which is should be ~0ULL. This causes the
DMA MR to get setup incorrectly in hardware.
- wr_log_show() needed a 64b divide function div64_u64() instead of
doing
division directly.
- fixed warnings about recasting a pointer to a u64
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When accepting a new IPv4 connect to an IPv6 socket, the CMA tries to
canonize the address family to IPv4, but does not properly process
the listening sockaddr to get the listening port, and does not properly
set the address family of the canonized sockaddr.
Fixes: e51060f08a ("IB: IP address based RDMA connection manager")
Cc: <stable@vger.kernel.org>
Reported-By: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
Pull SCSI target updates from Nicholas Bellinger:
"Lots of activity in target land the last months.
The highlights include:
- Convert fabric drivers tree-wide to target_register_template() (hch
+ bart)
- iser-target hardening fixes + v1.0 improvements (sagi)
- Convert iscsi_thread_set usage to kthread.h + kill
iscsi_target_tq.c (sagi + nab)
- Add support for T10-PI WRITE_STRIP + READ_INSERT operation (mkp +
sagi + nab)
- DIF fixes for CONFIG_DEBUG_SG=y + UNMAP file emulation (akinobu +
sagi + mkp)
- Extended TCMU ABI v2 for future BIDI + DIF support (andy + ilias)
- Fix COMPARE_AND_WRITE handling for NO_ALLLOC drivers (hch + nab)
Thanks to everyone who contributed this round with new features,
bug-reports, fixes, cleanups and improvements.
Looking forward, it's currently shaping up to be a busy v4.2 as well"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (69 commits)
target: Put TCMU under a new config option
target: Version 2 of TCMU ABI
target: fix tcm_mod_builder.py
target/file: Fix UNMAP with DIF protection support
target/file: Fix SG table for prot_buf initialization
target/file: Fix BUG() when CONFIG_DEBUG_SG=y and DIF protection enabled
target: Make core_tmr_abort_task() skip TMFs
target/sbc: Update sbc_dif_generate pr_debug output
target/sbc: Make internal DIF emulation honor ->prot_checks
target/sbc: Return INVALID_CDB_FIELD if DIF + sess_prot_type disabled
target: Ensure sess_prot_type is saved across session restart
target/rd: Don't pass incomplete scatterlist entries to sbc_dif_verify_*
target: Remove the unused flag SCF_ACK_KREF
target: Fix two sparse warnings
target: Fix COMPARE_AND_WRITE with SG_TO_MEM_NOALLOC handling
target: simplify the target template registration API
target: simplify target_xcopy_init_pt_lun
target: remove the unused SCF_CMD_XCOPY_PASSTHROUGH flag
target/rd: reduce code duplication in rd_execute_rw()
tcm_loop: fixup tpgt string to integer conversion
...
Pull networking fixes from David Miller:
1) Fix verifier memory corruption and other bugs in BPF layer, from
Alexei Starovoitov.
2) Add a conservative fix for doing BPF properly in the BPF classifier
of the packet scheduler on ingress. Also from Alexei.
3) The SKB scrubber should not clear out the packet MARK and security
label, from Herbert Xu.
4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.
5) Pause handling is not correct in the stmmac driver because it
doesn't take into consideration the RX and TX fifo sizes. From
Vince Bridgers.
6) Failure path missing unlock in FOU driver, from Wang Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
net: dsa: use DEVICE_ATTR_RW to declare temp1_max
netns: remove BUG_ONs from net_generic()
IB/ipoib: Fix ndo_get_iflink
sfc: Fix memcpy() with const destination compiler warning.
altera tse: Fix network-delays and -retransmissions after high throughput.
net: remove unused 'dev' argument from netif_needs_gso()
act_mirred: Fix bogus header when redirecting from VLAN
inet_diag: fix access to tcp cc information
tcp: tcp_get_info() should fetch socket fields once
net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
skbuff: Do not scrub skb mark within the same name space
Revert "net: Reset secmark when scrubbing packet"
bpf: fix two bugs in verification logic when accessing 'ctx' pointer
bpf: fix bpf helpers to use skb->mac_header relative offsets
stmmac: Configure Flow Control to work correctly based on rxfifo size
stmmac: Enable unicast pause frame detect in GMAC Register 6
stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
stmmac: Add defines and documentation for enabling flow control
stmmac: Add properties for transmit and receive fifo sizes
stmmac: fix oops on rmmod after assigning ip addr
...
Currently, iflink of the parent interface was always accessed, even
when interface didn't have a parent and hence we crashed there.
Handle the interface types properly: for a child interface, return
the ifindex of the parent, for parent interface, return its ifindex.
For child devices, make sure to set the parent pointer prior to
invoking register_netdevice(), this allows the new ndo to be called
by the stack immediately after the child device is registered.
Fixes: 5aa7add8f1 ('infiniband/ipoib: implement ndo_get_iflink')
Reported-by: Honggang Li <honli@redhat.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Honggang Li <honli@redhat.com>
Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>+
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
set_filter_wr is requesting __GFP_NOFAIL allocation although it can return
ENOMEM without any problems obviously (t4_l2t_set_switching does that
already). So the non-failing requirement is too strong without any
obvious reason. Drop __GFP_NOFAIL and reorganize the code to have the
failure paths easier.
The same applies to _c4iw_write_mem_dma_aligned which uses __GFP_NOFAIL
and then checks the return value and returns -ENOMEM on failure. This
doesn't make any sense what so ever. Either the allocation cannot fail or
it can.
del_filter_wr seems to be safe as well because the filter entry is not
marked as pending and the return value is propagated up the stack up to
c4iw_destroy_listen.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some rare cases, IO operations may be not aligned to page
boundaries. This prevents iser from performing fast memory
registration. In order to overcome that iser uses a bounce
buffer to carry the transaction. We basically allocate a buffer
in the size of the transaction and perform a copy.
The buffer allocation using kmalloc is too restrictive since it
requires higher order (atomic) allocations for large transactions
(which may result in memory exhaustion fairly fast for some workloads).
We rewrite the bounce buffer code path to allocate scattered pages
and perform a copy between the transaction sg and the bounce sg.
Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In singleton scatterlists, DMA memory registration code
is taken both for Fastreg and FMR code paths. Move it to
a function.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of passing ib_sge as output variable, we pass the mem_reg
pointer to have the routines fill the rkey as well. This reduces
code duplication and extra assignments. This is a preparation step
to unify some registration logics together. Also, pass iser_fast_reg_mr
the fastreg descriptor directly.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to keep lkey, va, len variables, we can keep
them as struct ib_sge. This will help when we change the
memory registration logic.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Memory regions are resources that are saved
in the device caches. Increase the probability for
a cache hit by adding the MRU descriptor to pool
head.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Make iser_[create|destroy]_fastreg_desc shorter, more
readable and easily extendable.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of open-coding connection fastreg pool get/put,
we introduce iser_reg_desc[get|put] helpers.
We aren't setting these static as this will be a per-device
routine later on. Also, cleanup iser_unreg_rdma_mem_fastreg
a bit.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need for these two separate. Keep it in a single routine
like in the fastreg case. This will also make iser_reg_page_vec
closer to iser_fast_reg_mr arguments. This is a preparation
step for registration flow refactor.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This struct members other than struct iser_mem_reg are unused,
so remove it altogether.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Buffer length was assigned twice, and no reason to set va to
io_addr and then add the offset, just set va to io_addr + offset.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As memory registration/de-registration methods, lets
move them to their natural location. While we're at it,
make iser_reg_page_vec routine static.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to pass that, we can take it from the task.
In a later stage, this function will be invoked
according to a device capability.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to keep two iser_data_buf structures just in case we use
mem copy. We can avoid that just by adding a pointer to the original
sg. So keep only two iser_data_buf per command (data and protection)
and pass the relevant data_buf to bounce buffer routine.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This code was added before we had protection data length
calculation (in iser_send_command), so we needed to calc
the sg data length from the sg itself. This is not needed
anymore.
This patch does not change any functionality.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Adir Lev <adirl@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This length miss-calculation may cause a silent data corruption
in the DIX case and cause the device to reference unmapped area.
Fixes: d77e65350f ('libiscsi, iser: Adjust data_length to include protection information')
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fast registration and local invalidate work requests can
also fail. We should call error completion handler for them.
Reported-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case the user unloaded ib_iser while ep_connect is in
progress, we need to destroy the endpoint although ep_disconnect
wasn't invoked (we detect this by the iser conn state != DOWN).
However, if we got an REJECTED/UNREACHABLE CM event we move the
connection state to DOWN which will prevent us from destroying
the endpoint in the module unload stage. Fix this by setting the
connection state to TERMINATING in iser_conn_error so we can still
destroy the endpoint at unload stage.
Reported-by: Ariel Nahum <arieln@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver already defined the pr_format, it just hadn't
been converted to use pr_info, pr_warn, and pr_err instead
of the equivalent printks. Convert so that messages from
the driver are now properly tagged with their driver name
and can be more easily debugged.
In addition, a number of these printk's were not newline
terminated, so fix that at the same time.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This change slightly reduces the time needed to log in.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: David Dillow <dave@thedillows.org>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since ib_dma_map_single can fail use ib_dma_mapping_error to check
for errors.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Hello,
When an application using XRCs abruptly terminates, the mmaped pages
of the CQ buffers are leaked.
This comes from the fact that when resources are released in
ib_uverbs_cleanup_ucontext(), we fail to release the CQs because their
refcount is not 0.
When creating an XRC SRQ, we increment the associated CQ refcount.
This refcount is only decremented when the SRQ is released.
Therefore we need to release the SRQs prior to the CQs to make sure
that all references to the CQs are gone before trying to release these.
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The current code decreases from the mss size (which is the gso_size
from the kernel skb) the size of the packet headers.
It shouldn't do that because the mss that comes from the stack
(e.g IPoIB) includes only the tcp payload without the headers.
The result is indication to the HW that each packet that the HW sends
is smaller than what it could be, and too many packets will be sent
for big messages.
An easy way to demonstrate one more aspect of the problem is by
configuring the ipoib mtu to be less than 2*hlen (2*56) and then
run app sending big TCP messages. This will tell the HW to send packets
with giant (negative value which under unsigned arithmetics becomes
a huge positive one) length and the QP moves to SQE state.
Fixes: b832be1e40 ('IB/mlx4: Add IPoIB LSO support')
Reported-by: Matthew Finlay <matt@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After Doug Ledford's changes there is no need in that bit, it's
semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Whenever there is no path->ah to the destination, keep only defined
number of skb's. Otherwise there are cases that the driver can keep
infinite list of skb's.
For example, when one device want to send unicast arp to the destination,
and from some reason the SM doesn't respond, the driver currently keeps
all the skb's. If that unicast arp traffic stopped, all these skb's
are kept by the path object till the interface is down.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As the result of a completion error the QP can moved to SQE state by
the hardware. Since it's not the Error state, there are no flushes
and hence the driver doesn't know about that.
The fix creates a task that after completion with error which is not a
flush tracks the QP state and if it is in SQE state moves it back to RTS.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update the cached broadcast record in the priv object after every new
join of this broadcast domain group.
These values are needed for the port configuration (MTU size) and to
all the new multicast (non-broadcast) join requests initial parameters.
For example, SM starts with 2K MTU for all the fabric, and after that it
restarts (or handover to new SM) with new port configuration of 4K MTU.
Without using the new values, the driver will keep its old configuration
of 2K and will not apply the new configuration of 4K.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The current code in the RX flow uses two sg entries for each incoming
packet, the first one was for the IB headers and the second for the rest
of the data, that causes two dma map/unmap and two allocations, and few
more actions that were done at the data path.
Use only one linear skb on each incoming packet, for the data (IB
headers and payload), that reduces the packet processing in the
data-path (only one skb, no frags, the first frag was not used anyway,
less memory allocations) and the dma handling (only one dma map/unmap
over each incoming packet instead of two map/unmap per each incoming packet).
After commit 73d3fe6d1c ("gro: fix aggregation for skb using frag_list") from
Eric Dumazet, we will get full aggregation for large packets.
When running bandwidth tests before and after the (over the card's numa node),
using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before
and after ~16Gbs on my setup (Mellanox's ConnectX3).
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We needed the mcast_mutex when we had to prevent the join completion
callback from having the value it stored in mcast->mc overwritten
by a delayed return from ib_sa_join_multicast. By storing the return
of ib_sa_join_multicast in an intermediate variable, we prevent a
delayed return from ib_sa_join_multicast overwriting the valid
contents of mcast->mc, and we no longer need a mutex to force the
join callback to run after the return of ib_sa_join_multicast. This
allows us to do away with the mutex entirely and protect our critical
sections with a just a spinlock instead. This is highly desirable
as there were some places where we couldn't use a mutex because the
code was not allowed to sleep, and so we were currently using a mix
of mutex and spinlock to protect what we needed to protect. Now we
only have a spin lock and the locking complexity is greatly reduced.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allow the ipoib layer to attempt to join all outstanding multicast
groups at once. The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempts to join different groups
in parallel. Take advantage of that.
In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join. There are a few special cases we handle though:
1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling.
If we completed successfully, we are done. The join task now needs kicked
either by mcast_send or mcast_restart_task or mcast_start_thread, but
should not need started anytime else except when scheduling a backoff
attempt to rejoin.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying. We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used. We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it. And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.
This made it less than totally effective, and certainly made its use
confusing. And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.
In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.
1) Remove the MCAST_JOIN_STARTED flag entirely
2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback. This forces
ib_sa_multicast_join to return and set mcast->mc before we process
the callback. This way, our callback can safely clear mcast->mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task
8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
we have a running task. We know when we need to reschedule our
join task thread and don't need a flag to tell us.
9) Add a helper for rescheduling the join task thread
A number of different races are resolved with these changes. These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.
One race looks something like this:
Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
alloc member
call callback
ifconfig ib0 down
wait_for_completion
callback call completes
wait_for_completion in
mcast_dev_flush completes
mcast->mc is PTR_ERR_OR_NULL
so we skip ib_sa_leave_multicast
return from callback
return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast
We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c
Another like this:
Thread 1 Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv->broadcast = NULL
join_complete
wait_for_completion
mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
complete
return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever
By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback. This allows us to clear mcast->mc if there is an
error and the join is going to fail. We do this before we complete
the mcast. In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue. Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.
Signed-off-by: Doug Ledford <dledford@redhat.com>
During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up. It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue means
that we would also have to prevent all races between different devices
(for instance, ipoib_mcast_restart_task on interface ib0 can race with
ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
the rtnl_lock).
There are several possible solutions to this problem:
Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail. This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.
Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail. Again, this can
drop work on the floor.
Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface. In addition, keep the global
workqueue, but now limit it to only flush tasks. In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.
Signed-off-by: Doug Ledford <dledford@redhat.com>
We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface. Unfortunately, that's not
the case. In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too. But this thread is
deadlocked on the rtnl lock. The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_OPER_UP has been cleared, in which case we don't need to do
anything anyway and we just return.
While investigating which flag should be used, FLAG_ADMIN_UP or
FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
appropriate flag to use. However, there was a mix of these two flags in
use in the existing code. So while we check for that flag here as part
of this race fix, also cleanup the two places that had used the less
appropriate flag for their tests.
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
needs to keep it held. It also needs to call flush_workqueue() to flush
out any outstanding work. In the past, we've had to try and make sure
that we didn't flush out any outstanding join completions because they
also wanted to grab rtnl_lock() and that would deadlock. It turns out
that the only thing in the join completion handler that needs this lock
can be safely moved to our carrier_on_task, thereby reducing the
potential for the join completion code and the flush code to deadlock
against each other.
Signed-off-by: Doug Ledford <dledford@redhat.com>
In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.
Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.
This factoring is needed in preparation for having per-device work
queues. The original per-device workqueue code resulted in the following
error message:
<ibdev>: ib_dealloc_pd failed
That error was tracked down to this issue. With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated. This code
puts the explicit flushes in the needed places to avoid that problem.
Signed-off-by: Doug Ledford <dledford@redhat.com>
In a call to ib_umem_get(), if address is 0x0 and size is
already page aligned, check added in commit 8494057ab5
("IB/uverbs: Prevent integer overflow in ib_umem_get address
arithmetic") will refuse to register a memory region that
could otherwise be valid (provided vm.mmap_min_addr sysctl
and mmap_low_allowed SELinux knobs allow userspace to map
something at address 0x0).
This patch allows back such registration: ib_umem_get()
should probably don't care of the base address provided it
can be pinned with get_user_pages().
There's two possible overflows, in (addr + size) and in
PAGE_ALIGN(addr + size), this patch keep ensuring none
of them happen while allowing to pin memory at address
0x0. Anyway, the case of size equal 0 is no more (partially)
handled as 0-length memory region are disallowed by an
earlier check.
Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org> # 8494057ab5 ("IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic")
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Jack Morgenstein <jackm@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If ib_umem_get() is called with a size equal to 0 and an
non-page aligned address, one page will be pinned and a
0-sized umem will be returned to the caller.
This should not be allowed: it's not expected for a memory
region to have a size equal to 0.
This patch adds a check to explicitly refuse to register
a 0-sized region.
Link: http://mid.gmane.org/cover.1428929103.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Jack Morgenstein <jackm@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change the default mode to be HOST assigned instead of SM assigned. This is
the expected operational mode, because it doesn't depend on SM availability.
As PF generates random GUIDs as the initial admin values, this gives
out of the box experience.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Request GIDs from the SM on demand, i.e., when a VF actually needs them,
and release them when the GIDs are no longer in use.
In cloud environments, this is useful for GID migrations, in which a
GID is assigned to a VF on the destination HCA, while the VF on the
source HCA is shutdown (but the GID was not administratively released).
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change the init flow to ask GUIDs only for active VFs. This is done for
both SM & HOST modes so that there is no need any more to maintain the
ownership record type.
In case SM mode is used, the initial value will be 0, ask the SM to assign,
for the HOST mode the initial value will be the HOST generated GUID.
This will enable out of the box experience for both probed and attached VFs.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Set the admin alias GUID per the administrator's request via the sysfs
mechanism into the core layer.
The "get" request returns the current value. However, if the administrator
requests the SM to assign a new value by requesting 0, the SM assigned
GUID is returned.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the SM rejects an alias GUID request the PF driver keeps trying to acquire
the specified GUID indefinitely, utilizing an exponential backoff scheme.
Retrying is managed per GUID entry. Each entry that wasn't applied holds its
next retry information. Retry requests to the SM consist of records of 8
consecutive GUIDS. Each record that contains GUIDs requiring retries holds its
next time-to-run based on the retry information of all its GUID entries. The
record having the lowest retry time will run first when that retry time
arrives.
Since the method (SET or DELETE) as sent to the SM applies to all the GUIDs in
the record, we must handle SET requests and DELETE requests in separate SM
messages (one for SETs and the other for DELETEs).
To avoid race conditions where a GUID entry request (set or delete) was
modified after the SM request was sent, we save the method and the requested
indices as part of the callback's context -- thus, only the requested indexes
are evaluated when the response is received.
When an GUID entry is approved we turn off its retry-required bit, this
prevents redundant SM retries from occurring on that record.
The port down event should be sent only when previously it was up. Likewise,
the port up event should be sent only if previously the port was down.
Synchronization was added around the flows that change entries and record state
to prevent race conditions.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of calling target_fabric_configfs_init() +
target_fabric_configfs_register() / target_fabric_configfs_deregister()
target_fabric_configfs_free() from every target driver, rewrite the API
so that we have simple register/unregister functions that operate on
a const operations vector.
This patch also fixes a memory leak in several target drivers. Several
target drivers namely called target_fabric_configfs_deregister()
without calling target_fabric_configfs_free().
A large part of this patch is based on earlier changes from
Bart Van Assche <bart.vanassche@sandisk.com>.
(v2: Add a new TF_CIT_SETUP_DRV macro so that the core configfs code
can declare attributes as either core only or for drivers)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Things Not To Do When Writing A Driver, part 1001st:
have writev() and write() on the same file doing completely
different things. As in, "interpret very different sets of
commands".
We _can_ handle that, but it's a bloody bad idea.
Don't do that in new drivers. Ever.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These variables are always accessed via struct isert_conn so
no need to have a "conn_" prefix for them.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
iser target can handle as many connect request as
the fabric sends to it. This backlog should not set as
a back-pressure mechanism (which is not very useful).
isert does need a back-pressure mechanism, but it should
be added in isert by monitoring the number of pending
established connections (will be added in a later stage).
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In iser_connect_release there is no chance that
the iser device is set to NULL, if this happens
we have a BUG. So use BUG_ON.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Not sure what it was used for, but there is
no real need for it now as I see it. Go ahead
and get rid of it.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Move login buffer alloc/free code to dedicated
routines and introduce isert_conn_init which
initializes the connection lists and locks.
Simplifies and cleans up the code a little bit.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
isert_device_find_by_ib_dev and isert_device_try_release
can have a better, more common name like isert_device_[get|put].
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
No need to keep a local ib_dev as a device pointer.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Move the code for completion context handling to dedicated
routines. This simplifies the code and removes code duplication.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Simplify iser QP creation by splitting some unrelated
logic bulks to routines.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This is to favor the HCA cache hit rate using less MRs
and PDs. This commit partially reverts commit:
"eb6ab13 IB/isert: separate connection protection domains and dma MRs"
At the time I thought this would be needed.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Before we reach to connection established we may get an
error event. In this case the core won't teardown this
connection (never established it), so we take care of freeing
it ourselves.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This hang was a result of a missing command put when
a DIF error occurred during a rdma read (and we sent
an CHECK_CONDITION error without passing it to the
backend).
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: <stable@vger.kernel.org> # v3.14+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/cmd.c
net/core/fib_rules.c
net/ipv4/fib_frontend.c
The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.
The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparation for ethernet driver.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass consumer index as a parameter to arm CQ
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparation for ethernet driver.
These functions will be used in drivers other than mlx5_ib.
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do it in one place instead of every where the function is invoked
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass 0 in the sqd_event parameter.
Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The calls to SET_PORT used hard-code numbers, when supplying command's
opcode modifiers, fix that to use well defined constants.
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't use dev->iflink anymore.
CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Properly verify that the resulting page aligned end address is larger
than both the start address and the length of the memory area requested.
Both the start and length arguments for ib_umem_get are controlled by
the user. A misbehaving user can provide values which will cause an
integer overflow when calculating the page aligned end address.
This overflow can cause also miscalculation of the number of pages
mapped, and additional logic issues.
Addresses: CVE-2014-8159
Cc: <stable@vger.kernel.org>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For RoCE ports, we set the u32 PMA values based on u64 HCA counters. In case of
overflow, according to the IB spec, we have to saturate a counter to its
max value, do that.
Fixes: c37791349c ('IB/mlx4: Support PMA counters for IBoE')
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Processing an event is done in a different context from the one when
the event was dispatched. This requires a check that the slave
net device is still valid when the event is being processed. The check is done
under the iboe lock which ensure correctness.
Fixes: a575009030 ('IB/mlx4: Add port aggregation support')
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more vfs updates from Al Viro:
"Assorted stuff from this cycle. The big ones here are multilayer
overlayfs from Miklos and beginning of sorting ->d_inode accesses out
from David"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
procfs: fix race between symlink removals and traversals
debugfs: leave freeing a symlink body until inode eviction
Documentation/filesystems/Locking: ->get_sb() is long gone
trylock_super(): replacement for grab_super_passive()
fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
SELinux: Use d_is_positive() rather than testing dentry->d_inode
Smack: Use d_is_positive() rather than testing dentry->d_inode
TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
VFS: Split DCACHE_FILE_TYPE into regular and special types
VFS: Add a fallthrough flag for marking virtual dentries
VFS: Add a whiteout dentry type
VFS: Introduce inode-getting helpers for layered/unioned fs environments
Infiniband: Fix potential NULL d_inode dereference
posix_acl: fix reference leaks in posix_acl_create
autofs4: Wrong format for printing dentry
...
Pull SCSI target updates from Nicholas Bellinger:
"The highlights this round include:
- Update vhost-scsi to support F_ANY_LAYOUT using mm/iov_iter.c
logic, and signal VERSION_1 support (MST + Viro + nab)
- Fix iscsi/iser-target to remove problematic active_ts_set usage
(Gavin Guo)
- Update iscsi/iser-target to support multi-sequence sendtargets
(Sagi)
- Fix original PR_APTPL_BUF_LEN 8k size limitation (Martin Svec)
- Add missing WRITE_SAME end-of-device sanity check (Bart)
- Check for LBA + sectors wrap-around in sbc_parse_cdb() (nab)
- Other various minor SPC/SBC compliance fixes based upon Ronnie
Sahlberg test suite (nab)"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (32 commits)
target: Set LBPWS10 bit in Logical Block Provisioning EVPD
target: Fail UNMAP when emulate_tpu=0
target: Fail WRITE_SAME w/ UNMAP=1 when emulate_tpws=0
target: Add sanity checks for DPO/FUA bit usage
target: Perform PROTECT sanity checks for WRITE_SAME
target: Fail I/O with PROTECT bit when protection is unsupported
target: Check for LBA + sectors wrap-around in sbc_parse_cdb
target: Add missing WRITE_SAME end-of-device sanity check
iscsi-target: Avoid IN_LOGOUT failure case for iser-target
target: Fix PR_APTPL_BUF_LEN buffer size limitation
iscsi-target: Drop problematic active_ts_list usage
iscsi/iser-target: Support multi-sequence sendtargets text response
iser-target: Remove duplicate function names
vhost/scsi: potential memory corruption
vhost/scsi: Global tcm_vhost -> vhost_scsi rename
vhost/scsi: Drop left-over scsi_tcq.h include
vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits
vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq
vhost/scsi: Add ANY_LAYOUT iov -> sgl mapping prerequisites
vhost/scsi: Change vhost_scsi_map_to_sgl to accept iov ptr + len
...
- Re-enable on-demand paging changes with stable ABI
- Fairly large set of ocrdma HW driver fixes
- Some qib HW driver fixes
- Other miscellaneous changes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJU52nHAAoJEENa44ZhAt0hkycP/0vwYNl0JJadasSUrLm2AWje
iAePU+K7uiyxSuIU+eLnyyDV/iQ2sCjXfhQNE6FnFhH3JjwBYbhS2Q8WcjwfRWYX
iFhORQltb3spSeLdT7N1QfkMF4MtAw3ENPE3QKIgNaIQSso+J256BHrSiqr9Akwe
hwIVr0TLDO59ggPs3c083uUhC+5AMViMVgRR+N9+/59WHz2vG2WsTunpg1sYOAgC
KpUhfZW4za5xYJ0c8BOnPiSAfJasD1UdDg3oX0RPD4j/diiWAO6EOR+jkOLtODpd
8uhD8HM7ZvCKE1io5QnVdjTR/n51NaVGHk+yfWJRSvV1sw1AALtU4NVniI+E5mFs
Fwe+zUzbQ3j08TtrU0VjaeteBh2oGC0pJlkJS9+HXeDmH30/LQCMCiA5mBSSEPDg
0cPEAJgGAZqOIyyZe8jSslW/iN0cE6FDDb8+/1AZ80IfbdMxXh9FVwi7cdXn8+OQ
nXmtnSa7yzKWS9VVXDrvSzI0Y2oDv8DSDpCWGCm7bUcDXd/T5LpPR3RdSorGkYAr
O2zmuJZlCdkuKgW91O7cztNj2hTlDnJ+e+5P05KwPOb86Muum6tphLJd5b1h970X
Actx/6EX/ycSQ3ukiKW7Ksn2bYD3RLZvdbPYQM5xUjlGGWPF6nvmbZ8MqnyaLFfE
WKvx39DMGq3ktofiMkbf
=JLsF
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA updates from Roland Dreier:
- Re-enable on-demand paging changes with stable ABI
- Fairly large set of ocrdma HW driver fixes
- Some qib HW driver fixes
- Other miscellaneous changes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (43 commits)
IB/qib: Add blank line after declaration
IB/qib: Fix checkpatch warnings
IB/mlx5: Enable the ODP capability query verb
IB/core: Add on demand paging caps to ib_uverbs_ex_query_device
IB/core: Add support for extended query device caps
RDMA/cxgb4: Don't hang threads forever waiting on WR replies
RDMA/ocrdma: Fix off by one in ocrdma_query_gid()
RDMA/ocrdma: Use unsigned for bit index
RDMA/ocrdma: Help gcc generate better code for ocrdma_srq_toggle_bit
RDMA/ocrdma: Update the ocrdma module version string
RDMA/ocrdma: set vlan present bit for user AH
RDMA/ocrdma: remove reference of ocrdma_dev out of ocrdma_qp structure
RDMA/ocrdma: Add support for interrupt moderation
RDMA/ocrdma: Honor return value of ocrdma_resolve_dmac
RDMA/ocrdma: Allow expansion of the SQ CQEs via buddy CQ expansion of the QP
RDMA/ocrdma: Discontinue support of RDMA-READ-WITH-INVALIDATE
RDMA/ocrdma: Host crash on destroying device resources
RDMA/ocrdma: Report correct state in ibv_query_qp
RDMA/ocrdma: Debugfs enhancments for ocrdma driver
RDMA/ocrdma: Report correct count of interrupt vectors while registering ocrdma device
...
Upstream checkpatch now requires this.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Code that does this:
if (!(d_unhashed(tmp) && tmp->d_inode)) {
...
simple_unlink(parent->d_inode, tmp);
}
is broken because:
!(d_unhashed(tmp) && tmp->d_inode)
is equivalent to:
!d_unhashed(tmp) || !tmp->d_inode
so it is possible to get into simple_unlink() with tmp->d_inode == NULL.
simple_unlink(), however, assumes tmp->d_inode cannot be NULL.
I think that what was meant is this:
!d_unhashed(tmp) && tmp->d_inode
and that the logical-not operator or the final close-bracket was misplaced.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bryan O'Sullivan <bos@pathscale.com>
cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Re-enable the on-demand paging capability query through the
extended query device verb.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add on-demand paging capabilities reporting to the extended query device verb.
Yann Droneaud writes:
Note: as offsetof() is used to retrieve the size of the lower chunk
of the response, beware that it only works if the upper chunk
is right after, without any implicit padding. And, as the size of
the latter chunk is added to the base size, implicit padding at the
end of the structure is not taken in account. Both point must be
taken in account when extending the uverbs functionalities.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add extensible query device capabilities verb to allow adding new features.
ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy
capability fields to be used by both ib_uverbs_query_device and
ib_uverbs_ex_query_device.
Following the discussion about this patch [1], the code now validates
the command's comp_mask is zero, returning -EINVAL for unknown values,
in order to allow extending the verb in the future.
The verb also checks the user-space provided response buffer size and
only fills in capabilities that will fit in the buffer. In attempt to
follow the spirit of presentation [2] by Tzahi Oved that was presented
during OpenFabrics Alliance International Developer Workshop 2013, the
comp_mask bits will only describe which fields are valid. Furthermore,
fields that can simply be cleared when they are not supported, do not
require a comp_mask bit at all. The verb returns a response_length
field containing the actual number of bytes written by the kernel, so
that a newer version running on an older kernel can tell which fields
were actually returned.
[1] [PATCH v1 0/5] IB/core: extended query device caps cleanup for v3.19
http://thread.gmane.org/gmane.linux.kernel.api/7889/
[2] https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Tues_0423/2013_Workshop_Tues_0830_Tzahi_Oved-verbs_extensions_ofa_2013-tzahio.pdf
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In c4iw_wait_for_reply(), if a FW6_MSG WR reply is not received after
C4IW_WR_TO seconds, fail the WR operation and mark the device as fatally
dead. Further, if the device is marked fatally dead, then fail the WR
wait immediately.
Also change the timeout to 60 seconds.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The ->sgid_tbl[] array has OCRDMA_MAX_SGID number of elements so this
test is off by one. ->sgid_tbl is allocated in ocrdma_alloc_resources().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the expressions idx/32 and idx%32, both idx and 32 have signed
type, and unfortunately the C standard prescribes rounding to 0, so
unless gcc can prove that idx is non-negative, these cannot be
implemented as simple shift respectively mask operations. Help gcc by
changing the type of idx to unsigned - this cuts another few
instructions from the generated code.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
gcc emits a surprising amount of code in order to flip a bit. One
would think that a single instruction is enough.
$ scripts/bloat-o-meter /tmp/ocrdma_verbs.o drivers/infiniband/hw/ocrdma/ocrdma_verbs.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-142 (-142)
function old new delta
ocrdma_post_srq_recv 498 460 -38
ocrdma_poll_cq 2010 1962 -48
ocrdma_discard_cqes 495 439 -56
All three calls of ocrdma_srq_toggle_bit happen within spinlocks, so
saving a few useless instructions might be worthwhile.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For the AH that describs a VLAN interface details, vlan present bit
needs to be set during posting a WQE. This patch adds the code to
allow it happening.
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add support for interrupt moderation for ocrdma device. Thresholds
for high interrupt rates are static values derived based on experimental
results.
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Check for return value for ocrdma_resolve_dmac while setting AV params.
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the SQ and RQ of the QP in error state uses separate CQs, traverse
the list of QPs using each CQs and invoke the buddy CQ handler for
both SQ and RQ.
Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Cleanup sequence in ocrdma_remove(). The device should be
unregistered from IB stack before any device specific cleanup.
2. Always return success in the resource destroy path. In case destroy
command returns error, IB stack will trigger cleanup again while
closing the uverbs device causing kernel panic BUG_ON().
Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move PD allocation and deallocation from firmware to driver. At
driver load time all the PDs will be requested from firmware and their
management will be handled by driver to reduce mailbox commands
overhead at runtime.
Signed-off-by: Mitesh Ahuja <mitesh.ahuja@emulex.com>
Signed-off-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Selvin Xavier <selvin.xavier@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the last on-demand paging MR is released the notifier count is
left non-zero so that concurrent page faults will have to abort. If a
new MR is then registered, the counter is reset. However, the decision
is made to put the new MR in the list waiting for the notifier count
to reach zero, before the counter is reset. An invalidation or another
MR registration can release the MR to handle page faults, but without
such an event the MR can wait forever.
The patch fixes this issue by adding a check whether the MR is the
first on-demand paging MR when deciding whether it is ready to handle
page faults. If it is the first MR, we know that there are no mmu
notifiers running in parallel to the registration.
Fixes: 882214e2b1 ("IB/core: Implement support for MMU notifiers regarding on demand paging regions")
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When we create an MR using reg_create, the mlx5_ib_dev pointer is not
updated on the new MR. This results in a kernel panics for ODP MRs
while handling page faults, when the mlx5_ib_update_mtt function uses
the invalid device pointer.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If a GUID is not found, the 64-bit GUID printed in the message log
warning should converted to host-endian order for printing.
Found by Doug Ledford and Hal Rosenstock. Fix suggested by Hal.
Signed-off-by: Hal Rosenstock <hal@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Before the entries alignment, we need to check that the entries
doesn't exceed the device's max cqe.
2. After the alignment, we need to make sure that the aligned number
doesn't exceed the max cqes+1. The additional cqe is used to denote
that the resizing operation has completed.
3. If the users asks to resize the CQ with entries less than the
oustanding cqes we should fail instead of returning 0.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In case handle_eth_ud_smac_index fails, we need to free the allocated resources.
Fixes: 2f5bb47368 ("mlx4: Add ref counting to port MAC table for RoCE")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The deadlock occurs in __uverbs_modify_qp: we take a lock (idr_read_qp)
and in case of failure in ib_resolve_eth_l2_attrs we don't release
it (put_qp_read). Fix that.
Fixes: ed4c54e5b4 ("IB/core: Resolve Ethernet L2 addresses when modifying QP")
Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull debugfs patches from Al Viro:
"debugfs patches, mostly to make it possible for something like tracefs
to be transparently automounted on given directory in debugfs.
New primitive in there is debugfs_create_automount(name, parent, func,
arg), which creates a directory and makes its ->d_automount() return
func(arg). Another missing primitive was debugfs_create_file_size() -
open-coded in quite a few places. Dave's patch adds it and converts
the open-code instances to calling it"
* 'debugfs_automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
debugfs: Provide a file creation function that also takes an initial size
new primitive: debugfs_create_automount()
debugfs: split end_creating() into success and failure cases
debugfs: take mode-dependent parts of debugfs_get_inode() into callers
fold debugfs_mknod() into callers
fold debugfs_create() into caller
fold debugfs_mkdir() into caller
debugfs_mknod(): get rid useless arguments
fold debugfs_link() into caller
debugfs: kill __create_file()
debugfs: split the beginning and the end of __create_file() off
debugfs_{mkdir,create,link}(): get rid of redundant argument
When marshaling a user path to the kernel struct ib_sa_path, we need
to zero smac and dmac and set the vlan id to the "no vlan" value.
This is to ensure that Ethernet attributes are not used with
InfiniBand QPs.
Fixes: dd5f03beb4 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Signed-off-by: Ilya Nelkenbaum <ilyan@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In some cases, we might reach the iser connection termination without
ep_disconnect being invoked (for example if user-space daemon doesn't
exists. In this case, we need to free the iscsi endpoint when we
remove the iser connection.
Signed-off-by: Ariel Nahum <arieln@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When teardown process starts during live IO, we need to keep the
memory regions pool (frmr/fmr) until all in-flight tasks are properly
released, since each task may return a memory region to the pool. In
order to do this, we pass a destroy flag to iser_free_ib_conn_res to
indicate we can destroy the device and the memory regions
pool. iser_conn_release will pass it as true and also DEVICE_REMOVAL
event (we need to let the device to properly remove).
Also, Since we conditionally call iser_free_rx_descriptors,
remove the extra check on iser_conn->rx_descs.
Fixes: 5426b1711f ("IB/iser: Collapse cleanup and disconnect handlers")
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The uses of "rcu_assign_pointer()" are NULLing out the pointers.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
[Derived from http://marc.info/?l=linux-rdma&m=140836519219236&w=2]
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
According to RCU_INIT_POINTER()'s block comment 3.a, it can be used if
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
"3. The referenced data structure has already been exposed to readers either
at compile time or via rcu_assign_pointer() -and-
a. You have not made -any- reader-visible changes to this structure since
then".
These cases fulfill the conditions above because between the
rcu_dereference_protected() call and the rcu_assign_pointer() call
there is no update of that value. Therefore, this patch makes the
replacement.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(...,
(
rtnl_dereference(...)
|
rcu_dereference_protected(...)
) )
[consolidated from http://marc.info/?l=linux-rdma&m=140836578119485&w=2 and
http://marc.info/?l=linux-rdma&m=140906361403047&w=2]
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>