In the event flow, we currently pass only a port number in the
void *data argument. Rather than pass a pointer to the event handlers,
we should use an "unsigned long" parameter, and pass the port number
value directly.
In the future, if necessary for some events, we can use the unsigned long
parameter to pass a pointer.
Based on a patch by Eli Cohen <eli@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There were many places where parameters which should be u8/u16 were
integer type.
Additionally, in 2 places, a check for a non-null pointer was added
before dereferencing the pointer (this is actually a bug fix).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for a new mlx5 device which is VPI (i.e., ports can be
either IB or ETH), move the pci device functionality from mlx5_ib
to mlx5_core.
This involves the following changes:
1. Move mlx5_core_dev struct out of mlx5_ib_dev. mlx5_core_dev
is now an independent structure maintained by mlx5_core.
mlx5_ib_dev now has a pointer to that struct.
This requires changing a lot of places where the core_dev
struct was accessed via mlx5_ib_dev (now, this needs to
be a pointer dereference).
2. All PCI initializations are now done in mlx5_core. Thus,
it is now mlx5_core which does pci_register_device (and not
mlx5_ib, as was previously).
3. mlx5_ib now registers itself with mlx5_core as an "interface"
driver. This is very similar to the mechanism employed for
the mlx4 (ConnectX) driver. Once the HCA is initialized
(by mlx5_core), it invokes the interface drivers to do
their initializations.
4. There is a new event handler which the core registers:
mlx5_core_event(). This event handler invokes the
event handlers registered by the interfaces.
Based on a patch by Eli Cohen <eli@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/infiniband/hw/cxgb4/device.c
The cxgb4 conflict was simply overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the size advertised by FW
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advertise the actual max limits for things like qp depths, number of
qps, cqs, etc.
Clean up the queue allocation for qps and cqs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed error introduced in commit id 7730b4c (" cxgb4/iw_cxgb4: work request
logging feature") while compiling on 32 bit architecture reported by kbuild.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This define is used by cxgb4i and iw_cxgb4, moving to avoid code duplication
Signed-off-by: Anish Bhatt <anish@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit f360d88a2e, we advertise blocking multicast loopback to both
kernel and userspace consumers, but don't allow kernel consumers (e.g IPoIB)
to use it with their UD QPs. Fix that.
Fixes: f360d88a2e ("IB/mlx5: Add block multicast loopback support")
Reported-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit enhances the iwarp driver to optionally keep a log of rdma
work request timining data for kernel mode QPs. If iw_cxgb4 module option
c4iw_wr_log is set to non-zero, each work request is tracked and timing
data maintained in a rolling log that is 4096 entries deep by default.
Module option c4iw_wr_log_size_order allows specifing a log2 size to use
instead of the default order of 12 (4096 entries). Both module options
are read-only and must be passed in at module load time to set them. IE:
modprobe iw_cxgb4 c4iw_wr_log=1 c4iw_wr_log_size_order=10
The timing data is viewable via the iw_cxgb4 debugfs file "wr_log".
Writing anything to this file will clear all the timing data.
Data tracked includes:
- The host time when the work request was posted, just before ringing
the doorbell. The host time when the completion was polled by the
application. This is also the time the log entry is created. The delta
of these two times is the amount of time took processing the work request.
- The qid of the EQ used to post the work request.
- The work request opcode.
- The cqe wr_id field. For sq completions requests this is the swsqe
index. For recv completions this is the MSN of the ingress SEND.
This value can be used to match log entries from this log with firmware
flowc event entries.
- The sge timestamp value just before ringing the doorbell when
posting, the sge timestamp value just after polling the completion,
and CQE.timestamp field from the completion itself. With these three
timestamps we can track the latency from post to poll, and the amount
of time the completion resided in the CQ before being reaped by the
application. With debug firmware, the sge timestamp is also logged by
firmware in its flowc history so that we can compute the latency from
posting the work request until the firmware sees it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With ingress WRITE or READ RESPONSE errors, HW provides the offending
stag from the packet. This patch adds logic to log the parsed TPTE
in this case. cxgb4 now exports a function to read a TPTE entry
from adapter memory.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advertise a larger max read queue depth for qps, and gather the resource limits
from fw and use them to avoid exhaustinq all the resources.
Design:
cxgb4:
Obtain the max_ordird_qp and max_ird_adapter device params from FW
at init time and pass them up to the ULDs when they attach. If these
parameters are not available, due to older firmware, then hard-code
the values based on the known values for older firmware.
iw_cxgb4:
Fix the c4iw_query_device() to report these correct values based on
adapter parameters. ibv_query_device() will always return:
max_qp_rd_atom = max_qp_init_rd_atom = min(module_max, max_ordird_qp)
max_res_rd_atom = max_ird_adapter
Bump up the per qp max module option to 32, allowing it to be increased
by the user up to the device max of max_ordird_qp. 32 seems to be
sufficient to maximize throughput for streaming read benchmarks.
Fail connection setup if the negotiated IRD exhausts the available
adapter ird resources. So the driver will track the amount of ird
resource in use and not send an RI_WR/INIT to FW that would reduce the
available ird resources below zero.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updates iw_cxgb4 to determine the Ingress Padding Boundary from
cxgb4_lld_info, and take subsequent actions.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to only register with the iwpm core once. Currently it is
being done for every adapter, which causes a failure for each adapter
but the first, making multiple adapters unusable.
Fixes: 9eccfe109b ("RDMA/cxgb4: Add support for iWARP Port Mapper user space service")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The status page is mapped to user processes and allows sharing the
device state between the kernel and user processes. This state isn't
getting initialized and thus intermittently causes problems. Namely,
the user process can mistakenly think the user doorbell writes are
disabled which causes SQ work requests to never get fetched by HW.
Fixes: 05eb23893c ("cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes").
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@vger.kernel.org> # v3.15
Signed-off-by: Roland Dreier <roland@purestorage.com>
Based on origninal work by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Based on origninal work by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Change logic which determines our Physical Function at PCI Probe time.
Now we read the PL_WHOAMI register and get the Physical Function.
Pass Physical Function to Upper Layer Drivers in lld_info structure in the
new field "pf" added to lld_info. This is useful for the cases where the
PF, say PF4, is attached to a Virtual Machine via some form of "PCI
Pass Through" technology and the PCI Function shows up as PF0 in the VM.
Based on original work by Casey Leedom <leedom@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI target updates from Nicholas Bellinger:
"The highlights this round include:
- Add support for T10 PI pass-through between vhost-scsi +
virtio-scsi (MST + Paolo + MKP + nab)
- Add support for T10 PI in qla2xxx target mode (Quinn + MKP + hch +
nab, merged through scsi.git)
- Add support for percpu-ida pre-allocation in qla2xxx target code
(Quinn + nab)
- A number of iser-target fixes related to hardening the network
portal shutdown path (Sagi + Slava)
- Fix response length residual handling for a number of control CDBs
(Roland + Christophe V.)
- Various iscsi RFC conformance fixes in the CHAP authentication path
(Tejas and Calsoft folks + nab)
- Return TASK_SET_FULL status for tcm_fc(FCoE) DataIn + Response
failures (Vasu + Jun + nab)
- Fix long-standing ABORT_TASK + session reset hang (nab)
- Convert iser-initiator + iser-target to include T10 bytes into EDTL
(Sagi + Or + MKP + Mike Christie)
- Fix NULL pointer dereference regression related to XCOPY introduced
in v3.15 + CC'ed to v3.12.y (nab)"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (34 commits)
target: Fix NULL pointer dereference for XCOPY in target_put_sess_cmd
vhost-scsi: Include prot_bytes into expected data transfer length
TARGET/sbc,loopback: Adjust command data length in case pi exists on the wire
libiscsi, iser: Adjust data_length to include protection information
scsi_cmnd: Introduce scsi_transfer_length helper
target: Report correct response length for some commands
target/sbc: Check that the LBA and number of blocks are correct in VERIFY
target/sbc: Remove sbc_check_valid_sectors()
Target/iscsi: Fix sendtargets response pdu for iser transport
Target/iser: Fix a wrong dereference in case discovery session is over iser
iscsi-target: Fix ABORT_TASK + connection reset iscsi_queue_req memory leak
target: Use complete_all for se_cmd->t_transport_stop_comp
target: Set CMD_T_ACTIVE bit for Task Management Requests
target: cleanup some boolean tests
target/spc: Simplify INQUIRY EVPD=0x80
tcm_fc: Generate TASK_SET_FULL status for response failures
tcm_fc: Generate TASK_SET_FULL status for DataIN failures
iscsi-target: Reject mutual authentication with reflected CHAP_C
iscsi-target: Remove no-op from iscsit_tpg_del_portal_group
iscsi-target: Fix CHAP_A parameter list handling
...
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
In case protection information exists over the wire
iscsi header data length is required to include it.
Use protection information aware scsi helpers to set
the correct transfer length.
In order to avoid breakage, remove iser transfer length
checks for each task as they are not always true and
somewhat redundant anyway.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Acked-by: Mike Christie <michaelc@cs.wisc.edu>
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In case the transport is iser we should not include the
iscsi target info in the sendtargets text response pdu.
This causes sendtargets response to include the target
info twice.
Modify iscsit_build_sendtargets_response to filter
transport types that don't match.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reported-by: Slava Shwartsman <valyushash@gmail.com>
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In case the discovery session is carried over iser, we can't
access the assumed network portal since the default portal is
used. In this case we don't really need to allocate the fastreg
pool, just prepare to the text pdu that will follow.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reported-by: Alex Tabachnik <alext@mellanox.com>
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Fixed a bug that shows up with recv window sizes that exceed the size of
the RCV_BUFSIZ field in opt0 (>= 1024K). If the recv window exceeds
this, then we specify the max possible in opt0, add add the rest in via
a RX_DATA_ACK credits.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Select the appropriate hw mtu index and initial sequence number to optimize
hw memory performance.
Add new cxgb4_best_aligned_mtu() which allows callers to provide enough
information to be used to [possibly] select an MTU which will result in the
TCP Data Segment Size (AKA Maximum Segment Size) to be an aligned value.
If an RTR message exhange is required, then align the ISS to 8B - 1 + 4, so
that after the SYN the send seqno will align on a 4B boundary. The RTR
message exchange will leave the send seqno aligned on an 8B boundary.
If an RTR is not required, then align the ISS to 8B - 1. The goal is
to have the send seqno be 8B aligned when we send the first FPDU.
Based on original work by Casey Leedom <leeedom@chelsio.com> and
Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently indirect interrupts for RDMA CQs funnel through the LLD's RDMA
RXQs, which also handle direct interrupts for offload CPLs during RDMA
connection setup/teardown. The intended T4 usage model, however, is to
have indirect interrupts flow through dedicated IQs. IE not to mix
indirect interrupts with CPL messages in an IQ. This patch adds the
concept of RDMA concentrator IQs, or CIQs, setup and maintained by the
LLD and exported to iw_cxgb4 for use when creating CQs. RDMA CPLs will
flow through the LLD's RDMA RXQs, and CQ interrupts flow through the
CIQs.
Design:
cxgb4 creates and exports an array of CIQs for the RDMA ULD. These IQs
are sized according to the max available CQs available at adapter init.
In addition, these IQs don't need FL buffers since they only service
indirect interrupts. One CIQ is setup per RX channel similar to the
RDMA RXQs.
iw_cxgb4 will utilize these CIQs based on the vector value passed into
create_cq(). The num_comp_vectors advertised by iw_cxgb4 will be the
number of CIQs configured, and thus the vector value will be the index
into the array of CIQs.
Based on original work by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- Fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- A pile of SRP initiator fixes from Bart Van Assche.
- Fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- The usual fixes and cleanups to mlx4, mlx5, cxgb4 and other
low-level drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJTlzyEAAoJEENa44ZhAt0h9yoP/1UeXlejOpCJyiNdtJZ+ilcU
cb0PEzsjzqACyDqcoQ0EpQM3/3emccVIC3uUXK12mzlTIXOFYTeRLays/TbxZDLt
FK5D/NrMmmJmciPt1ZRgUX82kFFRGScEfpkXYs7jxtRaNT7CW5KwSNQr6aFXskUz
1gpdK1ARCN5rWcGl2HJx5o9C4c/Fa/Vov8lOsAkUZXD1SuPNT/fFN0u1pRzU68g0
k3oj81XnZq5ejOBQKXEHImcmjXwaJ2yjmzxhSsKebqDWDdXuS/F9e4taKneHTZmr
AdwJaLLJPWmAGi/vYYhkuLKpzIDpzMCqwr39lEabmjWvznYOlnjfVUXwUTE2nwNC
DIXuHOLFrSvF2cNxh8ZeEYKS8AV+PjAOahPC5whkWkY256Q67uB7cy9ilWAK+7xS
QcQ5Inr6iXvxIGYA4hNwUo8aK0NuKFwhkVVFEbkPaurbQZPqiKwyVE3w2FOws/Qp
0kLLCVvpRQYjKzkxyof2tb1AcNuVNKXHrYk6RaBDJ9mjxHbhvY4OSt4CBxAAXBu6
zoedUydN1Nz1UgAB1jDsBdyE2QQnXockA1+JJKNq6gM5Dz0DUdAylzQ2NqY9tnYz
RTzihEPYIiQUkV3B8ErbqsuO6z7M830AXO5AR6bLZn1zgJ0cbMLBaKLA8LRufJI/
qxNVwL32Uv1PjKZ+yX1x
=Wcdc
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull main InfiniBand/RDMA updates from Roland Dreier:
- add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- a pile of SRP initiator fixes from Bart Van Assche.
- fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- the usual fixes and cleanups to mlx4, mlx5, cxgb4 and other low-level
drivers.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (61 commits)
RDMA/cxgb4: Add support for iWARP Port Mapper user space service
RDMA/nes: Add support for iWARP Port Mapper user space service
RDMA/core: Add support for iWARP Port Mapper user space service
IB/mlx4: Fix gfp passing in create_qp_common()
IB/umad: Fix use-after-free on close
IB/core: Fix kobject leak on device register error flow
RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
mlx4_core: Fix GFP flags parameters to be gfp_t
IB/core: Fix port kobject deletion during error flow
IB/core: Remove unneeded kobject_get/put calls
IB/core: Fix sparse warnings about redeclared functions
IB/mad: Fix sparse warning about gfp_t use
IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO
IB: Add a QP creation flag to use GFP_NOIO allocations
IB: Return error for unsupported QP creation flags
IB: Allow build of hw/ and ulp/ subdirectories independently
mlx4_core: Move handling of MLX4_QP_ST_MLX to proper switch statement
RDMA/cxgb4: Add missing padding at end of struct c4iw_create_cq_resp
IB/srp: Avoid problems if a header uses pr_fmt
IB/umad: Fix error handling
...
Based on original work by Vipul Pandya.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[ Fix htons -> ntohs to make sparse happy. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds iWARP Port Mapper (IWPM) Version 2 support. The iWARP
Port Mapper implementation is based on the port mapper specification
section in the Sockets Direct Protocol paper -
http://www.rdmaconsortium.org/home/draft-pinkerton-iwarp-sdp-v1.0.pdf
Existing iWARP RDMA providers use the same IP address as the native
TCP/IP stack when creating RDMA connections. They need a mechanism to
claim the TCP ports used for RDMA connections to prevent TCP port
collisions when other host applications use TCP ports. The iWARP Port
Mapper provides a standard mechanism to accomplish this. Without this
service it is possible for RDMA application to bind/listen on the same
port which is already being used by native TCP host application. If
that happens the incoming TCP connection data can be passed to the
RDMA stack with error.
The iWARP Port Mapper solution doesn't contain any changes to the
existing network stack in the kernel space. All the changes are
contained with the infiniband tree and also in user space.
The iWARP Port Mapper service is implemented as a user space daemon
process. Source for the IWPM service is located at
http://git.openfabrics.org/git?p=~tnikolova/libiwpm-1.0.0/.git;a=summary
The iWARP driver (port mapper client) sends to the IWPM service the
local IP address and TCP port it has received from the RDMA
application, when starting a connection. The IWPM service performs a
socket bind from user space to get an available TCP port, called a
mapped port, and communicates it back to the client. In that sense,
the IWPM service is used to map the TCP port, which the RDMA
application uses to any port available from the host TCP port
space. The mapped ports are used in iWARP RDMA connections to avoid
collisions with native TCP stack which is aware that these ports are
taken. When an RDMA connection using a mapped port is terminated, the
client notifies the IWPM service, which then releases the TCP port.
The message exchange between the IWPM service and the iWARP drivers
(between user space and kernel space) is implemented using netlink
sockets.
1) Netlink interface functions are added: ibnl_unicast() and
ibnl_mulitcast() for sending netlink messages to user space
2) The signature of the existing ibnl_put_msg() is changed to be more
generic
3) Two netlink clients are added: RDMA_NL_NES, RDMA_NL_C4IW
corresponding to the two iWarp drivers - nes and cxgb4 which use
the IWPM service
4) Enums are added to enumerate the attributes in the netlink
messages, which are exchanged between the user space IWPM service
and the iWARP drivers
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: PJ Waskiewicz <pj.waskiewicz@solidfire.com>
[ Fold in range checking fixes and nlh_next removal as suggested by Dan
Carpenter and Steve Wise. Fix sparse endianness in hash. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
There are two kzalloc() calls which were not converted to use value of
gfp passed to create_qp_common() instead of using hardcoded GFP_KERNEL
in 40f2287bd5 ("IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO"). Fix
this by passing gfp value down properly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Avoid that closing /dev/infiniband/umad<n> or /dev/infiniband/issm<n>
triggers a use-after-free. __fput() invokes f_op->release() before it
invokes cdev_put(). Make sure that the ib_umad_device structure is
freed by the cdev_put() call instead of f_op->release(). This avoids
that changing the port mode from IB into Ethernet and back to IB
followed by restarting opensmd triggers the following kernel oops:
general protection fault: 0000 [#1] PREEMPT SMP
RIP: 0010:[<ffffffff810cc65c>] [<ffffffff810cc65c>] module_put+0x2c/0x170
Call Trace:
[<ffffffff81190f20>] cdev_put+0x20/0x30
[<ffffffff8118e2ce>] __fput+0x1ae/0x1f0
[<ffffffff8118e35e>] ____fput+0xe/0x10
[<ffffffff810723bc>] task_work_run+0xac/0xe0
[<ffffffff81002a9f>] do_notify_resume+0x9f/0xc0
[<ffffffff814b8398>] int_signal+0x12/0x17
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=75051
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: <stable@vger.kernel.org> # 3.x: 8ec0a0e6b58: IB/umad: Fix error handling
Signed-off-by: Roland Dreier <roland@purestorage.com>
The ports kobject isn't being released during error flow in device
registration. This patch refactors the ports kobject cleanup into a
single function called from both the error flow in device registration
and from the unregistration function.
A couple of attributes aren't being deleted (iw_stats_group, and
ib_class_attributes). While this may be handled implicitly by the
destruction of their kobjects, it seems better to handle all the
attributes the same way.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
[ Make free_port_list_attributes() static. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The i386 ABI disagrees with most other ABIs regarding alignment of
data types larger than 4 bytes: on most ABIs a padding must be added
at end of the structures, while it is not required on i386.
So for most ABI struct c4iw_alloc_ucontext_resp gets implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding is
not added.
The tool pahole can be used to find such implicit padding:
$ pahole --anon_include \
--nested_anon_include \
--recursive \
--class_name c4iw_alloc_ucontext_resp \
drivers/infiniband/hw/cxgb4/iw_cxgb4.o
Then, structure layout can be compared between i386 and x86_64:
+++ obj-i386/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 11:43:05.547432195 +0100
--- obj-x86_64/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 10:55:10.990133017 +0100
@@ -2,9 +2,8 @@ struct c4iw_alloc_ucontext_resp {
__u64 status_page_key; /* 0 8 */
__u32 status_page_size; /* 8 4 */
- /* size: 12, cachelines: 1, members: 2 */
- /* last cacheline: 12 bytes */
+ /* size: 16, cachelines: 1, members: 2 */
+ /* padding: 4 */
+ /* last cacheline: 16 bytes */
};
This ABI disagreement will make an x86_64 kernel try to write past the
buffer provided by an i386 binary.
When boundary check will be implemented, the x86_64 kernel will refuse
to write past the i386 userspace provided buffer and the uverbs will
fail.
If the structure is on a page boundary and the next page is not
mapped, ib_copy_to_udata() will fail and the uverb will fail.
Additionally, as reported by Dan Carpenter, without the implicit
padding being properly cleared, an information leak would take place
in most architectures.
This patch adds an explicit padding to struct c4iw_alloc_ucontext_resp,
and, like 92b0ca7cb1 ("IB/mlx5: Fix stack info leak in
mlx5_ib_alloc_ucontext()"), makes function c4iw_alloc_ucontext()
not writting this padding field to userspace. This way, x86_64 kernel
will be able to write struct c4iw_alloc_ucontext_resp as expected by
unpatched and patched i386 libcxgb4.
Link: http://marc.info/?i=cover.1399309513.git.ydroneaud@opteya.com
Link: http://marc.info/?i=1395848977.3297.15.camel@localhost.localdomain
Link: http://marc.info/?i=20140328082428.GH25192@mwanda
Cc: <stable@vger.kernel.org>
Fixes: 05eb23893c ("cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes")
Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When encountering an error during the add_port function, adding a port
to sysfs, the port kobject is freed without being deleted from sysfs.
Instead of freeing it directly, the patch uses kobject_put to release
the kobject and delete it.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The ib_core module will call kobject_get on the parent object of each
kobject it creates. This is redundant since kobject_add does that
anyway.
As a side effect, this patch should fix leaking the ports kobject and
the device kobject during unregister flow, since the previous code
didn't seem to take into account the kobject_get calls on behalf of
the child kobjects.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix a few functions that are declared with __attribute_const__ in the
ib_verbs.h header file but defined without it in verbs.c. This gets rid
of the following sparse warnings:
drivers/infiniband/core/verbs.c:51:5: error: symbol 'ib_rate_to_mult' redeclared with different type (originally declared at include/rdma/ib_verbs.h:469) - different modifiers
drivers/infiniband/core/verbs.c:68:14: error: symbol 'mult_to_ib_rate' redeclared with different type (originally declared at include/rdma/ib_verbs.h:607) - different modifiers
drivers/infiniband/core/verbs.c:85:5: error: symbol 'ib_rate_to_mbps' redeclared with different type (originally declared at include/rdma/ib_verbs.h:476) - different modifiers
drivers/infiniband/core/verbs.c:111:1: error: symbol 'rdma_node_get_transport' redeclared with different type (originally declared at include/rdma/ib_verbs.h:84) - different modifiers
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch addresses a bug where an early exception for SCSI WRITE
with ImmediateData=Yes was missing the target_put_sess_cmd() call
to drop the extra se_cmd->cmd_kref reference obtained during the
normal iscsit_setup_scsi_cmd() codepath execution.
This bug was manifesting itself during session shutdown within
isert_cq_rx_comp_err() where target_wait_for_sess_cmds() would
end up waiting indefinately for the last se_cmd->cmd_kref put to
occur for the failed SCSI WRITE + ImmediateData descriptors.
This fix follows what traditional iscsi-target code already does
for the same failure case within iscsit_get_immediate_data().
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Properly convert gfp_t & result to bool to fix:
drivers/infiniband/core/sa_query.c:621:33: warning: incorrect type in initializer (different base types)
drivers/infiniband/core/sa_query.c:621:33: expected bool [unsigned] [usertype] preload
drivers/infiniband/core/sa_query.c:621:33: got restricted gfp_t
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the various routines used to allocate memory resources which
serve QPs in mlx4 to get an input GFP directive. Have the Ethernet
driver to use GFP_KERNEL in it's QP allocations as done prior to this
commit, and the IB driver to use GFP_NOIO when the IB verbs
IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This addresses a problem where NFS client writes over IPoIB connected
mode may deadlock on memory allocation/writeback.
The problem is not directly memory reclamation. There is an indirect
dependency between network filesystems writing back pages and
ipoib_cm_tx_init() due to how a kworker is used. Page reclaim cannot
make forward progress until ipoib_cm_tx_init() succeeds and it is
stuck in page reclaim itself waiting for network transmission.
Ordinarily this situation may be avoided by having the caller use
GFP_NOFS but ipoib_cm_tx_init() does not have that information.
To address this, take a general approach and add a new QP creation
flag that tells the low-level hardware driver to use GFP_NOIO for the
memory allocations related to the new QP.
Use the new flag in the ipoib connected mode path, and if the driver
doesn't support it, re-issue the QP creation without the flag.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix the usnic and thw qib drivers to err when QP creation flags that
they don't understand are provided.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
It is not possible to build only the drivers/infiniband/hw/ (or ulp/)
subdirectory with command such as:
$ make ARCH=x86_64 O=./obj-x86_64/ drivers/infiniband/hw/
This fails with following error messages:
make[2]: Nothing to be done for `all'.
make[2]: Nothing to be done for `relocs'.
CHK include/config/kernel.release
Using /home/ydroneaud/src/linux as source for kernel
GEN /home/ydroneaud/src/linux/obj-x86_64/Makefile
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
CALL /home/ydroneaud/src/linux/scripts/checksyscalls.sh
/home/ydroneaud/src/linux/scripts/Makefile.build:44: /home/ydroneaud/src/linux/drivers/infiniband/hw/Makefile: No such file or directory
make[2]: *** No rule to make target `/home/ydroneaud/src/linux/drivers/infiniband/hw/Makefile'. Stop.
make[1]: *** [drivers/infiniband/hw/] Error 2
make: *** [sub-make] Error 2
This patch creates a Makefile in hw/ and ulp/ and moves each
corresponding parts of drivers/infiniband/Makefile in the new
Makefiles.
It should not break build except if some hw/ drivers or ulp/ were
allowed previously to be built while CONFIG_INFINIBAND is set to 'n',
but according to drivers/infiniband/Kconfig, it's not possible. So it
should be safe to apply.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>