The current work request posting code writes the LSO segment before
writing any data segments. This leaves a window where the LSO segment
overwrites the stamping in one cacheline that the HCA prefetches
before the rest of the cacheline is filled with the correct data
segments. When the HCA processes this work request, a local
protection error may result.
Fix this by saving the LSO header size field off and writing it only
after all data segments are written. This fix is a cleaned-up version
of a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1383>.
Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To allow allocating an aligned range of consecutive QP numbers, add an
interface to reserve an aligned range of QP numbers and have the QP
allocation function always take a QP number.
This will be used for RSS support in the mlx4_en Ethernet driver and
also potentially by IPoIB RSS support.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set RLKEY bit in the HW context for kernel QPs so that kernel QPs can
use the reserved L_Key for memory reference.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Byte swap the addresses in the page list for fast register work requests
to big endian to match what the HCA expectx. Also, the addresses must
have the "present" bit set so that the HCA knows it can access them.
Otherwise the HCA will fault the first time it accesses the memory
region.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code limits the max message size to 2K for UD QPs, while MTU
might be as big as 4K. This patch sets the maximum message size to
4K, which is needed for UD to work correctly on fabrics with a 4K MTU.
Signed-off-by: Alex Naslednikov <xalex@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Update existing Mellanox copyright lines to 2008, and add such lines
to files where they are missing.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for the following operations to mlx4 when device firmware
supports them:
- Send with invalidate and local invalidate send queue work requests;
- Allocate/free fast register MRs;
- Allocate/free fast register MR page lists;
- Fast register MR send queue work requests;
- Local DMA L_Key.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code uses kmalloc() and then just does a bitwise OR operation on
qp->flags in create_qp_common(), which means that qp->flags may
potentially have some unintended bits set. This patch uses kzalloc()
and avoids further explicit clearing of structure members, which also
shrinks the code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-65 (-65)
function old new delta
create_qp_common 2024 1959 -65
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
flag by using the per-multicast group loopback blocking feature of
mlx4 hardware.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 65adfa91 ("IB/mlx4: Fix RESET to RESET and RESET to ERROR
transitions") added some extra code to handle a QP state transition
from RESET to ERROR. However, the latest 1.2.1 version of the IB spec
has clarified that this transition is actually not allowed, so we can
remove this extra code again.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX returns the max message size it supports through the
QUERY_DEV_CAP firmware command. When modifying a QP to RTR, the max
message size for the QP must be specified. This value must not exceed
the value declared through QUERY_DEV_CAP. The current code ignores
the max allowed size and unconditionally sets the value to 2^31. This
patch sets all QPs to the max value allowed as returned from firmware.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The idea is that for QPs with fixed size work requests (eg selective
signaling QPs), before stamping the WQE, we read the value of the DS
field, which gives the effective size of the descriptor as used in the
previous post. Then we stamp only that area, since the rest of the
descriptor is already stamped.
When initializing the send queue buffer, make sure the DS field is
initialized to the max descriptor size so that the subsequent stamping
will be done on the entire descriptor area.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a kernel QP where the consumer asked for a send queue
with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
returned an error if the send queue stride is larger than the
hardware's maximum send work request descriptor size. This is not a
problem; the only issue is to make sure that the actual descriptors
used do not overflow the maximum descriptor size, so check this instead.
Clamp the returned max_send_sge value to be no bigger than what
query_device returns for the max_sge to avoid confusing hapless users,
even if the hardware is capable of handling a few more s/g entries.
This bug caused NFS/RDMA mounts to fail when the server adapter used
the mlx4 driver.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In addition to mlx4_ib, there will be ethernet and FC consumers of
mlx4_core, so move the code for managing kernel doorbells into the
core module to avoid having to duplicate this multiple times.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the QP was moved to another state (such as SQE) by the hardware,
then after this change the user won't have to set the IBV_QP_CUR_STATE
mask in order to execute modify QP in order to recover from this state.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than have build_mlx_header() return a negative value on failure
and the length of the segments it builds on success, add a pointer
parameter to return the length and return 0 on success. This matches
the calling convention used for build_lso_seg() and generates slightly
smaller code -- eg, on 64-bit x86:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-22 (-22)
function old new delta
mlx4_ib_post_send 2023 2001 -22
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX devices support checksum generation and verification of TCP
and UDP packets for UD IPoIB messages. This patch checks if the HCA
supports this and sets the IB_DEVICE_UD_IP_CSUM capability flag if it
does. It implements support for handling the IB_SEND_IP_CSUM send
flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Ali Ayub <ali@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCA supports shrinking WQEs, so that a single work request
can be made of multiple units of wqe_shift. This way, WRs can differ
in size, and do not have to be a power of 2 in size, saving memory and
speeding up send WR posting. Unfortunately, if we do this then the
wqe_index field in CQEs can't be used to look up the WR ID anymore, so
our implementation does this only if selective signaling is off.
Further, on 32-bit platforms, we can't use vmap() to make the QP
buffer virtually contigious. Thus we have to use constant-sized WRs to
make sure a WR is always fully within a single page-sized chunk.
Finally, we use WRs with the NOP opcode to avoid wrapping around the
queue buffer in the middle of posting a WR, and we set the
NoErrorCompletion bit to avoid getting completions with error for NOP
WRs. However, NEC is only supported starting with firmware 2.2.232,
so we use constant-sized WRs for older firmware. And, since MLX QPs
only support SEND, we use constant-sized WRs in this case.
When stamping during NOP posting, do stamping following setting of the
NOP WQE valid bit.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We use struct mlx4_buf for kernel QP, CQ and SRQ buffers, and the code
to look up an entry is duplicated in get_cqe_from_buf() and the QP and
SRQ versions of get_wqe(). Factor this out into mlx4_buf_offset().
This will also make it easier to switch over to using vmap() for buffers.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Because of a typo, mlx4_ib_post_send() takes the same lock rq.lock as
mlx4_ib_post_recv(). Correct the code so the intended sq.lock is
taken when posting a send.
Noticed by Yossi Leybovitch and pointed out by Jack Morgenstein from
Mellanox.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add sanity checks to send queue sizes passed in from userspace. The
minimum sq stride value below is taken from the MT25408 PRM (section
11.10, Table 306, log_sq_stride definition).
Without this check, userspace can submit arbitrarily large/small
values for the number of WQEs and the stride, which can crash the
kernel.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a __set_data_seg() helper in mlx4_ib_post_recv() too; in addition
to making the code easier to read, this also allows gcc to generate
better code -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-8 (-8)
function old new delta
mlx4_ib_post_recv 359 351 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is an addendum to commit 0e6e7416 ("IB/mlx4: Handle new FW
requirement for send request prefetching"). We also need to handle
prefetch marking properly for S/G segments, or else the HCA may end up
processing S/G segments that are not fully written and end up sending
the wrong data. This can actually cause data corruption in practice,
especially on systems with relatively slow CPUs (where the HCA is more
likely to prefetch while the CPU is in the middle of writing a work
request into memory).
We write S/G segments in reverse order into the WQE, in order to
guarantee that the first dword of all cachelines containing S/G
segments is written last (overwriting the headroom invalidation
pattern). The entire cacheline will thus contain valid data when the
invalidation pattern is overwritten.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The error handling code at err_wrid in create_qp_common() does not
handle a userspace QP attached to an SRQ correctly, since it ends up
in the else clause of the if statement. This means it tries to
kfree() the uninitialized qp->sq.wrid and qp->rq.wrid pointers. Fix
this so we only free the wrid arrays for kernel QPs.
Pointed out by Michael S. Tsirkin <mst@dev.mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Temporarily allocated struct mlx4_qp_context *context is leaked by
several error paths. The patch takes advantage of the return value
'err' being preinitialized to -EINVAL.
Spotted by Coverity (CID 1768).
Signed-off-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set remote address, atomic and datagram segments out of
mlx4_ib_post_send() into small helper functions. This doesn't change
the generated code in any significant way, and makes the source easier
on the eyes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set data segment entries out of mlx4_ib_post_send()
into set_data_seg(). This cleans up the code and lets the compiler do
a better job -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-16 (-16)
function old new delta
mlx4_ib_post_send 1598 1582 -16
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Return the receive queue sizes for both userspace QPs and kernel Qps
(not just kernel QPs) from mlx4_ib_query_qp(). Also zero the send
queue sizes for userspace QPs to avoid a possible information leak,
and set the max_inline_data for kernel QPs to 0 since inline sends are
not supported for kernel QPs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When clearing the ib_ah_attr parameter in to_ib_ah_attr(), use sizeof
*ib_ah_attr instead of sizeof *path. This is the same bug as was
fixed for mthca in 99d4f22e ("IB/mthca: Use correct structure size in
call to memset()"), but the code was cut and pasted into mlx4 before the
fix was merged.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a QP is in the INIT state, the sched_queue field hasn't been given
to the firmware yet, so the firmware cannot return the value when the QP
is queried. To handle this, use the port number that is saved in the
driver's QP data structure.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Correct the mask used to get the flow label, since the field is 20 bits,
not 24 bits.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Inline data segments in send WQEs are not allowed to cross a 64 byte
boundary. We use inline data segments to hold the UD headers for MLX
QPs (QP0 and QP1). A send with GRH on QP1 will have a UD header that
is too big to fit in a single inline data segment without crossing a
64 byte boundary, so split the header into two inline data segments.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Upcoming firmware introduces command interface revision 3, which
changes the way port capabilities are queried and set. Update the
driver to handle both the new and old command interfaces by adding a
new MLX4_FLAG_OLD_PORT_CMDS that it is set after querying the firmware
interface revision and then using the correct interface based on the
setting of the flag.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The calculation of max_inline_data in set_kernel_sq_size() is bogus,
since it doesn't take into account the fact that inline segments may
not cross a 64-byte boundary, and hence multiple inline segments will
probably need to be used to post large inline sends.
We don't support inline sends for kernel QPs anyway, so there's no
point in doing this calculation anyway, since the field is just zeroed
out a little later. So just delete the bogus calculation.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
New ConnectX firmware introduces FW command interface revision 2,
which requires that for each QP, a chunk of send queue entries (the
"headroom") is kept marked as invalid, so that the HCA doesn't get
confused if it prefetches entries that haven't been posted yet. Add
code to the driver to do this, and also update the user ABI so that
userspace can request that the prefetcher be turned off for userspace
QPs (we just leave the prefetcher on for all kernel QPs).
Unfortunately, marking send queue entries this way is confuses older
firmware, so we change the driver to allow only FW command interface
revisions 2. This means that users will have to update their firmware
to work with the new driver, but the firmware is changing quickly and
the old firmware has lots of other bugs anyway, so this shouldn't be too
big a deal.
Based on a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Doing max(1, foo) where foo is u32 generates a warning, because 1 is a
signed constant. Fix this by using 1U instead.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
QPs attached to an SRQ must never have their own RQ, and QPs not
attached to SRQs must have an RQ with at least 1 entry. Enforce all
of this in set_rq_size().
Based on a patch by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code in __mlx4_ib_modify_qp() overwrites context->params1 after
the RNR retry parameter is ORed in, which results in the RNR retry
parameter always being set to 0. Fix this by moving where we OR in
the value to later in the function, after the initial assignment of
context->params1.
Found by the Mellanox firmware group.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We need to initialize the owner bit of send queue WQEs to hardware
ownership whenever the QP is modified from reset to init, not just
when the QP is first allocated. This avoids having the hardware
process stale WQEs when the QP is moved to reset but not destroyed and
then modified to init again.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a QP is attached to a shared receive queue (SRQ), then it doesn't
have a receive queue (RQ). So don't allocate an RQ doorbell (or map a
doorbell from userspace for userspace QPs) for that QP.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass the number of WQEs for the send queue and their size from userspace
to the kernel to avoid having to keep the QP size calculations in sync
between the kernel driver and libmlx4. This fixes a bug seen with the
current mlx4_ib driver and current libmlx4 caused by a difference in the
calculated sizes for SQ WQEs. Also, this gives more flexibility for
userspace to experiment with using multiple WQE BBs for a single SQ WQE.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>