The recv lock was defined so the iscsi layer could block
the recv path from processing IO during recovery. It
turns out iser just set a lock to that pointer which was pointless.
We now disconnect the transport connection before doing recovery
so we do not need the recv lock. For iscsi_tcp we still stop
the recv path incase older tools are being used.
This patch also has iscsi_itt_to_ctask user grab the session lock
and has the caller access the task with the lock or get a ref
to it in case the target is broken and sends a tmf success response
then sends data or a response for the command that was supposed to
be affected bty the tmf.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This adds two new attrs used for creating initiator ports and
binding sessions to hardware.
The session level initiatorname:
Since bnx2i does a scsi_host per host device, we need to add the
iface initiator port settings on the session, so we can create
multiple initiator ports (each with different inames) per device/scsi_host.
The current iname reflects that qla4xxx can have one iname per hba, and we are
allocating a host per session for software. The iname on the host will
remain so we can export and set the hba level qla4xxx setting.
The ifacename attr:
To bind a session to a some peice of hardware in userspace we maintain
some mappings, but during boot or iscsid restart (iscsid contains the user
space part of the driver) we need to be able to figure out which of those
host mappings abstractions maps to certain sessions. This patch adds
a ifacename attr, which userspace can set to id the host side of the
endpoint across pivot_roots and iscsid restarts.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This hooks iser into the iscsi endpoint code. Previously it handled the
lookup and allocation. This has been made generic so bnx2i and iser can
share it. It also allows us to pass iser the leading conn's ep, so we
know the ib_deivce being used and can set it as the scsi_host's parent.
And that allows scsi-ml to set the dma_mask based on those values.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Currently we duplicate the list of sessions, because we were using the
test for if a session was on the host list to indicate if the session
was bound or unbound. We can instead use the target_id and fix up
the class so that drivers like bnx2i do not have to manage the target id
space.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This handles the iscsi_cmd_task rename and renames
the iser cmd task to iser task.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Convert ib_iser to support merged tasks.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Currently to get a ctask from the session cmd array, you have to
know to use the itt modifier. To make this easier on LLDs and
so in the future we can easilly kill the session array and use
the host shared map instead, this patch adds a nice wrapper
to strip the itt into a session->cmds index and return a ctask.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
After the stop_conn callback has returned the LLD should not
touch the scsi cmds. iscsi_tcp and libiscsi use the
conn->recv_lock and suspend_rx field to halt recv path
processing, but iser does not have any protection.
This patch modifies iser so that userspace can just
call the ep_disconnect callback, which will halt
all recv IO, before calling the stop_conn callback so
we do not have to worry about the conn->recv_lock and
suspend rx field. iser just needs to stop the send side
from accessing the ib conn.
Fixup to handle when the ep poll fails and ep disconnect
is called from Erez.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This removes the session and conn data_size fields from the iscsi_transport.
Just pass in the value like with host allocation. This patch also makes
it so the LLD iscsi_conn data is allocated with the iscsi_cls_conn.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This finishes the host/session unbinding, by adding some helpers
to add and remove hosts and the session they manage.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
bnx2i allocates a host per netdevice but will use libiscsi,
so this unbinds the session from the host in that code.
This will also be useful for the iser parent device dma settings
fixes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
max_cmd_len and max_conn are not really used. max_cmd_len is
always 16 and can be set by the LLD. max_conn is always one
since we do not support MCS.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
iscsi offload (bnx2i and qla4xx) allocate a scsi host per hba,
so the session creation path needs a shost/host_no argument.
Software iscsi/iser will follow the same behabior as before
where it allcoates a host per session, but in the future iser
will probably look more like bnx2i where the host's parent is
the hardware (rnic for iser and for bnx2i it is the nic), because
it does not use a socket layer like how iscsi_tcp does.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mad: Fix kernel crash when .process_mad() returns SUCCESS|CONSUMED
IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish()
MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries
IB/mlx4: Fix creation of kernel QP with max number of send s/g entries
IB/mthca: Fix max_sge value returned by query_device
RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send()
IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send()
IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate
IB/ipath: Fix printk format for ipath_sdma_status
If a low-level driver returns IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_CONSUMED,
handle_outgoing_dr_smp() doesn't clean up properly. The fix is to
kfree the local data and break, rather than falling through. This was
observed with the ipath driver, but could happen with any driver.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1027>.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We saw a kernel oops in our regression testing when a multicast "join
finish" occurred just after the interface was -- this is
<https://bugs.openfabrics.org/show_bug.cgi?id=1040>. The test
randomly causes the HCA physical port to go down then up.
The cause of this is that ipoib_mcast_join_finish() processing happen
just after ipoib_mcast_dev_flush() was invoked (in which case the
broadcast pointer is NULL). This patch tests for and handles the case
where priv->broadcast is NULL.
Cc: <stable@kernel.org>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a kernel QP where the consumer asked for a send queue
with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
returned an error if the send queue stride is larger than the
hardware's maximum send work request descriptor size. This is not a
problem; the only issue is to make sure that the actual descriptors
used do not overflow the maximum descriptor size, so check this instead.
Clamp the returned max_send_sge value to be no bigger than what
query_device returns for the max_sge to avoid confusing hapless users,
even if the hardware is capable of handling a few more s/g entries.
This bug caused NFS/RDMA mounts to fail when the server adapter used
the mlx4 driver.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is a race from when a device is created with device_create() and
then the drvdata is set with a call to dev_set_drvdata() in which a
sysfs file could be open, yet the drvdata will be NULL, causing all
sorts of bad things to happen.
This patch fixes the problem by using the new function,
device_create_drvdata().
Cc: Kay Sievers <kay.sievers@vrfy.org>
Reviewed-by: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The mthca driver returns the maximum number of scatter/gather entries
returned by the firmware as the max_sge value when device properties
are queried. However, the firmware also reports a limit on the
maximum descriptor size allowed, and because mthca takes into account
the worst case send request overhead when checking whether to allow a
QP to be created, the largest number of scatter/gather entries that
can be used with mthca may be limited by the maximum descriptor size
rather than just by the actual s/g entry limit.
This means that applications cannot actually create QPs with
max_send_sge equal to the limit returned by ib_query_device(). Fix
this by checking if the maximum descriptor size imposes a lower limit
and if so returning that lower limit.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send':
drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function
This is what akpm describes as "the dopey
gcc-doesn't-know-that-foo(&var)-writes-to-var problem."
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When I fixed the RC receive completion opcode in 2bfc8e9e ("IB/ipath:
Return the correct opcode for RDMA WRITE with immediate"), I forgot to
fix UC, which had the same problem for RDMA write with immediate
returning the wrong opcode.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f018c7e1 ("IB/ipath: Change ipath_devdata.ipath_sdma_status to be
unsigned long") changed ipath_sdma_status to be unsigned long, but left
a few debug messages that printed it out with a %016llx format, which
generates the warnings
drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
Fix this by changing the format used to print out the value to %08lx
(8 hex digits are now sufficient, because the highest bit used is 31).
Warnings reported by Randy Dunlap <randy.dunlap@oracle.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxio_flush_sq() was failing to wrap around the software send queue
causing garbage completion entries on a flush operation.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Andrew Morton <akpm@linux-foundation.org> pointed out that bitops
should take an unsigned long * arg. However, the ipath driver was
doing bitops on struct ipath_devdata.ipath_sdma_status, which is u64.
Change this member to unsigned long to avoid tons of warnings when x86
fixes the bitops to take unsigned long * instead of void *.
Also, change the IPATH_SDMA_RUNNING and IPATH_SDMA_SHUTDOWN bit
numbers to 30 and 31 (instead of 62 and 63) so that we're not setting
another booby trap for someone who tries to make ipath work on a
32-bit architecture.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The official reason is "with the presence of pid namespaces in the
kernel using pid_t-s inside one is no longer safe."
But the reason I fix this right now is the following:
About a month ago (when 2.6.25 was not yet released) there still was a
one last caller of a to-be-deprecated-soon function find_pid() - the
kill_proc() function, which in turn was only used by nfs callback
code.
During the last merge window, this last caller was finally eliminated
by some NFS patch(es) and I was about to finally kill this kill_proc()
and find_pid(), but found, that I was late and the kill_proc is now
called from the ipath driver since commit 58411d1c ("IB/ipath: Head of
Line blocking vs forward progress of user apps").
So here's a patch that fixes this code to use struct pid * and (!)
the kill_pid routine.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If an out of sequence RDMA read response middle or last packet is
received, we should only resend the RDMA read request on the first
out of sequence packet and drop subsequent out of sequence packets
otherwise, we get "too many retries".
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The send DMA hardware queue voided a number of prior assumptions about
when a send is complete which led to completions being generated out of
order. There were also a number of locking issues when switching the QP
to the error or reset states, and we implement the IB_QPS_SQD state.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When errors are detected in RC, the QP should transition to the
IB_QPS_ERR state, not the IB_QPS_SQE state. Also, when the error is on
the responder side, the receive work completion error was incorrect
(remote vs. local).
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix some bugs with the max_aggr module parameter added with LRO support:
- The module parameter value ignored and not actually used to set
lro_mgr.max_aggr.
- MODULE_PARM_DESC had a typo "_mro_" instead of "_lro_" so it didn't
end up describing the actual module parameter.
- The nes_lro_max_aggr variable was declared as unsigned, but the
module_param line said "int" instead of "uint" for the type.
- The default value for the parameter was stuck in the permissions
field of module_param, which led to nonsensical permissions for the
file under /sys/module/iw_nes/param.
- The parameter was used in only one file but defined in another, which
led to the variable being global for no good reason. Move everything
related to the parameter to the file nes_hw.c where it is actually
used.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is necessary because, in a multicore environment, a race between
uverbs async handler and destroy QP could occur.
Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
What's fixed:
in ipath_cancel_sends()
We need to unconditionally set ABORTING. So, swap the tests
so the set_bit() isn't shadowed by the &&.
If we've disarmed the piobufs, then we need to unconditionally
set DISARMED. So, move it out from the overly protective if
at the bottom.
in sdma_abort_task()
Abort_task was written knowing that the SDMA engine would always
be reset (and restarted) on error. A recent change broke that
fundamental assumption by taking the restart portion and making
it conditional on a link status change. But, SDMA can go boom
without a link status change in some conditions.
Signed-off-by: John Gregor <john.gregor@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Now that we always use PIO for vl15 on 7220, we could get stuck forever
if we happened to run out of PIO buffers from the verbs code, because
the setup code wouldn't run; the interrupt was also ignored if SDMA was
supported. We also have to reduce the pio update threshold if we have
fewer kernel buffers than the existing threshold.
Clean up the initialization a bit to get ordering safer and more
sensible, and use the existing ipath_chg_kernavail call to do init,
rather than doing it separately.
Drop unnecessary clearing of pio buffer on pio parity error.
Drop incorrect updating of pioavailshadow when exitting freeze mode
(software state may not match chip state if buffer has been allocated
and not yet written).
If we couldn't get a kernel buffer for a while, make sure we are
in sync with hardware, mainly to handle the exitting freeze case.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The loop in ipath_kreceive() that processes packets increments the
loop-index 'i' once too often, because the exit condition does not
depend on it, and is checked after the increment. By adding a check for
!last to the iterator in the for loop, we correct that in a way that is
not so likely to be re-broken by changes in the loop body.
Signed-off-by: Michael Albaugh <micheal.albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch fixes a bug in the RC responder which generates a completion
entry with the wrong opcode when an RDMA WRITE with immediate is received.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The semantics of cancel_sends changed, but the code using it was missed.
Don't leave sends and pioavail updates disabled, and add a comment as to
why the force update is needed.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a send work request has immediate errors and is not put on the
send queue, we shouldn't update any of the QP state.
The increment of the SSN wasn't obeying this.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We warn about prototype chips, but the function that checks for
support is also called as a result of a get_portinfo request, which
can clutter the logs.
Restrict warning to only appear during initialization.
Signed-off-by: Michael Albaugh <michael.albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently, iw_cxgb3 is severely limited on the amount of userspace
memory that can be registered in in a single memory region, which
causes big problems for applications that expect to be able to
register 100s of MB.
The problem is that the driver uses a single kmalloc()ed buffer to
hold the physical buffer list (PBL) for the entire memory region
during registration, which means that 8 bytes of contiguous memory are
required for each page of memory being registered. For example, a 64
MB registration will require 128 KB of contiguous memory with 4 KB
pages, and it unlikely that such an allocation will succeed on a busy
system.
This is purely a driver problem: the temporary page list buffer is not
needed by the hardware, so we can fix this by writing the PBL to the
hardware in page-sized chunks rather than all at once. We do this by
splitting the memory registration operation up into several steps:
- Allocate PBL space in adapter memory for the full registration
- Copy PBL to adapter memory in chunks
- Allocate STag and enable memory region
This also allows several other cleanups to the __cxio_tpt_op()
interface and related parts of the driver.
This change leaves the reregister memory region and memory window
operations broken, but they already didn't work due to other
longstanding bugs, so fixing them will be left to a later patch.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
chunks. This limits the largest single allocation that can be done to
the same size, which means that with 4 KB pages, each of which takes 8
bytes of PBL memory, the largest memory region that can be allocated
is 1 GB (256K PBL entries * 4 KB/entry).
Remove this limit by adding all the PBL memory in a single gen_pool
chunk, if possible. Add code that falls back to smaller chunks if
gen_pool_add() fails, which can happen if there is not sufficient
contiguous lowmem for the internal gen_pool bitmap.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Also remove duplicate assignment of local_ca_ack_delay and change
min_t check for local_ca_ack_delay to u8 instead of int.
Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Testing on large clusters shows its way too short at 10 secs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove bad BUG_ON() that can trigger in correct operation from
close_con_rpl(). It is possible to get a close_rpl message on a dead
connection. The sequence is:
- host refs ep for close exchange
- host posts close_req
- hw posts PEER_ABORT from incoming RST
- host marks ep DEAD
- host posts ABORT_RPL and releases ep resources
- hw posts CLOSE_RPL
- host derefs ep and ep freed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Flush the QP only after the HW disables the connection. Currently
we flush the QP when transitioning to CLOSING. This exposes a race
condition where the HW can complete a RECV WR, for instance, -and-
the SW can flush that same WR.
- Only call CQ event handlers on flush IFF we actually flushed something.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f56bcd80 ("IPoIB: Use separate CQ for UD send completions")
introduced a bug where the transmit queue could get stopped and never
woken up. The problem is that send completions are only polled at the
end of the xmit function, so if the send queue fills up and the xmit
path stops the queue, then there is no way for send completions to
ever get polled, and so the transmit queue stays stopped forever.
Fix this by arming the send CQ just before posting the last send
request that fills the send queue. Then, when the completion event
handler is called, drain the send CQ. Since it is possible that not
enough send completions are in the CQ, verify that the the net queue
has been woken up after draining the send CQ, and if not arm a timer
and drain again at the timer function.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When I merged bbf8eed1 ("IB/mlx4: Add support for resizing CQs") I
changed things around so that mlx4_ib_alloc_cq_buf() and
mlx4_ib_free_cq_buf() were used everywhere they could be. However, I
screwed up the number of entries passed into mlx4_ib_alloc_cq_buf()
in a couple places -- the function bumps the number of entries
internally, so the caller shouldn't add 1 as well.
Passing a too-big value for the number of entries to mlx4_ib_free_cq_buf()
can cause the cleanup to go off the end of an array and corrupt
allocator state in interesting ways.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Various cleanups:
- Change // to /* .. */
- Place whitespace around binary operators.
- Trim down a few long lines.
- Some minor alignment formatting for better readability.
- Remove some silly tabs.
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch enables the iw_nes module for NetEffect RNICs to support
additional PHYs including SFP+ (referred to as ARGUS in the code).
Signed-off-by: Eric Schneider <eric.schneider@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a child interface, copy the MTU information from the
parent. Otherwise when the child's multicast join completes, the MTU
will not be updated since the code does
dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
and priv->admin_mtu will be set to 0.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>