Commit Graph

54 Commits

Author SHA1 Message Date
Hariprasad S 1fc8190dd6 RDMA/cxgb4: Don't hang threads forever waiting on WR replies
In c4iw_wait_for_reply(), if a FW6_MSG WR reply is not received after
C4IW_WR_TO seconds, fail the WR operation and mark the device as fatally
dead.  Further, if the device is marked fatally dead, then fail the WR
wait immediately.

Also change the timeout to 60 seconds.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-02-18 08:33:15 -08:00
David S. Miller 8fd90bb889 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/infiniband/hw/cxgb4/device.c

The cxgb4 conflict was simply overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-22 00:44:59 -07:00
Hariprasad Shenai 91244bbd6b iw_cxgb4: Don't limit TPTE count to 32KB
Use the size advertised by FW

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-21 20:23:59 -07:00
Hariprasad Shenai 7730b4c7e3 cxgb4/iw_cxgb4: work request logging feature
This commit enhances the iwarp driver to optionally keep a log of rdma
work request timining data for kernel mode QPs.  If iw_cxgb4 module option
c4iw_wr_log is set to non-zero, each work request is tracked and timing
data maintained in a rolling log that is 4096 entries deep by default.
Module option c4iw_wr_log_size_order allows specifing a log2 size to use
instead of the default order of 12 (4096 entries). Both module options
are read-only and must be passed in at module load time to set them. IE:

modprobe iw_cxgb4 c4iw_wr_log=1 c4iw_wr_log_size_order=10

The timing data is viewable via the iw_cxgb4 debugfs file "wr_log".
Writing anything to this file will clear all the timing data.
Data tracked includes:

- The host time when the work request was posted, just before ringing
the doorbell.  The host time when the completion was polled by the
application.  This is also the time the log entry is created.  The delta
of these two times is the amount of time took processing the work request.

- The qid of the EQ used to post the work request.

- The work request opcode.

- The cqe wr_id field.  For sq completions requests this is the swsqe
index.  For recv completions this is the MSN of the ingress SEND.
This value can be used to match log entries from this log with firmware
flowc event entries.

- The sge timestamp value just before ringing the doorbell when
posting,  the sge timestamp value just after polling the completion,
and CQE.timestamp field from the completion itself.  With these three
timestamps we can track the latency from post to poll, and the amount
of time the completion resided in the CQ before being reaped by the
application.  With debug firmware, the sge timestamp is also logged by
firmware in its flowc history so that we can compute the latency from
posting the work request until the firmware sees it.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:25:16 -07:00
Hariprasad Shenai 4c2c576322 cxgb4/iw_cxgb4: use firmware ord/ird resource limits
Advertise a larger max read queue depth for qps, and gather the resource limits
from fw and use them to avoid exhaustinq all the resources.

Design:

cxgb4:

Obtain the max_ordird_qp and max_ird_adapter device params from FW
at init time and pass them up to the ULDs when they attach.  If these
parameters are not available, due to older firmware, then hard-code
the values based on the known values for older firmware.
iw_cxgb4:

Fix the c4iw_query_device() to report these correct values based on
adapter parameters.  ibv_query_device() will always return:

max_qp_rd_atom = max_qp_init_rd_atom = min(module_max, max_ordird_qp)
max_res_rd_atom = max_ird_adapter

Bump up the per qp max module option to 32, allowing it to be increased
by the user up to the device max of max_ordird_qp.  32 seems to be
sufficient to maximize throughput for streaming read benchmarks.

Fail connection setup if the negotiated IRD exhausts the available
adapter ird resources.  So the driver will track the amount of ird
resource in use and not send an RI_WR/INIT to FW that would reduce the
available ird resources below zero.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:25:16 -07:00
Hariprasad Shenai 04e10e2164 iw_cxgb4: Detect Ing. Padding Boundary at run-time
Updates iw_cxgb4 to determine the Ingress Padding Boundary from
cxgb4_lld_info, and take subsequent actions.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:25:16 -07:00
Steve Wise 46c1376db1 RDMA/cxgb4: Call iwpm_init() only once
We need to only register with the iwpm core once.  Currently it is
being done for every adapter, which causes a failure for each adapter
but the first, making multiple adapters unusable.

Fixes: 9eccfe109b ("RDMA/cxgb4: Add support for iWARP Port Mapper user space service")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-07-13 10:00:54 -07:00
Linus Torvalds f9da455b93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

 4) BPF now has a "random" opcode, from Chema Gonzalez.

 5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

 6) Support TCP fastopen over ipv6, from Daniel Lee.

 7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers.  From Ezequiel Garcia.

 8) Support software TSO in fec driver too, from Nimrod Andy.

 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
  rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
  tcp: fixing TLP's FIN recovery
  net: fec: Add software TSO support
  net: fec: Add Scatter/gather support
  net: fec: Increase buffer descriptor entry number
  net: fec: Factorize feature setting
  net: fec: Enable IP header hardware checksum
  net: fec: Factorize the .xmit transmit function
  bridge: fix compile error when compiling without IPv6 support
  bridge: fix smatch warning / potential null pointer dereference
  via-rhine: fix full-duplex with autoneg disable
  bnx2x: Enlarge the dorq threshold for VFs
  bnx2x: Check for UNDI in uncommon branch
  bnx2x: Fix 1G-baseT link
  bnx2x: Fix link for KR with swapped polarity lane
  sctp: Fix sk_ack_backlog wrap-around problem
  net/core: Add VF link state control policy
  net/fsl: xgmac_mdio is dependent on OF_MDIO
  net/fsl: Make xgmac_mdio read error message useful
  net_sched: drr: warn when qdisc is not work conserving
  ...
2014-06-12 14:27:40 -07:00
Hariprasad Shenai b408ff282d iw_cxgb4: don't truncate the recv window size
Fixed a bug that shows up with recv window sizes that exceed the size of
the RCV_BUFSIZ field in opt0 (>= 1024K).  If the recv window exceeds
this, then we specify the max possible in opt0, add add the rest in via
a RX_DATA_ACK credits.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-10 22:49:54 -07:00
Steve Wise 9eccfe109b RDMA/cxgb4: Add support for iWARP Port Mapper user space service
Based on original work by Vipul Pandya.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>

[ Fix htons -> ntohs to make sparse happy.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-06-10 10:12:06 -07:00
Steve Wise cc18b939e1 RDMA/cxgb4: Fix endpoint mutex deadlocks
In cases where the cm calls c4iw_modify_rc_qp() with the endpoint
mutex held, they must be called with internal == 1.  rx_data() and
process_mpa_reply() are not doing this.  This causes a deadlock
because c4iw_modify_rc_qp() might call c4iw_ep_disconnect() in some
!internal cases, and c4iw_ep_disconnect() acquires the endpoint mutex.
The design was intended to only do the disconnect for !internal calls.

Change rx_data(), FPDU_MODE case, to call c4iw_modify_rc_qp() with
internal == 1, and then disconnect only after releasing the mutex.

Change process_mpa_reply() to call c4iw_modify_rc_qp(TERMINATE) with
internal == 1 and set a new attr flag telling it to send a TERMINATE
message.  Previously this was implied by !internal.

Change process_mpa_reply() to return whether the caller should
disconnect after releasing the endpoint mutex.  Now rx_data() will do
the disconnect in the cases where process_mpa_reply() wants to
disconnect after the TERMINATE is sent.

Change c4iw_modify_rc_qp() RTS->TERM to only disconnect if !internal,
and to send a TERMINATE message if attrs->send_term is 1.

Change abort_connection() to not aquire the ep mutex for setting the
state, and make all calls to abort_connection() do so with the mutex
held.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-04-28 17:29:41 -07:00
Steve Wise fa658a98a2 RDMA/cxgb4: Use the BAR2/WC path for kernel QPs and T5 devices
Signed-off-by: Steve Wise <swise@opengridcomputing.com>

[ Fix cast from u64* to integer.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-04-11 11:36:01 -07:00
Linus Torvalds 877f075aac Main batch of InfiniBand/RDMA changes for 3.15:
- The biggest change is core API extensions and mlx5 low-level driver
    support for handling DIF/DIX-style protection information, and the
    addition of PI support to the iSER initiator.  Target support will be
    arriving shortly through the SCSI target tree.
 
  - A nice simplification to the "umem" memory pinning library now that
    we have chained sg lists.  Kudos to Yishai Hadas for realizing our
    code didn't have to be so crazy.
 
  - Another nice simplification to the sg wrappers used by qib, ipath and
    ehca to handle their mapping of memory to adapter.
 
  - The usual batch of fixes to bugs found by static checkers etc. from
    intrepid people like Dan Carpenter and Yann Droneaud.
 
  - A large batch of cxgb4, ocrdma, qib driver updates.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJTPYBnAAoJEENa44ZhAt0hGI4P/29eotGwpkANUQE6FQvxCUL2
 CXJtSg52lmYvGJrPK4IhihpbtQmHJz3iXEzlOOWidTw1dJgObR6vFaRymh7+vDLs
 CdzybMcXdasarqTuYeJbFzhkimpwtWWrMy/8Ik/Jj/5glGQ6cUSpdYZzVtFhYNqf
 hCGE8iLi+tuekJJj1htut5D6apXM7udcdc2yLJNOdsSj/VUXt1oqG1x9xAi9R8Tq
 7o8eFSStdlja0EBQ6Hli2zauCSnQkaUtr8h6EAFbcCtvBK8HqsHSc2gfq2ViFUiN
 ztt167oWoQnVkR0qCPL5nVt+CRQHHROprVXvbpcTI3aW61gNIl6OrUUOXefzHXac
 TNi+fdMpiEB/JQ4Z04Jzd1dGCSjYeTqPj4rO4meFjBmxRDdTgZHu7FWwejT1nYJ5
 d2abVdCOT+QWlIlM7m/pjdWJII5OYM+4/jtTayGepEaR4fTUzKtPZPBLNUBDBKE+
 4f92PC8LiuPkwJgb6XT96onPz1bDCOnPSEdwoKUFKPeGUcwgVOM/Wx5NU4Yf7rfg
 RxQwZ7mJXbjCYFlmGGo/0QDy6UEGkIFYlJSzooP+wlK1JvZ5h2M+9QKX2FtwzR+R
 I2kBxcTXWsM/h88R7MkNqbNIllmhssrJwmAE46OneZbfoBOB+JZjb4nLRTu0jEcS
 zn6f16GmJ37BKn2/qYY/
 =Ww6H
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband updates from Roland Dreier:
 "Main batch of InfiniBand/RDMA changes for 3.15:

   - The biggest change is core API extensions and mlx5 low-level driver
     support for handling DIF/DIX-style protection information, and the
     addition of PI support to the iSER initiator.  Target support will
     be arriving shortly through the SCSI target tree.

   - A nice simplification to the "umem" memory pinning library now that
     we have chained sg lists.  Kudos to Yishai Hadas for realizing our
     code didn't have to be so crazy.

   - Another nice simplification to the sg wrappers used by qib, ipath
     and ehca to handle their mapping of memory to adapter.

   - The usual batch of fixes to bugs found by static checkers etc.
     from intrepid people like Dan Carpenter and Yann Droneaud.

   - A large batch of cxgb4, ocrdma, qib driver updates"

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (102 commits)
  RDMA/ocrdma: Unregister inet notifier when unloading ocrdma
  RDMA/ocrdma: Fix warnings about pointer <-> integer casts
  RDMA/ocrdma: Code clean-up
  RDMA/ocrdma: Display FW version
  RDMA/ocrdma: Query controller information
  RDMA/ocrdma: Support non-embedded mailbox commands
  RDMA/ocrdma: Handle CQ overrun error
  RDMA/ocrdma: Display proper value for max_mw
  RDMA/ocrdma: Use non-zero tag in SRQ posting
  RDMA/ocrdma: Memory leak fix in ocrdma_dereg_mr()
  RDMA/ocrdma: Increment abi version count
  RDMA/ocrdma: Update version string
  be2net: Add abi version between be2net and ocrdma
  RDMA/ocrdma: ABI versioning between ocrdma and be2net
  RDMA/ocrdma: Allow DPP QP creation
  RDMA/ocrdma: Read ASIC_ID register to select asic_gen
  RDMA/ocrdma: SQ and RQ doorbell offset clean up
  RDMA/ocrdma: EQ full catastrophe avoidance
  RDMA/cxgb4: Disable DSGL use by default
  RDMA/cxgb4: rx_data() needs to hold the ep mutex
  ...
2014-04-03 16:57:19 -07:00
Steve Wise eda6d1d1b7 RDMA/cxgb4: Save the correct map length for fast_reg_page_lists
We cannot save the mapped length using the rdma max_page_list_len field
of the ib_fast_reg_page_list struct because the core code uses it.  This
results in an incorrect unmap of the page list in c4iw_free_fastreg_pbl().

I found this with dma mapping debugging enabled in the kernel.  The
fix is to save the length in the c4iw_fr_page_list struct.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-03-20 10:01:30 -07:00
Steve Wise ba32de9d8d RDMA/cxgb4: Mind the sq_sig_all/sq_sig_type QP attributes
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-03-20 10:01:30 -07:00
Steve Wise 05eb23893c cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes
The current logic suffers from a slow response time to disable user DB
usage, and also fails to avoid DB FIFO drops under heavy load. This commit
fixes these deficiencies and makes the avoidance logic more optimal.
This is done by more efficiently notifying the ULDs of potential DB
problems, and implements a smoother flow control algorithm in iw_cxgb4,
which is the ULD that puts the most load on the DB fifo.

Design:

cxgb4:

Direct ULD callback from the DB FULL/DROP interrupt handler.  This allows
the ULD to stop doing user DB writes as quickly as possible.

While user DB usage is disabled, the LLD will accumulate DB write events
for its queues.  Then once DB usage is reenabled, a single DB write is
done for each queue with its accumulated write count.  This reduces the
load put on the DB fifo when reenabling.

iw_cxgb4:

Instead of marking each qp to indicate DB writes are disabled, we create
a device-global status page that each user process maps.  This allows
iw_cxgb4 to only set this single bit to disable all DB writes for all
user QPs vs traversing the idr of all the active QPs.  If the libcxgb4
doesn't support this, then we fall back to the old approach of marking
each QP.  Thus we allow the new driver to work with an older libcxgb4.

When the LLD upcalls iw_cxgb4 indicating DB FULL, we disable all DB writes
via the status page and transition the DB state to STOPPED.  As user
processes see that DB writes are disabled, they call into iw_cxgb4
to submit their DB write events.  Since the DB state is in STOPPED,
the QP trying to write gets enqueued on a new DB "flow control" list.
As subsequent DB writes are submitted for this flow controlled QP, the
amount of writes are accumulated for each QP on the flow control list.
So all the user QPs that are actively ringing the DB get put on this
list and the number of writes they request are accumulated.

When the LLD upcalls iw_cxgb4 indicating DB EMPTY, which is in a workq
context, we change the DB state to FLOW_CONTROL, and begin resuming all
the QPs that are on the flow control list.  This logic runs on until
the flow control list is empty or we exit FLOW_CONTROL mode (due to
a DB DROP upcall, for example).  QPs are removed from this list, and
their accumulated DB write counts written to the DB FIFO.  Sets of QPs,
called chunks in the code, are removed at one time. The chunk size is 64.
So 64 QPs are resumed at a time, and before the next chunk is resumed, the
logic waits (blocks) for the DB FIFO to drain.  This prevents resuming to
quickly and overflowing the FIFO.  Once the flow control list is empty,
the db state transitions back to NORMAL and user QPs are again allowed
to write directly to the user DB register.

The algorithm is designed such that if the DB write load is high enough,
then all the DB writes get submitted by the kernel using this flow
controlled approach to avoid DB drops.  As the load lightens though, we
resume to normal DB writes directly by user applications.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:44:11 -04:00
Steve Wise 1cf24dcef4 RDMA/cxgb4: Fix QP flush logic
This patch makes following fixes in QP flush logic:

- correctly flushes unsignaled WRs followed by a signaled WR
- supports for flushing a CQ bound to multiple QPs
- resets cidx_flush if a active queue starts getting HW CQEs again
- marks WQ in error when we leave RTS. This was only being done for
  user queues, but we need it for kernel queues too so that
  post_send/post_recv will start returning the appropriate error
  synchronously
- eats unsignaled read resp CQEs. HW always inserts CQEs so we must
  silently discard them if the read work request was unsignaled.
- handles QP flushes with pending SW CQEs. The flush and out of order
  completion logic has a bug where if out of order completions are
  flushed but not yet polled by the consumer and the qp is then
  flushed then we end up inserting duplicate completions.
- c4iw_flush_sq() should only flush wrs that have not already been
  flushed.  Since we already track where in the SQ we've flushed via
  sq.cidx_flush, just start at that point and flush any remaining.
  This bug only caused a problem in the presence of unsignaled work
  requests.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>

[ Fixed sparse warning due to htonl/ntohl confusion.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:45 -07:00
Vipul Pandya 830662f6f0 RDMA/cxgb4: Add support for active and passive open connection with IPv6 address
Add new cpl messages, cpl_act_open_req6 and cpl_t5_act_open_req6, for
initiating active open connections.

Use LLD api cxgb4_create_server and cxgb4_create_server6 for
initiating passive open connections. Similarly use cxgb4_remove_server
to remove the passive open connections in place of listen_stop.

Add support for iWARP over VLAN device and enable IPv6 support on VLAN device.

Make use of import_ep in c4iw_reconnect.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>

[ Fix build when IPv6 is disabled and make sure iw_cxgb4 is not built-in
  when ipv6 is a module.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:06 -07:00
Vipul Pandya 3b174d942c RDMA/cxgb4: Bump tcam_full stat and WR reply timeout
Always bump the tcam_full stat. Also, bump wr reply timeout to 30 seconds.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-14 11:35:59 -04:00
Vipul Pandya 42b6a94990 RDMA/cxgb4: Use DSGLs for fastreg and adapter memory writes for T5.
It enables direct DMA by HW to memory region PBL arrays and fast register PBL
arrays from host memory, vs the T4 way of passing these arrays in the WR itself.
The result is lower latency for memory registration, and larger PBL array
support for fast register operations.

This patch also updates ULP_TX_MEM_WRITE command fields for T5. Ordering bit of
ULP_TX_MEM_WRITE is at bit position 22 in T5 and at 23 in T4.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-14 11:35:59 -04:00
Vipul Pandya 80ccdd6051 RDMA/cxgb4: Add module_params to enable DB FC & Coalescing on T5
Both DB Flow-Control and DB Coalescing are disabled by default on T5

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-14 11:35:58 -04:00
Vipul Pandya f079af7a11 RDMA/cxgb4: Add Support for Chelsio T5 adapter
Adds support for Chelsio T5 adapter.
Enables T5's Write Combining feature.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-14 11:35:58 -04:00
Tejun Heo e8d4dd606b IB/cxgb4: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:16 -08:00
Roland Dreier ef4e359d9b Merge branches 'core', 'cxgb4', 'ipoib', 'iser', 'misc', 'mlx4', 'qib' and 'srp' into for-next 2013-02-26 09:17:56 -08:00
Shani Michaeli 7083e42ee2 IB/core: Add "type 2" memory windows support
This patch enhances the IB core support for Memory Windows (MWs).

MWs allow an application to have better/flexible control over remote
access to memory.

Two types of MWs are supported, with the second type having two flavors:

    Type 1  - associated with PD only
    Type 2A - associated with QPN only
    Type 2B - associated with PD and QPN

Applications can allocate a MW once, and then repeatedly bind the MW
to different ranges in MRs that are associated to the same PD. Type 1
windows are bound through a verb, while type 2 windows are bound by
posting a work request.

The 32-bit memory key is composed of a 24-bit index and an 8-bit
key. The key is changed with each bind, thus allowing more control
over the peer's use of the memory key.

The changes introduced are the following:

* add memory window type enum and a corresponding parameter to ib_alloc_mw.
* type 2 memory window bind work request support.
* create a struct that contains the common part of the bind verb struct
  ibv_mw_bind and the bind work request into a single struct.
* add the ib_inc_rkey helper function to advance the tag part of an rkey.

Consumer interface details:

* new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
  IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
  for these features.

  Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
  IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
  memory windows. It can set neither to indicate it doesn't support
  type 2 windows at all.

* modify existing provides and consumers code to the new param of
  ib_alloc_mw and the ib_mw_bind_info structure

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-02-21 11:51:45 -08:00
Vipul Pandya 1ec779cc29 RDMA/cxgb4: Fix endpoint timeout race condition
The endpoint timeout logic had a race that could cause an endpoint
object to be freed while it was still on the timedout list.  This
can happen if the timer is stopped after it had fired, but before
the timedout thread processed the endpoint timeout.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-02-14 15:51:57 -08:00
Vipul Pandya 325abead6c RDMA/cxgb4: Keep QP referenced until TID released
The driver is currently releasing the last ref on the QP too early.
This can cause bus errors due to HW still fetching WRs from the HW
queue.  The fix is to keep a qp ref until we release the HW TID.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-02-14 15:51:56 -08:00
Vipul Pandya 793dad94e7 RDMA/cxgb4: Fix bug for active and passive LE hash collision path
Retries active opens for INUSE errors.

Logs any active ofld_connect_wr error replies.

Sends ofld_connect_wr on same ctrlq. It needs to go  on the same control txq as
regular CPL active/passive messages.

Retries on active open replies with EADDRINUSE.

Uses active open fw wr only if active filter region is set.

Adds stat for ofld_connect_wr failures.

This patch also adds debugfs file to show endpoints.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-12-19 23:03:12 -08:00
Vipul Pandya 5be78ee924 RDMA/cxgb4: Fix LE hash collision bug for active open connection
It enables establishing active open connection using fw_ofld_connection work
request when cpl_act_open_rpl says TCAM full error which may be because
of LE hash collision. Current support is only for IPv4 active open connections.

Sets ntuple bits in active open requests. For T4 firmware greater than 1.4.10.0
ntuple bits are required to be set.

Adds nocong and enable_ecn module parameter options.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>

[ Move all FW return values to t4fw_api.h.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-12-19 23:02:43 -08:00
Vipul Pandya 67bbc05512 RDMA/cxgb4: Add query_qp support
This allows querying the QP state before flushing.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:37 -07:00
Vipul Pandya ec3eead217 RDMA/cxgb4: Remove kfifo usage
Using kfifos for ID management was limiting the number of QPs and
preventing NP384 MPI jobs.  So replace it with a simple bitmap
allocator.

Remove IDs from the IDR tables before deallocating them.  This bug was
causing the BUG_ON() in insert_handle() to fire because the ID was
getting reused before being removed from the IDR table.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:36 -07:00
Vipul Pandya 422eea0a8c RDMA/cxgb4: DB Drop Recovery for RDMA and LLD queues
Add module option db_fc_threshold which is the count of active QPs
that trigger automatic db flow control mode.  Automatically transition
to/from flow control mode when the active qp count crosses
db_fc_theshold.

Add more db debugfs stats

On DB DROP event from the LLD, recover all the iwarp queues.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:33 -07:00
Vipul Pandya 4984037bef RDMA/cxgb4: Disable interrupts in c4iw_ev_dispatch()
Use GFP_ATOMIC in _insert_handle() if ints are disabled.

Don't panic if we get an abort with no endpoint found.  Just log a
warning.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:32 -07:00
Vipul Pandya 2c97478106 RDMA/cxgb4: Add DB Overflow Avoidance
Get FULL/EMPTY/DROP events from LLD.  On FULL event, disable normal
user mode DB rings.

Add modify_qp semantics to allow user processes to call into the
kernel to ring doobells without overflowing.

Add DB Full/Empty/Drop stats.

Mark queues when created indicating the doorbell state.

If we're in the middle of db overflow avoidance, then newly created
queues should start out in this mode.

Bump the C4IW_UVERBS_ABI_VERSION to 2 so the user mode library can
know if the driver supports the kernel mode db ringing.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:31 -07:00
Vipul Pandya 8d81ef34b2 RDMA/cxgb4: Add debugfs RDMA memory stats
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-05-18 13:22:29 -07:00
Roland Dreier 504255f8d0 Merge branches 'amso1100', 'cma', 'cxgb3', 'cxgb4', 'fdr', 'ipath', 'ipoib', 'misc', 'mlx4', 'misc', 'nes', 'qib' and 'xrc' into for-next 2011-11-01 09:37:08 -07:00
Kumar Sanghvi 581bbe2cd0 RDMA/cxgb4: Serialize calls to CQ's comp_handler
Commit 01e7da6ba5 ("RDMA/cxgb4: Make sure flush CQ entries are
collected on connection close") introduced a potential problem where a
CQ's comp_handler can get called simultaneously from different places
in the iw_cxgb4 driver.  This does not comply with
Documentation/infiniband/core_locking.txt, which states that at a
given point of time, there should be only one callback per CQ should
be active.

This problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>.
Based on discussion between Parav Pandit and Steve Wise, this patch
fixes the above problem by serializing the calls to a CQ's
comp_handler using a spin_lock.

Reported-by: Parav Pandit <Parav.Pandit@Emulex.Com>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-31 11:34:53 -07:00
Kumar Sanghvi d2fe99e86b RDMA/cxgb4: Add support for MPAv2 Enhanced RDMA Negotiation
This patch adds support for Enhanced RDMA Connection Establishment
(draft-ietf-storm-mpa-peer-connect-06), aka MPAv2.  Details of draft
can be obtained from:
<http://www.ietf.org/id/draft-ietf-storm-mpa-peer-connect-06.txt>

The patch updates the following functions for initiator perspective:
 - send_mpa_request
 - process_mpa_reply
 - post_terminate for TERM error codes
 - destroy_qp for TERM related change
 - adds layer/etype/ecode to c4iw_qp_attrs for sending with TERM
 - peer_abort for retrying connection attempt with MPA_v1 message
 - added c4iw_reconnect function

The patch updates the following functions for responder perspective:
 - process_mpa_request
 - send_mpa_reply
 - c4iw_accept_cr
 - passes ird/ord to upper layers

Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-10-06 09:39:24 -07:00
Steve Wise c337374bf2 RDMA/cxgb4: Use completion objects for event blocking
There exists a race condition when using wait_queue_head_t objects
that are declared on the stack.  This was being done in a few places
where we are sending work requests to the FW and awaiting replies, but
we don't have an endpoint structure with an embedded c4iw_wr_wait
struct.  So the code was allocating it locally on the stack.  Bad
design.  The race is:

  1) thread on cpuX declares the wait_queue_head_t on the stack, then
     posts a firmware WR with that wait object ptr as the cookie to be
     returned in the WR reply.  This thread will proceed to block in
     wait_event_timeout() but before it does:

  2) An interrupt runs on cpuY with the WR reply.  fw6_msg() handles
     this and calls c4iw_wake_up().  c4iw_wake_up() sets the condition
     variable in the c4iw_wr_wait object to TRUE and will call
     wake_up(), but before it calls wake_up():

  3) The thread on cpuX calls c4iw_wait_for_reply(), which calls
     wait_event_timeout().  The wait_event_timeout() macro checks the
     condition variable and returns immediately since it is TRUE.  So
     this thread never blocks/sleeps. The function then returns
     effectively deallocating the c4iw_wr_wait object that was on the
     stack.

  4) So at this point cpuY has a pointer to the c4iw_wr_wait object
     that is no longer valid.  Further its pointing to a stack frame
     that might now be in use by some other context/thread.  So cpuY
     continues execution and calls wake_up() on a ptr to a wait object
     that as been effectively deallocated.

This race, when it hits, can cause a crash in wake_up(), which I've
seen under heavy stress. It can also corrupt the referenced stack
which can cause any number of failures.

The fix:

Use struct completion, which supports on-stack declarations.
Completions use a spinlock around setting the condition to true and
the wake up so that steps 2 and 4 above are atomic and step 3 can
never happen in-between.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
2011-05-24 09:47:38 -07:00
Steve Wise 2f25e9a540 RDMA/cxgb4: EEH errors can hang the driver
A few more EEH fixes:

c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
return an error.

The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
event followed by a ib_register_device() when the device was
reinitialized.  However, the RDMA core doesn't allow multiple
iterations of register/deregister by the provider. See
drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
the kobject ref is held until the device is deallocated in
ib_deallocate_device().  Calling deregister adds this kobj reference,
and then a subsequent register call will generate a WARN_ON() from the
kobject subsystem because the kobject is being initialized but is
already initialized with the ref held.

So the provider must deregister and dealloc when resetting for an EEH
event, then alloc/register to re-initialize.  To do this, we cannot
use the device ptr as our ULD handle since it will change with each
reallocation.  This commit adds a ULD context struct which is used as
the ULD handle, and then contains the device pointer and other state
needed.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:23 -07:00
Steve Wise d9594d990a RDMA/cxgb4: Reset wait condition atomically
The driver was never really waiting for RDMA_WR/FINI completions
because the condition variable used to determine if the completion
happened was never reset, and this condition variable is reused for
both connection setup and teardown.  This causes various driver
crashes under heavy loads due to releasing resources too early.

The fix is to use atomic bits to correctly reset the condition
immediately after the completion is detected.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:22 -07:00
Steve Wise 30c95c2d49 RDMA/cxgb4: Don't change QP state outside EP lock
Concurrent ingress CLOSE and ULP ABORT operations causes a crash due
to a race condition where the close path releases the EP lock and then
tries to move the QP state to CLOSED.  This must be done inside the EP
lock to avoid the race.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-05-09 22:06:22 -07:00
Steve Wise 2942813739 RDMA/cxgb4: Remove db_drop_task
Unloading iw_cxgb4 can crash due to the unload code trying to use
db_drop_task, which is uninitialized.  So remove this dead code.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2011-03-14 12:09:10 -07:00
Linus Torvalds 008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Stephen Hemminger c943109163 RDMA/cxgb3,cxgb4: Remove dead code
This removes unused code found by running 'make namespacecheck';
compile tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2011-01-10 17:41:43 -08:00
Jesper Juhl e987fa357a infiniband: Only include mutex.h once in drivers/infiniband/hw/cxgb4/iw_cxgb4.h
Only include the header linux/mutex.h once inside
drivers/infiniband/hw/cxgb4/iw_cxgb4.h

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-15 14:35:19 +01:00
Steve Wise 2f5b48c3ad RDMA/cxgb4: Use a mutex for QP and EP state transitions
Move the connection setup/teardown paths to the workq thread removing
spin lock/irq disable requirements for these paths.  This allows calls
down to the LLD for EP and QP state transition actions to be atomic
with respect to processing CPL messages coming up from the HW.
Namely, calls to rdma_init() and rdma_fini() can now be called with
the mutex held avoiding many race conditions with the abort path.

The QP spinlock is still used but only to manipulate the qp state.  This
allows the fastpaths, poll, post_send, and pos_recv, to run in the
irq context.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-09-28 10:53:48 -07:00
Steve Wise c6d7b26791 RDMA/cxgb4: Support on-chip SQs
T4 support on-chip SQs to reduce latency.  This patch adds support for
this in iw_cxgb4:

 - Manage ocqp memory like other adapter mem resources.
 - Allocate user mode SQs from ocqp mem if available.
 - Map ocqp mem to user process using write combining.
 - Map PCIE_MA_SYNC reg to user process.

Bump uverbs ABI.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-09-28 10:46:35 -07:00
Steve Wise aadc4df308 RDMA/cxgb4: Centralize the wait logic
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-09-28 10:46:34 -07:00
Steve Wise d4f1a5c6ef RDMA/cxgb4: Use correct control txq
There is only one control txq per tx channel.  So use the port number
as the queue index when sending.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-08-02 21:06:12 -07:00