The poll and flush code needs to handle all send opcodes: SEND,
SEND_WITH_SE, SEND_WITH_INV, and SEND_WITH_SE_INV.
Ignore TERM indications if the connection already gone.
Ignore HW receive completions if the RQ is empty.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The variable 'offset' in iwch_sgl2pbl_map() needs to be a u64.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Freeze activity when notified that the underlying chip
is getting reset on a EEH event or fatal error.
Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the iw_cxgb3 module's cxgb3_client "add" func gets called by the
cxgb3 module, the iwarp driver ends up calling the ethtool ops
get_drvinfo function in cxgb3 to get the fw version and other info.
Currently the iwarp driver grabs the rtnl lock around this down call
to serialize. As of 2.6.27 or so, things changed such that the rtnl
lock is held around the call to the netdev driver open function. Also
the cxgb3_client "add" function doesn't get called if the device is
down.
So, if you load cxgb3, then load iw_cxgb3, then ifconfig up the
device, the iw_cxgb3 add func gets called with the rtnl_lock held. If
you load cxgb3, ifconfig up the device, then load iw_cxgb3, the add
func gets called without the rtnl_lock held. The former causes the
deadlock, the latter does not.
In addition, there are iw_cxgb3 sysfs handlers that also can call down
into cxgb3 to gather the fw and hw versions. These can be called
concurrently on different processors and at any time. Thus we need to
push this serialization down in the cxgb3 driver get_drvinfo func.
The fix is to remove rtnl lock usage, and use a per-device lock in cxgb3.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The array wqe->read.reserved has only two entries, but
iwch_post_zb_read() sets [0], [1], and [2], which is one too many.
This is harmless since it runs into the next field, rem_stag, which is
initialized correctly immediately after, but we might as well get
things right, especially since it makes the code smaller.
This was spotted by the Coverity checker (CID 2475).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
The error path in iwch_connect() can fail to drop the cmid reference,
which will cause the process to hang when destroying the cmid.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When running ibv_devinfo, the active_mtu returned is garbage. This is
due to the field not being populated in the query_port function in the
driver. The patch below populates the active_mtu field with a MTU of
2k. It also zeros the struct, so that any new additions to it will
return 0.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Running 'ifconfig up' on the cxgb3 interface with iw_cxgb3 loaded
causes a deadlock. The rtnl lock is already held in this path. The
function fw_supports_fastreg() was introduced in 2.6.27 to
conditionally set the IB_DEVICE_MEM_MGT_EXTENSIONS bit iff the
firmware was at 7.0 or greater, and this function also acquires the
rtnl lock and which thus causes a deadlock. Further, if iw_cxgb3 is
loaded _after_ the nic interface is brought up, then the deadlock does
not occur and therefore fw_supports_fastreg() does need to grab the
rtnl lock in that path.
It turns out this code is all useless anyway. The low level driver
will NOT allow the open if the firmware isn't 7.0, so iw_cxgb3 can
always set the MEM_MGT_EXTENSIONS bit. Simplify...
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- MWs don't have local read/write permissions.
- Set the MW_BIND enabled bit if a MR has MW_BIND access.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Set the stag0 and fastreg capability bits only for kernel qps.
- QP_PRIV flag is no longer used, so don't set it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Handling the zero STag in receive work request requires some extra
logic in the driver:
- Only set the QP_PRIV bit for kernel mode QPs.
- Add a zero STag build function for recv wrs. The uP needs a PBL
allocated and passed down in the recv WR so it can construct a HW
PBL for the zero STag S/G entries. Note: we need to place a few
restrictions on zero STag usage because of this:
1) all SGEs in a recv WR must either be zero STag or not. No mixing.
2) an individual SGE length cannot exceed 128MB for a zero-stag SGE.
This should be OK since it's not really practical to allocate
such a large chunk of pinned contiguous DMA mapped memory.
- Add an optimized non-zero-STag recv wr format for kernel users.
This is needed to optimize both zero and non-zero STag cracking in
the recv path for kernel users.
- Remove the iwch_ prefix from the static build functions.
- Bump required FW version.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
STag support and IB HCAs to indicate reserved L_Key support.
- Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
this in with the appropriate local DMA L_Key (if they support it).
- Fix up the drivers using this flag.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 does not currently report the page size capabilities, and
incorrectly reports them internally.
This version changes the bit-shifting to a static value (per Steve's
request).
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Add a new rdma ctl command called RDMA_GET_MIB to the cxgb3 low
level driver to obtain the protocol mib from the rnic hardware.
- Add new iw_cxgb3 provider method to get the MIB from the low level
driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The members struct iwch_rnic_attributes.vendor_id and .vendor_part_id
are write-only, so we might as well get rid of them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
- set fw_ver
- set hw_ver
- set max_qp_wr to something reasonable
- set max_cqe to something reasonable
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit if fw supports it.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add local_inv work request support.
- add send_with_inv and send_with_se_inv work request support.
- removed useless duplicate enums/defines for TPT/MW/MR stuff.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The change to iwch_provider.c in commit f4e91eb4 ("IB: convert struct
class_device to struct device") undid the fix done in commit 7f049f2f
("RDMA/cxgb3: Hold rtnl_lock() around ethtool get_drvinfo call"). It
removed the calls to rtnl_lock() that serialized the iw_cxgb3 ethtool
ops calls into the cxgb3 driver. This locking is needed to avoid
messing up the internal state of the cxgb3 driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send':
drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function
This is what akpm describes as "the dopey
gcc-doesn't-know-that-foo(&var)-writes-to-var problem."
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
cxio_flush_sq() was failing to wrap around the software send queue
causing garbage completion entries on a flush operation.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently, iw_cxgb3 is severely limited on the amount of userspace
memory that can be registered in in a single memory region, which
causes big problems for applications that expect to be able to
register 100s of MB.
The problem is that the driver uses a single kmalloc()ed buffer to
hold the physical buffer list (PBL) for the entire memory region
during registration, which means that 8 bytes of contiguous memory are
required for each page of memory being registered. For example, a 64
MB registration will require 128 KB of contiguous memory with 4 KB
pages, and it unlikely that such an allocation will succeed on a busy
system.
This is purely a driver problem: the temporary page list buffer is not
needed by the hardware, so we can fix this by writing the PBL to the
hardware in page-sized chunks rather than all at once. We do this by
splitting the memory registration operation up into several steps:
- Allocate PBL space in adapter memory for the full registration
- Copy PBL to adapter memory in chunks
- Allocate STag and enable memory region
This also allows several other cleanups to the __cxio_tpt_op()
interface and related parts of the driver.
This change leaves the reregister memory region and memory window
operations broken, but they already didn't work due to other
longstanding bugs, so fixing them will be left to a later patch.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
chunks. This limits the largest single allocation that can be done to
the same size, which means that with 4 KB pages, each of which takes 8
bytes of PBL memory, the largest memory region that can be allocated
is 1 GB (256K PBL entries * 4 KB/entry).
Remove this limit by adding all the PBL memory in a single gen_pool
chunk, if possible. Add code that falls back to smaller chunks if
gen_pool_add() fails, which can happen if there is not sufficient
contiguous lowmem for the internal gen_pool bitmap.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Testing on large clusters shows its way too short at 10 secs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove bad BUG_ON() that can trigger in correct operation from
close_con_rpl(). It is possible to get a close_rpl message on a dead
connection. The sequence is:
- host refs ep for close exchange
- host posts close_req
- hw posts PEER_ABORT from incoming RST
- host marks ep DEAD
- host posts ABORT_RPL and releases ep resources
- hw posts CLOSE_RPL
- host derefs ep and ep freed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Flush the QP only after the HW disables the connection. Currently
we flush the QP when transitioning to CLOSING. This exposes a race
condition where the HW can complete a RECV WR, for instance, -and-
the SW can flush that same WR.
- Only call CQ event handlers on flush IFF we actually flushed something.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI, Intel MPI and other applications don't respect the iWARP
requirement that the client (active) side of the connection send the
first RDMA message. This class of application connection setup is
called peer-to-peer. Typically once the connection is setup, _both_
sides want to send data.
This patch enables supporting peer-to-peer over the chelsio RNIC by
enforcing this iWARP requirement in the driver itself as part of RDMA
connection setup.
Connection setup is extended, when the peer2peer module option is 1,
such that the MPA initiator will send a 0B Read (the RTR) just after
connection setup. The MPA responder will suspend SQ processing until
the RTR message is received and reply-to.
In the longer term, this will be handled in a standardized way by
enhancing the MPA negotiation so peers can indicate whether they
want/need the RTR and what type of RTR (0B read, 0B write, or 0B send)
should be sent. This will be done by standardizing a few bits of the
private data in order to negotiate all this. However this patch
enables peer-to-peer applications now and allows most of the required
firmware and driver changes to be done and tested now.
Design:
- Add a module option, peer2peer, to enable this mode.
- New firmware support for peer-to-peer mode:
- a new bit in the rdma_init WR to tell it to do peer-2-peer
and what form of RTR message to send or expect.
- process _all_ preposted recvs before moving the connection
into rdma mode.
- passive side: defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the RTR is received.
- active side: expect and process the 0B read WR on offload TX
queue. Defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the 0B read WR is processed from the offload TX queue.
- If peer2peer is set, driver posts 0B read request on offload TX
queue just after posting the rdma_init WR to the offload TX queue.
- Add CQ poll logic to ignore unsolicitied read responses.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 only supports 4GB memory regions. The lustre RDMA code uses
this attribute and currently has to code around our bad setting.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI and other stress testing exposed a few bad bugs in handling
aborts in the middle of a normal close. Fix these by:
- serializing abort reply and peer abort processing with disconnect
processing
- warning (and ignoring) if ep timer is stopped when it wasn't running
- cleaning up disconnect path to correctly deal with aborting and
dead endpoints
- in iwch_modify_qp(), taking a ref on the ep before releasing the qp
lock if iwch_ep_disconnect() will be called. The ref is dropped
after calling disconnect.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
__FUNCTION__ is gcc-specific, use __func__ instead.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix sparse warnings about pointer signedness by using a signed int when
calling idr_get_new_above().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Because of a typo in iwch_accept_cr(), the cxgb3 connection handling
code programs the hardware IRD (incoming RDMA read queue depth) with
the value that is passed in for the ORD (outgoing RDMA read queue
depth). In particular this means that if an application passes in IRD
> 0 and ORD = 0 (which is a completely sane and valid thing to do for
an app that expects only incoming RDMA read requests), then the
hardware will end up programmed with IRD = 0 and the app will fail in
a mysterious way.
Fix this by using "ep->ird" instead of "ep->ord" in the intended place.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cxbg3 driver is unnecessarily decreasing the number of CQ entries by
one when creating a CQ. This will cause the CQ not to have as many
entries as requested by the user if the user requests a power of 2 size.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set cap.max_inline_data to the actual max inline data that the adapter
support, so that userspace apps see the right value returned.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A single entry (addr 0x10001000, size 0x2000) will get converted to
page address 0x10000000 with a page size of 0x4000. The code as it
stands doesn't address the single buffer case, but in fact it allows
the subsequent single-buffer special case to be eliminated entirely.
Because the mask now includes the (page adjusted) starting and ending
addresses, the general case works for the single buffer case as well.
Signed-off-by: Bryan Rosenburg <rosnbrg@us.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxgb3 HW and driver don't support loopback RDMA connections. So
fail any connection attempt where the destination address is local.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25: (1470 commits)
[IPV6] ADDRLABEL: Fix double free on label deletion.
[PPP]: Sparse warning fixes.
[IPV4] fib_trie: remove unneeded NULL check
[IPV4] fib_trie: More whitespace cleanup.
[NET_SCHED]: Use nla_policy for attribute validation in ematches
[NET_SCHED]: Use nla_policy for attribute validation in actions
[NET_SCHED]: Use nla_policy for attribute validation in classifiers
[NET_SCHED]: Use nla_policy for attribute validation in packet schedulers
[NET_SCHED]: sch_api: introduce constant for rate table size
[NET_SCHED]: Use typeful attribute parsing helpers
[NET_SCHED]: Use typeful attribute construction helpers
[NET_SCHED]: Use NLA_PUT_STRING for string dumping
[NET_SCHED]: Use nla_nest_start/nla_nest_end
[NET_SCHED]: Propagate nla_parse return value
[NET_SCHED]: act_api: use PTR_ERR in tcf_action_init/tcf_action_get
[NET_SCHED]: act_api: use nlmsg_parse
[NET_SCHED]: act_api: fix netlink API conversion bug
[NET_SCHED]: sch_netem: use nla_parse_nested_compat
[NET_SCHED]: sch_atm: fix format string warning
[NETNS]: Add namespace for ICMP replying code.
...
Needed to propagate it down to the __ip_route_output_key.
Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes TOPDIR from infiniband Makefile and delete
one include statement pointing to a non-existing directory
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <mshefty@ichips.intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Correctly work around T3A issues by checking "hwtype != T3A" instead of
"hwtype == T3B". This will be needed for new hardware types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The existing logic incorrectly maps this buffer list:
0: addr 0x10001000, size 0x1000
1: addr 0x10002000, size 0x1000
To this bogus page list:
0: 0x10000000
1: 0x10002000
The shift calculation must also take into account the address of the
first entry masked by the page_mask as well as the last address+size
rounded up to the next page size.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- for kernel mode cqs, call event notification handler when flushing.
- flush QP when moving from RTS -> CLOSING.
- fix logic to identify a kernel mode qp.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The 5.0 firmware now supports translating sgls in recv work requests,
so remove the host driver logic currently doing the translation.
Note: this change requires 5.0 firmware.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently the call into cxgb3 to get the driver info is not serialized.
The iw_cxgb3 module needs to hold the rtnl_lock around the ethtool ops
call like dev_ioctl() does.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The device attribute max_qp_init_rd_atom is not getting set in cxgb3's
query_device method. Version 1.0.4 of librdmacm now validates the
user's requested initiator and responder resources against the max
supported by the device. Since iw_cxgb3 wasn't setting this attribute
(and it defaulted to 0), all rdma_connect()s fail if there are
initiator resources requested by the app. Fix this by setting the
correct value in iwch_query_device().
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (87 commits)
mlx4_core: Fix section mismatches
IPoIB: Allow setting policy to ignore multicast groups
IB/mthca: Mark error paths as unlikely() in post_srq_recv functions
IB/ipath: Minor fix to ordering of freeing and zeroing of tid pages.
IB/ipath: Remove redundant link state checks
IB/ipath: Fix IB_EVENT_PORT_ERR event
IB/ipath: Better handling of unexpected GPIO interrupts
IB/ipath: Maintain active time on all chips
IB/ipath: Fix QHT7040 serial number check
IB/ipath: Indicate a couple of chip bugs to userspace
IB/ipath: iba6110 rev4 no longer needs recv header overrun workaround
IB/ipath: Use counters in ipath_poll and cleanup interrupts in ipath_close
IB/ipath: Remove duplicate copy of LMC
IB/ipath: Add ability to set the LMC via the sysfs debugging interface
IB/ipath: Optimize completion queue entry insertion and polling
IB/ipath: Implement IB_EVENT_QP_LAST_WQE_REACHED
IB/ipath: Generate flush CQE when QP is in error state
IB/ipath: Remove redundant code
IB/ipath: Future proof eeprom checksum code (contents reading)
IB/ipath: UC RDMA WRITE with IMMEDIATE doesn't send the immediate
...
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow changing parameter values without having to reload the module.
This is safe because these parameters are only looked at when a new
connection is established.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 used netdev_priv() and dev->priv for different purposes.
In 2.6.23, netdev_priv() == dev->priv, cxgb3 needs a fix.
This patch is a partial backport of Dave Miller's changes in the
net-2.6.24 git branch.
Without this fix, cxgb3 crashes on 2.6.23.
Signed-off-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Transform some calls to kmalloc/memset to a single kzalloc (or kcalloc).
Here is a short excerpt of the semantic patch performing
this transformation:
@@
type T2;
expression x;
identifier f,fld;
expression E;
expression E1,E2;
expression e1,e2,e3,y;
statement S;
@@
x =
- kmalloc
+ kzalloc
(E1,E2)
... when != \(x->fld=E;\|y=f(...,x,...);\|f(...,x,...);\|x=E;\|while(...) S\|for(e1;e2;e3) S\)
- memset((T2)x,0,E1);
@@
expression E1,E2,E3;
@@
- kzalloc(E1 * E2,E3)
+ kcalloc(E1,E2,E3)
[akpm@linux-foundation.org: get kcalloc args the right way around]
Signed-off-by: Yoann Padioleau <padator@wanadoo.fr>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <bryan.wu@analog.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Dave Airlie <airlied@linux.ie>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Acked-by: Pierre Ossman <drzeus-list@drzeus.cx>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Greg KH <greg@kroah.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change Kconfig objects from "menu, config" into "menuconfig" so
that the user can disable the whole feature without having to
enter the menu first.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
[ Also remove cast from void * return of kmalloc() as suggested by
Jesper Juhl <jesper.juhl@gmail.com>. ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This bug results in an abort request being sent down _after_ the tid
has been released. If the tid happens to have been reused, then the
subsequent generation of the tid gets incorrectly aborted.
The thread running iwch_accecpt_cr() must not abort a connection if an
error is returned after being awakened. If any errors did occur while
iwch_accept_cr() is blocked, then the connection has already been
aborted on the thread processing the error.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The LLD does this for us in cxgb3_remove_tid().
Also fixed active open failure cases where we also shouldn't be
releasing the TID.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Negative advice messages should _not_ count toward the 2 abort
requests needed to indicate an abort request.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't set the gen bits nor length bits in the terminate WR. This is
done by the LLD driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Due to a HW issue, our current scheme to transition the connection from
streaming to rdma mode is broken on the passive side. The firmware
and driver now support a new transition scheme for the passive side:
- driver posts rdma_init_wr (now including the initial receive seqno)
- driver posts last streaming message via TX_DATA message (MPA start
response)
- uP atomically sends the last streaming message and transitions the
tcb to rdma mode.
- driver waits for wr_ack indicating the last streaming message was ACKed.
NOTE: This change also bumps the required firmware version to 4.3.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export ib_umem_get()/ib_umem_release() and put low-level drivers in
control of when to call ib_umem_get() to pin and DMA map userspace,
rather than always calling it in ib_uverbs_reg_mr() before calling the
low-level driver's reg_user_mr method.
Also move these functions to be in the ib_core module instead of
ib_uverbs, so that driver modules using them do not depend on
ib_uverbs.
This has a number of advantages:
- It is better design from the standpoint of making generic code a
library that can be used or overridden by device-specific code as
the details of specific devices dictate.
- Drivers that do not need to pin userspace memory regions do not
need to take the performance hit of calling ib_mem_get(). For
example, although I have not tried to implement it in this patch,
the ipath driver should be able to avoid pinning memory and just
use copy_{to,from}_user() to access userspace memory regions.
- Buffers that need special mapping treatment can be identified by
the low-level driver. For example, it may be possible to solve
some Altix-specific memory ordering issues with mthca CQs in
userspace by mapping CQ buffers with extra flags.
- Drivers that need to pin and DMA map userspace memory for things
other than memory regions can use ib_umem_get() directly, instead
of hacks using extra parameters to their reg_phys_mr method. For
example, the mlx4 driver that is pending being merged needs to pin
and DMA map QP and CQ buffers, but it does not need to create a
memory key for these buffers. So the cleanest solution is for mlx4
to call ib_umem_get() in the create_qp and create_cq methods.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The semantics defined by the InfiniBand specification say that
completion events are only generated when a completions is added to a
completion queue (CQ) after completion notification is requested. In
other words, this means that the following race is possible:
while (CQ is not empty)
ib_poll_cq(CQ);
// new completion is added after while loop is exited
ib_req_notify_cq(CQ);
// no event is generated for the existing completion
To close this race, the IB spec recommends doing another poll of the
CQ after requesting notification.
However, it is not always possible to arrange code this way (for
example, we have found that NAPI for IPoIB cannot poll after
requesting notification). Also, some hardware (eg Mellanox HCAs)
actually will generate an event for completions added before the call
to ib_req_notify_cq() -- which is allowed by the spec, since there's
no way for any upper-layer consumer to know exactly when a completion
was really added -- so the extra poll of the CQ is just a waste.
Motivated by this, we add a new flag "IB_CQ_REPORT_MISSED_EVENTS" for
ib_req_notify_cq() so that it can return a hint about whether the a
completion may have been added before the request for notification.
The return value of ib_req_notify_cq() is extended so:
< 0 means an error occurred while requesting notification
== 0 means notification was requested successfully, and if
IB_CQ_REPORT_MISSED_EVENTS was passed in, then no
events were missed and it is safe to wait for another
event.
> 0 is only returned if IB_CQ_REPORT_MISSED_EVENTS was
passed in. It means that the consumer must poll the
CQ again to make sure it is empty to avoid the race
described above.
We add a flag to enable this behavior rather than turning it on
unconditionally, because checking for missed events may incur
significant overhead for some low-level drivers, and consumers that
don't care about the results of this test shouldn't be forced to pay
for the test.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a num_comp_vectors member to struct ib_device and extend
ib_create_cq() to pass in a comp_vector parameter -- this parallels
the userspace libibverbs API. Update all hardware drivers to set
num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
value. Pass the value of num_comp_vectors to userspace rather than
hard-coding a value of 1.
We want multiple CQ event vector support (via MSI-X or similar for
adapters that can generate multiple interrupts), but it's not clear
how many vectors we want, or how we want to deal with policy issues
such as how to decide which vector to use or how to set up interrupt
affinity. This patch is useful for experimenting, since no core
changes will be necessary when updating a driver to support multiple
vectors, and we know that we want to make at least these changes
anyway.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The HW now posts 2 ABORT_RPL and/or PEER_ABORT_REQ messages. We need
to handle them by silenty dropping the 1st but mark that we're ready
for the final message. This plugs some close races between the uP and
HW. Also update the minimum required firmware version.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix TERMINATE layer, type, and ecode values based on
conformance testing.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: (49 commits)
IB: Set class_dev->dev in core for nice device symlink
IB/ehca: Implement modify_port
IB/umad: Clarify documentation of transaction ID
IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements
IB/mad: Change SMI to use enums rather than magic return codes
IB/umad: Implement GRH handling for sent/received MADs
IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr
IB/sa: Set src_path_bits correctly in ib_init_ah_from_path()
IB/ucm: Simplify ib_ucm_event()
RDMA/ucma: Simplify ucma_get_event()
IB/mthca: Simplify CQ cleaning in mthca_free_qp()
IB/mthca: Fix mthca_write_mtt() on HCAs with hidden memory
IB/mthca: Update HCA firmware revisions
IB/ipath: Fix WC format drift between user and kernel space
IB/ipath: Check that a UD work request's address handle is valid
IB/ipath: Remove duplicate stuff from ipath_verbs.h
IB/ipath: Check reserved memory keys
IB/ipath: Fix unit selection when all CPU affinity bits set
IB/ipath: Don't allow QPs 0 and 1 to be opened multiple times
IB/ipath: Disable IB link earlier in shutdown sequence
...
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.
This one touches just the most simple cases:
skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()
The next ones will handle the slightly more "complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All RDMA drivers except ehca set class_dev->dev to their dma_device
value (ehca leaves this unset). dma_device is the only value that
makes any sense, so move this assignment to core/sysfs.c. This reduce
the duplicated code in the rest of the drivers and gives ehca a nice
/sys/class/infiniband/ehcaX/device symlink.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As of commit 6cdbd77e ("cxgb3 - missing CPL hanler and register
setting."), the cxgb3 ethernet NIC driver no longer handles SET_TCB
replies, so we need to do it in the iWARP driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Acked-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This was spotted by the Coverity checker (CID 1554).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix memory region permission problems:
- remove useless and redundant iwch_mem_perms enum.
- create ib_to_tpt_access_rights() for mapping ib access rights
to T3 TPT permissions.
- create ib_to_mwbind_access_rights() for mapping ib access rights
to T3 MWBIND WR permissions.
- fix up the mem reg code to utilize the new functions.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only print one AE error for a given connection in the kernel log.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Stop the endpoint timer when the MPA exchange is aborted by the peer.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change iwch_destroy_qp() to always move the QP to ERROR and let
iwch_modify_qp() decide what to do.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fixes for "normal close" failures:
- Start normal close timer when moving to CLOSING state.
- Handle ABORTING state in close_con_rpl().
- Stop timer correctly on abort during a normal close.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 uses dma_alloc_coherent() et al. thus needs linux/dma-mapping.h
include in order to build reliably.
Noticed on sparc64.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the consumer rejects the connection we end up under-referencing the
endpoint structure. The fix is to call iwch_ep_disconnect() instead
of the low level disconnect functions so that the endpoint close timer
is started correctly.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Stop the ep timer in ec_status() if the status indicates a
bad close.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- don't mark static functions in C files as inline - gcc should know
best whether inlining makes sense
- never compile the unused cxio_dbg.c
- make the following needlessly global functions static:
- cxio_hal.c: cxio_hal_clear_qp_ctx()
- iwch_provider.c: iwch_get_qp()
- remove the following unused global functions:
- cxio_hal.c: cxio_allocate_stag()
- cxio_resource.: cxio_hal_get_rhdl()
- cxio_resource.: cxio_hal_put_rhdl()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove the Open Grid Computing copyright. It shouldn't be there.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For T3B devices, mark user QP in error once we transition
to TERMINATE.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an RDMA/iWARP driver for the Chelsio T3 1GbE and 10GbE adapters.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>