Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.
There are also several linux/version.h include removals from Jesper."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...
This patch addresses a bug in srpt_handle_cmd() failure handling where
send_ioctx->kref is being leaked with the local extra reference after init,
causing the expected kref_put() in srpt_handle_send_comp() to not be the final
call to invoke srpt_put_send_ioctx_kref() -> transport_generic_free_cmd() and
perform se_cmd descriptor memory release.
It also fixes a SCF_SCSI_RESERVATION_CONFLICT handling bug where this code
is incorrectly falling through to transport_handle_cdb_direct() after
invoking srpt_queue_status() to send SAM_STAT_RESERVATION_CONFLICT status.
Note this patch is for >= v3.3 mainline code, and current lio-core.git
code has already been converted to target_submit_cmd() + se_cmd->cmd_kref usage,
and internal ioctx->kref usage has been removed. I'm including this patch
now into target-pending/for-next with a CC' for v3.3 stable.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
To issue a port query, use the QUERY_(Ethernet)_PORT command instead
of the MAD_IFC command, since MAD_IFC attempts to query the firmware
IB SMA, which is irrelevant for IBoE ports.
This allows us to handle both 10Gb/s and 40Gb/s rates (e.g in sysfs),
using QDR speed (10Gb/s) and width of 1X or 4X.
Signed-off-by: Dotan Barak <dotanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If an erroneous CQE is polled in the first iteration (i.e. npolled ==
0), we don't update the consumer index and hence the hardware could
get a wrong notion of how many CQEs software polled. Fix this by
unconditionally updating the doorbell record. We could change the
check to be something like
if (npolled || err != -EAGAIN)
...
but it does not seem worth the effort since a posted write to memory
should not cost too much.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch drops the following unused legacy API callers from target_core_fabric.h:
*) TFO->fall_back_to_erl0()
*) TFO->stop_session()
*) TFO->sess_logged_in()
*) TFO->is_state_remove()
This patch also removes the stub usage in loopback, tcm_fc, iscsi_target,
and ib_srpt fabric modules.
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Use a bit in wc_flags rather then a whole integer to hold the
"checksum OK" flag. By itself, this change doesn't reduce the size of
struct ib_wc on 64bit machines -- it stays on 56 bytes because of
padding. However, it will allow to add more fields in the future
without enlarging the struct. Also, it will let us have a unified
approach with future libibverbs checksum offload reporting, because a
bit flag doesn't break the library ABI.
This patch was suggested during conversation with Liran Liss
<liranl@mellanox.com>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When destroying a listening cmid, the iwcm first marks the state of
the cmid as DESTROYING, then releases the lock and calls into the
iWARP provider to destroy the endpoint. Since the cmid is not locked,
its possible for the iWARP provider to pass a connection request event
to the iwcm, which will be silently dropped by the iwcm. This causes
the iWARP provider to never free up the resources from this connection
because the assumption is the iwcm will accept or reject this connection.
The solution is to reject these connection requests.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since flush_qp() is always called with irqs disabled, all the locking
inside flush_qp() and __flush_qp() doesn't need irq save/restore.
Further, passing the flag variable from iwch_modify_qp() is just wrong
and causes a WARN_ON() in local_bh_enable().
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
While doing the work for commit a6f7feae6d ("IB/mlx4: pass SMP
vendor-specific attribute MADs to firmware") we realized that the
firmware would respond on all sorts of vendor-specific MADs.
Therefore commit 97285b7817 ("mlx4_core: Add extended port
capabilities support") adds redundant code into the driver, since
there's no real reaon to maintain the extended capabilities of the
port, as they can be queried on demand (e.g the FDR10 capability).
This patch reverts commit 97285b7817 and removes the check for
extended caps from the mlx4_ib driver port query flow.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When we destroy a cm_id, we must purge associated events from the
event queue. If the cm_id is for a listen request, we also purge
corresponding pending connect requests. This requires destroying
the cm_id's associated with the connect requests by calling
rdma_destroy_id(). rdma_destroy_id() blocks until all outstanding
callbacks have completed.
The issue is that we hold file->mut while purging events from the
event queue. We also acquire file->mut in our event handler. Calling
rdma_destroy_id() while holding file->mut can lead to a deadlock,
since the event handler callback cannot acquire file->mut, which
prevents rdma_destroy_id() from completing.
Fix this by moving events to purge from the event queue to a temporary
list. We can then release file->mut and call rdma_destroy_id()
outside of holding any locks.
Bug report by Or Gerlitz <ogerlitz@mellanox.com>:
[ INFO: possible circular locking dependency detected ]
3.3.0-rc5-00008-g79f1e43-dirty #34 Tainted: G I
tgtd/9018 is trying to acquire lock:
(&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0359a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
but task is already holding lock:
(&file->mut){+.+.+.}, at: [<ffffffffa02470fe>] ucma_free_ctx+0xb6/0x196 [rdma_ucm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&file->mut){+.+.+.}:
[<ffffffff810682f3>] lock_acquire+0xf0/0x116
[<ffffffff8135f179>] mutex_lock_nested+0x64/0x2e6
[<ffffffffa0247636>] ucma_event_handler+0x148/0x1dc [rdma_ucm]
[<ffffffffa035a79a>] cma_ib_handler+0x1a7/0x1f7 [rdma_cm]
[<ffffffffa0333e88>] cm_process_work+0x32/0x119 [ib_cm]
[<ffffffffa03362ab>] cm_work_handler+0xfb8/0xfe5 [ib_cm]
[<ffffffff810423e2>] process_one_work+0x2bd/0x4a6
[<ffffffff810429e2>] worker_thread+0x1d6/0x350
[<ffffffff810462a6>] kthread+0x84/0x8c
[<ffffffff81369624>] kernel_thread_helper+0x4/0x10
-> #0 (&id_priv->handler_mutex){+.+.+.}:
[<ffffffff81067b86>] __lock_acquire+0x10d5/0x1752
[<ffffffff810682f3>] lock_acquire+0xf0/0x116
[<ffffffff8135f179>] mutex_lock_nested+0x64/0x2e6
[<ffffffffa0359a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffffa024715f>] ucma_free_ctx+0x117/0x196 [rdma_ucm]
[<ffffffffa0247255>] ucma_close+0x77/0xb4 [rdma_ucm]
[<ffffffff810df6ef>] fput+0x117/0x1cf
[<ffffffff810dc76e>] filp_close+0x6d/0x78
[<ffffffff8102b667>] put_files_struct+0xbd/0x17d
[<ffffffff8102b76d>] exit_files+0x46/0x4e
[<ffffffff8102d057>] do_exit+0x299/0x75d
[<ffffffff8102d599>] do_group_exit+0x7e/0xa9
[<ffffffff8103ae4b>] get_signal_to_deliver+0x536/0x555
[<ffffffff81001717>] do_signal+0x39/0x634
[<ffffffff81001d39>] do_notify_resume+0x27/0x69
[<ffffffff81361c03>] retint_signal+0x46/0x83
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&file->mut);
lock(&id_priv->handler_mutex);
lock(&file->mut);
lock(&id_priv->handler_mutex);
*** DEADLOCK ***
1 lock held by tgtd/9018:
#0: (&file->mut){+.+.+.}, at: [<ffffffffa02470fe>] ucma_free_ctx+0xb6/0x196 [rdma_ucm]
stack backtrace:
Pid: 9018, comm: tgtd Tainted: G I 3.3.0-rc5-00008-g79f1e43-dirty #34
Call Trace:
[<ffffffff81029e9c>] ? console_unlock+0x18e/0x207
[<ffffffff81066433>] print_circular_bug+0x28e/0x29f
[<ffffffff81067b86>] __lock_acquire+0x10d5/0x1752
[<ffffffff810682f3>] lock_acquire+0xf0/0x116
[<ffffffffa0359a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffff8135f179>] mutex_lock_nested+0x64/0x2e6
[<ffffffffa0359a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffff8106546d>] ? trace_hardirqs_on_caller+0x11e/0x155
[<ffffffff810654b1>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa0359a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffffa024715f>] ucma_free_ctx+0x117/0x196 [rdma_ucm]
[<ffffffffa0247255>] ucma_close+0x77/0xb4 [rdma_ucm]
[<ffffffff810df6ef>] fput+0x117/0x1cf
[<ffffffff810dc76e>] filp_close+0x6d/0x78
[<ffffffff8102b667>] put_files_struct+0xbd/0x17d
[<ffffffff8102b5cc>] ? put_files_struct+0x22/0x17d
[<ffffffff8102b76d>] exit_files+0x46/0x4e
[<ffffffff8102d057>] do_exit+0x299/0x75d
[<ffffffff8102d599>] do_group_exit+0x7e/0xa9
[<ffffffff8103ae4b>] get_signal_to_deliver+0x536/0x555
[<ffffffff810654b1>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff81001717>] do_signal+0x39/0x634
[<ffffffff8135e037>] ? printk+0x3c/0x45
[<ffffffff8106546d>] ? trace_hardirqs_on_caller+0x11e/0x155
[<ffffffff810654b1>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff81361803>] ? _raw_spin_unlock_irq+0x2b/0x40
[<ffffffff81039011>] ? set_current_blocked+0x44/0x49
[<ffffffff81361bce>] ? retint_signal+0x11/0x83
[<ffffffff81001d39>] do_notify_resume+0x27/0x69
[<ffffffff8118a1fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81361c03>] retint_signal+0x46/0x83
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
I'm getting compile failures building this driver, which I narrowed
down to the ilog2 call in ehca_get_max_hwpage_size...
ERROR: ".____ilog2_NaN" [drivers/infiniband/hw/ehca/ib_ehca.ko]
undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
The use of shca->hca_cap_mr_pgsize is confusing the compiler, and
resulting in the __builtin_constant_p in ilog2 going insane.
I tried making it take the u32 pgsize as an argument and the expansion
of shca->_pgsize in the caller, but that failed as well.
With this patch in place, the driver compiles on my GCC 4.6.2 here.
Suggested-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Kyle McMartin <kmcmarti@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file. Add that enum, and use
it all over the code.
The IB speed/width notation is also used by iWARP and IBoE HW drivers,
which use the convention of rate = speed * width to advertise their
port link rate.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
An iser target may send iscsi NO-OP PDUs as soon as it marks the iSER
iSCSI session as fully operative. This means that there is window
where there are no posted receive buffers on the initiator side, so
it's possible for the iSER RC connection to break because of RNR NAK /
retry errors. To fix this, rely on the flags bits in the login
request to have FFP (0x3) in the lower nibble as a marker for the
final login request, and post an initial chunk of receive buffers
before sending that login request instead of after getting the login
response.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We allocate the login dma buffers in iser_verbs.c as part of
alloc_ib_conn_resources(), however we are freeing them in
iser_initiator.c as part of iser_free_rx_descriptors(). This is
needlessly confusing. We have an alloc_rx_descriptors() and it
doesn't alloc something that the free_rx_descriptors() frees, and we
have an alloc_ib_conn_resources() that allocs something not freed by
free_ib_conn_resources(). Clean that up.
Signed-off-by: Doug Ledford <dledford@redhat.com>
[ Fix build error in iser_free_ib_conn_res(). - Or ]
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove sysfs attributes before removing a target instead of testing
the target state in every sysfs attribute callback method. Note: it is
safe to invoke a sysfs attribute removal method like
device_remove_file() twice on the same attribute.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use pr_fmt() and pr_xxx() instead of more verbose printk() equivalents.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 71eeba16 ("IB: Add new InfiniBand link speeds") introduced a bug
where eg the rate for IB 4X SDR links iss displayed as "8.5 Gb/sec"
instead of "10 Gb/sec" as it used to be. Fix that.
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Davem considers that the argument list of this interface is getting
out of control. This patch tries to address this issue following
his proposal:
struct netlink_dump_control c = { .dump = dump, .done = done, ... };
netlink_dump_start(..., &c);
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ConnectX devices have a limit on the number of mappings that can be
done on an FMR before having to call sync_tpt. The current
mlx4_ib driver reports the limit correctly in max_map_per_fmr in
.query_device(), but mlx4_core doesn't check it when actually
allocating FMRs.
Add a max_fmr_maps field to struct mlx4_caps and enforce this maximum
value on FMR allocations.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the opcode of a work request exceeds the range of valid opcodes,
return the pointer to the offending work request.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set up a response with appropriate error status and send it for MADs
that are not supported by a specific class/version.
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Swapna Thete <swapna.thete@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since create_comp_task() creates percpu kthread, it makes sense to use
kthread_create_on_node() to get proper NUMA affinity for kthread
stack.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The current get portinfo handling filters the LID being sent,
changing zero to 0xffff.
This causes OpenSM to log excessive warning messages.
Reviewed-by: Edward Mascarenhas <edward.mascarenhas@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Call irq_set_affinity_hint() to give userspace programs such as
irqbalance the information to be able to distribute qib interrupts
appropriately.
The logic allocates all non-receive interrupts to the first CPU local
to the HCA. Receive interrupts are allocated round robin starting
with the second CPU local to the HCA with potential wrap back to the
second CPU.
This patch also adds a refinement to the name registered for MSI-X
interrupts so that user level scripts can determine the device
associated with the IRQs when there are multiple HCAs with a
potentially different set of local CPUs.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Don't worry about p2p_type if peer2peer itself is not requested in the
first place.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Change the test for if a cmd is a tmr request to checking if
SCF_SCSI_TMR_CDB (a new flag) is set in cmd->se_cmd_flags.
Also remove se_tmr_req_cache usage in favor of kzalloc usage,
and make core_tmr_alloc_req() return int + setup se_cmd->se_tmr_req
directly and fix up various fabric module usages
Cc: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Replace various atomic_ts used as flags in struct se_cmd with a single
transport_state bitmap that requires t_state_lock to be held for modifications.
In the target core that assumption generally is true, but some recently added
code in the SRP target had to grow new lock calls. I can't say I like the way
how it messes with the command state directly, but let's leave that for later.
(Re-add missing ib_srpt.c changes that nab dropped..)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quoth David:
1) GRO MAC header comparisons were ethernet specific, breaking other
link types. This required a multi-faceted fix to cure the originally
noted case (Infiniband), because IPoIB was lying about it's actual
hard header length. Thanks to Eric Dumazet, Roland Dreier, and
others.
2) Fix build failure when INET_UDP_DIAG is built in and ipv6 is modular.
From Anisse Astier.
3) Off by ones and other bug fixes in netprio_cgroup from Neil Horman.
4) ipv4 TCP reset generation needs to respect any network interface
binding from the socket, otherwise route lookups might give a
different result than all the other segments received. From Shawn
Lu.
5) Fix unintended regression in ipv4 proxy ARP responses, from Thomas
Graf.
6) Fix SKB under-allocation bug in sh_eth, from Yoshihiro Shimoda.
7) Revert skge PCI mapping changes that are causing crashes for some
folks, from Stephen Hemminger.
8) IPV4 route lookups fill in the wildcarded fields of the given flow
lookup key passed in, which is fine most of the time as this is
exactly what the caller's want. However there are a few cases that
want to retain the original flow key values afterwards, so handle
those cases properly. Fix from Julian Anastasov.
9) IGB/IXGBE VF lookup bug fixes from Greg Rose.
10) Properly null terminate filename passed to ethtool flash device
method, from Ben Hutchings.
11) S3 resume fix in via-velocity from David Lv.
12) Fix double SKB free during xmit failure in CAIF, from Dmitry
Tarnyagin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabled
ipv4: Fix wrong order of ip_rt_get_source() and update iph->daddr.
netprio_cgroup: fix wrong memory access when NETPRIO_CGROUP=m
netprio_cgroup: don't allocate prio table when a device is registered
netprio_cgroup: fix an off-by-one bug
bna: fix error handling of bnad_get_flash_partition_by_offset()
isdn: type bug in isdn_net_header()
net: Make qdisc_skb_cb upper size bound explicit.
ixgbe: ethtool: stats user buffer overrun
ixgbe: dcb: up2tc mapping lost on disable/enable CEE DCB state
ixgbe: do not update real num queues when netdev is going away
ixgbe: Fix broken dependency on MAX_SKB_FRAGS being related to page size
ixgbe: Fix case of Tx Hang in PF with 32 VFs
ixgbe: fix vf lookup
igb: fix vf lookup
e1000: add dropped DMA receive enable back in for WoL
gro: more generic L2 header check
IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses
zd1211rw: firmware needs duration_id set to zero for non-pspoll frames
net: enable TC35815 for MIPS again
...
Commit a0417fa3a1 ("net: Make qdisc_skb_cb upper size bound
explicit.") made it possible for a netdev driver to use skb->cb
between its header_ops.create method and its .ndo_start_xmit
method. Use this in ipoib_hard_header() to stash away the LL address
(GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows
IPoIB to stop lying about its hard_header_len, which will let us fix
the L2 check for GRO.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a0417fa3a1 ("net: Make qdisc_skb_cb upper size bound
explicit.") made it possible for a netdev driver to use skb->cb
between its header_ops.create method and its .ndo_start_xmit
method. Use this in ipoib_hard_header() to stash away the LL address
(GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows
IPoIB to stop lying about its hard_header_len, which will let us fix
the L2 check for GRO.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
transport_init_session() and target_fabric_configfs_init() don't
return NULL pointers, they only return ERR_PTRs or valid pointers.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update copyright information in the source files.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the current code, vendor-specific MADs (e.g with the FDR-10
attribute) are silently dropped by the driver, resulting in timeouts
at the sending side and inability to query/configure the relevant
feature. However, the ConnectX firmware is able to handle such MADs.
For unsupported attributes, the firmware returns a GET_RESPONSE MAD
containing an error status.
For example, for a FDR-10 node with LID 11:
# ibstat mlx4_0 1
CA: 'mlx4_0'
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 11
LMC: 0
SM lid: 24
Capability mask: 0x02514868
Port GUID: 0x0002c903002e65d1
Link layer: InfiniBand
Extended Port Query (EPI) vendor mad timeouts before the patch:
# smpquery MEPI 11 -d
ibwarn: [4196] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [4196] _do_madrpc: retry 1 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: retry 2 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: timeout after 3 retries, 3000 ms
ibwarn: [4196] mad_rpc: _do_madrpc failed; dport (Lid 11)
smpquery: iberror: [pid 4196] main: failed: operation EPI: ext port info query failed
EPI query works OK with the patch:
# smpquery MEPI 11 -d
ibwarn: [6548] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [6548] mad_rpc: data offs 64 sz 64
mad data
0000 0000 0000 0001 0000 0001 0000 0001
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
# Ext Port info: Lid 11 port 0
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x01
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Ira Weiny <weiny2@llnl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix fast memory registration opcode in local invalidate completion.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Donald Wood <Donald.E.Wood@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Zero high order word of fast memory registration (FMR) length field.
FMR length field is 32 bits, so high word should always be zero.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Donald Wood <Donald.E.Wood@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
After reporting a new connection request to user space, the rdma_ucm
will discard subsequent events until the user has associated a user
space idenfier with the kernel cm_id. This is needed to avoid
reporting a reject/disconnect event to the user for a request that
they may not have processed.
The user space identifier is set once the user tries to accept the
connection request. However, the following race exists in ucma_accept():
ctx->uid = cmd.uid;
<events may be reported now>
ret = rdma_accept(ctx->cm_id, ...);
Once ctx->uid has been set, new events may be reported to the user.
While the above mentioned race is avoided, there is an issue that the
user _may_ receive a reject/disconnect event if rdma_accept() fails,
depending on when the event is processed. To simplify the use of
rdma_accept(), discard all events unless rdma_accept() succeeds.
This problem was discovered based on questions from Roland Dreier
<roland@purestorage.com>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 8d4548f2b ("IB/qib: Default some module parameters optimally")
introduced an issue with older root complexes. They cannot handle the
pcie_caps of 0x51 (MaxReadReq 4096, MaxPayload=256).
A typical diagnostic in this situation reported by syslog contains
the text:
[PCIe Poisoned TLP][Send DMA memory read]
Restore the module paramter default to zero with will avoid any
changes in the root complex.
Reviewed-by: Mark Debbage <mark.debbage@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
alloc_dummy_hdrq() is called with locks held and thus should not use
GFP_KERNEL.
The semantic patch that makes this report is available in
scripts/coccinelle/locks/call_kern.cocci.
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Make sure all exit paths from this function unlock everything.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set a reject flag, when sending MPA reject message to inform the peer
that the application has rejected the connection.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
"dentry" is a valid pointer. "*dentry" was intended.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We have just been investigating kernel panics related to
cq->ibcq.event_handler() completion calls. The problem is that
ib_destroy_qp() fails with -EBUSY.
Further investigation revealed qp->usecnt is not initialized. This
counter was introduced in linux-3.2 by commit 0e0ec7e063
("RDMA/core: Export ib_open_qp() to share XRC TGT QPs") but it only
gets initialized for IB_QPT_XRC_TGT, but it is checked in
ib_destroy_qp() for any QP type.
Fix this by initializing qp->usecnt for every QP we create.
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Signed-off-by: Sven Breuner <sven.breuner@itwm.fraunhofer.de>
[ Initialize qp->usecnt in uverbs too. - Sean ]
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Now we must provide the IP destination address, and a reference has
to be dropped when we're done with the entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we must provide the IP destination address, and a reference has
to be dropped when we're done with the entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we must provide the IP destination address, and a reference has
to be dropped when we're done with the entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For historical reasons, we allow module_param(bool) to take an int (or
an unsigned int). That's going away.
A few drivers really want an int: they set it to -1 and a parameter
will set it to 0 or 1. This sucks: reading them from sysfs will give
'Y' for both -1 and 1, but if we change it to an int, then the users
might be broken (if they did "param" instead of "param=1").
Use a new 'bint' parser for them.
(ntfs has a different problem: it needs an int for debug_msgs because
it's also exposed via sysctl.)
Cc: Steve Glendinning <steve.glendinning@smsc.com>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Guenter Roeck <guenter.roeck@ericsson.com>
Cc: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Cc: Christoph Raisch <raisch@de.ibm.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux390@de.ibm.com
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: lm-sensors@lm-sensors.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: alsa-devel@alsa-project.org
Acked-by: Takashi Iwai <tiwai@suse.de> (For the sound part)
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> (For the hwmon driver)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
Clean up sparse warnings in the rdma core layer.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix endianness bugs reported by sparse in the RDMA core stack. Note
that these are real bugs, but don't affect any existing code to the
best of my knowledge. The mlid issue would only affect kernel users
of rdma_join_multicast which have the rdma_cm attach/detach its QP.
There are no current in tree users that do this. (rdma_join_multicast
may be used called by user space applications, which does not have
this issue.) And the pkey setting is simply returned as
informational.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix for reset which happens right after sending a terminate message.
Terminate timer is not deleted when the connection is closed.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Change the PHY clock divisor to make the MDIO clock 2.5MHz, instead of
3.5MHz (which is out of spec).
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add a missing 16-bit reserved field between ap_status and info fields.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For IBoE, SLs 0-7 are mapped to Ethernet 802.1Q user priority bits
(pbits) which are part of the VLAN tag, SLs 8-15 are reserved.
Under Ethernet, the ConnectX firmware treats (decode/encode) the four
bit SL field in various constructs such as QPC / UD WQE / CQE as PPP0
and not as 0PPP. This correlates well to the fact that within the
vlan tag the pbits are located in bits 15-13 and not 12-14.
The current code wasn't consistent around that area - the
encoding was correct for the IBoE QPC.path.schedule_queue field,
but was wrong for IBoE CQEs and when MLX header was built.
These inconsistencies resulted in wrong SL <--> wire 802.1Q pbits
mapping, which is fixed by using SL <--> PPP0 all around the place.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Minimize the need for users to have to set module parameters to get
good performance.
The following two parameters are changed:
- rcvhdrcnt to twice the rcvegrcnt
- pcie_caps=0x51
The rcvhdrcnt at twice the egrcount allows the preemptive NAK code
during reception to function in 100% of the cases rather than a sender
jiffies-based timeout.
The pcie_caps default of 0x51 will set the proposed MaxPayload and
MaxReceiveReqest to 256 and 4096 respectively. The capabilities on
the root complex will be used to limit those values.
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The current code locks the QP s_lock, followed by the pending_lock, I
guess to to protect against the allocate failing.
This patch only locks the pending_lock, assuming that the empty case
is an exeception, in which case the pending_lock is dropped, and the
original code is executed. This will save a lock of s_lock in the
normal case.
The observation is that the sdma descriptors will deplete at twice the
rate of txreq's, so this should be rare.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Prevent a receive data corruption by ensuring that the write to update
the rcvhdrheadn register to generate an interrupt is at the very end
of the receive processing.
Signed-off-by: Ramkrishna Vepa <ram.vepa@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The qib driver makes use of the the 64-bit jiffies API.
Code inspection reveals that that version of the API is not really
required. This patch converts to use the "normal" jiffies.
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
More style issues revealed with checkpatch.pl -f.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Userspace verbs multicast attach/detach operations on a QP are done
while holding the rwsem of the QP for reading. That's not sufficient
since a reader lock allows more than one reader to acquire the
lock. However, multicast attach/detach does list manipulation that
can corrupt the list if multiple threads run in parallel.
Fix this by acquiring the rwsem as a writer to serialize attach/detach
operations. Add idr_write_qp() and put_qp_write() to encapsulate
this.
This fixes oops seen when running applications that perform multicast
joins/leaves.
Reported by: Mike Dubman <miked@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 53ab1c6498 ("IB/qib: Correct nfreectxts for multiple HCAs")
reversed the increments and decrements of dd->nfreectxts. Fix it.
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
private_data_len is defined as a u8. If the user specifies a large
private_data size (> 220 bytes), we will calculate a total length that
exceeds 255, resulting in private_data_len wrapping back to 0. This
can lead to overwriting random kernel memory. Avoid this by verifying
that the resulting size fits into a u8.
Reported-by: B. Thery <benjamin.thery@bull.net>
Addresses: <http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2335>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds the kernel module ib_srpt SCSI RDMA Protocol (SRP) target
implementation conforming to the SRP r16a specification for the mainline
drivers/target infrastructure.
This driver was originally developed by Vu Pham and has been optimized by
Bart Van Assche and merged into upstream LIO based on his srpt-lio-4.1
branch here:
https://github.com/bvanassche/srpt-lio/commits/srpt-lio-4.1/
This updated patch also contains the following two changes from
lio-core-2.6.git/master. One is to fix a bug with 1 >= task->task_sg[]
chained mappings in ib_srpt, and the other to convert the configfs control
plane to reference IB Port GUID and struct srpt_port directly following
mainline v4.x target_core_fabric_configfs.c convertion for ib_srpt
to work with rtslib/rtsadmin v2 code.
These seperate patches can be found here:
ib_srpt: Fix bug with chainged SGLs in srpt_map_sg_to_ib_sge
http://www.risingtidesystems.com/git/?p=lio-core-2.6.git;a=commitdiff;h=ea485147563b6555a97dbf811825fbb586519252
ib_srpt: Convert se_wwn endpoint reference to struct srpt_port->port_wwn
http://www.risingtidesystems.com/git/?p=lio-core-2.6.git;a=commitdiff;h=4e544a210acb227df1bb4ca5086e65bdf4e648ea
This also includes the following recent v1 -> v2 review changes:
ib_srpt: Fix potential out-of-bounds array access
ib_srpt: Avoid failed multipart RDMA transfers
ib_srpt: Fix srpt_alloc_fabric_acl failure case return value
ib_srpt: Update comments to reference $driver/$port layout
ib_srpt: Fix sport->port_guid formatting code
ib_srpt: Remove legacy use_port_guid_in_session_name module parameter
ib_srpt: Convert srp_max_rdma_size into per port configfs attribute
ib_srpt: Convert srp_max_rsp_size into per port configfs attribute
ib_srpt: Convert srpt_sq_size into per port configfs attribute
and v2 -> v3 review changes:
ib_srpt: Fix possible race with srp_sq_size in srpt_create_ch_ib
ib_srpt: Fix possible race with srp_max_rsp_size in srpt_release_channel_work
ib_srpt: Fix up MAX_SRPT_RDMA_SIZE define
ib_srpt: Make srpt_map_sg_to_ib_sge() failure case return -EAGAIN
ib_srpt: Convert port_guid to use subnet_prefix + interface_id formatting
ib_srpt: Make srpt_check_stop_free return kref_put status
ib_srpt: Make compilation with BUG=n proceed`
ib_srpt: Use new target_core_fabric.h include
ib_srpt: Check hex2bin() return code to silence build warning
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vu Pham <vu@mellanox.com>
Cc: David Dillow <dillowda@ornl.gov>
Signed-off-by: Nicholas A. Bellinger <nab@risingtidesystems.com>
For SRIOV, some Hypervisor commands can be executed directly (native = 1).
Others should go through the command wrapper flow (for tracking resource
usage, for example, or for changing some HCA configurations that slaves
need to be notified of).
This patch sets the groundwork for this capability -- adding the correct
value of "native" in each case.
Note that if SRIOV is not activated, this parameter has no effect.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port mask now has additional state.
Port can be set as "none". In this case neither the mlx4_en or mlx4_ib
drivers take ownership of the port.
In multifunction mode there is an option to set the vfs as single ported devices.
(in single function mode, both physical ports belong to same function)
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cfcde11c3d ("IB/mlx4: Use flow counters on IBoE ports") added
code that sets elements of counters[] to -1 if no counter is allocated,
but then goes ahead and passes every entry to mlx4_counter_free() on
shutdown. This is a bad idea, especially if MLX4_DEV_CAP_FLAG_COUNTERS
isn't set so there isn't even an underlying bitmap to free from.
Tested-by: Sean Hefty <sean.hefty@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reduce the number of dst_get_neighbour_noref() calls within a single
call chain. Primarily by passing the neighbour pointer down to the
helper functions.
Handle dst_get_neighbour_noref() returning NULL in ipoib_start_xmit()
by incrementing the dropped counter and freeing the packet. We don't
want it to fall through into the ARP/RARP/multicast handling, since
that should only happen when skb_dst() is NULL.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
Three pieces of code do the same thing, create a l2t entry and then
import this information into the c4iw_ep object.
Create a helper function and call it from these 3 locations instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
This way we consolidate the RCU locking down into the place where it
actually matters, and also we can make the code handle
dst_get_neighbour_noref() returning NULL properly.
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV4 should do exactly what the IPV6 code does here, which is
use the neighbour obtained via the dst entry.
And now that the two code paths do the same thing, use a common
helper function to perform the operation.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Roland Dreier <roland@purestorage.com>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
netdev->neigh_priv_len records the private area length.
This will trigger for neigh_table objects which set tbl->entry_size
to zero, and the first instances of this will be forthcoming.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f2c31e32b3 ("net: fix NULL dereferences in check_peer_redir()")
forgot to take care of infiniband uses of dst neighbours.
Many thanks to Marc Aurele who provided a nice bug report and feedback.
Reported-by: Marc Aurele La France <tsi@ualberta.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This following can occur with ipoib when processing a multicast reponse:
BUG: soft lockup - CPU#0 stuck for 67s! [ib_mad1:982]
Modules linked in: ...
CPU 0:
Modules linked in: ...
Pid: 982, comm: ib_mad1 Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant DL160 G5
RIP: 0010:[<ffffffff814ddb27>] [<ffffffff814ddb27>] _spin_unlock_irqrestore+0x17/0x20
RSP: 0018:ffff8802119ed860 EFLAGS: 00000246
0000000000000004 RBX: ffff8802119ed860 RCX: 000000000000a299
RDX: ffff88021086c700 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffffffff8100bc8e R08: ffff880210ac229c R09: 0000000000000000
R10: ffff88021278aab8 R11: 0000000000000000 R12: ffff8802119ed860
R13: ffffffff8100be6e R14: 0000000000000001 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d4840 CR3: 0000000209aa5000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
[<ffffffffa032c247>] ? ipoib_mcast_send+0x157/0x480 [ib_ipoib]
[<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffffa03283d4>] ? ipoib_path_lookup+0x124/0x2d0 [ib_ipoib]
[<ffffffffa03286fc>] ? ipoib_start_xmit+0x17c/0x430 [ib_ipoib]
[<ffffffff8141e758>] ? dev_hard_start_xmit+0x2c8/0x3f0
[<ffffffff81439d0a>] ? sch_direct_xmit+0x15a/0x1c0
[<ffffffff81423098>] ? dev_queue_xmit+0x388/0x4d0
[<ffffffffa032d6b7>] ? ipoib_mcast_join_finish+0x2c7/0x510 [ib_ipoib]
[<ffffffffa032dab8>] ? ipoib_mcast_sendonly_join_complete+0x1b8/0x1f0 [ib_ipoib]
[<ffffffffa02a0946>] ? mcast_work_handler+0x1a6/0x710 [ib_sa]
[<ffffffffa015f01e>] ? ib_send_mad+0xfe/0x3c0 [ib_mad]
[<ffffffffa00f6c93>] ? ib_get_cached_lmc+0xa3/0xb0 [ib_core]
[<ffffffffa02a0f9b>] ? join_handler+0xeb/0x200 [ib_sa]
[<ffffffffa029e4fc>] ? ib_sa_mcmember_rec_callback+0x5c/0xa0 [ib_sa]
[<ffffffffa029e79c>] ? recv_handler+0x3c/0x70 [ib_sa]
[<ffffffffa01603a4>] ? ib_mad_completion_handler+0x844/0x9d0 [ib_mad]
[<ffffffffa015fb60>] ? ib_mad_completion_handler+0x0/0x9d0 [ib_mad]
[<ffffffff81088830>] ? worker_thread+0x170/0x2a0
[<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810886c0>] ? worker_thread+0x0/0x2a0
[<ffffffff8108ddf6>] ? kthread+0x96/0xa0
[<ffffffff8100c1ca>] ? child_rip+0xa/0x20
Coinciding with stack trace is the following message:
ib0: ib_address_create failed
The code below in ipoib_mcast_join_finish() will note the above
failure in the address handle but otherwise continue:
ah = ipoib_create_ah(dev, priv->pd, &av);
if (!ah) {
ipoib_warn(priv, "ib_address_create failed\n");
} else {
The while loop at the bottom of ipoib_mcast_join_finish() will attempt
to send queued multicast packets in mcast->pkt_queue and eventually
end up in ipoib_mcast_send():
if (!mcast->ah) {
if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
skb_queue_tail(&mcast->pkt_queue, skb);
else {
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
}
My read is that the code will requeue the packet and return to the
ipoib_mcast_join_finish() while loop and the stage is set for the
"hung" task diagnostic as the while loop never sees a non-NULL ah, and
will do nothing to resolve.
There are GFP_ATOMIC allocates in the provider routines, so this is
possible and should be dealt with.
The test that induced the failure is associated with a host SM on the
same server during a shutdown.
This patch causes ipoib_mcast_join_finish() to exit with an error
which will flush the queued mcast packets. Nothing is done to unwind
the QP attached state so that subsequent sends from above will retry
the join.
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Reviewed-by: Gary Leshner <gary.leshner@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Don't over-schedule QSFP work on driver initialization. It could end
up being run simultaneously on two different CPUs resulting in bad
EEPROM reads. In combination with setting the physical IB link state
prior to the IBC being brought out of reset, this can cause the link
state machine to start training early with wrong settings.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix logic so that we don't retry with MPAv1 once we have done that
already. Otherwise, we end up retrying with MPAv1 even when its not
needed on getting peer aborts - and this could lead to kernel panic.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix another place in the code where logic dealing with the t4_cqe was
using the wrong QID. This fixes the counting logic so that it tests
against the SQ QID instead of the RQ QID when counting RCQES.
Signed-off by: Jonathan Lallinger <jonathan@ogc.us>
Signed-off by: Steve Wise <swise@ogc.us>
Signed-off-by: Roland Dreier <roland@purestorage.com>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was mistakenly introduced by dde05cbdf8 ("IB/qib: Hold links
until tuning data is available").
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
The following panic can occur when flushing a QP:
RIP: 0010:[<ffffffffa0168e8b>] [<ffffffffa0168e8b>] qib_send_complete+0x3b/0x190 [ib_qib]
RSP: 0018:ffff8803cdc6fc90 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff8803d84ba000 RCX: 0000000000000000
RDX: 0000000000000005 RSI: ffffc90015a53430 RDI: ffff8803d84ba000
RBP: ffff8803cdc6fce0 R08: ffff8803cdc6fc90 R09: 0000000000000001
R10: 00000000ffffffff R11: 0000000000000000 R12: ffff8803d84ba0c0
R13: ffff8803d84ba5cc R14: 0000000000000800 R15: 0000000000000246
FS: 0000000000000000(0000) GS:ffff880036600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000034 CR3: 00000003e44f9000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qib/0 (pid: 1350, threadinfo ffff8803cdc6e000, task ffff88042728a100)
Stack:
53544c5553455201 0000000100000005 0000000000000000 ffff8803d84ba000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000001 ffff8803cdc6fd30 ffffffffa0165d7a
Call Trace:
[<ffffffffa0165d7a>] qib_make_rc_req+0x36a/0xe80 [ib_qib]
[<ffffffffa0165a10>] ? qib_make_rc_req+0x0/0xe80 [ib_qib]
[<ffffffffa01698b3>] qib_do_send+0xf3/0xb60 [ib_qib]
[<ffffffff814db757>] ? thread_return+0x4e/0x777
[<ffffffffa01697c0>] ? qib_do_send+0x0/0xb60 [ib_qib]
[<ffffffff81088bf0>] worker_thread+0x170/0x2a0
[<ffffffff8108e530>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81088a80>] ? worker_thread+0x0/0x2a0
[<ffffffff8108e1c6>] kthread+0x96/0xa0
[<ffffffff8100c1ca>] child_rip+0xa/0x20
[<ffffffff8108e130>] ? kthread+0x0/0xa0
[<ffffffff8100c1c0>] ? child_rip+0x0/0x20
RIP [<ffffffffa0168e8b>] qib_send_complete+0x3b/0x190 [ib_qib]
The RC error state flush logic in qib_make_rc_req() could return all
of the acked wqes and potentially have emptied the queue. It would
then unconditionally try return a flush completion via
qib_send_complete() for an invalid wqe, or worse a valid one that is
not queued. The panic results when the completion code tries to
maintain an MR reference count for a NULL MR.
This fix modifies logic to only send one completion per
qib_make_rc_req() call and changing the completion status from
IB_WC_SUCCESS to IB_WC_WR_FLUSH_ERR as the completions progress.
The outer loop will call as many times as necessary to flush the queue.
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The current driver never does DMA unmapping on these buffers. Fix that
by adding DMA unmapping to the task cleanup callback, and DMA mapping to
the task init function (drop the headers_initialized micro-optimization).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver counted on the transactional nature of iSCSI login/text
flows and used the same buffer for both the request and the response.
We also went further and did DMA mapping only once, with
DMA_FROM_DEVICE, which violates the DMA mapping API. Fix that by
using different buffers, one for requests and one for responses, and
use the correct DMA mapping direction for each.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The num_free field of mthca_buddy has a type of array of unsigned int
while it was allocated as an array of pointers. On 64-bit platforms
this allocates twice more than required. Fix this by allocating the
correct size for the type.
This is the same bug just fixed in mlx4 by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
mlx4_core: Deprecate log_num_vlan module param
IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
IB/mlx4: Enable 4K mtu for IBoE
RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
RDMA/cxgb4: Serialize calls to CQ's comp_handler
RDMA/cxgb3: Serialize calls to CQ's comp_handler
IB/qib: Fix issue with link states and QSFP cables
IB/mlx4: Configure extended active speeds
mlx4_core: Add extended port capabilities support
IB/qib: Hold links until tuning data is available
IB/qib: Clean up checkpatch issue
IB/qib: Remove s_lock around header validation
IB/qib: Precompute timeout jiffies to optimize latency
IB/qib: Use RCU for qpn lookup
IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
IB/qib: Decode path MTU optimization
IB/qib: Optimize RC/UC code by IB operation
IPoIB: Use the right function to do DMA unmap pages
RDMA/cxgb4: Use correct QID in insert_recv_cqe()
RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
...
Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".
The difference between mlocking and pinning is:
A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.
B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.
I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:
Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.
This patch introduces a separate counter for pinned pages and
accounts them seperately.
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Mike Marciniszyn <infinipath@qlogic.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These files were getting the moduleparam infrastructure from the
implicit presence of module.h being everywhere, but that is going
away soon.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
These were getting it implicitly via device.h --> module.h but
we are going to stop that when we clean up the headers.
Fix these in advance so the tree remains biscect-clean.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
They had been getting it implicitly via device.h but we can't
rely on that for the future, due to a pending cleanup so fix
it now.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
They get it via module.h (via device.h) but we want to clean that up.
When we do, we'll get things like:
CC [M] drivers/infiniband/core/sysfs.o
sysfs.c:361: error: 'S_IRUGO' undeclared here (not in a function)
sysfs.c:654: error: 'S_IWUSR' undeclared here (not in a function)
so add in the stat header it is using explicitly in advance.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
There's no need to set the vlan-related fields in an IBoE send WQE
control segment:
- the vlan to be used by a UD QP is set in the datagram segment.
- for GSI (CM) QP, all the headers down to 8021q and MAC are built by
the software anyway.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IBoE port MTU is derived from the corresponding Ethernet netdevice
MTU, which can support jumbo frames of 9K, and hence surely supports
the max IB mtu of 4K.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
QPs need to be moved to error before telling the firwmare to shutdown
the queue. Otherwise, the application can submit WRs that will never
get fetched by the hardware and never flushed by the driver.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swsie@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 01e7da6ba5 ("RDMA/cxgb4: Make sure flush CQ entries are
collected on connection close") introduced a potential problem where a
CQ's comp_handler can get called simultaneously from different places
in the iw_cxgb4 driver. This does not comply with
Documentation/infiniband/core_locking.txt, which states that at a
given point of time, there should be only one callback per CQ should
be active.
This problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>.
Based on discussion between Parav Pandit and Steve Wise, this patch
fixes the above problem by serializing the calls to a CQ's
comp_handler using a spin_lock.
Reported-by: Parav Pandit <Parav.Pandit@Emulex.Com>
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
iw_cxgb3 has a potential problem where a CQ's comp_handler can get
called simultaneously from different places in iw_cxgb3 driver. This
does not comply with Documentation/infiniband/core_locking.txt, which
states that at a given point of time, there should be only one
callback per CQ should be active.
Such problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>
for iw_cxgb4 driver. Based on discussion between Parav Pandit and
Steve Wise, this patch fixes the above problem by serializing the
calls to a CQ's comp_handler using a spin_lock.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix an issue where the link would come up after replugging a cable
even if it has been DISABLED manually.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the extended active speeds based on the hardware configuration.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
[ Move FDR-10 handling into ib_link_query_port(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Hold the link state machine until the tuning data is read from the
QSFP EEPROM so correct tuning settings are applied before the state
machine attempts to bring the link up. Link is also held on cable
unplug in case a different cable is used.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This was probably present from initial submission.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Review of qib_ruc_check_hdr() shows that the s_lock is not required in
the normal case. The r_lock is held in all cases, and protects the qp
fields that are read.
The s_lock will be needed to around the call to qib_migrate_qp() to
insure that the send engine sees a consistent set of fields.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
A new field is added to qib_qp called timeout_jiffies. It is
initialized upon create and modify.
The field is now used instead of a computation based on qp->timeout.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The heavy weight spinlock in qib_lookup_qpn() is replaced with RCU.
The hash list itself is now accessed via jhash functions instead of mod.
The changes should benefit multiple receive contexts in different
processors by not contending for the lock just to read the hash
structures.
The patch also adds a lookaside_qp (pointer) and a lookaside_qpn in
the context. The interrupt handler will test the current packet's qpn
against lookaside_qpn if the lookaside_qp pointer is non-NULL. The
pointer is NULL'ed when the interrupt handler exits.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The context init now saves a shift from rcvegrbufs_perchunk
rcvegrbufs_perchunk_shift using ilog2. A BUG_ON() protects the
power of 2 assumption.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Store both the encoded and decoded MTU in the QP structure as a minor
optimization for UC/RC receive routines.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The memset for zeroing work completions had been unconditional.
This patch removes the memset and moves the zeroing into the work
completion with a more explicit field by field set. With this patch,
non-ONLY/non-LAST packets will avoid the overhead since they will not
generate a completion.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.
Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pages that were mapped using ib_dma_map_page() should be unmapped
using ib_dma_unmap_page().
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When creating flushed receive CQEs, set the QPID field in the t4_cqe
to the SQ QID and not the RQ QID. Otherwise the poll code will not
find the correct QP context.
Signed-off by: Jonathan Lallinger <jonathan@ogc.us>
Signed-off by: Steve Wise <swise@ogc.us>
Signed-off-by: Roland Dreier <roland@purestorage.com>
At the time when a peer closes the connection, iw_cxgb4 will not send
a cq event if ibqp.uobject exists. In that case, its possible for a
user application to get blocked in ibv_get_cq_event().
To resolve this, call the cq's comp_handler to unblock any read from
ibv_get_cq_event(). This will trigger userspace to poll the cq and
collect flush status completions for any pending work requests.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow processes that share the same XRC domain to open an existing
shareable QP. This permits those processes to receive events on the
shared QP and transfer ownership, so that any process may modify the
QP. The latter allows the creating process to exit, while a remaining
process can still transition it for path migration purposes.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC TGT QPs are shared resources among multiple processes. Since the
creating process may exit, allow other processes which share the same
XRC domain to open an existing QP. This allows us to transfer
ownership of an XRC TGT QP to another process.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support the creation of XRC INI and TGT QPs. To handle the case where
a CQ or PD is not provided, we allocate them internally with the xrcd.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to create XRC SRQs. This patch is based on a patch
from Jack Morgenstrein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support creating and destroying XRC domains. Any sharing of the XRCD
is managed above the low-level driver.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Because an XRC TGT QP can end up being shared among multiple
processes, don't have the ib_cm automatically send a DREQ when the
userspace process that owns the ib_cm_id exits. Disconnect can be
initiated by the user directly; otherwise, the owner of the XRC INI QP
controls the connection.
Note that as a result of the process exiting, the ib_cm will stop
tracking the XRC connection on the target side. For the purposes of
disconnecting, this isn't a big deal. The ib_cm will respond to the
DREQ appropriately. For other messages, mainly LAP, the CM will
reject the request, since there's no one available to route the
request to.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow users to connect XRC QPs through the rdma_cm.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to indicate the QP type separately from the port space
when allocating an rdma_cm_id. With RDMA_PS_IB, there is no longer a
1:1 relationship between the QP type and port space, so we need to
switch on the QP type to select between UD and connected QPs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add RDMA_PS_IB. XRC QP types will use the IB port space when operating
over the RDMA CM. For the 'IP protocol' field value, we select 0x3F,
which is listed as being for 'any local network'.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The XRC annex was updated to have XRC behave more like RD. Specifically,
the XRC TGT QPN moves from the local QPN to local EECN field. Lookup of
SRQN is done using the REQ/REP protocol.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update the REQ and REP messages to support XRC connection setup
according to the XRC Annex. Several existing fields must be set to 0 or
1 when connecting XRC QPs, and a reserved field is changed to an
extended transport type.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow user space to operate on XRC TGT QPs the same way as other types
of QPs, with one notable exception: since XRC TGT QPs may be shared
among multiple processes, the XRC TGT QP is allowed to exist beyond the
lifetime of the creating process.
The process that creates the QP is allowed to destroy it, but if the
process exits without destroying the QP, then the QP will be left bound
to the lifetime of the XRCD.
TGT QPs are not associated with CQs or a PD.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC INI QPs are similar to send only RC QPs. Allow user space to create
INI QPs. Note that INI QPs do not require receive CQs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We require additional information to create XRC SRQs than we can
exchange using the existing create SRQ ABI. Provide an enhanced create
ABI for extended SRQ types.
Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>
and Roland Dreier <roland@purestorage.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow user space to create XRC domains. Because XRCDs are expected to
be shared among multiple processes, we use inodes to identify an XRCD.
Based on patches by Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC TGT QPs are intended to be shared among multiple users and
processes. Allow the destruction of an XRC TGT QP to be done explicitly
through ib_destroy_qp() or when the XRCD is destroyed.
To support destroying an XRC TGT QP, we need to track TGT QPs with the
XRCD. When the XRCD is destroyed, all tracked XRC TGT QPs are also
cleaned up.
To avoid stale reference issues, if a user is holding a reference on a
TGT QP, we increment a reference count on the QP. The user releases the
reference by calling ib_release_qp. This releases any access to the QP
from a user above verbs, but allows the QP to continue to exist until
destroyed by the XRCD.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
XRC communication is between an initiator (INI) QP and a target (TGT)
QP. Target QPs are associated with SRQs through an XRCD. An XRC TGT QP
behaves like a receive-only RD QP. XRC INI QPs behave similarly to RC
QPs, except that work requests posted to an XRC INI QP must specify the
remote SRQ that is the target of the work request.
We define two new QP types for XRC, to distinguish between INI and TGT
QPs, and update the core layer to support XRC QPs.
This patch is derived from work by Jack Morgenstein
<jackm@dev.mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
XRC defines SRQs that are specifically used by XRC connections. Expand
the SRQ code to support XRC SRQs. An XRC SRQ is currently restricted to
only XRC use according to the IB XRC Annex.
Portions of this patch were derived from work by
Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
XRC ("eXtended reliable connected") is an IB transport that provides
better scalability by allowing senders to specify which shared receive
queue (SRQ) should be used to receive a message, which essentially
allows one transport context (QP connection) to serve multiple
destinations (as long as they share an adapter, of course).
A few new concepts are introduced to support this. This patch adds:
- A new device capability flag, IB_DEVICE_XRC, which low-level
drivers set to indicate that a device supports XRC.
- A new object type, XRC domains (struct ib_xrcd), and new verbs
ib_alloc_xrcd()/ib_dealloc_xrcd(). XRCDs are used to limit which
XRC SRQs an incoming message can target.
This patch is derived from work by Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use new function ib_rate_to_mbps() to handle printing rate in debugfs,
so that we handle extended rates.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Introduce support for the following extended speeds:
FDR-10: a Mellanox proprietary link speed which is 10.3125 Gbps with
64b/66b encoding rather than 8b/10b encoding.
FDR: IBA extended speed 14.0625 Gbps.
EDR: IBA extended speed 25.78125 Gbps.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix build errors:
drivers/infiniband/hw/ipath/ipath_init_chip.c:54:1: error: 'S_IRUGO' undeclared here (not in a function)
drivers/infiniband/hw/ipath/ipath_init_chip.c:54:1: error: bit-field '<anonymous>' width not an integer constant
drivers/infiniband/hw/ipath/ipath_init_chip.c:67:1: error: 'S_IWUSR' undeclared here (not in a function)
drivers/infiniband/hw/ipath/ipath_init_chip.c:67:1: error: bit-field '<anonymous>' width not an integer constant
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support for Packed and Unaligned (PAU) FPDUs is needed for
interoperability between NES and non-NES nodes. When the NES hardware
detects a PAU frame, it will pass it to the driver to process the
frame. NES driver creates a new frame for each FPDU and forwards it
to the hardware to be sent to its associated qp.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Print the IP address of the remote host when a critical asynchronous event is
received.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fixes a crash that occurs during close when error async event is received.
Terminate message is not sent to the remote node if already processing close.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When I converted some drivers from pci_map_page to skb_frag_dma_map I
neglected to convert PCI_DMA_xDEVICE into DMA_x_DEVICE and
pci_dma_mapping_error into dma_mapping_error.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for Enhanced RDMA Connection Establishment
(draft-ietf-storm-mpa-peer-connect-06), aka MPAv2. Details of draft
can be obtained from:
<http://www.ietf.org/id/draft-ietf-storm-mpa-peer-connect-06.txt>
For backwards compatibility, the MPAv2 enabled driver reverts to MPAv1
if the remote node doesn't support MPAv2.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds support for Enhanced RDMA Connection Establishment
(draft-ietf-storm-mpa-peer-connect-06), aka MPAv2. Details of draft
can be obtained from:
<http://www.ietf.org/id/draft-ietf-storm-mpa-peer-connect-06.txt>
The patch updates the following functions for initiator perspective:
- send_mpa_request
- process_mpa_reply
- post_terminate for TERM error codes
- destroy_qp for TERM related change
- adds layer/etype/ecode to c4iw_qp_attrs for sending with TERM
- peer_abort for retrying connection attempt with MPA_v1 message
- added c4iw_reconnect function
The patch updates the following functions for responder perspective:
- process_mpa_request
- send_mpa_reply
- c4iw_accept_cr
- passes ird/ord to upper layers
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As part of MPAv2 Enhanced RDMA Negotiation, pass max supported ird/ord
values upwards for the time being in iw_cxgb3 and amso1100.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update struct iw_cm_event to support propagating the ird/ord values
upwards to the application.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code that was recently introduced to report the number
of free contexts is flawed for multiple HCAs:
/* Return the number of free user ports (contexts) available. */
return scnprintf(buf, PAGE_SIZE, "%u\n", dd->cfgctxts -
dd->first_user_ctxt - (u32)qib_stats.sps_ctxts);
The qib_stats is global to the module, not per HCA, so the code is broken
for multiple HCAs.
This patch adds a qib_devdata field, freectxts, that reflects the free
contexts for this HCA.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add calls to ib_umem_release(), as in the other error-handling code in
nes_reg_user_mr().
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
cmd is unsigned, no need to check for < 0. Found by code inspection.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If a received MAD contains an invalid or reserved mgmt class, we will
attempt to access method_table outside of its range. Add a check to
ensure that mgmt class falls within the handled range.
Found by code inspection.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Check that conn_param is not null before dereferencing it when
processing rdma_accept(). This is necessary to prevent a possible
system crash, which can be caused by user space.
Problem found by code inspection.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since commit e58aa3d2d0 ("genirq: Run irq handlers with interrupts
disabled"), we run all interrupt handlers with interrupts disabled and
we even check and yell when an interrupt handler returns with
interrupts enabled -- cf commit b738a50a20 ("genirq: Warn when
handler enables interrupts").
So now this flag is a no-op and can be removed.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The iw_cxgb4 module crashes at init time if the T4 card does not
support RDMA. So clean up the init logic to correctly deal with
non-RDMA cards.
- If any RDMA resources are not available, then fail the initialization
logging an info message.
- Clean up properly on initialization failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA CM uses the local qp_type to determine how to process an
incoming request. This can result in an incoming REQ being treated as
a SIDR REQ and vice versa. Fix this by switching off the event type
instead, and for good measure verify that the listener supports the
incoming connection request.
This problem showed up when a user space application mismatched the QP
types between a client and server app.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This oops was reported recently:
d:mon> e
cpu 0xd: Vector: 300 (Data Access) at [c0000000fd4c7120]
pc: d00000000076f194: .t3_l2t_get+0x44/0x524 [cxgb3]
lr: d000000000b02108: .init_act_open+0x150/0x3d4 [cxgb3i]
sp: c0000000fd4c73a0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc0000000fd640d40
paca = 0xc00000000054ff80
pid = 5085, comm = iscsid
d:mon> t
[c0000000fd4c7450] d000000000b02108 .init_act_open+0x150/0x3d4 [cxgb3i]
[c0000000fd4c7500] d000000000e45378 .cxgbi_ep_connect+0x784/0x8e8 [libcxgbi]
[c0000000fd4c7650] d000000000db33f0 .iscsi_if_rx+0x71c/0xb18
[scsi_transport_iscsi2]
[c0000000fd4c7740] c000000000370c9c .netlink_data_ready+0x40/0xa4
[c0000000fd4c77c0] c00000000036f010 .netlink_sendskb+0x4c/0x9c
[c0000000fd4c7850] c000000000370c18 .netlink_sendmsg+0x358/0x39c
[c0000000fd4c7950] c00000000033be24 .sock_sendmsg+0x114/0x1b8
[c0000000fd4c7b50] c00000000033d208 .sys_sendmsg+0x218/0x2ac
[c0000000fd4c7d70] c00000000033f55c .sys_socketcall+0x228/0x27c
[c0000000fd4c7e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080da560cfc
The root cause was an EEH error, which sent us down the offload_close path in
the cxgb3 driver, which in turn sets cdev->l2opt to NULL, without regard for
upper layer driver (like the cxgbi drivers) which might have execution contexts
in the middle of its use. The result is the oops above, when t3_l2t_get attempts
to dereference L2DATA(cdev)->nentries in arp_hash right after the EEH error handler sets it to NULL.
The fix is to prevent the setting of the NULL pointer until after there are no
further users of it. The t3cdev->l2opt pointer is now converted to be an rcu
pointer and the L2DATA macro is now called under the protection of the
rcu_read_lock(). When the EEH error path:
t3_adapter_error->offload_close->cxgb3_offload_deactivate
Is exectured, setting of that l2opt pointer to NULL, is now gated on an rcu
quiescence point, preventing, allowing L2DATA callers to safely check for a NULL
pointer without concern that the underlying data will be freeded before the
pointer is dereferenced.
This has been tested by the reporter and shown to fix the reproted oops
[nhorman: fix up unitinialised variable reported by Dan Carpenter]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Karen Xie <kxie@chelsio.com>
Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and driver's host attrs to use the attribute
container sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and driver's session attrs to use the attribute
container sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
The iscsi class currently does not support writable sysfs
attrs for LLD sysfs settings. This patch converts the
iscsi class and drivers to use the attribute container
sysfs group and the sysfs group's is_visible callout
to be able to support readable or writable sysfs attrs.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tom Tucker <tom@opengridcomputing.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC3270 mandates that iSCSI PDUs are padded to the closest integer
number of four byte words. Fix the iser code to support that on both
the TX/RX flows.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code that prepares the SG associated with SCSI command for FMR was
buggy for systems with DMA addresses that don't fit in unsigned long,
e.g under the 32-bit based XenServer dom0 sizeof(dma_addr_t) is 8.
Fix that by casting to unsigned long long a masking constant used by
the code. This resolves a crash in iser_sg_to_page_vec on this system.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Moves the drivers for the Chelsio chipsets into
drivers/net/ethernet/chelsio/ and the necessary Kconfig and Makefile
changes.
CC: Divy Le Ray <divy@chelsio.com>
CC: Dimitris Michailidis <dm@chelsio.com>
CC: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target: Convert to DIV_ROUND_UP_SECTOR_T usage for sectors / dev_max_sectors
kernel.h: Add DIV_ROUND_UP_ULL and DIV_ROUND_UP_SECTOR_T macro usage
iscsi-target: Add iSCSI fabric support for target v4.1
iscsi: Add Serial Number Arithmetic LT and GT into iscsi_proto.h
iscsi: Use struct scsi_lun in iscsi structs instead of u8[8]
iscsi: Resolve iscsi_proto.h naming conflicts with drivers/target/iscsi
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch renames the following iscsi_proto.h structures to avoid
namespace issues with drivers/target/iscsi/iscsi_target_core.h:
*) struct iscsi_cmd -> struct iscsi_scsi_req
*) struct iscsi_cmd_rsp -> struct iscsi_scsi_rsp
*) struct iscsi_login -> struct iscsi_login_req
This patch includes useful ISCSI_FLAG_LOGIN_[CURRENT,NEXT]_STAGE*,
and ISCSI_FLAG_SNACK_TYPE_* definitions used by iscsi_target_mod, and
fixes the incorrect definition of struct iscsi_snack to following
RFC-3720 Section 10.16. SNACK Request.
Also, this patch updates libiscsi, iSER, be2iscsi, and bn2xi to
use the updated structure definitions in a handful of locations.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Nicholas A. Bellinger <nab@linux-iscsi.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits)
IB/qib: Defer HCA error events to tasklet
mlx4_core: Bump the driver version to 1.0
RDMA/cxgb4: Use printk_ratelimited() instead of printk_ratelimit()
IB/mlx4: Support PMA counters for IBoE
IB/mlx4: Use flow counters on IBoE ports
IB/pma: Add include file for IBA performance counters definitions
mlx4_core: Add network flow counters
mlx4_core: Fix location of counter index in QP context struct
mlx4_core: Read extended capabilities into the flags field
mlx4_core: Extend capability flags to 64 bits
IB/mlx4: Generate GID change events in IBoE code
IB/core: Add GID change event
RDMA/cma: Don't allow IPoIB port space for IBoE
RDMA: Allow for NULL .modify_device() and .modify_port() methods
IB/qib: Update active link width
IB/qib: Fix potential deadlock with link down interrupt
IB/qib: Add sysfs interface to read free contexts
IB/mthca: Remove unnecessary read of PCI_CAP_ID_EXP
IB/qib: Remove double define
IB/qib: Remove unnecessary read of PCI_CAP_ID_EXP
...
With ib_qib options:
options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4
a run of ib_write_bw -a yields the following:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2910.64 229.80
------------------------------------------------------------------
The top cpu use in a profile is:
CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
of 0x00 (No unit mask) count 1002300
Counted LLC_MISSES events (Last level cache demand requests from this core that
missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % samples % app name symbol name
15237 29.2642 964 17.1195 ib_qib.ko qib_7322intr
12320 23.6618 1040 18.4692 ib_qib.ko handle_7322_errors
4106 7.8860 0 0 vmlinux vsnprintf
Analysis of the stats, profile, the code, and the annotated profile indicate:
- All of the overflow interrupts (one per packet overflow) are
serviced on CPU0 with no mitigation on the frequency.
- All of the receive interrupts are being serviced by CPU0. (That is
the way truescale.cmds statically allocates the kctx IRQs to CPU)
- The code is spending all of its time servicing QIB_I_C_ERROR
RcvEgrFullErr interrupts on CPU0, starving the packet receive
processing.
- The decode_err routine is very inefficient, using a printf variant
to format a "%s" and continues to loop when the errs mask has been
cleared.
- Both qib_7322intr and handle_7322_errors read pci registers, which
is very inefficient.
The fix does the following:
- Adds a tasklet to service QIB_I_C_ERROR
- Replaces the very inefficient scnprintf() with a memcpy(). A field
is added to qib_hwerror_msgs to save the sizeof("string") at
compile time so that a strlen is not needed during err_decode().
- The most frequent errors (Overflows) are serviced first to exit the
loop as early as possible.
- The loop now exits as soon as the errs mask is clear rather than
fruitlessly looping through the msp array.
With this fix the performance changes to:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2990.64 2941.35
------------------------------------------------------------------
During testing of the error handling overflow patch, it was determined
that some CPU's were slower when servicing both overflow and receive
interrupts on CPU0 with different MSI interrupt vectors.
This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI
interrupt for kctx's < 2 and to service them on the default interrupt.
For some CPUs, the cost of the interrupt enter/exit is more costly
than then the additional PCI read in the default handler.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- unify vlan and nonvlan rx path
- kill nesvnic->vlan_grp and nes_netdev_vlan_rx_register
- allow to turn on/off rx/tx vlan accel via ethtool (set_features)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since printk_ratelimit() shouldn't be used anymore (see comment in
include/linux/printk.h), replace it with printk_ratelimited().
Signed-off-by: Manuel Zerpies <manuel.f.zerpies@ww.stud.uni-erlangen.de>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use the per port counter attached to all QPs created on that port to
implement port level packets/bytes performance counters a la IB.
Derived from a patch by Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate flow counter per Ethernet/IBoE port, and attach this counter
to all the QPs created on that port. Based on patch by Eli Cohen
<eli@mellanox.co.il>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move the various definitions and mad structures needed for software
implementation of IBA PM agent from the ipath and qib drivers into a
single include file, which in turn could be used by more consumers.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IBoE doesn't use LIDs. Use the GID change event to update the IB core
cache for addition/deletion of GIDs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add IB GID change event type. This is needed for IBoE when the HW
driver updates the GID (e.g when new VLANs are added/deleted) table
and the change should be reflected to the IB core cache.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch fixes a kernel crash in cma_set_qkey().
When the link layer is Ethernet, it is wrong to use IPoIB port space
since no IPoIB interface is available. Specifically, setting the
Q_Key when port space is RDMA_PS_IPOIB requires MGID calculation and
an SA query, which doesn't make sense over Ethernet.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
These methods don't make sense for iWARP devices, so rather than
forcing them to implement stubs, just return -ENOSYS in the core if
the hardware driver doesn't set .modify_device and/or .modify_port.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update the active link width on QLE7220 chips when link goes down if
chip width does not match shadowed width.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a possibility of a deadlock due to the way locks are
acquired and released in qib_set_uevent_bits(). The function
qib_set_uevent_bits() is called in process context and it uses
spin_lock() and spin_unlock(). This same lock is acquired/released
in interrupt context which can lead to a deadlock when running on
the same cpu.
The fix is to replace spin_lock() and spin_unlock() with
spin_lock_irqsave() and spin_unlock_irqrestore() respectively in
qib_set_uevent_bits().
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Indicate the number of free user contexts via the sysfs file
/sys/class/infiniband/qib0/nfreectxts as required for PSM.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PCIE capability offset is saved during PCI bus walking. It will
remove an unnecessary search in the PCI configuration space if this
value is referenced instead of reacquiring it. Also, pci_is_pcie is a
better way of determining if the device is PCIE or not (as it uses the
same saved PCIE capability offset).
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PCIE capability offset is saved during PCI bus walking. It will
remove an unnecessary search in the PCI configuration space if this
value is referenced instead of reacquiring it.
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Adapt to new api. We plan to remove old one later. Almost all
changes are trivial, but there is one real fix: the following code is
unsafe:
int ncpus = num_online_cpus()
for (i = 0; i < ncpus; i++) {
..
}
because 1) we don't guarantee last bit of online cpus is equal to
num_online_cpus(). some arch assign sparse cpu number. 2) cpu
hotplugging may change cpu_online_mask at same time. we need to pin
it by get_online_cpus().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Adapt to use new APIs. We plan to remove old one later and plan to
change current->cpus_allowed implementation.
No functional change.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Avoid assigning an IS_ERR value to the cm_id pointer. This fixes a
few anomalies in the error flow due to confusion about checking for
NULL vs IS_ERR, and eliminates the need to test for the IS_ERR value
every time we wish to determine if the cma_id object has a cm device
associated with it.
Also, eliminate the now-unnecessary procedure cma_has_cm_dev (we can
check directly for the existence of the device pointer -- for a
non-NULL check, makes no difference if it is the iwarp or the ib
pointer).
Finally, make a few code changes here to improve coding consistency.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
SCSI scanning of a channel🆔lun triplet in Linux works as follows
(function scsi_scan_target() in drivers/scsi/scsi_scan.c):
- If lun == SCAN_WILD_CARD, send a REPORT LUNS command to the target
and process the result.
- If lun != SCAN_WILD_CARD, send an INQUIRY command to the LUN
corresponding to the specified channel🆔lun triplet to verify
whether the LUN exists.
So a SCSI driver must either take the channel and target id values in
account in its quecommand() function or it should declare that it only
supports one channel and one target id.
Currently the ib_srp driver does neither. As a result scanning the
SCSI bus via e.g. rescan-scsi-bus.sh causes many duplicate SCSI
devices to be created. For each 0:0:L device, several duplicates are
created with the same LUN number and with (C:I) != (0:0). Fix this by
declaring that the ib_srp driver only supports one channel and one
target id.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@kernel.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commits 71c29bd5c2 ("IB/uverbs: Add devnode method to set path/mode")
and c3af0980ce ("IB: Add devnode methods to cm_class and umad_class")
added devnode methods that set the mode.
However, these methods don't check for a NULL mode, and so we get a
crash when unloading modules because devtmpfs_delete_node() calls
device_get_devnode() with mode == NULL.
Add the missing checks.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
[ Also fix cm.c. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to timing, it is possible for the LOS and DFE to remain on. This
is due to the link progressing to LinkUP prior to the driver getting
the first Status Changed interrupt. By expanding the conditions under
which LOS is turned off and DFE timeout is being set, timing is no
longer an issue.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- fix a race where the driver could end up sending a close_con_req
after an abort_rpl. In c4iw_ep_disconnect(), send abort or close
request with the ep mutex held.
- fix a hang where driver fails to wake up when a connection is reset
during a normal close. Wake up any waiters in the interrupt path,
and correctly cleanup after rdma_fini() failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove left-over code from T3 that limited MR sizes to 32b.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Memory allocated for user CQs gets rounded up to the next page
boundary. And after rounding, we recalculate the resulting IQ depth
and we need to make sure we don't exceed the HW limits.
This bug can result a much smaller CQ allocated than was expected if
the HW size field is exceeded, resulting in CQ overflow failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* remove interrupt.g inclusion from netdevice.h -- not needed
* fixup fallout, add interrupt.h and hardirq.h back where needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA/cma: Save PID of ID's owner
RDMA/cma: Add support for netlink statistics export
RDMA/cma: Pass QP type into rdma_create_id()
RDMA: Update exported headers list
RDMA/cma: Export enum cma_state in <rdma/rdma_cm.h>
RDMA/nes: Add a check for strict_strtoul()
RDMA/cxgb3: Don't post zero-byte read if endpoint is going away
RDMA/cxgb4: Use completion objects for event blocking
IB/srp: Fix integer -> pointer cast warnings
IB: Add devnode methods to cm_class and umad_class
IB/mad: Return EPROTONOSUPPORT when an RDMA device lacks the QP required
IB/uverbs: Add devnode method to set path/mode
RDMA/ucma: Add .nodename/.mode to tell userspace where to create device node
RDMA: Add netlink infrastructure
RDMA: Add error handling to ib_core_init()
Save the PID associated with an RDMA CM ID for reporting via netlink.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add callbacks and data types for statistics export of all current
devices/ids. The schema for RDMA CM is a series of netlink messages.
Each one contains an rdma_cm_stat struct. Additionally, two netlink
attributes are created for the addresses for each message (if
applicable).
Their types used are:
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR (The source address for this ID)
RDMA_NL_RDMA_CM_ATTR_DST_ADDR (The destination address for this ID)
sockaddr_* structs are encapsulated within these attributes.
In other words, every transaction contains a series of messages like:
-------message 1-------
struct rdma_cm_id_stats {
__u32 qp_num;
__u32 bound_dev_if;
__u32 port_space;
__s32 pid;
__u8 cm_state;
__u8 node_type;
__u8 port_num;
__u8 reserved;
}
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute - contains the source address
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute - contains the destination address
-------end 1-------
-------message 2-------
struct rdma_cm_id_stats
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute
-------end 2-------
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA CM currently infers the QP type from the port space selected
by the user. In the future (eg with RDMA_PS_IB or XRC), there may not
be a 1-1 correspondence between port space and QP type. For netlink
export of RDMA CM state, we want to export the QP type to userspace,
so it is cleaner to explicitly associate a QP type to an ID.
Modify rdma_create_id() to allow the user to specify the QP type, and
use it to make our selections of datagram versus connected mode.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move cma.c's internal definition of enum cma_state to enum rdma_cm_state
in an exported header so that it can be exported via RDMA netlink.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
It should check if strict_strtoul() succeeds before using
'wqm_quanta_value'.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
[ Convert to kstrtoul() directly while we're here. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
tx_ack() wasn't checking the endpoint state and consequently would
attempt to post the p2p 0B read on an endpoint/QP that is closing or
aborting. This causes a NULL pointer dereference crash.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There exists a race condition when using wait_queue_head_t objects
that are declared on the stack. This was being done in a few places
where we are sending work requests to the FW and awaiting replies, but
we don't have an endpoint structure with an embedded c4iw_wr_wait
struct. So the code was allocating it locally on the stack. Bad
design. The race is:
1) thread on cpuX declares the wait_queue_head_t on the stack, then
posts a firmware WR with that wait object ptr as the cookie to be
returned in the WR reply. This thread will proceed to block in
wait_event_timeout() but before it does:
2) An interrupt runs on cpuY with the WR reply. fw6_msg() handles
this and calls c4iw_wake_up(). c4iw_wake_up() sets the condition
variable in the c4iw_wr_wait object to TRUE and will call
wake_up(), but before it calls wake_up():
3) The thread on cpuX calls c4iw_wait_for_reply(), which calls
wait_event_timeout(). The wait_event_timeout() macro checks the
condition variable and returns immediately since it is TRUE. So
this thread never blocks/sleeps. The function then returns
effectively deallocating the c4iw_wr_wait object that was on the
stack.
4) So at this point cpuY has a pointer to the c4iw_wr_wait object
that is no longer valid. Further its pointing to a stack frame
that might now be in use by some other context/thread. So cpuY
continues execution and calls wake_up() on a ptr to a wait object
that as been effectively deallocated.
This race, when it hits, can cause a crash in wake_up(), which I've
seen under heavy stress. It can also corrupt the referenced stack
which can cause any number of failures.
The fix:
Use struct completion, which supports on-stack declarations.
Completions use a spinlock around setting the condition to true and
the wake up so that steps 2 and 4 above are atomic and step 3 can
never happen in-between.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Fix
drivers/infiniband/ulp/srp/ib_srp.c: In function 'srp_handle_recv':
drivers/infiniband/ulp/srp/ib_srp.c:1150: warning: cast to pointer from integer of different size
drivers/infiniband/ulp/srp/ib_srp.c: In function 'srp_send_completion':
drivers/infiniband/ulp/srp/ib_srp.c🔢 warning: cast to pointer from integer of different size
by adding an intermediate cast to uintptr_t.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: David Dillow <dillowda@ornl.gov>
We want the ucmX, umadX and issmX device nodes to show up under
/dev/infiniband, and additionally ucmX should have mode 0666. Add
appropriate devnode methods to their class structs for this.
Signed-off-by: Roland Dreier <roland@purestorage.com>
We had a script which was looping through the devices returned from
ibstat and attempted to register a SMI agent on an ethernet device.
This caused a kernel panic for IBoE devices that don't have QP0.
Fix this by checking if the QP exists before using it.
Signed-off-by: Ira Weiny <weiny2@llnl.gov>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We want udev to create a device node under /dev/infiniband with
permission 0666 for uverbsX devices, so add a devnode method to set the
appropriate info.
Signed-off-by: Roland Dreier <roland@purestorage.com>
We want udev to create a device node under /dev/infiniband with
permission 0666 for rdma_cm, so add that info to our struct miscdevice.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
After discovering that wide use of prefetch on modern CPUs
could be a net loss instead of a win, net drivers which were
relying on the implicit inclusion of prefetch.h via the list
headers showed up in the resulting cleanup fallout. Give
them an explicit include via the following $0.02 script.
=========================================
#!/bin/bash
MANUAL=""
for i in `git grep -l 'prefetch(.*)' .` ; do
grep -q '<linux/prefetch.h>' $i
if [ $? = 0 ] ; then
continue
fi
( echo '?^#include <linux/?a'
echo '#include <linux/prefetch.h>'
echo .
echo w
echo q
) | ed -s $i > /dev/null 2>&1
if [ $? != 0 ]; then
echo $i needs manual fixup
MANUAL="$i $MANUAL"
fi
done
echo ------------------- 8\<----------------------
echo vi $MANUAL
=========================================
Signed-off-by: Paul <paul.gortmaker@windriver.com>
[ Fixed up some incorrect #include placements, and added some
non-network drivers and the fib_trie.c case - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
Add basic RDMA netlink infrastructure that allows for registration of
RDMA clients for which data is to be exported and supplies message
construction callbacks.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
[ Reorganize a few things, add CONFIG_NET dependency. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Manual merge of arch/powerpc/kernel/smp.c and add missing scheduler_ipi()
call to arch/powerpc/platforms/cell/interrupt.c
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The driver reads PCI revision ID from the PCI configuration register
while it's already stored by PCI subsystem in the revision field of
struct pci_dev.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IW_CM_EVENT_STATUS_xxx values were used in only a couple of places;
cma.c uses -Exxx values instead, and so do the amso1100, cxgb3 and cxgb4
drivers -- only nes was using the enum values (with the mild consequence
that all nes connection failures were treated as generic errors rather
than reported as timeouts or rejections).
We can fix this confusion by getting rid of enum iw_cm_event_status and
using a plain int for struct iw_cm_event.status, and converting nes to
use -Exxx as the other iWARP drivers do.
This also gets rid of the warning
drivers/infiniband/core/cma.c: In function 'cma_iw_handler':
drivers/infiniband/core/cma.c:1333:3: warning: case value '4294967185' not in enumerated type 'enum iw_cm_event_status'
drivers/infiniband/core/cma.c:1336:3: warning: case value '4294967186' not in enumerated type 'enum iw_cm_event_status'
drivers/infiniband/core/cma.c:1332:3: warning: case value '4294967192' not in enumerated type 'enum iw_cm_event_status'
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Faisal Latif <faisal.latif@intel.com>
Commit 44c10138fd ("PCI: Change all drivers to use pci_device->revision")
already converted this driver to using the revision field of struct
pci_dev but commit bb9171448d ("IB/ipath: Misc changes to prepare
for IB7220 introduction") later reverted that change for some strange
reason. Restore the change.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The time limit test now correctly checks against current jiffies to
avoid the hang.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
A few more EEH fixes:
c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
return an error.
The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
event followed by a ib_register_device() when the device was
reinitialized. However, the RDMA core doesn't allow multiple
iterations of register/deregister by the provider. See
drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
the kobject ref is held until the device is deallocated in
ib_deallocate_device(). Calling deregister adds this kobj reference,
and then a subsequent register call will generate a WARN_ON() from the
kobject subsystem because the kobject is being initialized but is
already initialized with the ref held.
So the provider must deregister and dealloc when resetting for an EEH
event, then alloc/register to re-initialize. To do this, we cannot
use the device ptr as our ULD handle since it will change with each
reallocation. This commit adds a ULD context struct which is used as
the ULD handle, and then contains the device pointer and other state
needed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver was never really waiting for RDMA_WR/FINI completions
because the condition variable used to determine if the completion
happened was never reset, and this condition variable is reused for
both connection setup and teardown. This causes various driver
crashes under heavy loads due to releasing resources too early.
The fix is to use atomic bits to correctly reset the condition
immediately after the completion is detected.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Parens are missing: '|' has a higher presedence than '?'.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
c4iw_uld_add() must return ERR_PTR() values instead of NULL on failure.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Concurrent ingress CLOSE and ULP ABORT operations causes a crash due
to a race condition where the close path releases the EP lock and then
tries to move the QP state to CLOSED. This must be done inside the EP
lock to avoid the race.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Lustre requires that clients bind to a privileged port number before
connecting to a remote server. On larger clusters (typically more
than about 1000 nodes), the number of privileged ports is exhausted,
resulting in lustre being unusable.
To handle this, we add support for reusable addresses to the rdma_cm.
This mimics the behavior of the socket option SO_REUSEADDR. A user
may set an rdma_cm_id to reuse an address before calling
rdma_bind_addr() (explicitly or implicitly). If set, other
rdma_cm_id's may be bound to the same address, provided that they all
have reuse enabled, and there are no active listens.
If rdma_listen() is called on an rdma_cm_id that has reuse enabled, it
will only succeed if there are no other id's bound to that same
address. The reuse option is exported to user space. The behavior of
the kernel reuse implementation was verified against that given by
sockets.
This patch is derived from a path by Ira Weiny <weiny2@llnl.gov>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
cma_use_port() assumes that the sockaddr is an IPv4 address. Since
IPv6 addressing is supported (and also to support other address
families) make the code more generic in its address handling.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This updates the network drivers so that they don't access the
ethtool_cmd::speed field directly, but use ethtool_cmd_speed()
instead.
For most of the drivers, these changes are purely cosmetic and don't
fix any problem, such as for those 1GbE/10GbE drivers that indirectly
call their own ethtool get_settings()/mii_ethtool_gset(). The changes
are meant to enforce code consistency and provide robustness with
future larger throughputs, at the expense of a few CPU cycles for each
ethtool operation.
All drivers compiled with make allyesconfig ion x86_64 have been
updated.
Tested: make allyesconfig on x86_64 + e1000e/bnx2x work
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fe3cc0d99d ("powerpc: Add
pgprot_writecombine") in benh's tree exposes the pgprot_writecombine()
API to drivers on powerpc. cxgb4 has an open-coded version of the same,
so use the common API now that it's available.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Anton Blanchard <anton@samba.org>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (56 commits)
route: Take the right src and dst addresses in ip_route_newports
ipv4: Fix nexthop caching wrt. scoping.
ipv4: Invalidate nexthop cache nh_saddr more correctly.
net: fix pch_gbe section mismatch warning
ipv4: fix fib metrics
mlx4_en: Removing HW info from ethtool -i report.
net_sched: fix THROTTLED/RUNNING race
drivers/net/a2065.c: Convert release_resource to release_region/release_mem_region
drivers/net/ariadne.c: Convert release_resource to release_region/release_mem_region
bonding: fix rx_handler locking
myri10ge: fix rmmod crash
mlx4_en: updated driver version to 1.5.4.1
mlx4_en: Using blue flame support
mlx4_core: reserve UARs for userspace consumers
mlx4_core: maintain available field in bitmap allocator
mlx4: Add blue flame support for kernel consumers
mlx4_en: Enabling new steering
mlx4: Add support for promiscuous mode in the new steering model.
mlx4: generalization of multicast steering.
mlx4_en: Reporting HW revision in ethtool -i
...
Commit 1765a57533 ("net: make dev->master general") introduced a
test of an uninitialized netdev. Fix the code so the intended netdev
is tested.
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB: Increase DMA max_segment_size on Mellanox hardware
IB/mad: Improve an error message so error code is included
RDMA/nes: Don't print success message at level KERN_ERR
RDMA/addr: Fix return of uninitialized ret value
IB/srp: try to use larger FMR sizes to cover our mappings
IB/srp: add support for indirect tables that don't fit in SRP_CMD
IB/srp: rework mapping engine to use multiple FMR entries
IB/srp: allow sg_tablesize to be set for each target
IB/srp: move IB CM setup completion into its own function
IB/srp: always avoid non-zero offsets into an FMR
The same packet steering mechanism would be used both for IB and Ethernet,
Both multicasts and unicasts.
This commit prepares the general infrastructure for this.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
HW revision is derived from device ID and rev id.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, each device is assumed to be able only handle 64 KB chunks
during DMA. By giving the segment size a larger value, the block layer
will coalesce more S/G entries together for SRP, allowing larger
requests with the same sg_tablesize setting. The block layer is the
only direct user of it, though a few IOMMU drivers reference it as
well for their *_map_sg coalescing code. pci-gart_64 on x86, and a
smattering on on sparc, powerpc, and ia64.
Since other IB protocols could potentially see larger segments with
this, let's check those:
- iSER is fine, because you limit your maximum request size to 512
KB, so we'll never overrun the page vector in struct iser_page_vec
(128 entries currently). It is independent of the DMA segment size,
and handles multi-page segments already.
- IPoIB is fine, as it maps each page individually, and doesn't use
ib_dma_map_sg().
- RDS appears to do the right thing and has no dependencies on DMA
segment size, but I don't claim to have done a complete audit.
- NFSoRDMA and 9p are OK -- they do not use ib_dma_map_sg(), so they
doesn't care about the coalescing.
- Lustre's ko2iblnd does not care about coalescing -- it properly
walks the returned sg list.
This patch ups the value on Mellanox hardware to 1 GB, which matches
reported firmware limits on mlx4.
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There's no reason to print "NetEffect RNIC driver successfully loaded"
at level KERN_ERR (where it will uglify the console on a quiet boot).
Change it to KERN_INFO.
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits)
KVM: unbreak userspace that does not sets tss address
KVM: MMU: cleanup pte write path
KVM: MMU: introduce a common function to get no-dirty-logged slot
KVM: fix rcu usage in init_rmode_* functions
KVM: fix kvmclock regression due to missing clock update
KVM: emulator: Fix permission checking in io permission bitmap
KVM: emulator: Fix io permission checking for 64bit guest
KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n
KVM: x86: Remove useless regs_page pointer from kvm_lapic
KVM: improve comment on rcu use in irqfd_deassign
KVM: MMU: remove unused macros
KVM: MMU: cleanup page alloc and free
KVM: MMU: do not record gfn in kvm_mmu_pte_write
KVM: MMU: move mmu pages calculated out of mmu lock
KVM: MMU: set spte accessed bit properly
KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits
KVM: Start lock documentation
KVM: better readability of efer_reserved_bits
KVM: Clear async page fault hash after switching to real mode
KVM: VMX: Initialize vm86 TSS only once.
...
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits)
[SCSI] scsi_dh_rdac: Add MD36xxf into device list
[SCSI] scsi_debug: add consecutive medium errors
[SCSI] libsas: fix ata list corruption issue
[SCSI] hpsa: export resettable host attribute
[SCSI] hpsa: move device attributes to avoid forward declarations
[SCSI] scsi_debug: Logical Block Provisioning (SBC3r26)
[SCSI] sd: Logical Block Provisioning update
[SCSI] Include protection operation in SCSI command trace
[SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try)
[SCSI] target: Fix volume size misreporting for volumes > 2TB
[SCSI] bnx2fc: Broadcom FCoE offload driver
[SCSI] fcoe: fix broken fcoe interface reset
[SCSI] fcoe: precedence bug in fcoe_filter_frames()
[SCSI] libfcoe: Remove stale fcoe-netdev entries
[SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h
[SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument
[SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs
[SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out"
[SCSI] libfc: Fixing a memory leak when destroying an interface
[SCSI] megaraid_sas: Version and Changelog update
...
Fix up trivial conflicts due to whitespace differences in
drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}
Commit b23dd4fe42 ("ipv4: Make output route lookup return rtable
directly") resulted in leaving ret uninitialized, where it may later
be returned.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In most cases, get_user_pages and get_user_pages_fast should be used
to pin user pages in memory. But sometimes, some special flags except
FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in
following patch, KVM needs FOLL_HWPOISON. To support these users,
__get_user_pages is exported directly.
There are some symbol name conflicts in infiniband driver, fixed them too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Michel Lespinasse <walken@google.com>
CC: Roland Dreier <roland@kernel.org>
CC: Ralph Campbell <infinipath@qlogic.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits)
bonding: enable netpoll without checking link status
xfrm: Refcount destination entry on xfrm_lookup
net: introduce rx_handler results and logic around that
bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag
bonding: wrap slave state work
net: get rid of multiple bond-related netdevice->priv_flags
bonding: register slave pointer for rx_handler
be2net: Bump up the version number
be2net: Copyright notice change. Update to Emulex instead of ServerEngines
e1000e: fix kconfig for crc32 dependency
netfilter ebtables: fix xt_AUDIT to work with ebtables
xen network backend driver
bonding: Improve syslog message at device creation time
bonding: Call netif_carrier_off after register_netdevice
bonding: Incorrect TX queue offset
net_sched: fix ip_tos2prio
xfrm: fix __xfrm_route_forward()
be2net: Fix UDP packet detected status in RX compl
Phonet: fix aligned-mode pipe socket buffer header reserve
netxen: support for GbE port settings
...
Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c
with the staging updates.
Now that we can get larger SG lists, we can take advantage of HCAs that
allow us to use larger FMR sizes. In many cases, we can use up to 512
entries, so start there and work our way down.
Signed-off-by: David Dillow <dillowda@ornl.gov>
This allows us to guarantee the ability to submit up to 8 MB requests
based on the current value of SCSI_MAX_SG_CHAIN_SEGMENTS. While FMR will
usually condense the requests into 8 SG entries, it is imperative that
the target support external tables in case the FMR mapping fails or is
not supported.
We add a safety valve to allow targets without the needed support to
reap the benefits of the large tables, but fail in a manner that lets
the user know that the data didn't make it to the device. The user must
add "allow_ext_sg=1" to the target parameters to indicate that the
target has the needed support.
If indirect_sg_entries is not specified in the modules options, then
the sg_tablesize for the target will default to cmd_sg_entries unless
overridden by the target options.
Signed-off-by: David Dillow <dillowda@ornl.gov>
Instead of forcing all of the S/G entries to fit in one FMR, and falling
back to indirect descriptors if that fails, allow the use of as many
FMRs as needed to map the request. This lays the groundwork for allowing
indirect descriptor tables that are larger than can fit in the command
IU, but should marginally improve performance now by reducing the number
of indirect descriptors needed.
We increase the minimum page size for the FMR pool to 4K, as larger
pages help increase the coverage of each FMR, and it is rare that the
kernel would send down a request with scattered 512 byte fragments.
This patch also move some of the target initialization code afte the
parsing of options, to keep it together with the new code that needs to
allocate memory based on the options given.
Signed-off-by: David Dillow <dillowda@ornl.gov>
Different configurations of target software allow differing max sizes of
the command IU. Allowing this to be changed per-target allows all
targets on an initiator to get an optimal setting.
We deprecate srp_sg_tablesize and replace it with cmd_sg_entries in
preparation for allowing more indirect descriptors than can fit in the
IU.
Signed-off-by: David Dillow <dillowda@ornl.gov>
It is unclear exactly how this code works around Mellanox SRP targets,
or if the problem is on the target side or in the HCA itself. In an
abundance of caution, we should always enable the workaround.
Signed-off-by: David Dillow <dillowda@ornl.gov>
rdma_destroy_id currently uses the global rdma cm 'lock' to test if an
rdma_cm_id has been bound to a device. This prevents an active
address resolution callback handler from assigning a device to the
rdma_cm_id after rdma_destroy_id checks for one.
Instead, we can replace the use of the global lock around the check to
the rdma_cm_id device pointer by setting the id state to destroying,
then flushing all active callbacks. The latter is accomplished by
acquiring and releasing the handler_mutex. Any active handler will
complete first, and any newly scheduled handlers will find the
rdma_cm_id in an invalid state.
In addition to optimizing the current locking scheme, the use of the
rdma_cm_id mutex is a more intuitive synchronization mechanism than
that of the global lock. These changes are based on feedback from
Doug Ledford <dledford@redhat.com> while he was trying to debug a
crash in the rdma cm destroy path.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This problem was reported by Moni Shoua <monis@mellanox.com> and Amir
Vadai <amirv@mellanox.com>:
When destroying a cm_id from a context of a work queue and if
the lap_state of this cm_id is IB_CM_LAP_SENT, we need to
release the reference of this id that was taken upon the send
of the LAP message. Otherwise, if the expected APR message
gets lost, it is only after a long time that the reference
will be released, while during that the work handler thread is
not available to process other things.
It turns out that we need to cancel any pending LAP messages whenever
we transition out of the IB_CM_ESTABLISH state. This occurs when
disconnecting - either sending or receiving a DREQ. It can also
happen in a corner case where we receive a REJ message after sending
an RTU, followed by a LAP. Add checks and cancel any outstanding LAP
messages in these three cases.
Canceling the LAP when sending a DREQ fixes the destroy problem
reported by Moni. When a cm_id is destroyed in the IB_CM_ESTABLISHED
state, it sends a DREQ to the remote side to notify the peer that the
connection is going away.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When processing a SIDR REQ, the ib_cm allocates a new cm_id. The
refcount of the cm_id is initialized to 1. However, cm_process_work
will decrement the refcount after invoking all callbacks. The result
is that the cm_id will end up with refcount set to 0 by the end of the
sidr req handler.
If a user tries to destroy the cm_id, the destruction will proceed,
under the incorrect assumption that no other threads are referencing
the cm_id. This can lead to a crash when the cm callback thread tries
to access the cm_id.
This problem was noticed as part of a larger investigation with kernel
crashes in the rdma_cm when running on a real time OS.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Doug Ledford and Red Hat reported a crash when running the rdma_cm on
a real-time OS. The crash has the following call trace:
cm_process_work
cma_req_handler
cma_disable_callback
rdma_create_id
kzalloc
init_completion
cma_get_net_info
cma_save_net_info
cma_any_addr
cma_zero_addr
rdma_translate_ip
rdma_copy_addr
cma_acquire_dev
rdma_addr_get_sgid
ib_find_cached_gid
cma_attach_to_dev
ucma_event_handler
kzalloc
ib_copy_ah_attr_to_user
cma_comp
[ preempted ]
cma_write
copy_from_user
ucma_destroy_id
copy_from_user
_ucma_find_context
ucma_put_ctx
ucma_free_ctx
rdma_destroy_id
cma_exch
cma_cancel_operation
rdma_node_get_transport
rt_mutex_slowunlock
bad_area_nosemaphore
oops_enter
They were able to reproduce the crash multiple times with the
following details:
Crash seems to always happen on the:
mutex_unlock(&conn_id->handler_mutex);
as conn_id looks to have been freed during this code path.
An examination of the code shows that a race exists in the request
handlers. When a new connection request is received, the rdma_cm
allocates a new connection identifier. This identifier has a single
reference count on it. If a user calls rdma_destroy_id() from another
thread after receiving a callback, rdma_destroy_id will proceed to
destroy the id and free the associated memory. However, the request
handlers may still be in the process of running. When control returns
to the request handlers, they can attempt to access the newly created
identifiers.
Fix this by holding a reference on the newly created rdma_cm_id until
the request handler is through accessing it.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the M_Key field in SubnGet and SugnGetResp MADs based on correctly
interpreting the protection level specified in the M_KeyProtBits field.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For active and far-EQ cables use an LE2 value of 0 for improved SI.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Show whether the SQ is in onchip memory or not.
- Dump both SQ and RQ QIDs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This at least kicks the user mode applications that are watching for
device events.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the ULP mode for initial RDMA connection setup to the proper DDP
mode. This avoids wasting some HW resources while in streaming mode.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This avoids the CIDX_INC overflow issue with T4A2 when running
kernel RDMA applications.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Unloading iw_cxgb4 can crash due to the unload code trying to use
db_drop_task, which is uninitialized. So remove this dead code.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.
This is the first step to move in that direction.
Signed-off-by: David S. Miller <davem@davemloft.net>
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.
Signed-off-by: David S. Miller <davem@davemloft.net>
This pactch has iser export the address and port
of the endpoint.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Fix a bug which causes the driver to return incorrect MADs as a
response to Set(PortInfo) which sets the link width to 0xFF or link
speed to 0xF.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a double completion associated with error handling for RC QPs.
The sequence is:
- The do_rc_ack() routine fields an RNR nack and there are 0
rnr_retries configured on the QP.
- qib_error_qp() stops the pending timer
- qib_rc_send_complete() is called from sdma_complete()
- qib_rc_send_complete() starts the timer because the msb of the psn
just completed says an ack is needed.
- a bunch of flushes occur as ipoib posts WQEs to an error'ed QP
- rc_timeout() calls qib_restart_rc()
- qib_restart_rc() calls qib_send_complete() with a
IB_WC_RETRY_EXC_ERR on a wqe that has already been completed in the
past
The fix avoids starting the timer since another packet will never
arrive.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
dev->master is now tightly connected to bonding driver. This patch makes
this pointer more general and ready to be used by others.
- netdev_set_master() - bond specifics moved to new function
netdev_set_bond_master()
- introduced netif_is_bond_slave() to check if device is a bonding slave
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following panic BUG_ON occurs during qib testing:
Kernel BUG at include/linux/timer.h:82
RIP [<ffffffff881f7109>] :ib_qib:start_timer+0x73/0x89
RSP <ffffffff80425bd0>
<0>Kernel panic - not syncing: Fatal exception
<0>Dumping qib trace buffer from panic
qib_set_lid INFO: IB0:1 got a lid: 0xf8
Done dumping qib trace buffer
BUG: warning at kernel/panic.c:137/panic() (Tainted: G
The flaw is due to a missing state test when processing responses that
results in an add_timer() call when the same timer is already queued.
This code was executing in parallel with a QP destroy on another CPU
that had changed the state to reset, but the missing test caused to
response handling code to run on into the panic.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
nes_port_ibevent() should not be called when the nes RDMA device is not
registered with the RDMA core. Add missing checks of of_device_registered flag.
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA: Update missed conversion of flush_scheduled_work()
RDMA/ucma: Copy iWARP route information on queries
RDMA/amso1100: Fix compile warnings
RDMA/cxgb4: Set the correct device physical function for iWARP connections
RDMA/cxgb4: Limit MAXBURST EQ context field to 256B
IB/qib: Hold link for TX SERDES settings
mlx4_core: Add ConnectX-3 device IDs
Commit f06267104d ("RDMA: Update workqueue usage") introduced ib_wq
and removed the use of flush_scheduled_work(); however, during the merge
process one chunk was lost in ib_sa_remove_one(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For iWARP rdma_cm ids, the "route" information is the L2 src and
next hop addresses.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix compile warnings on 32-bit by using "0" instead of "(u64) NULL" to
assign to "c2_vq_req->reply_msg".
Signed-off-by: Ralf Thielow <ralf.thielow@googlemail.com>
[ Change from "(unsigned long) NULL" to plain old "0" as suggested by
Bart Van Assche <bvanassche@acm.org>. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PF passed to FW was 0, causing PCI failures in an SR-IOV environment.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
MAXBURST cannot exceed 256B for on-chip queues. With a 512B MAXBURST,
we can lock up the chip.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Hold the IB link at DISABLED until we get the correct TX settings
on mezz boards.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
is used to configure any non-standard kernel with a much larger scope than
only small devices.
This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
references to the option throughout the kernel. A new CONFIG_EMBEDDED
option is added that automatically selects CONFIG_EXPERT when enabled and
can be used in the future to isolate options that should only be
considered for embedded systems (RISC architectures, SLOB, etc).
Calling the option "EXPERT" more accurately represents its intention: only
expert users who understand the impact of the configuration changes they
are making should enable it.
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Acked-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg KH <gregkh@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Robin Holt <holt@sgi.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
RDMA: Update workqueue usage
RDMA/nes: Fix incorrect SFP+ link status detection on driver init
RDMA/nes: Fix SFP+ link down detection issue with switch port disable
RDMA/nes: Generate IB_EVENT_PORT_ERR/PORT_ACTIVE events
RDMA/nes: Fix bonding on iw_nes
IB/srp: Test only once whether iu allocation succeeded
IB/mlx4: Handle protocol field in multicast table
RDMA: Use vzalloc() to replace vmalloc()+memset(0)
mlx4_{core, ib, en}: Fix driver when sizeof (phys_addr_t) > sizeof (long)
IB/mthca: Fix driver when sizeof (phys_addr_t) > sizeof (long)
* ib_wq is added, which is used as the common workqueue for infiniband
instead of the system workqueue. All system workqueue usages
including flush_scheduled_work() callers are converted to use and
flush ib_wq.
* cancel_delayed_work() + flush_scheduled_work() converted to
cancel_delayed_work_sync().
* qib_wq is removed and ib_wq is used instead.
This is to prepare for deprecation of flush_scheduled_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During iw_nes initialization the link status for SFP+ PHY is always
detected as "up" regardless of real state (cable either connected or
disconnected). Add SFP+ PHY specific link status detection to the
iw_nes initialization procedure. Use link status recheck for
netdev_open to detect delayed state updates.
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In case of SFP+ PHY, link status check at interrupt processing can
give false results. For proper link status change detection a delayed
recheck is needed to give nes registers time to settle. Add a
periodic link status recheck scheduled at interrupt to detect
potential delayed registers state changes.
Addresses: http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2117
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Depending on link state change, IB_EVENT_PORT_ERR or
IB_EVENT_PORT_ACTIVE should be generated when handling MAC interrupts.
Plugging in a cable happens to result in series of interrupts changing
driver's link state a number of times before finally staying at link
up (e.g. link up, link down, link up, link down, ..., link up). To
prevent sending series of redundant IB_EVENT_PORT_ACTIVE and
IB_EVENT_PORT_ERR events, we use a timer to debounce them in
nes_port_ibevent().
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Enable configuring bonds on nes devices by adding missing support for
master net_device to the driver.
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Merge the two tests in srp_queuecommand() of whether information unit
allocation succeeded into one. An intended side effect of this change
is that we fix the warning:
drivers/infiniband/ulp/srp/ib_srp.c: In function 'srp_queuecommand':
drivers/infiniband/ulp/srp/ib_srp.c:1116: warning: 'req' may be used uninitialized in this function
(seen with CONFIG_CC_OPTIMIZE_FOR_SIZE=y at least with gcc 4.4.4)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The newest device firmware stores IB vs. Ethernet protocol in two bits
in members_count field of multicast group table (0: Infiniband, 1:
Ethernet). When changing the QP members count for a multicast group,
it important not to reset this information. When calling multicast
attach first time, the protocol type should be specified. In this
patch we always set it IB, but in the future we will handle Ethernet
too. When looking for a QP, the protocol type shoud be checked too.
Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems have PCI addresses that don't fit in unsigned long (eg some
32-bit PowerPC 440 systems have 36-bit bus addresses). Fix up mlx4 drivers
by using phys_addr_t where appropriate, so we don't truncate any PCI
resource addresses before ioremapping them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems have PCI addresses that don't fit in unsigned long (eg some
32-bit PowerPC 440 systems have 36-bit bus addresses). Fix up the driver
by using phys_addr_t where appropriate, so we don't truncate any PCI
resource addresses before ioremapping them.
Signed-off-by: John L. Burr <jlburr@cadence.com>
[ Update to apply to current driver source. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Fix refcount leak in lkey/rkey validation
IB/qib: Improve SERDES tunning on QMH boards
IB/qib: Unnecessary delayed completions on RC connection
IB/qib: Issue pre-emptive NAKs on eager buffer overflow
IB/qib: RDMA lkey/rkey validation is inefficient for large MRs
IB/qib: Change QPN increment
IB/qib: Add fix missing from earlier patch
IB/qib: Change receive queue/QPN selection
IB/qib: Fix interrupt mitigation
IB/qib: Avoid duplicate writes to the rcv head register
IB/qib: Add a few new SERDES tunings
IB/qib: Reset packet list after freeing
IB/qib: New SERDES init routine and improvements to SI quality
IB/qib: Clear WAIT_SEND flags when setting QP to error state
IB/qib: Fix context allocation with multiple HCAs
IB/qib: Fix multi-Florida HCA host panic on reboot
IB/qib: Handle transitions from ACTIVE_DEFERRED to ACTIVE better
IB/qib: UD send with immediate receive completion has wrong size
IB/qib: Set port physical state even if other fields are invalid
IB/qib: Generate completion callback on errors
...
The mr optimization introduced a reference count leak on an exception
test. The lock/refcount manipulation is moved down and the problematic
exception test now calls bail to insure that the lock is released.
Additional fixes as suggested by Ralph Campbell <ralph.campbell@qlogic.org>:
- reduce lock scope of dma regions
- use explicit values on returns vs. automatic ret value
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Improve the QMH SERDES tunning on initial driver load by having the
driver go through a link state change.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently on receipt of a response message (ACKs, RDMA Response,
Atomic Responses etc.) if the SDMA completion counter is not advanced
the driver delays the completion of the WQE. In most cases this is
overly pessimistic as the response (ACK) to a previously transmitted
send implies that the send is complete. Ensure that SDMA queue is
progressed appropriately before determining if a send has delayed
completions.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Under congestion resulting in eager buffer overflow attempt to send
pre-emptive NAKs if header queue entries with TID errors are generated
and a valid header is present. This prevents long timeouts and flow
restarts if a trailing set of packets are dropped due to eager
overflows. Pre-emptive NAKs are currently only supported for RDMA
writes.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current code loops during rkey/lkey validiation to isolate the MR
for the RDMA, which is expensive when the current operation is inside
a very large memory region.
This fix optimizes rkey/lkey validation routines for user memory
regions and fast memory regions. The MR entry can be isolated by
shifts/mods instead of looping. The existing loop is preserved for
phys memory regions for now.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Changing from +1 to +2 allows for better QP distribution across
receive contexts.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The upstream code was missing part of a receive/error race fix from
the internal tree. Add the missing part, which makes future merges
possible.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The basic idea is that on SusieQ, the difficult part of mapping QPN to
context is handled by the mapping registers so the generic QPN
allocation doesn't need to worry about chip specifics. For Monty and
Linda, there is no mapping table so the qpt->mask (same as
dd->qpn_mask), is used to see if the QPN to context falls within
[zero..dd->n_krcv_queues).
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For SusieQ we need to write to the interrupt timer register before
updating the header queue head with interrupt count. This is to
ensure that the timer is enabled properly and a receive available
interrupt is delivered. Otherwise this interrupt can be lost if the
receiver header/eager queues are full before the timer is enabled.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Avoid duplicate writes to the head register as this can lead to lost
interrupts if the context goes full before the second write is done.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add new SERDES tuning to aid manufacturing.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Reset the list pointers after freeing the SDMA packet list. This is
done to any potential double-free cases.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement new SERDES initialization routine and improvements to signal
integrity -- disable LE1 adaptation, disable LOS after link-up, set
better SERDES parameters.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If these flags are set when the QP is transitioned to the error state,
it will wait until the flags are cleared, which may never happen if
the error transition is due to a link going down.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver was incorrectly choosing HCAs on which to allocate new user
contexts based on overall count of usable ports regardless whether the
usable port was on the currently selected HCA.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add check when setting configured contexts that the value does not
exceed the number of contexts allocated for the card. If the value
exceeds the already allocated count, set it to what is already
allocated.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the link transitions from ACTIVE_DEFERRED to ACTIVE, the driver
only sees the ACTIVE state. With this change, it will check whether
the state was already ACTIVE and if so, it will not generated IB
events and will not clear symbol error counts.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code to generate receive completion entries for UD send with
immediate contains the wrong payload length. This is because when the
code to compute the payload size was moved, the value of hdrsize
didn't get moved too. The fix is to update tlen directly.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IBTA vol. 1 release 1.2.1 spec. says:
C14-24.2.1: If PortInfo:Portstate=Down, then a SubnSet(PortInfo) shall
make any changes it specifies to PortInfo:PortPhysicalState; any other
result is vendor-dependent.
The patch changes the error handling so that the reply says there are
invalid fields but still attempts to set fields that are in range
including PortInfo:PortPhysicalState.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to IBTA vol. 1, C11-30.1.1, a notification callback is
invoked if the CQ is armed for the next solicited completion event or
an error completion. The error case wasn't being generated correctly.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support to recognize another board variation named QME7362.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The receive header queue sizes need to modified for performance
tuning. Three module parameters are added to support this.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_create_send_mad() can return ERR_PTR(-ENOMEM) here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_create_send_mad() can return ERR_PTR(-ENOMEM) here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_free_cq_buf() should not be called under spin_lock_irq() since
it calls dma_free_coherent(), which needs irqs enabled. Fix this by
deferring the free to outside the locked region.
This was found due to the
WARN_ON(irqs_disabled());
in swiotlb_free_coherent().
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As a first step in moving from LRO to GRO, revert commit af40da894e
("IPoIB: add LRO support"). Also eliminate the ethtool set_flags
callback which isn't needed anymore. Finally, we need to include
<linux/sched.h> directly to get the declaration of restart_syscall()
(which used to be included implicitly through <linux/inet_lro.h>).
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Re-initializing the wait object in rdma_init()/rdma_fini() causes a
timing window which can lead to a deadlock during close. Once this
deadlock hits, all RDMA activity over the T4 device will be stuck.
There's no need to re-init the wait object, so remove it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This removes unused code found by running 'make namespacecheck';
compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Put the variables accessed together in the hot-path into common
cachelines, and separate them by RW vs RO to avoid false dirtying.
We keep a local copy of the lkey and rkey in the target to avoid
traversing pointers (and associated cache lines) to find them.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Dillow <dillowda@ornl.gov>
We don't need protection against the SCSI stack, so use our own lock to
allow parallel progress on separate CPUs.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[ broken out and small cleanups by David Dillow ]
Signed-off-by: David Dillow <dillowda@ornl.gov>
We only need the lock to cover list and credit manipulations, so push
those into srp_remove_req() and update the call chains.
We reorder the request removal and command completion in
srp_process_rsp() to avoid the SCSI mid-layer sending another command
before we've released our request and added any credits returned by the
target. This prevents us from returning HOST_BUSY unneccesarily.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[ broken out, small cleanups, and modified to avoid potential extraneous
HOST_BUSY returns by David Dillow ]
Signed-off-by: David Dillow <dillowda@ornl.gov>
We only need locks to protect our lists and number of credits available.
By pre-consuming the credit for the request, we can reduce our lock
coverage to just those areas. If we don't actually send the request,
we'll need to put the credit back into the pool.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[ broken out and small cleanups by David Dillow ]
Signed-off-by: David Dillow <dillowda@ornl.gov>
We use req->scmnd != NULL to indicate an active request, so there's no
need to keep a separate list for them. We can afford the array iteration
during error handling, and dropping it gives us one less item that needs
lock protection.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[ broken out and small cleanups by David Dillow ]
Signed-off-by: David Dillow <dillowda@ornl.gov>
dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Only one CPU at a time will own an RX IU, so using the address of the IU
as the work request cookie allows us to avoid taking a lock. We can
similarly prepare the TX path for lockless posting by moving the free TX
IUs to a list. This also removes the requirement that the queue sizes be
a power of 2.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[ broken out, small cleanups, and modified to avoid needing an extra field
in the IU by David Dillow]
Signed-off-by: David Dillow <dillowda@ornl.gov>
We can only have one task management comment outstanding, so move the
completion and status to the target port. This allows us to handle
resets of a LUN without a corresponding request having been sent.
Meanwhile, we don't need to play games with host_scribble, just use it
as the pointer it is.
This fixes a crash when we issue a bus reset using sg_reset.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=13893
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Dillow <dillowda@ornl.gov>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
In ib_uverbs_poll_cq() code there is a potential integer overflow if
userspace passes in a large cmd.ne. The calls to kmalloc() would
allocate smaller buffers than intended, leading to memory corruption.
There iss also an information leak if resp wasn't all used.
Unprivileged userspace may call this function, although only if an
RDMA device that uses this function is present.
Fix this by copying CQ entries one at a time, which avoids the
allocation entirely, and also by moving this copying into a function
that makes sure to initialize all memory copied to userspace.
Special thanks to Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
for his help and advice.
Cc: <stable@kernel.org>
Signed-off-by: Dan Carpenter <error27@gmail.com>
[ Monkey around with things a bit to avoid bad code generation by gcc
when designated initializers are used. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB: Fix information leak in marshalling code
IB/pack: Remove some unused code added by the IBoE patches
IB/mlx4: Fix IBoE link state
IB/mlx4: Fix IBoE reported link rate
mlx4_core: Workaround firmware bug in query dev cap
IB/mlx4: Fix memory ordering of VLAN insertion control bits
MAINTAINERS: Update NetEffect entry
ib_ucm_init_qp_attr() and ucma_init_qp_attr() pass struct ib_uverbs_qp_attr
with reserved, qp_state, {ah_attr,alt_ah_attr}{reserved,->grh.reserved}
fields uninitialized to copy_to_user(). This leads to leaking of
contents of kernel stack memory to userspace.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use netif_running() and netif_carrier_ok() to report link state,
exactly as is done to report Ethernet link state in sysfs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The link rate is the product of the link speed in the link width. For
Etherent ports the rate is 10G, so we use 1 for the width and 4 for
speed to get the correct rate.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We must fully update the control segment before marking it as valid,
so that hardware doesn't start executing it before we're ready.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Move VLAN control bit setting to before wmb(). - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev_base_lock is the legacy way to lock the device list, and is planned
to disappear. (writers hold RTNL, readers hold RCU lock)
Convert rdma_translate_ip() and update_ipv6_gids() to RCU locking.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the mid-layer's ->queuecommand() invocation from being locked
with the host lock to being unlocked to facilitate speeding up the
critical path for drivers who don't need this lock taken anyway.
The patch below presents a simple SCSI host lock push-down as an
equivalent transformation. No locking or other behavior should change
with this patch. All existing bugs and locking orders are preserved.
Additionally, add one parameter to queuecommand,
struct Scsi_Host *
and remove one parameter from queuecommand,
void (*done)(struct scsi_cmnd *)
Scsi_Host* is a convenient pointer that most host drivers need anyway,
and 'done' is redundant to struct scsi_cmnd->scsi_done.
Minimal code disturbance was attempted with this change. Most drivers
needed only two one-line modifications for their host lock push-down.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Acked-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only include the header linux/mutex.h once inside
drivers/infiniband/hw/cxgb4/iw_cxgb4.h
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
It seems idev field in struct rtable has no special purpose, but adding
extra atomic ops.
We hold refcounts on the device itself (using percpu data, so pretty
cheap in current kernel).
infiniband case is solved using dst.dev instead of idev->dev
Removal of this field means routing without route cache is now using
shared data, percpu data, and only potential contention is a pair of
atomic ops on struct neighbour per forwarded packet.
About 5% speedup on routing test.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (63 commits)
IB/qib: clean up properly if pci_set_consistent_dma_mask() fails
IB/qib: Allow driver to load if PCIe AER fails
IB/qib: Fix uninitialized pointer if CONFIG_PCI_MSI not set
IB/qib: Fix extra log level in qib_early_err()
RDMA/cxgb4: Remove unnecessary KERN_<level> use
RDMA/cxgb3: Remove unnecessary KERN_<level> use
IB/core: Add link layer type information to sysfs
IB/mlx4: Add VLAN support for IBoE
IB/core: Add VLAN support for IBoE
IB/mlx4: Add support for IBoE
mlx4_en: Change multicast promiscuous mode to support IBoE
mlx4_core: Update data structures and constants for IBoE
mlx4_core: Allow protocol drivers to find corresponding interfaces
IB/uverbs: Return link layer type to userspace for query port operation
IB/srp: Sync buffer before posting send
IB/srp: Use list_first_entry()
IB/srp: Reduce number of BUSY conditions
IB/srp: Eliminate two forward declarations
IB/mlx4: Signal node desc changes to SM by using FW to generate trap 144
IB: Replace EXTRA_CFLAGS with ccflags-y
...
Use the new {max,min}3 macros to save some cycles and bytes on the stack.
This patch substitutes trivial nested macros with their counterpart.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Joe Perches <joe@perches.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up properly if pci_set_consistent_dma_mask() fails.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some PCIe root complex chip sets don't support advanced error reporting.
Allow the driver to load OK if pci_enable_pcie_error_reporting() fails.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If CONFIG_PCI_MSI is not set, and a QLE7140 is present, the pointer
"dd" is uninitialized.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Noticed this odd looking thing in dmesg:
ib_qib 0000:02:00.0: <3>ib_qib: Unable to enable pcie error reporting: -5
which is due to a bad use of dev_info.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since an IB transport port may use either IB or Ethernet as its link layer,
add the file /sys/class/infiniband/<device>/ports/<port_num>/link_layer to
show the link layer for the port.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows IBoE traffic to be encapsulated in 802.1Q tagged
VLAN frames. The VLAN tag is encoded in the GID and derived from it
by a simple computation.
The netdev notifier callback is modified to catch VLAN device
addition/removal and the port's GID table is updated to reflect the
change, so that for each netdevice there is an entry in the GID table.
When the port's GID table is exhausted, GID entries will not be added.
Only children of the main interfaces can add to the GID table; if a
VLAN interface is added on another VLAN interface (e.g. "vconfig add
eth2.6 8"), then that interfaces will not add an entry to the GID
table.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:
GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.
The 3 bits user priority field of the packets are identical to the 3
bits of the SL.
In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE to mlx4_ib. The bulk of the code is handling the
new address vector fields; mlx4 needs the MAC address of a remote node
to include it in a WQE (for datagrams) or in the QP context (for
connected QPs). Address resolution is done by assuming all unicast
GIDs are either link-local IPv6 addresses.
Multicast group attach/detach needs to update the NIC's multicast
filters; but since attaching a QP to a multicast group can be done
before the QP is bound to a port, for IBoE we need to keep track of
all multicast groups that a QP is attached too before it transitions
from INIT to RTR (since it does not have a port in the INIT state).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Many things cleaned up and otherwise monkeyed with; hope I didn't
introduce too many bugs. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
srp_send_tsk_mgmt() was missing the proper DMA sync calls before posting
the buffer to the device.
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the list_first_entry() macro in ib_srp instead of open-coding the equivalent,
which makes the source code slightly more descriptive. The list_first_entry()
macro itself was introduced in kernel 2.6.22.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As proposed by the SRP (draft) standard, ib_srp reserves one ring
element for SRP_TSK_MGMT requests. This patch makes sure that the SCSI
mid-layer never tries to queue more than (SRP request limit) - 1 SCSI
commands to ib_srp. This improves performance for targets whose request
limit is less than or equal to SRP_NORMAL_REQ_SQ_SIZE by reducing the
number of BUSY responses reported by ib_srp to the SCSI mid-layer.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The Node Description cannot be changed via MADs (it is read-only).
Until now, it was changed in the driver via sysfs, and the new Node
Description was simply inserted by the driver into MAD responses
(replacing the description returned by FW).
System startup scripts use the sysfs interface to change the node
description at driver startup to show the hostname, etc. However, this
has a race condition: the SM could discover the original FW node
description rather than the system-specific description if it queried the
port before the startup scripts finish running.
For mlx4, we fix this with a new FW command (SET_NODE) that allows
passing the new node description to FW. When this command is invoked,
FW sends a trap 144 to the SM. When it gets this trap, the SM can
query the node to obtain the new node description -- thus eliminating
the effects of the race.
This patch simply calls SET_NODE command when a new node description
is entered via sysfs (thus causing trap 144 to be issued by the FW).
We ignore all failures of the SET_NODE command (including those caused
by using a device FW that predates the SET_NODE command), since in
that case things work just as before.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For iWARP connections, the connect request is carried in a TCP payload
on an already established TCP connection. So if the ucma's backlog is
full, the connection request is transmitted and acked at the TCP level
by the time the connect request gets dropped in the ucma. The end
result is the connection gets rejected by the iWARP provider.
Further, a 32 node 256NP OpenMPI job will generate > 128 connect
requests on some ranks.
This patch increases the default max backlog to 1024, and adds a
sysctl variable so the backlog can be adjusted at run time.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the net device's dev_id field to encode the port number of the pci
device. This can be used to to associate a net device with the pci
device's port. The encoding is: dev_id = port - 1.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
This patch adds support for SRP_CRED_REQ to avoid a lockup by targets
that use that mechanism to return credits to the initiator. This
prevents a lockup observed in the field where we would never add the
credits from the SRP_CRED_REQ to our current count, and would therefore
never send another command to the target.
Minimal support for SRP_AER_REQ is also added, as these messages can
also be used to convey additional credits to the initiator.
Based upon extensive debugging and code by Bart Van Assche and a bug
report by Chris Worley.
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The transmit ring in ib_srp (srp_target.tx_ring) is currently only used
for allocating requests sent by the initiator to the target. This patch
prepares using that ring for allocation of both requests and responses.
Also, this patch differentiates the uses of SRP_SQ_SIZE, increases the
size of the IB send completion queue by one element and reserves one
transmit ring slot for SRP_TSK_MGMT requests.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
See table 35 in IBA - the header order for RDMA_WRITE_ONLY_WITH_IMMEDIATE
and SEND_LAST_WITH_IMMEDIATE is different: the RDMA_WRITE_ONLY has
a RETH header before the immediate data, so we need a different code path
to extract the immediate data.
I tested this with a userspace app that does RDMA_WRITE with immediate
on a QLE7140.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The flushing of work requests for user QPs is implemented entirely in
the user mode library. The only kernel interaction is to mark the
user QP object indicating it is in error when the QP exits RTS. When
the user QP operations are called by the application (eg: post_send,
post_recv), the QP in error bit is checked and if set, the library
flushes the QP. If, however, the application is not doing IO, but
rather just polling the CQ, it will never get flushed work requests.
This breaks some classes of applications.
This patch adds logic to mark user CQs in error when a QP that is bound
to the CQ is marked in error. The library poll code can then notice
the CQ is in error and flush all the in error QPs bound to that CQ.
Design:
- add 1 extra CQE entry to the CQ memory that will be used to indicate
in error status.
- return the desired CQ memory size that should be mapped by the library
- bump the ABI since the create_cq uverbs response changes.
- detect older libraries and reduce the mmap size accordingly.
(The ABI bump doesn't break old libraries, since they didn't check
the ABI field anyway)
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove the local service t4_pktgl_to_skb() and use cxgb4_pktgl_to_skb()
exported by cxgb4.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
The patch below updates broken web addresses in the kernel
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Dimitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Ben Pfaff <blp@cs.stanford.edu>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Reviewed-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Add support for packing IBoE packet headers.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Clean up and fix ib_ud_header_init() a bit. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE device binding and IP --> GID resolution. Path
resolving and multicast joining are implemented within cma.c by
filling in the responses and running callbacks in the CMA work queue.
IP --> GID resolution always yields IPv6 link local addresses; remote
GIDs are derived from the destination MAC address of the remote port.
Multicast GIDs are always mapped to multicast MACs as is done in IPv6.
(IPv4 multicast is enabled by translating IPv4 multicast addresses to
IPv6 multicast as described in
<http://www.mail-archive.com/ipng@sunroof.eng.sun.com/msg02134.html>.)
Some helper functions are added to ib_addr.h.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since IBoE is using Ethernet as its link layer, there is no central
management entity so there is need for QP0. QP1 is still needed since
it handles communications between CM agents. This patch will skip QP0
and create only QP1 for IBoE ports.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IPoIB is IP-over-Infiniband link layer. In the case of IBoE, the link
layer is Ethernet and IP can work directly over Ethernet, so disable
IPoIB for non-IB_LINK_LAYER_INFINIBAND ports.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A process can get stuck in an uninterruptible wait in the
kernel while destroying a cm_id when iw_cm_connect() fails:
For example, When creation of a PD fails but the user continues with
an attempt to connect to the server without checking the return value,
in iw_cm_connect() a NULL qp is found so the call fails. However the
IWCM_F_CONNECT_WAIT bit is not cleared. destroy_cm_id() then waits
forever for IWCM_F_CONNECT_WAIT to be cleared.
The same problem exists on the passive side with the accept call.
Fix this by clearing the bit and waking up any waiters in the
appropriate spots.
Signed-off-by: Animesh Trivedi <atr@zurich.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We can replace our equivalent open-coded version.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the limit on the size of max fast registration WRs that can be
posted to match hardware capabilities.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This lets the bonding driver to detect when an interface goes down.
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With commit cd6860eb ("RDMA/nes: Fix hangs on ifdown") we no longer
remove nes interfaces on ifdown. On nes_query_port(), add an
additional check of the netdev queue and report IB_PORT_DOWN if the
queue is not running.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
the eHCA driver registers a MR for all of kernel memory, but makes the
assumption that valid memory exists at KERNELBASE. This assumption
may not be true in the case of a relocatable kernel, so use KERNELBASE
+ PHYSICAL_START to get the true beginning of usable kernel memory.
cc: Joachim Fenkes <fenkes@de.ibm.com>
cc: Christoph Raisch <raisch@de.ibm.com>
cc: Hoan-Ham Hguyen <hnguyen@de.ibm.com>
Signed-off-by: Sonny Rao <sonnyrao@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Just a small cleanup. The "passive_state" variable isn't used any
more after commit dae58728dc ("RDMA/nes: Fix double CLOSE event
indication crash")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
IGMP processing is broken because the IPOIB does not set the
skb->pkt_type the right way for multicast traffic. All incoming
packets are set to PACKET_HOST which means that igmp_recv() will
ignore the IGMP broadcasts/multicasts.
This in turn means that the IGMP timers are firing and are sending
information about multicast subscriptions unnecessarily. In a large
private network this can cause traffic spikes.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Remove dsgl support - doesn't work in T4.
- Wrap the immediate PBL as needed when building it in the wr.
- Adjust max pbl depth allowed based on ulptx alignment requirements.
- Bump the slots per SQ to 5 to allow up to 128MB fast registers.
- Advertise fastreg support by default.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the connection setup/teardown paths to the workq thread removing
spin lock/irq disable requirements for these paths. This allows calls
down to the LLD for EP and QP state transition actions to be atomic
with respect to processing CPL messages coming up from the HW.
Namely, calls to rdma_init() and rdma_fini() can now be called with
the mutex held avoiding many race conditions with the abort path.
The QP spinlock is still used but only to manipulate the qp state. This
allows the fastpaths, poll, post_send, and pos_recv, to run in the
irq context.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
T4 support on-chip SQs to reduce latency. This patch adds support for
this in iw_cxgb4:
- Manage ocqp memory like other adapter mem resources.
- Allocate user mode SQs from ocqp mem if available.
- Map ocqp mem to user process using write combining.
- Map PCIE_MA_SYNC reg to user process.
Bump uverbs ABI.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add "stags" debugfs file. This is useful for examining the TPTE and
PBL entries in adapter memory. It allows scripts to dump just the
active entries.
Also clean up the "qps" file handlers and shared common code.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This helps debug cases where HW resources are depleted.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
T4 FW sends up CPL_RDMA_TERMINATE to indicate a peer TERM. This
triggers the QP moving to TERMINATE state.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
T4 incorrectly inserts TERM CQEs into the CQ. Silently ignore them.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxgb4_*_send() functions return NET_XMIT_ values, which are
positive integers or negative errno values. So don't treat positive
return values as an error.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The HW design requires zeroing any pad in SGLs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In c4iw_modify_qp() error path, only use qhp->ep if ep is not already set.
Otherwise qhp->ep can be NULL and we crash.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix:
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_alloc_fast_reg_page_list':
drivers/infiniband/hw/nes/nes_verbs.c:477: warning: cast to pointer from integer of different size
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_post_send':
drivers/infiniband/hw/nes/nes_verbs.c:3486: warning: cast to pointer from integer of different size
drivers/infiniband/hw/nes/nes_verbs.c:3486: warning: cast to pointer from integer of different size
by printing u64 quantities by casting to unsigned long and long and
using %llx, rather than casting to void* and using %p.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows ports to have different link layers:
IB_LINK_LAYER_INFINIBAND or IB_LINK_LAYER_ETHERNET. This is required
for adding IBoE (InfiniBand-over-Ethernet, aka RoCE) support. For
devices that do not provide an implementation for querying the link
layer property of a port, we return a default value based on the
transport: RMA_TRANSPORT_IB nodes will return IB_LINK_LAYER_INFINIBAND
and RDMA_TRANSPORT_IWARP nodes will return IB_LINK_LAYER_ETHERNET.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix:
drivers/infiniband/hw/cxgb4/qp.c: In function ‘create_qp’:
drivers/infiniband/hw/cxgb4/qp.c:147: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/qp.c: In function ‘rdma_fini’:
drivers/infiniband/hw/cxgb4/qp.c:988: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/qp.c: In function ‘rdma_init’:
drivers/infiniband/hw/cxgb4/qp.c:1063: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/mem.c: In function ‘write_adapter_mem’:
drivers/infiniband/hw/cxgb4/mem.c:74: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/cq.c: In function ‘destroy_cq’:
drivers/infiniband/hw/cxgb4/cq.c:58: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/cq.c: In function ‘create_cq’:
drivers/infiniband/hw/cxgb4/cq.c:135: warning: cast from pointer to integer of different size
drivers/infiniband/hw/cxgb4/cm.c: In function ‘fw6_msg’:
drivers/infiniband/hw/cxgb4/cm.c:2326: warning: cast to pointer from integer of different size
by casting pointers to unsigned long instead of u64.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The HW by default has RX coalescing on. For iWARP connections, this
causes a 100ms delay in connection establishement due to the ingress
MPA Start message being stalled in HW. So explicitly turn RX
coalescing off when setting up iWARP connections.
This was causing very bad performance for NP64 gather operations using
Open MPI, due to the way it sets up connections on larger jobs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Changing state to CLOSING when FIN is received causes A0 cards to
hang. Fix this by checking for A0 cards in FIN handling.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the driver receives an AE for FIN received, it closes the
connection without changing the state of the connection in the
hardware to closing. By changing the state to closing, hardware will
do a normal close sequence.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During a stress testing in a large cluster, multiple close event are
detected and BUG() is hit in the iWARP core. The cause is that the
active node gave up while waiting for an MPA response from the peer
and tried to close the connection by sending RST. The passive node
driver receives the RST but is waiting for MPA response from the user.
When the MPA accept is received, the driver offloads the connection
and sends a CLOSE event. The driver gets an AE indicating RESET
received and also sends a CLOSE event, hitting a BUG().
Fix this by correcting RESET handling and sending CLOSE events.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Setting TX pause param writes to the wrong register location causing
the adapter to hang. Correct the define used to write the reigster.
Addresses: https://bugs.openfabrics.org/show_bug.cgi?id=2116
Reported-by: Shiri Franchi <shirif@voltaire.com>
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The max depth supported by T3 is 64K entries. This fixes a bug
introduced in commit 9918b28d ("RDMA/cxgb3: Increase the max CQ
depth") that causes stalls and possibly crashes in large MPI clusters.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Add missing <linux/slab.h> include
IB/ehca: Drop unnecessary NULL test
RDMA/nes: Fix confusing if statement indentation
IB/ehca: Init irq tasklet before irq can happen
RDMA/nes: Fix misindented code
RDMA/nes: Fix showing wqm_quanta
RDMA/nes: Get rid of "set but not used" variables
RDMA/nes: Read firmware version from correct place
IB/srp: Export req_lim via sysfs
IB/srp: Make receive buffer handling more robust
IB/srp: Use print_hex_dump()
IB: Rename RAW_ETY to RAW_ETHERTYPE
RDMA/nes: Fix two sparse warnings
RDMA/cxgb3: Make needlessly global iwch_l2t_send() static
IB/iser: Make needlessly global iser_alloc_rx_descriptors() static
RDMA/cxgb4: Add timeouts when waiting for FW responses
IB/qib: Fix race between qib_error_qp() and receive packet processing
IB/qib: Limit the number of packets processed per interrupt
IB/qib: Allow writes to the diag_counters to be able to clear them
IB/qib: Set cfgctxts to number of CPUs by default
...
of_device is just an alias for platform_device, so remove it entirely. Also
replace to_of_device() with to_platform_device() and update comment blocks.
This patch was initially generated from the following semantic patch, and then
edited by hand to pick up the bits that coccinelle didn't catch.
@@
@@
-struct of_device
+struct platform_device
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Reviewed-by: David S. Miller <davem@davemloft.net>
Fix build failure on sparc64 which is missing the include of
<linux/slab.h> via <asm/pci.h> that x86, powerpc, ia64, etc. have.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
list_for_each_entry binds its first argument to a non-null value, and thus
any null test on the value of that argument is superfluous.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
iterator I;
expression x;
statement S,S1,S2;
@@
I(x,...) { <...
- if (x == NULL && ...) S
...> }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix confusing indentation that makes a statement look as if it's part of
an if statement when in fact it isn't.
Reported-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Initialize tasklet before interrupts are requested to prevent
scheduling of an uninitialized tasklet.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export req_lim via sysfs for debugging.
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Acked-by: David Dillow <dave@thedillows.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits)
phy/marvell: add 88ec048 support
igb: Program MDICNFG register prior to PHY init
e1000e: correct MAC-PHY interconnect register offset for 82579
hso: Add new product ID
can: Add driver for esd CAN-USB/2 device
l2tp: fix export of header file for userspace
can-raw: Fix skb_orphan_try handling
Revert "net: remove zap_completion_queue"
net: cleanup inclusion
phy/marvell: add 88e1121 interface mode support
u32: negative offset fix
net: Fix a typo from "dev" to "ndev"
igb: Use irq_synchronize per vector when using MSI-X
ixgbevf: fix null pointer dereference due to filter being set for VLAN 0
e1000e: Fix irq_synchronize in MSI-X case
e1000e: register pm_qos request on hardware activation
ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice
net: Add getsockopt support for TCP thin-streams
cxgb4: update driver version
cxgb4: add new PCI IDs
...
Manually fix up conflicts in:
- drivers/net/e1000e/netdev.c: due to pm_qos registration
infrastructure changes
- drivers/net/phy/marvell.c: conflict between adding 88ec048 support
and cleaning up the IDs
- drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req
conflict (registration change vs marking it static)
The current strategy in ib_srp for posting receive buffers is:
* Post one buffer after channel establishment.
* Post one buffer before sending an SRP_CMD or SRP_TSK_MGMT to the target.
As a result, only the first non-SRP_RSP information unit from the
target will be processed. If that first information unit is an
SRP_T_LOGOUT, it will be processed. On the other hand, if the
initiator receives an SRP_CRED_REQ or SRP_AER_REQ before it receives a
SRP_T_LOGOUT, the SRP_T_LOGOUT won't be processed.
We can fix this inconsistency by changing the strategy for posting
receive buffers to:
* Post all receive buffers after channel establishment.
* After a receive buffer has been consumed and processed, post it again.
A side effect is that the ib_post_recv() call is moved out of the SCSI
command processing path. Since __srp_post_recv() is not called
directly any more, get rid of it and move the code directly into
srp_post_recv(). Also, move srp_post_recv() up in the file to avoid a
forward declaration.
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Acked-by: David Dillow <dave@thedillows.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace an open-coded dump of the receive buffer with a call to
print_hex_dump().
Signed-off-by: Bart Van Assche <bart.vanassche@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change abbreviated IB_QPT_RAW_ETY to IB_QPT_RAW_ETHERTYPE to make
the special QP type easier to understand.
cf http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04530.html
Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Simple changes to fix warnings:
CHECK drivers/infiniband/hw/nes/nes_verbs.c
nes_verbs.c:1944:45: warning: Using plain integer as NULL pointer
nes_verbs.c:1944:48: warning: Using plain integer as NULL pointer
CHECK drivers/infiniband/hw/nes/nes_cm.c
nes_cm.c:2645:43: warning: mixing different enum types
nes_cm.c:2645:43: int enum iw_cm_event_type versus
nes_cm.c:2645:43: int enum iw_cm_event_status
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Acked-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't hang a host thread if the FW stops responding.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When transitioning a QP to the error state, in progress RWQEs need to
be marked complete. This also involves releasing the reference count
to the memory regions referenced in the SGEs. The locking in the
receive packet processing wasn't sufficient to prevent qib_error_qp()
from modifying the r_sge state at the same time, thus leading to
kernel panics.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't processes too many packets without allowing other IRQ functions
a chance to run. Otherwise, there is a chance of getting a "soft
lockup" messages and poor application response times.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Up to now, we have set the number of available user contexts based on
the number of hardware contexts which is set according to the number
of available CPUs. This was fine since most CPUs had a power of two
number of cores and the chip supported 4, 8, or 16 user contexts. Now
that some systems have 12 cores, the default isn't optimal and should
be set to 12 even though 16 hardware contexts need to be enabled.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Endpoint timer manipulation needs to be done inside the lock. Otherwise
we can get into a situation where a timer is stopped before it is started,
which hits the WARN_ON() in stop_ep_timer().
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is only one control txq per tx channel. So use the port number
as the queue index when sending.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There exists a race condition where the app disconnects, which
initiates an orderly close (via rdma_fini()), concurrently with an
ingress abort condition, which initiates an abortive close operation.
Since rdma_fini() must be called without IRQs disabled, the fini can
be called after the QP has been transitioned to ERROR. This is ok,
but we need to protect against qp->ep getting NULLed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
NULL pointer dereferences in ib_cm_init_qp_attr() were seen by some
users. From a crash dump, I determined that we died in
cm_init_qp_rts_attr() (it's inlined, so it doesn't show up in the
traceback) on the line labeled below:
static int cm_init_qp_rts_attr(struct cm_id_private *cm_id_priv,
struct ib_qp_attr *qp_attr,
int *qp_attr_mask)
{
........
if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) {
.....
} else {
*qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE;
qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; <-die
The problem is that the rdma_cm can call ib_send_cm_mra() after a
connection has been established. The ib_cm incorrectly assumes that
the MRA is in response to a LAP (load alternate path) message, even
though no LAP message has been received. The ib_cm needs to check the
lap_state before sending an MRA if the cm_id state is established.
Reported-by: Arthur Kepner <akepner@sgi.com>
Reported-by: Josh England <jjengla@gmail.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When ib_unregister_device() is called from netdev stop during ifdown,
it sometimes hangs. Changes made to indicate port_err to ib_dispatch_event()
during netdev stop and port_active during netdev open. The
ib_unregister_device() is only called during remove of the module.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Read and print eeprom version and save it off for later use.
Also delete a tab.
Signed-off-by: Chien Tung <chien.tin.tung@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch converts pci_table entries, where .subvendor=PCI_ANY_ID and
.subdevice=PCI_ANY_ID, .class=0 and .class_mask=0, to use the
PCI_VDEVICE macro, and thus improves readability.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When ioremap() fails with a NULL pointer, catch the error and pass it
to the caller of create_qp() or create_cq() instead of trying to
dereference the NULL pointer later on.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We used to allow only full specification, or using all contexts within
an HCA before moving to the next HCA. We now allow an additional
method -- round-robining through HCAs -- and make that the default.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Turn off IB latency mode. This improves link quality for slower
process chips.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When the default llseek action gets changed to no_llseek, all file
systems relying on the current behaviour need to set explicit .llseek
operations.
In case of qib_fs, we want the files to be seekable, so
generic_file_llseek fits best.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
T4 EQ entries are in multiples of 64 bytes. Currently the RDMA SQ and
RQ use fixed sized entries composed of 4 EQ entries for the SQ and 2
EQ entries for the RQ. For optimial latency with small IO, we need to
change this so the HW only needs to DMA the EQ entries actually used
by a given work request.
Implementation:
- add wq_pidx counter to track where we are in the EQ. cidx/pidx are
used for the sw sq/rq tracking and flow control.
- the variable part of work requests is the SGL. Add new functions to
build the SGL and/or immediate data directly in the EQ memory
wrapping when needed.
- adjust the min burst size for the EQ contexts to 64B.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Q_FREECNT() returns the number of spaces free. This should never be a
negative amount. Also the num_wrs is an unsigned int so it can never
be less than zero.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The alloc_skb() in various allocations are failable, so remove
__GFP_NOFAIL from their masks.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The failure path in ipath_init_one() does not match the cleanup code
in ipath_remove_one() and appears to leave interrupts enabled in some
cases. Change it to match.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix reading hcall locking capability bit from device capabilities.
Signed-off-by: Alexander Schmidt <alexs@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than use a variable size array allocation on the stack,
define a constant for the maximum array size possible.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The rest of the code seems to assume that ep->com.cm_id can't be NULL,
so remove an unneeded test.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We don't need to assign rpl here, we do that later on.
Signed-off-by: Dan Carpenter <error27@gmail.com>
[ Indeed this assignment makes no sense, since skb is set to NULL a
couple of lines before. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change code like
x = expr(++x)
that assigns to x twice without a sequence point in between to the
intended (and well-defined)
x = expr(x + 1)
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Extract the microcode for the QLogic QLE7220 series IB HCA and use the
kernel microcode request facility to load the microcode. This
supports Debian Linux's requirements to separate microcode which
doesn't have open source code available from the device driver.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Fix world-writable child interface control sysfs attributes
IB/qib: Clean up properly if qib_init() fails
IB/qib: Completion queue callback needs to be single threaded
IB/qib: Update 7322 serdes tables
IB/qib: Clear 6120 hardware error register
IB/qib: Clear eager buffer memory for each new process
IB/qib: Mask hardware error during link reset
IB/qib: Don't mark VL15 bufs as WC to avoid a rare 7322 chip problem
RDMA/cxgb4: Derive smac_idx from port viid
RDMA/cxgb4: Avoid false GTS CIDX_INC overflows
RDMA/cxgb4: Don't call abort_connection() for active connect failures
RDMA/cxgb4: Use the DMA state API instead of the pci equivalents
Sumeet Lahorani <sumeet.lahorani@oracle.com> reported that the IPoIB
child entries are world-writable; however we don't want ordinary users
to be able to create and destroy child interfaces, so fix them to be
writable only by root.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If qib_init() fails, the driver fails to free memory, unregister
device files, and unregister with the PCIe framework. The driver will
unload without error but a subsequent driver load will cause the
system to panic. This was found by changing the 7220 code to load the
serdes microcode separately and not installing the microcode file.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Workqueues aren't exactly equivalent to tasklets since the callback
function may be called from multiple CPUs before the callback returns.
This causes completion notification callbacks to have MT bugs since
they weren't expecting this behavior. The fix is to use a single
threaded work queue.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The hardware error register needs to be cleared or another interrupt
will be generated, thus causing an infinite loop. This is a
regression introduced when removing debug output.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The eager buffers are not being cleared before being mmapped into a
new user address space. This is a potential security risk and should
be fixed. Note that the eager header queue is already being cleared.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The HCA checks for certain hardware errors which can be falsely
triggered when the IB link is reset. The fix is to mask them rather
than report them.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't set write combining via PAT on the VL15 buffers to avoid a rare
problem with unaligned writes from interrupt-flushed store buffers.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The T4 IQ hw design assumes CIDX_INC credits will be returned on a
regular basis and always before the CIDX counter crosses over the PIDX
counter. For RDMA CQs, however, returning CIDX_INC credits is only
needed and desired when and if the CQ is armed for notification. This
can lead to a GTS write returning credits that causes the HW to reject
the credit update because it causes CIDX to pass PIDX. Once this
happens, the CIDX/PIDX counters get out of whack and an application
can miss a notification and get stuck blocked awaiting a notification.
To avoid this, we allocate the HW IQ 2x times the requested size.
This seems to avoid the false overflow failures. If we see more
issues with this, then we'll have to add code in the poll path to
return credits periodically like when the amount reaches 1/2 the queue
depth). I would like to avoid this as it adds a PCI write transaction
for applications that never arm the CQ (like most MPIs).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>