Commit Graph

3296 Commits

Author SHA1 Message Date
shefty 63f05be2c0 RDMA/cm: Change return value from find_gid_port()
Problem reported by Dan Carpenter <dan.carpenter@oracle.com>:

The patch 3c86aa70bf67: "RDMA/cm: Add RDMA CM support for IBoE
devices" from Oct 13, 2010, leads to the following warning:
net/sunrpc/xprtrdma/svc_rdma_transport.c:722 svc_rdma_create()
	 error: passing non neg 1 to ERR_PTR

This bug would result in a NULL dereference.  svc_rdma_create() is
supposed to return ERR_PTRs or valid pointers, but instead it returns
ERR_PTRs, valid pointers and 1.

The call tree is:

svc_rdma_create()
   => rdma_bind_addr()
      => cma_acquire_dev()
         => find_gid_port()

rdma_bind_addr() should return a valid errno.  Fix this by having
find_gid_port() also return a valid errno.  If we can't find the
specified GID on a given port, return -EADDRNOTAVAIL, rather than
-EAGAIN, to better indicate the error.  We also drop using the
special return value of '1' and instead pass through the error
returned by the underlying verbs call.  On such errors, rather
than aborting the search,  we simply continue to check the next
device/port.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-29 12:16:29 -08:00
Jack Morgenstein ceb7decb36 IB/mlx4: Fix spinlock order to avoid lockdep warnings
lockdep warns about taking a hard-irq-unsafe lock (sriov->id_map_lock)
inside a hard-irq-safe lock (sriov->going_down_lock).

Since id_map_lock is never taken in the interrupt context, we can
simply reverse the order of taking the two spinlocks, thus avoiding
the warning and the depencency.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-29 12:14:45 -08:00
Nicholas Bellinger 3e4f574857 ib_srpt: Convert TMR path to target_submit_tmr
This patch converts the TMR path in srpt_handle_tsk_mgmt() to use
target_submit_tmr() with TARGET_SCF_ACK_KREF flag usage.

v2: Drop ununused res in target_submit_tmr (Fengguang.Wu)

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-11-28 11:20:28 -08:00
Nicholas Bellinger 9474b04313 ib_srpt: Convert I/O path to target_submit_cmd + drop legacy ioctx->kref
This patch converts the main srpt_handle_cmd() I/O path to use modern
target_submit_cmd() with TARGET_SCF_ACK_KREF flag usage.  This includes
dropping the original internal ioctx->kref + srpt_put_send_ioctx() usage
in favor of target_put_sess_cmd() w/ se_cmd_t->cmd_kref within ib_srpt
response callbacks.

It also updates srpt_abort_cmd() to call target_put_sess_cmd() for
completion of aborted commands, and adds target_wait_for_sess_cmds() into
srpt_release_channel_work() to allow outstanding I/O to complete during
session shutdown.

Also, go ahead and update srpt_handle_tsk_mgmt() to make the remaining
transport_init_se_cmd() to setup the ioctx->cmd with se_tmr_req.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Roland Dreier <roland@kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-11-28 01:10:59 -08:00
Roland Dreier d0951d2133 Merge branches 'cxgb4', 'misc', 'mlx4', 'nes' and 'uapi' into for-next 2012-11-26 11:08:41 -08:00
Julia Lawall 5107c2a3d1 RDMA/cxgb3: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-26 11:08:16 -08:00
Julia Lawall 76f267b7ad RDMA/cxgb4: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-26 11:07:18 -08:00
Or Gerlitz 08ff32352d mlx4: 64-byte CQE/EQE support
ConnectX-3 devices can use either 64- or 32-byte completion queue
entries (CQEs) and event queue entries (EQEs).  Using 64-byte
EQEs/CQEs performs better because each entry is aligned to a complete
cacheline.  This patch queries the HCA's capabilities, and if it
supports 64-byte CQEs and EQES the driver will configure the HW to
work in 64-byte mode.

The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.

Since this mode is global, userspace (libmlx4) must be updated to work
with the configured CQE size, and guests using SR-IOV virtual
functions need to know both EQE and CQE size.

In case one of the 64-byte CQE/EQE capabilities is activated, the
patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
they need an update to be able to work with the PPF. This is done by
changing the returned pf_context_behaviour not to be zero any more. In
case none of these capabilities is activated that value remains zero
and older guest drivers can run OK.

The SRIOV related flow is as follows

1. the PPF does the detection of the new capabilities using
   QUERY_DEV_CAP command.

2. the PPF activates the new capabilities using INIT_HCA.

3. the VF detects if the PPF activated the capabilities using
   QUERY_HCA, and if this is the case activates them for itself too.

Note that the VF detects that it must be aware to the new PF behaviour
using QUERY_FUNC_CAP.  Steps 1 and 2 apply also for native mode.

User space notification is done through a new field introduced in
struct mlx4_ib_ucontext which holds device capabilities for which user
space must take action. This changes the binary interface so the ABI
towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
when **needed** i.e. only when the driver does use 64-byte CQEs or
future device capabilities which must be in sync by user space. This
practice allows to work with unmodified libmlx4 on older devices (e.g
A0, B0) which don't support 64-byte CQEs.

In order to keep existing systems functional when they update to a
newer kernel that contains these changes in VF and userspace ABI, a
module parameter enable_64b_cqe_eqe must be set to enable 64-byte
mode; the default is currently false.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-26 10:19:17 -08:00
Benjamin Herrenschmidt 2a859ab07b Merge branch 'merge' into next
Merge my own merge branch to get various fixes from there
and upstream, especially the hvc console tty refcouting fixes
which which testing is quite a bit harder...
2012-11-26 09:23:57 +11:00
Alan Cox c9795bd708 RDMA/amsol1100: Fix missing break
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:52:55 -08:00
Alan Cox 5390f86796 IB/ipath: Remove unreachable code
Signed-off-by: Alan Cox <alan@linux.intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:50:51 -08:00
Julia Lawall 079abea6a3 RDMA/nes: Use WARN()
Use WARN() rather than printk() followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>

[ Remove extra KERN_ERR from WARN() format.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:49:15 -08:00
Tatyana Nikolova cecdcd5f24 RDMA/nes: Fix for incorrect multicast address in the perfect filter table
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:49:14 -08:00
Tatyana Nikolova fc8d7547b1 RDMA/nes: Fix for sending fpdus in order to hardware
Locking fix to prevent race conditions. Fpdus (per qp) need to be
forwarded to hardware in the order of their sequence numbers.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Donald Wood <Donald.E.Wood@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:49:14 -08:00
Tatyana Nikolova bff3976bef RDMA/nes: Fix for unlinking skbs from empty list
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:49:13 -08:00
Tatyana Nikolova 3127e4ea54 RDMA/nes: Fix incorrect address of IP header
Fix for incorrect ip header address when forwarding fpdus to hardware.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-22 00:49:13 -08:00
Ian Munsie cca55d9ddf powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
I am going to use this in the next patch, better to have this code in
one place rather than three.

Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-11-15 15:08:07 +11:00
Christoph Hellwig de103c93af target: pass sense_reason as a return value
Pass the sense reason as an explicit return value from the I/O submission
path instead of storing it in struct se_cmd and using negative return
values.  This cleans up a lot of the code pathes, and with the sparse
annotations for the new sense_reason_t type allows for much better
error checking.

(nab: Convert spc_emulate_modesense + spc_emulate_modeselect to use
      sense_reason_t with Roland's MODE SELECT changes)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-11-06 20:55:46 -08:00
Roland Dreier 1e3474d1de Merge branches 'cxgb4' and 'mlx4' into for-next 2012-10-23 09:03:49 -07:00
Thadeu Lima de Souza Cascardo 32c631f9f2 RDMA/cxgb4: Don't free chunk that we have failed to allocate
In the error path of registering memory when there's a failure to
allocate a chunk from the memory pool, we try to free the same chunk
we just failed to allocate, which will BUG().

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-22 11:05:00 -07:00
Eli Cohen bef83ed92c IB/mlx4: Synchronize cleanup of MCGs in MCG paravirtualization
A client re-register event invokes cleanup of all MCGs.  This is
required to protect against misbehaved guests leading to corruption of
join/leave database.  However, since cleaning up the MCGs is a heavy
operation, it is pushed to a work queue for further processing.
Client re-register is also propagated to ULPs (e.g IPoIB).

However, since the cleanup is performed in a workqueue, the ULP could
leave and re-join groups before the cleanup occurs.  In this case,
when the cleanup takes place, it prunes the (newly-joined) MCGs and
the ULP is left without actual MCGs while believing it joined them.

Fix this by setting the flushing flag before invoking the cleanup task
and clearing it after flushing is complete.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-18 10:29:02 -07:00
Jack Morgenstein 2c75d2ccb6 IB/mlx4: Fix QP1 P_Key processing in the Primary Physical Function (PPF)
In the MAD paravirtualization code, one of the checks performed when
forwarding QP1 (GSI) packets from wire to slave was a P_Key check: the
P_Key received in the MAD must be present in the guest's paravirtualized
P_Key table, and at least one of the (packet P_Key, guest P_Key) must
be a full-membership P_Key.

However, if everyone involved has only limited membership in the
default P_Key, then packets sent by full-member remote hosts arrive at
the PPF but are not passed on to the VFs with the current P_Key1 check.

Fix this as follows:

1. Don't care if P_Key received over wire is full or not. If it
   successfully passed HW checks on the real QP1, then simply pass it
   to guest regardless of whether the guest has full or limited
   membership in its P_Key table.

2. If the guest (including paravirtualized master) has both full and
   limited P_Key forms in its table, preferentially pass the
   paravirtualized P_Key index of the full P_Key form in the tunnel
   header.

3. In the multicast join flow (mlx4/mcg.c), use the index for the
   default P_Key (wherever it is located) in replies generated from
   within the mcg module (previously, P_Key index 0 was used in all
   cases).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-18 10:29:02 -07:00
Doug Ledford 8a095030f7 IB/mlx4: Fix build error on platforms where UL is not 64 bits
Line 110 uses UL as a compiler cast for the 0x constant, but it's not
large enough to hold a 64-bit value on a 32-bit arch.

Signed-off-by: Doug Ledford <dledford@redhat.com>

[ Use "-1" instead of "FFFFFFFFFFFFFFFFULL".  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-18 10:29:01 -07:00
Linus Torvalds a188e7e93a Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull scsi target updates from Nicholas Bellinger:
 "Things have been calm for the most part with no new fabric drivers in
  flight for v3.7 (we're up to eight now !), so this update is primarily
  focused on addressing a few long-standing items within target-core and
  iscsi-target fabric code.

  The highlights include:

   - target: Simplify fabric sense data length handling (roland)
   - qla2xxx: Fix endianness of task management response code (roland)
   - target: fix truncation of mode data, support zero allocation length
     (paolo)
   - target: Properly support zero-length commands in normal processing
     path (paolo)
   - iscsi-target: Correctly set 0xffffffff field within ISCSI_OP_REJECT
     PDU (ronnie + nab)
   - iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG
     demo-mode (ronnie + nab)
   - target/file: Re-enable optional fd_buffered_io=1 operation (nab +
     hch)
   - iscsi-target: Add MaxXmitDataSegmenthLength forr target ->
     initiator MDRSL declaration (nab)
   - target: Add target_submit_cmd_map_sgls for SGL fabric memory
     passthrough (nab + hch)
   - tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls (hch +
     nab)
   - tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls (nab
     + hch)

  The last series for adding a new target_submit_cmd_map_sgls() fabric
  caller (as requested by hch) that accepts pre-allocated SGL memory
  (using existing logic), along with converting tcm_loop + tcm_vhost has
  only been in -next for the last days, but has gotten enough review
  +testing and is clear enough a mechanical change that I think it's
  reasonable to merge for -rc1 code.

  Thanks again to everyone who contributed this round! Extra special
  thanks to Roland (PureStorage) for tracking down the qla2xxx target
  TMR response code endian issue, and to Paolo (Redhat) for resolving
  the long standing zero-length CDB issues within target-core between
  virtual and pSCSI backends."

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (44 commits)
  iscsi-target: Bump defaults for nopin_timeout + nopin_response_timeout values
  iscsit: proper endianess conversions
  iscsit: use the itt_t abstract type
  iscsit: add missing endianess conversion in iscsit_check_inaddr_any
  iscsit: remove incorrect unlock in iscsit_build_sendtargets_resp
  iscsit: mark various functions static
  target/iscsi: precedence bug in iscsit_set_dataout_sequence_values()
  target/usb-gadget: strlen() doesn't count the terminator
  target/usb-gadget: remove duplicate initialization
  tcm_vhost: Convert I/O path to use target_submit_cmd_map_sgls
  target: Add control CDB READ payload zero work-around
  tcm_loop: Convert I/O path to use target_submit_cmd_map_sgls
  target: Add target_submit_cmd_map_sgls for SGL fabric memory passthrough
  iscsi-target: Add explicit set of cache_dynamic_acls=1 for TPG demo-mode
  iscsi-target: Change iscsi_target_seq_pdu_list.c to honor MaxXmitDataSegmentLength
  iscsi-target: Add MaxXmitDataSegmentLength connection recovery check
  iscsi-target: Convert incoming PDU payload checks to MaxXmitDataSegmentLength
  iscsi-target: Enable MaxXmitDataSegmentLength operation in login path
  iscsi-target: Add base MaxXmitDataSegmentLength code
  target/file: Re-enable optional fd_buffered_io=1 operation
  ...
2012-10-10 19:52:19 +09:00
David S. Miller 8dd9117cc7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
Pulled mainline in order to get the UAPI infrastructure already
merged before I pull in David Howells's UAPI trees for networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-09 13:14:32 -04:00
Konstantin Khlebnikov 314e51b985 mm: kill vma flag VM_RESERVED and mm->reserved_vm counter
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

 | effect                 | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump      | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct.  Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:19 +09:00
Linus Torvalds ca4da6948b Second batch of changes for the 3.7 merge window:
- Late-breaking fix for IPoIB on mlx4 SR-IOV VFs.
  - Fix for IPoIB build breakage with CONFIG_INFINIBAND_IPOIB_CM=n
    (new netlink config changes are to blame).
  - Make sure retry count values are in range in RDMA CM.
  - A few nes hardware driver fixes and cleanups.
  - Have iSER initiator use >1 interrupt vectors if available.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABCAAGBQJQbkIkAAoJEENa44ZhAt0hrLEP/3jNpwR+6/rbVxcPkmVVITsn
 yUz9S3V7loJqQqlHt7ktphgcoUpar+xs2C8DbCyVupyee4K3zd7Lw+wOVO2OJ16Y
 +gE8PZB6mSYOVaI0JW26Qqe2gcQ5DLZ4ic6E8gEiuyjNo3ig3MZl17ZffTI+5lXi
 igR+Qnsy2bm18JI6aumSeKYa17f7EmOK63jaIp0jc95vrtWke3jRRs6+bmmDLV7q
 4ZxCP7ckBTRFVkisXlWZ95MmUbg/lhWosZGdJwjRYs+LW3tzgY18OQZgbw6FJKpJ
 0OouAO51UMKcizfg7ikwUIw6/apaCFXXv/pH3rIeq18qiJeKcDsIXseY7sFRboHR
 fIPpfhHENokcxtlOReaw7hnlyMuwrLiYbF6TjPN9t8dWTNbwe+4W0K9G60fu1wmk
 qjaUTBCCuDF4KtKqCxP01cZzXyjJrworw5p+Nc1wxds+JeyKRCpVRrfYep8vOgft
 z6KJjQGDF16MvXmhF1dD9VBpmaMCOA4NDhfa0IcgHPTOLc/7Nbe5NTD95InlzM3Q
 /C0CPj+4J4kzEMnTBHVJvv1NeD3G1RQD80bce6ViTbgOUg9pmeGHkLWe420h7xT4
 u/LBnwM1dcqY7BDOnn3ycYoJgr5OCQZOpLcNWjTXv4t51mBOYtM6AJ6IBIvkOmSz
 QjEyiDEEvT2K7c/rhYOz
 =9Jv4
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband changes from Roland Dreier:
  "Second batch of changes for the 3.7 merge window:
   - Late-breaking fix for IPoIB on mlx4 SR-IOV VFs.
   - Fix for IPoIB build breakage with CONFIG_INFINIBAND_IPOIB_CM=n (new
     netlink config changes are to blame).
   - Make sure retry count values are in range in RDMA CM.
   - A few nes hardware driver fixes and cleanups.
   - Have iSER initiator use >1 interrupt vectors if available."

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
  RDMA/cma: Check that retry count values are in range
  IB/iser: Add more RX CQs to scale out processing of SCSI responses
  RDMA/nes: Bump the version number of nes driver
  RDMA/nes: Remove unused module parameter "send_first"
  RDMA/nes: Remove unnecessary if-else statement
  RDMA/nes: Add missing break to switch.
  mlx4_core: Adjust flow steering attach wrapper so that IB works on SR-IOV VFs
  IPoIB: Fix build with CONFIG_INFINIBAND_IPOIB_CM=n
2012-10-07 17:19:49 +09:00
Gao feng 809d5fc9bf infiniband: pass rdma_cm module to netlink_dump_start
set netlink_dump_control.module to avoid panic.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 00:30:56 -04:00
Linus Torvalds 5f3d2f2e1a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "Some highlights in addition to the usual batch of fixes:

   - 64TB address space support for 64-bit processes by Aneesh Kumar

   - Gavin Shan did a major cleanup & re-organization of our EEH support
     code (IBM fancy PCI error handling & recovery infrastructure) which
     paves the way for supporting different platform backends, along
     with some rework of the PCIe code for the PowerNV platform in order
     to remove home made resource allocations and instead use the
     generic code (which is possible after some small improvements to it
     done by Gavin).

   - Uprobes support by Ananth N Mavinakayanahalli

   - A pile of embedded updates from Freescale folks, including new SoC
     and board supports, more KVM stuff including preparing for 64-bit
     BookE KVM support, ePAPR 1.1 updates, etc..."

Fixup trivial conflicts in drivers/scsi/ipr.c

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
  powerpc/iommu: Fix multiple issues with IOMMU pools code
  powerpc: Fix VMX fix for memcpy case
  driver/mtd:IFC NAND:Initialise internal SRAM before any write
  powerpc/fsl-pci: use 'Header Type' to identify PCIE mode
  powerpc/eeh: Don't release eeh_mutex in eeh_phb_pe_get
  powerpc: Remove tlb batching hack for nighthawk
  powerpc: Set paca->data_offset = 0 for boot cpu
  powerpc/perf: Sample only if SIAR-Valid bit is set in P7+
  powerpc/fsl-pci: fix warning when CONFIG_SWIOTLB is disabled
  powerpc/mpc85xx: Update interrupt handling for IFC controller
  powerpc/85xx: Enable USB support in p1023rds_defconfig
  powerpc/smp: Do not disable IPI interrupts during suspend
  powerpc/eeh: Fix crash on converting OF node to edev
  powerpc/eeh: Lock module while handling EEH event
  powerpc/kprobe: Don't emulate store when kprobe stwu r1
  powerpc/kprobe: Complete kprobe and migrate exception frame
  powerpc/kprobe: Introduce a new thread flag
  powerpc: Remove unused __get_user64() and __put_user64()
  powerpc/eeh: Global mutex to protect PE tree
  powerpc/eeh: Remove EEH PE for normal PCI hotplug
  ...
2012-10-06 03:16:12 +09:00
Fengguang Wu 125c4c706b idr: rename MAX_LEVEL to MAX_IDR_LEVEL
To avoid name conflicts:

  drivers/video/riva/fbdev.c:281:9: sparse: preprocessor token MAX_LEVEL redefined

While at it, also make the other names more consistent and add
parentheses.

[akpm@linux-foundation.org: repair fallout]
[sfr@canb.auug.org.au: IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: walter harms <wharms@bfs.de>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:04:56 +09:00
Roland Dreier 56040f07b2 Merge branches 'cma', 'ipoib', 'iser', 'mlx4' and 'nes' into for-next 2012-10-04 19:12:22 -07:00
Sean Hefty 4ede178a5e RDMA/cma: Check that retry count values are in range
The retry_count and rnr_retry_count connection parameters are both
3-bit values.  Check that the values are in range and reduce if
they're not.

This fixes a problem reported by Doug Ledford <dledford@redhat.com>
that resulted in the userspace rping test (part of the librdmacm
samples) failing to run over Intel IB HCAs.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ Use min_t() to avoid warnings about type mismatch.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-04 19:11:54 -07:00
Alex Tabachnik 5a33a66949 IB/iser: Add more RX CQs to scale out processing of SCSI responses
RX/TX CQs will now be selected from a per HCA pool.  For the RX flow
this has the effect of using different interrupt vectors when using
low level drivers (such as mlx4) that map the "vector" param provided
by the ULP on CQ creation to a dedicated IRQ/MSI-X vector.  This
allows the RX flow processing of IO responses to be distributed across
multiple CPUs.

QPs (--> iSER sessions) are assigned to CQs in round robin order using
the CQ with the minimum number of sessions attached to it.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Alex Tabachnik <alext@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-03 21:26:49 -07:00
Tatyana Nikolova cf9fd75cbc RDMA/nes: Bump the version number of nes driver
Bumpthe version of the nes driver to reflect recent fixes.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-03 14:27:59 -07:00
Tatyana Nikolova a378e3a3fb RDMA/nes: Remove unused module parameter "send_first"
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-03 14:27:59 -07:00
Tatyana Nikolova a67b078cf6 RDMA/nes: Remove unnecessary if-else statement
Remove unnecessary if-else statement -- we do the same thing either way.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-03 14:27:58 -07:00
Tatyana Nikolova d5fb476a10 RDMA/nes: Add missing break to switch.
Static code analyzer cppcheck points out a missing break.

Reported-by: David Binderman <dcb314@hotmail.com>
Addresses: <https://bugzilla.kernel.org/show_bug.cgi?id=47671>
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-03 14:27:58 -07:00
Roland Dreier 71d9c5f9e6 IPoIB: Fix build with CONFIG_INFINIBAND_IPOIB_CM=n
With the new netlink support in commit 862096a8bb ("IB/ipoib: Add more
rtnl_link_ops callbacks") we need ipoib_set_mode() to be available even
if connected mode isn't built.  Move the function from ipoib_cm.c to
ipoib_main.c (and make a few CM-related macros available unconditonally).

This fixes the build error

    drivers/built-in.o: In function 'ipoib_changelink':
    ipoib_netlink.c:(.text+0x6a5fc9): undefined reference to 'ipoib_set_mode'
    ipoib_netlink.c:(.text+0x6a5fe3): undefined reference to 'ipoib_set_mode'

when CONFIG_INFINIBAND_IPOIB_CM isn't set.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-02 21:33:41 -07:00
Linus Torvalds aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Linus Torvalds 7a9a2970b5 First batch of InfiniBand/RDMA changes for the 3.7 merge window:
- mlx4 IB support for SR-IOV
  - A couple of SRP initiator fixes
  - Batch of nes hardware driver fixes
  - Fix for long-standing use-after-free crash in IPoIB
  - Other miscellaneous fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABCAAGBQJQav4jAAoJEENa44ZhAt0hmL0QAJTuMdSOzYFd/NB38owJCNM2
 kz/N1GlBm3z98fIlGo8u+lzgV2qxqZSAzsJsouMeK38KiAX3CL8HKe44A1QvTM6v
 dXTNL4JFX24/YF+nlmMY8Av518I9Mkte3BZCnpYkBjVFBWe0ePwoRC/btfBXPDIV
 0snq4OtjoBAn00dOOyuZ5PoyY9xf0z4UB0Gple9sM4mzEb8wVWdNDDPOiuPJc6fA
 L+gk6HLkZDg54+QswafdKYwpeTq45wIKLmCdS3oUNmppMLVhZY8rECOwzSa+KiTr
 /Yo19n+zl+IBlvjQHhmUqGHvdD17PaGlr+TckAsQqmVfXUH5qqpEnkF8FoEK59c5
 YA3lVU8Sj4BPhJ7qX54CuN3767mZizakkxCr9iPRzABFTgzWVgcSgCrE8jjx4i0h
 Pam+L5bmANFStgmGR8PmXiNgcrCUcEqYHsOWDDAnHa5ekb2nyv1JL1c18hlY9hC3
 Xb1YTMZFwvofGza89hBu7oHrMbLOUc5kW2lBpvUn2nlyf3i0F8ISlVbVbNjFA54p
 60/jHa2VOQ2CcJUJKnJOk4ajOOEfHnPtMn2q96XJ69Dp8+eSYEO/G+0i1OlChq4h
 ClnG0Yp+NkT1o8WXMd7guDR+RsXt+DXIij5TiUWRIqnIlopIsMTRhNH28tMu4jQL
 fgN5n987wru91ewdX4gW
 =PAcy
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull infiniband updates from Roland Dreier:
 "First batch of InfiniBand/RDMA changes for the 3.7 merge window:
   - mlx4 IB support for SR-IOV
   - A couple of SRP initiator fixes
   - Batch of nes hardware driver fixes
   - Fix for long-standing use-after-free crash in IPoIB
   - Other miscellaneous fixes"

This merge also removes a new use of __cancel_delayed_work(), and
replaces it with the regular cancel_delayed_work() that is now irq-safe
thanks to the workqueue updates.

That said, I suspect the sequence in question should probably use
"mod_delayed_work()".  I just did the minimal "don't use deprecated
functions" fixup, though.

* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (45 commits)
  IB/qib: Fix local access validation for user MRs
  mlx4_core: Disable SENSE_PORT for multifunction devices
  mlx4_core: Clean up enabling of SENSE_PORT for older (ConnectX-1/-2) HCAs
  mlx4_core: Stash PCI ID driver_data in mlx4_priv structure
  IB/srp: Avoid having aborted requests hang
  IB/srp: Fix use-after-free in srp_reset_req()
  IB/qib: Add a qib driver version
  RDMA/nes: Fix compilation error when nes_debug is enabled
  RDMA/nes: Print hardware resource type
  RDMA/nes: Fix for crash when TX checksum offload is off
  RDMA/nes: Cosmetic changes
  RDMA/nes: Fix for incorrect MSS when TSO is on
  RDMA/nes: Fix incorrect resolving of the loopback MAC address
  mlx4_core: Fix crash on uninitialized priv->cmd.slave_sem
  mlx4_core: Trivial cleanups to driver log messages
  mlx4_core: Trivial readability fix: "0X30" -> "0x30"
  IB/mlx4: Create paravirt contexts for VFs when master IB driver initializes
  mlx4: Modify proxy/tunnel QP mechanism so that guests do no calculations
  mlx4: Paravirtualize Node Guids for slaves
  mlx4: Activate SR-IOV mode for IB
  ...
2012-10-02 17:20:40 -07:00
Linus Torvalds aecdc33e11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

 1) GRE now works over ipv6, from Dmitry Kozlov.

 2) Make SCTP more network namespace aware, from Eric Biederman.

 3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

 4) Make openvswitch network namespace aware, from Pravin B Shelar.

 5) IPV6 NAT implementation, from Patrick McHardy.

 6) Server side support for TCP Fast Open, from Jerry Chu and others.

 7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

 8) Increate the loopback default MTU to 64K, from Eric Dumazet.

 9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic.  This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail.  Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
  hyperv: Add buffer for extended info after the RNDIS response message.
  hyperv: Report actual status in receive completion packet
  hyperv: Remove extra allocated space for recv_pkt_list elements
  hyperv: Fix page buffer handling in rndis_filter_send_request()
  hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
  hyperv: Fix the max_xfer_size in RNDIS initialization
  vxlan: put UDP socket in correct namespace
  vxlan: Depend on CONFIG_INET
  sfc: Fix the reported priorities of different filter types
  sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
  sfc: Fix loopback self-test with separate_tx_channels=1
  sfc: Fix MCDI structure field lookup
  sfc: Add parentheses around use of bitfield macro arguments
  sfc: Fix null function pointer in efx_sriov_channel_type
  vxlan: virtual extensible lan
  igmp: export symbol ip_mc_leave_group
  netlink: add attributes to fdb interface
  tg3: unconditionally select HWMON support when tg3 is enabled.
  Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
  gre: fix sparse warning
  ...
2012-10-02 13:38:27 -07:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Roland Dreier d172f5a4ab Merge branches 'cma', 'cxgb4', 'ipoib', 'mlx4', 'mlx4-sriov', 'nes', 'qib' and 'srp' into for-linus 2012-10-02 07:43:59 -07:00
Or Gerlitz 862096a8bb IB/ipoib: Add more rtnl_link_ops callbacks
Add the rtnl_link_ops changelink and fill_info callbacks, through
which the admin can now set/get the driver mode, etc policies.
Maintain the proprietary sysfs entries only for legacy childs.

For child devices, set dev->iflink to point to the parent
device ifindex, such that user space tools can now correctly
show the uplink relation as done for vlan, macvlan, etc
devices. Pointed out by Patrick McHardy <kaber@trash.net>

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 17:12:22 -04:00
Linus Torvalds fdb2f9c2eb PCI changes for the 3.7 merge window:
Host bridge hotplug
     - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
     - Clear host bridge resource info to avoid issue when releasing (Yinghai Lu)
     - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
     - Use standard list ops for acpi_pci_drivers (Jiang Liu)
 
   Device hotplug
     - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang Liu)
     - Remove fakephp driver (Bjorn Helgaas)
     - Fix VGA ref count in hotplug remove path (Yinghai Lu)
     - Allow acpiphp to handle PCIe ports without native hotplug (Jiang Liu)
     - Implement resume regardless of pciehp_force param (Oliver Neukum)
     - Make pci_fixup_irqs() work after init (Thierry Reding)
 
   Miscellaneous
     - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
     - Factor out PCI Express Capability accessors (Jiang Liu)
     - Add pcibios_window_alignment() so powerpc EEH can use generic resource assignment (Gavin Shan)
     - Make pci_error_handlers const (Stephen Hemminger)
     - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
     - Improve Vendor-Specific Extended Capability support (Bjorn Helgaas)
     - Use standard list ops for bus->devices (Bjorn Helgaas)
     - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
     - Reassign invalid bus number ranges (Intel DP43BF workaround) (Yinghai Lu)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQac4hAAoJEPGMOI97Hn6zjZYP/iaqU9kjmgTsBbSyzB4oApv/
 RRxo3I+ad9GF6XlMQfVAtyx1pgCD1gdGAtoDgGSCTqgdYD3CO10AxKU+yleAk1wo
 dbMxLifJNTrT3G1mZ/NL16yEGhCwvhfwzRtB1VoZmCT4lSApO/7cJkXl2DzHfA/i
 pmltOOiQCN8kbUcJbVPtUyTVPi2zl/8bsyCyTkS7YG0VXeGRM+ZUvPWZJ7MnWYYB
 5qoCdrw5ENCCiDQ9yw5SAfgL23b+0p6OI/x3Lkex0QQOWwSqGSiaHt4b7eitrC5b
 2eAJg32f/AzZke1YbKLMfdsL0VJP3GAswhDVHlgmo63rZkOZChm+97dgZ35Mcv5v
 kEXkWyBb1xJ3t8rZir6Qer9Iv2wOB+MkZ5qtU/Vf+l0wLQLYTrRVsKngrEDREONk
 dXbokp6iVSPeA1sTSdH9MmHlTUIj82ZLSGcxcjTsN8NWZjxx6g3rNx1uay+5MYOW
 4ET9zNu5snrAqN6N4Tb81gvtG8qYfxzdvVfrA9AaGKI6xxB7jkqgFJRp55JiEcFc
 x4cmWkhvdlhVsG2TQwFxYNfswOqD+7NCs6V4kSVZX6ezpDrH7I5VvcnnhstF7C8l
 KZul0EV7OW+kDK23pNe24lVP2xtOv6G8eK/3PmeKIXWl1V83nqre/oLufRzTfs+Z
 SxkILwY/MFpuCFteKE1t
 =haBu
 -----END PGP SIGNATURE-----

Merge tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug
    - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
    - Clear host bridge resource info to avoid issue when releasing
      (Yinghai Lu)
    - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
    - Use standard list ops for acpi_pci_drivers (Jiang Liu)

  Device hotplug
    - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang
      Liu)
    - Remove fakephp driver (Bjorn Helgaas)
    - Fix VGA ref count in hotplug remove path (Yinghai Lu)
    - Allow acpiphp to handle PCIe ports without native hotplug (Jiang
      Liu)
    - Implement resume regardless of pciehp_force param (Oliver Neukum)
    - Make pci_fixup_irqs() work after init (Thierry Reding)

  Miscellaneous
    - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
    - Factor out PCI Express Capability accessors (Jiang Liu)
    - Add pcibios_window_alignment() so powerpc EEH can use generic
      resource assignment (Gavin Shan)
    - Make pci_error_handlers const (Stephen Hemminger)
    - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
    - Improve Vendor-Specific Extended Capability support (Bjorn
      Helgaas)
    - Use standard list ops for bus->devices (Bjorn Helgaas)
    - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
    - Reassign invalid bus number ranges (Intel DP43BF workaround)
      (Yinghai Lu)"

* tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (102 commits)
  PCI: acpiphp: Handle PCIe ports without native hotplug capability
  PCI/ACPI: Use acpi_driver_data() rather than searching acpi_pci_roots
  PCI/ACPI: Protect acpi_pci_roots list with mutex
  PCI/ACPI: Use acpi_pci_root info rather than looking it up again
  PCI/ACPI: Pass acpi_pci_root to acpi_pci_drivers' add/remove interface
  PCI/ACPI: Protect acpi_pci_drivers list with mutex
  PCI/ACPI: Notify acpi_pci_drivers when hot-plugging PCI root bridges
  PCI/ACPI: Use normal list for struct acpi_pci_driver
  PCI/ACPI: Use DEVICE_ACPI_HANDLE rather than searching acpi_pci_roots
  PCI: Fix default vga ref_count
  ia64/PCI: Clear host bridge aperture struct resource
  x86/PCI: Clear host bridge aperture struct resource
  PCI: Stop all children first, before removing all children
  Revert "PCI: Use hotplug-safe pci_get_domain_bus_and_slot()"
  PCI: Provide a default pcibios_update_irq()
  PCI: Discard __init annotations for pci_fixup_irqs() and related functions
  PCI: Use correct type when freeing bus resource list
  PCI: Check P2P bridge for invalid secondary/subordinate range
  PCI: Convert "new_id"/"remove_id" into generic pci_bus driver attributes
  xen-pcifront: Use hotplug-safe pci_get_domain_bus_and_slot()
  ...
2012-10-01 12:05:36 -07:00
Mike Marciniszyn c00aaa1a02 IB/qib: Fix local access validation for user MRs
Commit 8aac4cc3a9 ("IB/qib: RCU locking for MR validation") introduced
a bug that broke user post sends.  The proper validation of the MR
was lost in the patch.

This patch corrects that validation.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-10-01 09:59:26 -07:00
Bart Van Assche d853667091 IB/srp: Avoid having aborted requests hang
We need to call scsi_done() for commands after we abort them.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:36:48 -07:00
Bart Van Assche 9b796d06d5 IB/srp: Fix use-after-free in srp_reset_req()
srp_free_req() uses the scsi_cmnd structure contents to unmap
buffers, so we must invoke srp_free_req() before we release
ownership of that structure.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:36:47 -07:00
Dean Luick e20d583818 IB/qib: Add a qib driver version
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:36:29 -07:00
Tatyana Nikolova bca1935ccd RDMA/nes: Fix compilation error when nes_debug is enabled
Removing old variables caused a compile error from nes_debug().

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:55 -07:00
Tatyana Nikolova 818216442b RDMA/nes: Print hardware resource type
Hardware resource types are added and when a resource isn't available,
its type is printed.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:55 -07:00
Tatyana Nikolova fc4ba7291b RDMA/nes: Fix for crash when TX checksum offload is off
When TX checksum offload is disabled for an iWarp connection,
skb->ip_summed needs to be set to CHECKSUM_NONE.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:54 -07:00
Tatyana Nikolova 48a9956362 RDMA/nes: Cosmetic changes
- Remove unnecessary statement "if (1)"
 - Refactor a statement (wqe_misc |= NES_NIC_SQ_WQE_COMPLETION) out of
   if/else statement, because it is independant of the flow.
 - Define netdev->features in one line for clarity.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:53 -07:00
Tatyana Nikolova 6ad1be814b RDMA/nes: Fix for incorrect MSS when TSO is on
In TSO handling code, skb_shared_info() is used to get the MSS
instead of the bool function skb_is_gso() (which always returns 1).

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:52 -07:00
Tatyana Nikolova ef3d0c4a5e RDMA/nes: Fix incorrect resolving of the loopback MAC address
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:34:52 -07:00
Jack Morgenstein 3806d08cf6 IB/mlx4: Create paravirt contexts for VFs when master IB driver initializes
When we have VFs and PFs on same host, the VFs are activated within
the mlx4_core module before the mlx4_ib kernel module is loaded.

When the mlx4_ib module initializes the PF (master), it now creates
MAD paravirtualization contexts for any VFs that already active.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:44 -07:00
Jack Morgenstein 47605df953 mlx4: Modify proxy/tunnel QP mechanism so that guests do no calculations
Previously, the structure of a guest's proxy QPs followed the
structure of the PPF special qps (qp0 port 1, qp0 port 2, qp1 port 1,
qp1 port 2, ...).  The guest then did offset calculations on the
sqp_base qp number that the PPF passed to it in QUERY_FUNC_CAP().

This is now changed so that the guest does no offset calculations
regarding proxy or tunnel QPs to use.  This change frees the PPF from
needing to adhere to a specific order in allocating proxy and tunnel
QPs.

Now QUERY_FUNC_CAP provides each port individually with its proxy
qp0, proxy qp1, tunnel qp0, and tunnel qp1 QP numbers, and these are
used directly where required (with no offset calculations).

To accomplish this change, several fields were added to the phys_caps
structure for use by the PPF and by non-SR-IOV mode:

    base_sqpn -- in non-sriov mode, this was formerly sqp_start.
    base_proxy_sqpn -- the first physical proxy qp number -- used by PPF
    base_tunnel_sqpn -- the first physical tunnel qp number -- used by PPF.

The current code in the PPF still adheres to the previous layout of
sqps, proxy-sqps and tunnel-sqps.  However, the PPF can change this
layout without affecting VF or (paravirtualized) PF code.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:43 -07:00
Jack Morgenstein afa8fd1db9 mlx4: Paravirtualize Node Guids for slaves
This is necessary in order to support > 1 VF/PF in a VM for software
that uses the node guid as a discriminator, such as librdmacm.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:43 -07:00
Jack Morgenstein 026149cbaa mlx4: Activate SR-IOV mode for IB
Remove the error returns for IB ports from mlx4_ib_add,
mlx4_INIT_PORT_wrapper, and mlx4_CLOSE_PORT_wrapper.

Currently, SRIOV is supported only for devices for which the
link layer is IB on all ports; RoCE support will be added later.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:42 -07:00
Jack Morgenstein 992e8e6e87 IB/mlx4: Miscellaneous adjustments for SR-IOV IB support
1. Allow only master to change node description.
2. Prevent AH leakage in send mads.
3. Take device part number from PCI structure, so that guests see the
   VF part number (and not the PF part number).
4. Place the device revision ID into caps structure at startup.
5. SET_PORT in update_gids_task needs to go through wrapper on master.
6. In mlx4_ib_event(), PORT_MGMT_EVENT needs be handled in a work
   queue on the master, since it propagates events to slaves using
   GEN_EQE.
7. Do not support FMR on slaves.
8. Add spinlock to slave_event(), since it is called both in interrupt
   context and in process context (due to 6 above, and also if
   smp_snoop is used).  This fix was found and implemented by Saeed
   Mahameed <saeedm@mellanox.com>

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:41 -07:00
Jack Morgenstein c1e7e46612 IB/mlx4: Add iov directory in sysfs under the ib device
This directory is added only for the master -- slaves do not have it.

The sysfs iov directory is used to manage and examine the port P_Key
and guid paravirtualization.

Under iov/ports, the administrator may examine the gid and P_Key tables
as they are present in the device (and as are seen in the "network
view" presented to the SM).

Under the iov/<pci slot number> directories, the admin may map the
index numbers in the physical tables (as under iov/ports) to the
paravirtualized index numbers that guests see.

For example, if the administrator, for port 1 on guest 2 maps physical
pkey index 10 to virtual index 1, then that guest, whenever it uses
its pkey index 1, will actually be using the real pkey index 10.

Based on patch from Erez Shitrit <erezsh@mellanox.com>

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:39 -07:00
Jack Morgenstein 2a4fae148c IB/mlx4: Propagate P_Key and guid change port management events to slaves
P_Key change and guid change events are not of interest to all slaves,
but only to those slaves which "see" the table slots whose contents
have change.

For example, if the guid at port 1, index 5 has changed in the PPF, we
wish to propagate the gid-change event only to the function which has
that guid index mapped to its port/guid table (in this case it is
slave #5). Other functions should not get the event, since the event
does not affect them.

Similarly with P_Keys -- P_Key change events are forwarded only to
slaves which have that P_Key index mapped to their virtual P_Key table.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:38 -07:00
Jack Morgenstein a0c64a17ab mlx4: Add alias_guid mechanism
For IB ports, we paravirtualize the GUID at index 0 on slaves.  The
GUID at index 0 seen by a slave is the actual GUID occupying the GUID
table at the slave-id index.

The driver, by default, requests at startup time that subnet manager
populate its entire guid table with GUIDs. These guids are then mapped
(paravirtualized) to the slaves, and appear for each slave as its GUID
at index 0.

Until each slave has such a guid, its port status is DOWN.

The guid table is cached to support special QP paravirtualization, and
event propagation to slaves on guid change (we test to see if the guid
really changed before propagating an event to the slave).

To support this caching, add capability to __mlx4_ib_query_gid() to
obtain the network view (i.e., physical view) gid at index X, not just
the host (paravirtualized) view.

Based on a patch from Erez Shitrit <erezsh@mellanox.com>

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:37 -07:00
Amir Vadai 3cf69cc8db IB/mlx4: Add CM paravirtualization
In CM para-virtualization:

1. Incoming requests are steered to the correct vHCA according to the
   embedded GID.
2. Communication IDs on outgoing requests are replaced by a globally
   unique ID, generated by the PPF, since there is no synchronization
   of ID generation between guests (and so these IDs are not
   guaranteed to be globally unique).  The guest's comm ID is stored,
   and is returned to the response MAD when it arrives.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:36 -07:00
Oren Duer b9c5d6a643 IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV
MCG paravirtualization support includes:
- Creating multicast groups by VFs, and keeping accounting of them
- Leaving multicast groups by VFs
- Updating SM only with real changes in the overall picture of MCGs status
- Creation of MGID=0 groups (let SM choose MGID)

Note that the MCG module maintains its own internal MCG object
reference counts.  The reason for this is that the IB core is used to
track only the multicast groups joins generated by the PF it runs
over.  The PF IB core layer is unaware of slaves, so it cannot be used
to keep track of MCG joins they generate.

Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:35 -07:00
Jack Morgenstein 0a9a01884d mlx4: MAD_IFC paravirtualization
The MAD_IFC firmware command fulfills two functions.

First, it is used in the QP0/QP1 MAD-handling flow to obtain
information from the FW (for answering queries), and for setting
variables in the HCA (MAD SET packets).

For this, MAD_IFC should provide the FW (physical) view of the data.
This is the view that OpenSM needs.  We call this the "network view".

In the second case, MAD_IFC is used by various verbs to obtain data
regarding the local HCA (e.g., ib_query_device()).  We call this the
"host view".

This data needs to be paravirtualized.

MAD_IFC therefore needs a wrapper function, and also needs another
flag indicating whether it should provide the network view (when it is
called by ib_process_mad in special-qp packet handling), or the host
view (when it is called while implementing a verb).

There are currently 2 flag parameters in mlx4_MAD_IFC already:
ignore_bkey and ignore_mkey.  These two parameters are replaced by a
single "mad_ifc_flags" parameter, with different bits set for each
flag.  A third flag is added: "network-view/host-view".

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:34 -07:00
Jack Morgenstein 37bfc7c1e8 IB/mlx4: SR-IOV multiplex and demultiplex MADs
Special QPs are paravirtualized.

vHCAs are not given direct access to QP0/1. Rather, these QPs are
operated by a special context hosted by the PF, which mediates access
to/from vHCAs.  This is done by opening a "tunnel" per vHCA port per
QP0/1. A tunnel comprises a pair of UD QPs: a "Tunnel QP" in the
PF-context and a "Proxy QP" in the vHCA.  All vHCA MAD traffic must
pass through the corresponding tunnel.  vHCA QPs cannot be assigned to
VL15 and are denied of the well-known QKey.

Outgoing messages are "de-multiplexed" (i.e., directed to the wire via
the real special QP).

Incoming messages are "multiplexed" (i.e. steered by the PPF to the
correct VF or to the PF)

QP0 access is restricted to the PF vHCA. VF vHCAs also have (virtual)
QP0s, but they never receive any SMPs and all SMPs sent are discarded.
QP1 traffic is allowed for all vHCAs, but special care is required to
bridge the gap between the host and network views.

Specifically:
- Transaction IDs are mapped to guarantee uniqueness among vHCAs
- CM para-virtualization
  o   Incoming requests are steered to the correct vHCA according to the embedded GID
  o   Local communication IDs are mapped to ensure uniqueness among vHCAs
  (see the patch that adds CM paravirtualization.)
- Multicast para-virtualization
  o   The PF context aggregates membership state from all vHCAs
  o   The SA is contacted only when the aggregate membership changes
  o   If the aggregate does not change, the PF context will provide the
      requesting vHCA with the proper response.
  (see the patch that adds multicast group paravirtualization)

Incoming MADs are steered according to:
- the DGID If a GRH is present
- the mapped transaction ID for response MADs
- the embedded GID in CM requests
- the remote communication ID in other CM messages

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:34 -07:00
Jack Morgenstein 54679e1482 mlx4: Implement QP paravirtualization and maintain phys_pkey_cache for smp_snoop
This requires:

1. Replacing the paravirtualized P_Key index (inserted by the guest)
   with the real P_Key index.

2. For UD QPs, placing the guest's true source GID index in the
   address path structure mgid field, and setting the ud_force_mgid
   bit so that the mgid is taken from the QP context and not from the
   WQE when posting sends.

3. For UC and RC QPs, placing the guest's true source GID index in the
   address path structure mgid field.

4. For tunnel and proxy QPs, setting the Q_Key value reserved for that
   proxy/tunnel pair.

Since not all the above adjustments occur in all the QP transitions,
the QP transitions require separate wrapper functions.

Secondly, initialize the P_Key virtualization table to its default
values: Master virtualized table is 1-1 with the real P_Key table,
guest virtualized table has P_Key index 0 mapped to the real P_Key
index 0, and all the other P_Key indices mapped to the reserved
(invalid) P_Key at index 127.

Finally, add logic in smp_snoop for maintaining the phys_P_Key_cache.
and generating events on the master only if a P_Key actually changed.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:33 -07:00
Jack Morgenstein fc06573dfa IB/mlx4: Initialize SR-IOV IB support for slaves in master context
Allocate SR-IOV paravirtualization resources and MAD demuxing contexts
on the master.

This has two parts.  The first part is to initialize the structures to
contain the contexts.  This is done at master startup time in
mlx4_ib_init_sriov().

The second part is to actually create the tunneling resources required
on the master to support a slave.  This is performed the master
detects that a slave has started up (MLX4_DEV_EVENT_SLAVE_INIT event
generated when a slave initializes its comm channel).

For the master, there is no such startup event, so it creates its own
tunneling resources when it starts up.  In addition, the master also
creates the real special QPs.  The ib_core layer on the master causes
creation of proxy special QPs, since the master is also
paravirtualized at the ib_core layer.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:32 -07:00
Jack Morgenstein 1ffeb2eb8b IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support
1. Introduce the basic SR-IOV parvirtualization context objects for
   multiplexing and demultiplexing MADs.
2. Introduce support for the new proxy and tunnel QP types.

This patch introduces the objects required by the master for managing
QP paravirtualization for guests.

struct mlx4_ib_sriov is created by the master only.
It is a container for the following:

1. All the info required by the PPF to multiplex and de-multiplex MADs
   (including those from the PF). (struct mlx4_ib_demux_ctx demux)
2. All the info required to manage alias GUIDs (i.e., the GUID at
   index 0 that each guest perceives.  In fact, this is not the GUID
   which is actually at index 0, but is, in fact, the GUID which is at
   index[<VF number>] in the physical table.
3. structures which are used to manage CM paravirtualization
4. structures for managing the real special QPs when running in SR-IOV
   mode.  The real SQPs are controlled by the PPF in this case.  All
   SQPs created and controlled by the ib core layer are proxy SQP.

struct mlx4_ib_demux_ctx contains the information per port needed
to manage paravirtualization:

1. All multicast paravirt info
2. All tunnel-qp paravirt info for the port.
3. GUID-table and GUID-prefix for the port
4. work queues.

struct mlx4_ib_demux_pv_ctx contains all the info for managing the
paravirtualized QPs for one slave/port.

struct mlx4_ib_demux_pv_qp contains the info need to run an individual
QP (either tunnel qp or real SQP).

Note:  We made use of the 2 most significant bits in enum
mlx4_ib_qp_flags (based on enum ib_qp_create_flags in ib_verbs.h).
We need these bits in the low-level driver for internal purposes.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:30 -07:00
Jack Morgenstein 73aaa7418f IB/core: Add ib_find_exact_cached_pkey()
When P_Key tables potentially contain both full and partial membership
copies for the same P_Key, we need a function to find the index for an
exact (16-bit) P_Key.

This is necessary when the master forwards QP1 MADs sent by guests.
If the guest has sent the MAD with a limited membership P_Key, we need
to to forward the MAD using the same limited membership P_Key.  Since
the master may have both the limited and the full member P_Keys in its
table, we must make sure to retrieve the limited membership P_Key in
this case.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:30 -07:00
Jack Morgenstein ff7166c447 IB/core: Handle table with full and partial membership for the same P_Key
Extend the cached and non-cached P_Key table lookups to handle limited
and full membership of the same P_Key to co-exist in the P_Key table.

This is necessary for SR-IOV, to allow for some guests would to have
the full membership P_Key in their virtual P_Key table, while other
guests on the same physical HCA would have the limited one.
To support this, we need both the limited and full membership P_Keys
to be present in the master's (hypervisor physical port) P_Key table.

The algorithm for handling P_Key tables which contain both the limited
and the full membership versions of the same P_Key works as follows:

When scanning the P_Key table for a 15-bit P_Key:

A. If there is a full member version of that P_Key anywhere in the
    table, return its index (even if a limited-member version of the
    P_Key exists earlier in the table).

B. If the full member version is not in the table, but the
   limited-member version is in the table, return the index of the
   limited P_Key.

Signed-off-by: Liran Liss <liranl@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:29 -07:00
Dotan Barak 46db567deb IB/mlx4: Fill in sq_sig_type in query QP
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:33:08 -07:00
Patrick McHardy bea1e22df4 IPoIB: Fix use-after-free of multicast object
Fix a crash in ipoib_mcast_join_task().  (with help from Or Gerlitz)

Commit c8c2afe360 ("IPoIB: Use rtnl lock/unlock when changing device
flags") added a call to rtnl_lock() in ipoib_mcast_join_task(), which
is run from the ipoib_workqueue, and hence the workqueue can't be
flushed from the context of ipoib_stop().

In the current code, ipoib_stop() (which doesn't flush the workqueue)
calls ipoib_mcast_dev_flush(), which goes and deletes all the
multicast entries.  This takes place without any synchronization with
a possible running instance of ipoib_mcast_join_task() for the same
ipoib device, leading to a crash due to NULL pointer dereference.

Fix this by making sure that the workqueue is flushed before
ipoib_mcast_dev_flush() is called.  To make that possible, we move the
RTNL-lock wrapped code to ipoib_mcast_join_finish().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:32:33 -07:00
Emil Goode c079c28714 RDMA/cxgb4: Fix error handling in create_qp()
The variable ret is assigned return values in a couple of places, but
its value is never returned.  This patch makes use of the ret variable
so that the caller get correct error codes returned.

The following changes are also introduced:

- The alloc_oc_sq function can return -ENOSYS or -ENOMEM so we want to
  get the return value from it.

- Change the label names to improve readability.

Signed-off-by: Emil Goode <emilgoode@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:32:14 -07:00
Dotan Barak 2a22fb8c69 RDMA/cma: Use consistent component mask for IPoIB port space multicast joins
CMA multicast joins for the IPoIB port space need to use the same
component mask used by the ipoib driver.  Otherwise, it's possible for
the CMA to create a group to which a join made by ipoib will fail, or
vise-versa.  Some of the component mask fields set by ipoib weren't
set by the CMA, fix that.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:31:47 -07:00
Dotan Barak d8c9166961 IB/core: Remove unused variables in ucm/ucma
Remove unused wait objects from ucm/ucma events flow.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-30 20:31:46 -07:00
David S. Miller 6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Al Viro 2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro 88b428d6e1 switch infinibarf users of fget() to fget_light()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:10 -04:00
Paul E. McKenney 593d1006cd Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25b
Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
approach from https://lkml.org/lkml/2012/9/5/585.
2012-09-25 10:03:56 -07:00
Paul E. McKenney 5217192b85 Merge remote-tracking branch 'tip/smp/hotplug' into next.2012.09.25b
The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
were due to adjacent insertions and deletions, which were resolved
by simply accepting the changes on both branches.
2012-09-25 10:01:45 -07:00
Eric W. Biederman d03ca5820d userns: Convert ipathfs to use GLOBAL_ROOT_UID and GLOBAL_ROOT_GID
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:19 -07:00
Or Gerlitz 9baa0b0364 IB/ipoib: Add rtnl_link_ops support
Add rtnl_link_ops to IPoIB, with the first usage being child device
create/delete through them. Childs devices are now either legacy ones,
created/deleted through the ipoib sysfs entries, or RTNL ones.

Adding support for RTNL childs involved refactoring of ipoib_vlan_add
which is now used by both the sysfs and the link_ops code.

Also, added ndo_uninit entry to support calling unregister_netdevice_queue
from the rtnl dellink entry. This required removal of calls to
ipoib_dev_cleanup from the driver in flows which use unregister_netdevice,
since the networking core will invoke ipoib_uninit which does exactly that.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20 16:49:17 -04:00
Roland Dreier 9c58b7ddd7 target: Simplify fabric sense data length handling
Every fabric driver has to supply a se_tfo->set_fabric_sense_len()
method, just so iSCSI can return an offset of 2.  However, every fabric
driver is already allocating a sense buffer and passing it into the
target core, either via transport_init_se_cmd() or target_submit_cmd().

So instead of having iSCSI pass the start of its sense buffer into the
core and then later tell the core to skip the first 2 bytes, it seems
easier for iSCSI just to do the offset of 2 when it passes the sense
buffer into the core.  Then we can drop the se_tfo->set_fabric_sense_len()
everywhere, and just add a couple of lines of code to iSCSI to set the
sense data length to the beginning of the buffer right before it sends
it over the network.

(nab: Remove .set_fabric_sense_len usage from tcm_qla2xxx_npiv_ops +
      change transport_get_sense_buffer to follow v3.6-rc6 code w/o
      ->set_fabric_sense_len usage)

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-09-17 17:12:58 -07:00
Roland Dreier 2ed772b7b9 target: Remove unused target_core_fabric_ops.get_fabric_sense_len method
There are no callers of se_tfo->get_fabric_sense_len(), so we should
stop having every fabric driver implement it.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-09-17 16:15:47 -07:00
Benjamin Herrenschmidt eda485f06d Merge remote-tracking branch 'pci/pci/gavin-window-alignment' into next
Merge Gavin patches from the PCI tree as subsequent powerpc
patches are going to depend on them

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-17 16:07:43 +10:00
Roland Dreier 2116fe4e8b Merge branches 'cxgb4', 'ipoib', 'mlx4', 'ocrdma' and 'qib' into for-next 2012-09-14 10:42:52 -07:00
Mike Marciniszyn 4c3550057b IB/qib: Fix failure of compliance test C14-024#06_LocalPortNum
Commit 3236b2d469 ("IB/qib: MADs with misset M_Keys should return
failure") introduced a return code assignment that unfortunately
introduced an unconditional exit for the routine due to missing braces.

This patch adds the braces to correct the original patch.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-14 10:42:32 -07:00
Parav Pandit ae3bca90e9 RDMA/ocrdma: Fix CQE expansion of unsignaled WQE
Fix CQE expansion of unsignaled WQE -- don't expand the CQE when the
WQE index of the completed CQE matches with last pending WQE (tail) in
the queue.

Signed-off-by: Parav Pandit <parav.pandit@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-14 10:40:58 -07:00
Bjorn Helgaas 9a5d5bd848 Merge commit 'v3.6-rc5' into pci/gavin-window-alignment
* commit 'v3.6-rc5': (1098 commits)
  Linux 3.6-rc5
  HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
  Remove user-triggerable BUG from mpol_to_str
  xen/pciback: Fix proper FLR steps.
  uml: fix compile error in deliver_alarm()
  dj: memory scribble in logi_dj
  Fix order of arguments to compat_put_time[spec|val]
  xen: Use correct masking in xen_swiotlb_alloc_coherent.
  xen: fix logical error in tlb flushing
  xen/p2m: Fix one-off error in checking the P2M tree directory.
  powerpc: Don't use __put_user() in patch_instruction
  powerpc: Make sure IPI handlers see data written by IPI senders
  powerpc: Restore correct DSCR in context switch
  powerpc: Fix DSCR inheritance in copy_thread()
  powerpc: Keep thread.dscr and thread.dscr_inherit in sync
  powerpc: Update DSCR on all CPUs when writing sysfs dscr_default
  powerpc/powernv: Always go into nap mode when CPU is offline
  powerpc: Give hypervisor decrementer interrupts their own handler
  powerpc/vphn: Fix arch_update_cpu_topology() return value
  ARM: gemini: fix the gemini build
  ...

Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/rapidio/devices/tsi721.c
2012-09-13 15:54:57 -06:00
Bjorn Helgaas 78890b5989 Merge commit 'v3.6-rc5' into next
* commit 'v3.6-rc5': (1098 commits)
  Linux 3.6-rc5
  HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
  Remove user-triggerable BUG from mpol_to_str
  xen/pciback: Fix proper FLR steps.
  uml: fix compile error in deliver_alarm()
  dj: memory scribble in logi_dj
  Fix order of arguments to compat_put_time[spec|val]
  xen: Use correct masking in xen_swiotlb_alloc_coherent.
  xen: fix logical error in tlb flushing
  xen/p2m: Fix one-off error in checking the P2M tree directory.
  powerpc: Don't use __put_user() in patch_instruction
  powerpc: Make sure IPI handlers see data written by IPI senders
  powerpc: Restore correct DSCR in context switch
  powerpc: Fix DSCR inheritance in copy_thread()
  powerpc: Keep thread.dscr and thread.dscr_inherit in sync
  powerpc: Update DSCR on all CPUs when writing sysfs dscr_default
  powerpc/powernv: Always go into nap mode when CPU is offline
  powerpc: Give hypervisor decrementer interrupts their own handler
  powerpc/vphn: Fix arch_update_cpu_topology() return value
  ARM: gemini: fix the gemini build
  ...

Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/rapidio/devices/tsi721.c
2012-09-13 08:41:01 -06:00
Bjorn Helgaas 1959ec5f82 Merge branch 'pci/stephen-const' into next
* pci/stephen-const:
  make drivers with pci error handlers const
  scsi: make pci error handlers const
  netdev: make pci_error_handlers const
  PCI: Make pci_error_handlers const
2012-09-12 13:54:10 -06:00
Shlomo Pongratz b5120a6e11 IPoIB: Fix AB-BA deadlock when deleting neighbours
Lockdep points out a circular locking dependency betwwen the ipoib
device priv spinlock (priv->lock) and the neighbour table rwlock
(ntbl->rwlock).

In the normal path, ie neigbour garbage collection task, the neigh
table rwlock is taken first and then if the neighbour needs to be
deleted, priv->lock is taken.

However in some error paths, such as in ipoib_cm_handle_tx_wc(),
priv->lock is taken first and then ipoib_neigh_free routine is called
which in turn takes the neighbour table ntbl->rwlock.

The solution is to get rid the neigh table rwlock completely and use
only priv->lock.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-12 09:21:45 -07:00
Shlomo Pongratz 66172c0993 IPoIB: Fix memory leak in the neigh table deletion flow
If the neighbours hash table is empty when unloading the module, then
ipoib_flush_neighs(), the cleanup routine, isn't called and the
memory used for the hash table itself leaked.

To fix this, ipoib_flush_neighs() is allways called, and another
completion object is added to signal when the table is freed.

Once invoked, ipoib_flush_neighs() flushes all the neighbours (if
there are any), calls the the hash table RCU free routine, which now
signals completion of the deletion process, and waits for the last
neighbour to be freed.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-12 09:05:03 -07:00
Pablo Neira Ayuso 9f00d9776b netlink: hide struct module parameter in netlink_kernel_create
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).

Suggested by David S. Miller.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-08 18:46:30 -04:00
Wei Yongjun 92dd6c3d4d RDMA/cxgb4: Move dereference below NULL test
spatch with a semantic match is used to found this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-09-07 16:19:03 -07:00
Stephen Hemminger 1d3520357d make drivers with pci error handlers const
Covers the rest of the uses of pci error handler.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-09-07 16:35:00 -06:00
Benjamin Herrenschmidt fff34b3412 Merge branch 'merge' into next
Brings in various bug fixes from 3.6-rcX
2012-09-07 09:48:59 +10:00
Vipul Pandya e5619c120d RDMA/cxgb4: Update RDMA/cxgb4 due to macro definition removal in cxgb4 driver
cxgb4 driver has duplicate definitions of registers which will be removed. This
patch updates the RDMA/cxgb4 driver accordingly.

Signed-off-by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Reviewed-by: Sivakumar Subramani <sivasu@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:41:40 -04:00
Michael Ellerman b4b8d1e48e IB/ehca: Remove uses of virt_to_abs() and abs_to_virt()
abs_to_virt() simply calls __va() and we'd like to get rid of it,
so replace all abs_to_virt() uses with __va().

__va() returns void *, so when assigning to a pointer there's no need
to cast.

Similarly virt_to_abs() just calls __pa().

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05 15:18:45 +10:00
Michael Ellerman aa97c32c1e IB/ehca: Don't use phys_to_abs(), it's a nop
Since commit f5339277 phys_to_abs() is 100% a nop, so just remove it.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05 15:18:43 +10:00
Jiang Liu 0921caf326 IB/qib: Use PCI Express Capability accessors
Use PCI Express Capability access functions to simplify qib driver.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
2012-08-23 10:11:15 -06:00
Jiang Liu 3c55569b08 IB/mthca: Use PCI Express Capability accessors
Use PCI Express Capability access functions to simplify mthca driver.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Roland Dreier <roland@purestorage.com>
2012-08-23 10:11:15 -06:00
Tejun Heo 136b5721d7 workqueue: deprecate __cancel_delayed_work()
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work().  Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo e7c2f96744 workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Paul E. McKenney bff4a39479 infiniband: ehca: Fix compiler warnings
Fix comp_task() to return void to match smp_hotplug_thread's thread_fn
member, and adjust indirection on the ->cpu_comp_threads per-CPU
member of the ehca_comp_pool structure.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
2012-08-20 08:11:04 -07:00
Paul E. McKenney c0e12e51b0 infiniband: ehca: Fix while->do-while conversion typo
This commit just adds a needed semicolon.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
2012-08-20 08:00:23 -07:00
Roland Dreier c0369b296e Merge branches 'cma', 'ipoib', 'misc', 'mlx4', 'ocrdma', 'qib' and 'srp' into for-next 2012-08-16 09:38:39 -07:00
Kleber Sacilotto de Souza a0675a386a IB/mlx4: Check iboe netdev pointer before dereferencing it
Unlike other parts of the mlx4_ib code, the function build_mlx_header()
doesn't check if the iboe netdev of the given port is valid before
dereferencing it, which can cause a crash if the ethernet interface
has already been taken down.

Fix this by checking for a valid netdev pointer before using it to get
the port MAC address.

Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-16 09:38:19 -07:00
Bart Van Assche 220329916c IB/srp: Fix a race condition
Avoid a crash caused by the scmnd->scsi_done(scmnd) call in
srp_process_rsp() being invoked with scsi_done == NULL.  This can
happen if a reply is received during or after a command abort.

Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
Reference: http://marc.info/?l=linux-rdma&m=134314367801595
Cc: <stable@vger.kernel.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-15 12:00:48 -07:00
Julia Lawall 51fa3ca37e IB/qib: Fix error return code in qib_init_7322_variables()
Convert a 0 error return code to a negative one, as returned elsewhere
in the function.

A simplified version of the semantic match that finds this problem is
as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret;
expression e,e1,e2,e3,e4,x;
@@

(
if (\(ret != 0\|ret < 0\) || ...) { ... return ...; }
|
ret = 0
)
... when != ret = e1
*x = \(kmalloc\|kzalloc\|kcalloc\|devm_kzalloc\|ioremap\|ioremap_nocache\|devm_ioremap\|devm_ioremap_nocache\)(...);
... when != x = e2
    when != ret = e3
*if (x == NULL || ...)
{
  ... when != ret = e4
*  return ret;
}
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-15 11:58:21 -07:00
Masanari Iida 142ad5db2b IB: Fix typos in infiniband drivers
Correct spelling typos in comments in drivers/infiniband.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-15 11:56:19 -07:00
Shlomo Pongratz 6c723a68c6 IB/ipoib: Fix RCU pointer dereference of wrong object
Commit b63b70d877 ("IPoIB: Use a private hash table for path lookup
in xmit path") introduced a bug where in ipoib_neigh_free() (which is
called from a few errors flows in the driver), rcu_dereference() is
invoked with the wrong pointer object, which results in a crash.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-14 15:21:44 -07:00
Shlomo Pongratz fa16ebed31 IB/ipoib: Add missing locking when CM object is deleted
Commit b63b70d877 ("IPoIB: Use a private hash table for path lookup
in xmit path") introduced a bug where in ipoib_cm_destroy_tx() a CM
object is moved between lists without any supported locking.  Under a
stress test, this eventually leads to list corruption and a crash.

Previously when this routine was called, callers were taking the
device priv lock.  Currently this function is called from the RCU
callback associated with neighbour deletion.  Fix the race by taking
the same lock we used to before.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-14 15:21:44 -07:00
Tejun Heo 41f63c5359 workqueue: use mod_delayed_work() instead of cancel + queue
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().

Most conversions are straight-forward.  Ones worth mentioning are,

* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
  use mod_delayed_work() and cancel loop in
  edac_mc_reset_delay_period() is dropped.

* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
  watchdog is active or not.  @fan_watchdog_active and related code
  dropped.

* drivers/power/charger-manager.c: Seemingly a lot of
  delayed_work_pending() abuse going on here.
  [delayed_]work_pending() are unsynchronized and racy when used like
  this.  I converted one instance in fullbatt_handler().  Please
  conver the rest so that it invokes workqueue APIs for the intended
  target state rather than trying to game work item pending state
  transitions.  e.g. if timer should be modified - call
  mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
  simplified.  Note that round_jiffies() calls in this function are
  meaningless.  round_jiffies() work on absolute jiffies not delta
  delay used by delayed_work.

v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work().  They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock.  __cancel_delayed_work() users are
    dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-13 16:27:37 -07:00
Tatyana Nikolova 418edaaba9 RDMA/ucma.c: Fix for events with wrong context on iWARP
It is possible for asynchronous RDMA_CM_EVENT_ESTABLISHED events to be
generated with ctx->uid == 0, because ucma_set_event_context() copies
ctx->uid to the event structure outside of ctx->file->mut.  This leads
to a crash in the userspace library, since it gets a bogus event.

Fix this by taking the mutex a bit earlier in ucma_event_handler.

Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Sean Hefty <Sean.Hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-13 13:08:35 -07:00
Thomas Gleixner 81942621bd infiniband: Ehca: Use hotplug thread infrastructure
Get rid of the hotplug notifiers and use the generic hotplug thread
infrastructure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.775527032@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:08 +02:00
Roland Dreier d549f55f2e RDMA/ocrdma: Don't call vlan_dev_real_dev() for non-VLAN netdevs
If CONFIG_VLAN_8021Q is not set, then vlan_dev_real_dev() just goes BUG(),
so we shouldn't call it unless we're actually dealing with a VLAN netdev.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-10 16:52:13 -07:00
Jack Morgenstein df7fba6647 IB/mlx4: Fix possible deadlock on sm_lock spinlock
The sm_lock spinlock is taken in the process context by
mlx4_ib_modify_device, and in the interrupt context by update_sm_ah,
so we need to take that spinlock with irqsave, and release it with
irqrestore.

Lockdeps reports this as follows:

    [ INFO: inconsistent lock state ]
    3.5.0+ #20 Not tainted
    inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
    swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
    (&(&ibdev->sm_lock)->rlock){?.+...}, at: [<ffffffffa028af1d>] update_sm_ah+0xad/0x100 [mlx4_ib]
    {HARDIRQ-ON-W} state was registered at:
      [<ffffffff810b84a0>] mark_irqflags+0x120/0x190
      [<ffffffff810b9ce7>] __lock_acquire+0x307/0x4c0
      [<ffffffff810b9f51>] lock_acquire+0xb1/0x150
      [<ffffffff815523b1>] _raw_spin_lock+0x41/0x50
      [<ffffffffa028d563>] mlx4_ib_modify_device+0x63/0x240 [mlx4_ib]
      [<ffffffffa026d1fc>] ib_modify_device+0x1c/0x20 [ib_core]
      [<ffffffffa026c353>] set_node_desc+0x83/0xc0 [ib_core]
      [<ffffffff8136a150>] dev_attr_store+0x20/0x30
      [<ffffffff81201fd6>] sysfs_write_file+0xe6/0x170
      [<ffffffff8118da38>] vfs_write+0xc8/0x190
      [<ffffffff8118dc01>] sys_write+0x51/0x90
      [<ffffffff8155b869>] system_call_fastpath+0x16/0x1b

    ...
    *** DEADLOCK ***

    1 lock held by swapper/0/0:

    stack backtrace:
    Pid: 0, comm: swapper/0 Not tainted 3.5.0+ #20
    Call Trace:
    <IRQ>  [<ffffffff810b7bea>] print_usage_bug+0x18a/0x190
    [<ffffffff810b7370>] ? print_irq_inversion_bug+0x210/0x210
    [<ffffffff810b7fb2>] mark_lock_irq+0xf2/0x280
    [<ffffffff810b8290>] mark_lock+0x150/0x240
    [<ffffffff810b84ef>] mark_irqflags+0x16f/0x190
    [<ffffffff810b9ce7>] __lock_acquire+0x307/0x4c0
    [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
    [<ffffffff810b9f51>] lock_acquire+0xb1/0x150
    [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
    [<ffffffff815523b1>] _raw_spin_lock+0x41/0x50
    [<ffffffffa028af1d>] ? update_sm_ah+0xad/0x100 [mlx4_ib]
    [<ffffffffa026b2fa>] ? ib_create_ah+0x1a/0x40 [ib_core]
    [<ffffffffa028af1d>] update_sm_ah+0xad/0x100 [mlx4_ib]
    [<ffffffff810c27c3>] ? is_module_address+0x23/0x30
    [<ffffffffa028b05b>] handle_port_mgmt_change_event+0xeb/0x150 [mlx4_ib]
    [<ffffffffa028c177>] mlx4_ib_event+0x117/0x160 [mlx4_ib]
    [<ffffffff81552501>] ? _raw_spin_lock_irqsave+0x61/0x70
    [<ffffffffa022718c>] mlx4_dispatch_event+0x6c/0x90 [mlx4_core]
    [<ffffffffa0221b40>] mlx4_eq_int+0x500/0x950 [mlx4_core]

Reported by: Or Gerlitz <ogerlitz@mellanox.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-08-10 13:02:24 -07:00
Roland Dreier 1da9b6b43e Merge branches 'cma', 'ipoib', 'ocrdma' and 'qib' into for-next 2012-07-30 07:47:27 -07:00
Shlomo Pongratz b63b70d877 IPoIB: Use a private hash table for path lookup in xmit path
Dave Miller <davem@davemloft.net> provided a detailed description of
why the way IPoIB is using neighbours for its own ipoib_neigh struct
is buggy:

    Any time an ipoib_neigh is changed, a sequence like the following is made:

    			spin_lock_irqsave(&priv->lock, flags);
    			/*
    			 * It's safe to call ipoib_put_ah() inside
    			 * priv->lock here, because we know that
    			 * path->ah will always hold one more reference,
    			 * so ipoib_put_ah() will never do more than
    			 * decrement the ref count.
    			 */
    			if (neigh->ah)
    				ipoib_put_ah(neigh->ah);
    			list_del(&neigh->list);
    			ipoib_neigh_free(dev, neigh);
    			spin_unlock_irqrestore(&priv->lock, flags);
    			ipoib_path_lookup(skb, n, dev);

    This doesn't work, because you're leaving a stale pointer to the freed up
    ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
    with all the locking done to protect _changes_ to *ipoib_neigh(n), and
    with the code in ipoib_neigh_free() that NULLs out the pointer.

    The core issue is that read side calls to *to_ipoib_neigh(n) are not
    being synchronized at all, they are performed without any locking.  So
    whether we hold the lock or not when making changes to *ipoib_neigh(n)
    you still can have threads see references to freed up ipoib_neigh
    objects.

    	cpu 1			cpu 2
    	n = *ipoib_neigh()
    				*ipoib_neigh() = NULL
    				kfree(n)
    	n->foo == OOPS

    [..]

    Perhaps the ipoib code can have a private path database it manages
    entirely itself, which holds all the necessary information and is
    looked up by some generic key which is available easily at transmit
    time and does not involve generic neighbour entries.

See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
<http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
for the full discussion.

This patch aims to solve the race conditions found in the IPoIB driver.

The patch removes the connection between the core networking neighbour
structure and the ipoib_neigh structure.  In addition to avoiding the
race described above, it allows us to handle SKBs carrying IP packets
that don't have any associated neighbour.

We add an ipoib_neigh hash table with N buckets where the key is the
destination hardware address.  The ipoib_neigh is fetched from the
hash table and instead of the stashed location in the neighbour
structure. The hash table uses both RCU and reference counting to
guarantee that no ipoib_neigh instance is ever deleted while in use.

Fetching the ipoib_neigh structure instance from the hash also makes
the special code in ipoib_start_xmit that handles remote and local
bonding failover redundant.

Aged ipoib_neigh instances are deleted by a garbage collection task
that runs every M seconds and deletes every ipoib_neigh instance that
was idle for at least 2*M seconds. The deletion is safe since the
ipoib_neigh instances are protected using RCU and reference count
mechanisms.

The number of buckets (N) and frequency of running the GC thread (M),
are taken from the exported arb_tbl.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-30 07:46:50 -07:00
Mike Marciniszyn 5d7fe4efbf IB/qib: Fix size of cc_supported_table_entries
Commit 36a8f01cd2 ("IB/qib: Add congestion control agent
implementation") tries to store the value 1984 in a u8, which leads to
truncation.  Fix this by making the member big enough.

This bug was detected by a smatch warning.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ramkrishna Vepa <ramkrishna.vepa@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-29 20:26:10 -07:00
Roland Dreier 0764c76ecb RDMA/ucma: Convert open-coded equivalent to memdup_user()
Suggested by scripts/coccinelle/api/memdup_user.cocci.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-27 13:27:45 -07:00
Roland Dreier 9e8fa040cb RDMA/ocrdma: Fix check of GSI CQs
It looks like one check was accidentally duplicated, and the other 3
checks were left out.  This was detected by scripts/coccinelle/tests/doubletest.cocci:

    drivers/infiniband/hw/ocrdma/ocrdma_verbs.c:895:6-54: duplicated argument to && or ||

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-27 13:14:44 -07:00
Fengguang Wu 4e28904528 RDMA/cma: Use PTR_RET rather than if (IS_ERR(...)) + PTR_ERR
Suggested by scripts/coccinelle/api/ptr_ret.cocci.

Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-27 13:05:18 -07:00
Linus Torvalds 5dedb9f3bd InfiniBand/RDMA changes for the 3.6 merge window:
- Updates to the qib low-level driver
  - First chunk of changes for SR-IOV support for mlx4 IB
  - RDMA CM support for IPv6-only binding
  - Other misc cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABCAAGBQJQDXhUAAoJEENa44ZhAt0hZxwQAJydS9f9me31AGa45SAq8rA8
 6e3LgnQ6jS38he7LZZdSsT7g25jROCKOcTj6VUkNSVAVSkK8zcp7wngjDxw50IK6
 FF9hNeljMkWBflOhxzB34DRGCW4b4J+Yt1o7v1RtawiG/Mhri/6imKL0Aqjt4GwX
 a1MPZn+xI2osLujfdHJtATPWWB9jCaXdFe4DJUNPdqJhS6TN7s8OP3XMiqoJtdaV
 ptHeSSbjdUR1mg/h2LU2FVmWHXNSBxn7MEsrBBRkQVyiEkXieFBLwPTp4DqmAUJf
 xugm6Hf4sbiQ+QuU0baJODt56wveuYVQ4HKUzE0urUFXyU4TUB9blrehWZnKsRte
 wo/w4nvVlnXgGhHH0Igq76RDX8aCwc/6uQJ/29oChWjrei3HE0LjmIlPAu0vAhyw
 ViLe02/r2gKXQv1NxIqhPmGsJTZizg2mUk2eEJHHPQb/NGL7iNo6b+141AfURoqQ
 goGmlGdzffCRpmo6FXFZ57RPVRS4gwMunCY/Pmvq5a2t4oZh8899l2+V3N7bCfvH
 +JdavxAjia9U4IlgPsAqVaz8z8TyHOY/lEd75Wnw8q9www3kLx7hASvHsdwrS4LL
 ihzECzsaXOSoIXQzghs+6iyp1pzmzE9ve8OqbIGJStzlrOLyn7Gjo+Ixwm0SLsTO
 I7h9iJ1OMKFZ8bzmHsWb
 =f09j
 -----END PGP SIGNATURE-----

Merge tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband

Pull InfiniBand/RDMA changes from Roland Dreier:
 - Updates to the qib low-level driver
 - First chunk of changes for SR-IOV support for mlx4 IB
 - RDMA CM support for IPv6-only binding
 - Other misc cleanups and fixes

Fix up some add-add conflicts in include/linux/mlx4/device.h and
drivers/net/ethernet/mellanox/mlx4/main.c

* tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (30 commits)
  IB/qib: checkpatch fixes
  IB/qib: Add congestion control agent implementation
  IB/qib: Reduce sdma_lock contention
  IB/qib: Fix an incorrect log message
  IB/qib: Fix QP RCU sparse warnings
  mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them
  mlx4_core: Allow guests to have IB ports
  mlx4_core: Implement mechanism for reserved Q_Keys
  net/mlx4_core: Free ICM table in case of error
  IB/cm: Destroy idr as part of the module init error flow
  mlx4_core: Remove double function declarations
  IB/mlx4: Fill the masked_atomic_cap attribute in query device
  IB/mthca: Fill in sq_sig_type in query QP
  IB/mthca: Warning about event for non-existent QPs should show event type
  IB/qib: Fix sparse RCU warnings in qib_keys.c
  net/mlx4_core: Initialize IB port capabilities for all slaves
  mlx4: Use port management change event instead of smp_snoop
  IB/qib: RCU locking for MR validation
  IB/qib: Avoid returning EBUSY from MR deregister
  IB/qib: Fix UC MR refs for immediate operations
  ...
2012-07-24 13:56:26 -07:00
Linus Torvalds 3c4cfadef6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David S Miller:

 1) Remove the ipv4 routing cache.  Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache.  Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making.  Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point.  Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

 2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

 3) TCP SYN/ACK performance tweaks from Eric Dumazet.

 4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

 5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

 6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

 7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

 8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

 9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer.  Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket.  The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too.  From Eric
    Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
  genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
  r8169: revert "add byte queue limit support".
  ipv4: Change rt->rt_iif encoding.
  net: Make skb->skb_iif always track skb->dev
  ipv4: Prepare for change of rt->rt_iif encoding.
  ipv4: Remove all RTCF_DIRECTSRC handliing.
  ipv4: Really ignore ICMP address requests/replies.
  decnet: Don't set RTCF_DIRECTSRC.
  net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
  ipv4: Remove redundant assignment
  rds: set correct msg_namelen
  openvswitch: potential NULL deref in sample()
  tcp: dont drop MTU reduction indications
  bnx2x: Add new 57840 device IDs
  tcp: avoid oops in tcp_metrics and reset tcpm_stamp
  niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
  niu: Fix to check for dma mapping errors.
  net: Fix references to out-of-scope variables in put_cmsg_compat()
  net: ethernet: davinci_emac: add pm_runtime support
  net: ethernet: davinci_emac: Remove unnecessary #include
  ...
2012-07-24 10:01:50 -07:00
Roland Dreier 089117e1ad Merge branches 'cma', 'cxgb4', 'misc', 'mlx4-sriov', 'mlx-cleanups', 'ocrdma' and 'qib' into for-linus 2012-07-22 23:26:17 -07:00
Linus Torvalds cb47c1831f Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull target updates from Nicholas Bellinger:
 "There have been lots of work in a number of areas this past round.
  The highlights include:

   - Break out target_core_cdb.c emulation into SPC/SBC ops (hch)
   - Add a parse_cdb method to target backend drivers (hch)
   - Move sync_cache + write_same + unmap into spc_ops (hch)
   - Use target_execute_cmd for WRITEs in iscsi_target + srpt (hch)
   - Offload WRITE I/O backend submission in tcm_qla2xxx + tcm_fc (hch +
     nab)
   - Refactor core_update_device_list_for_node() into enable/disable
     funcs (agrover)
   - Replace the TCM processing thread with a TMR work queue (hch)
   - Fix regression in transport_add_device_to_core_hba from TMR
     conversion (DanC)
   - Remove racy, now-redundant check of sess_tearing_down with qla2xxx
     (roland)
   - Add range checking, fix reading of data len + possible underflow in
     UNMAP (roland)
   - Allow for target_submit_cmd() returning errors + convert fabrics
     (roland + nab)
   - Drop bogus struct file usage for iSCSI/SCTP (viro)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (54 commits)
  iscsi-target: Drop bogus struct file usage for iSCSI/SCTP
  target: NULL dereference on error path
  target: Allow for target_submit_cmd() returning errors
  target: Check number of unmap descriptors against our limit
  target: Fix possible integer underflow in UNMAP emulation
  target: Fix reading of data length fields for UNMAP commands
  target: Add range checking to UNMAP emulation
  target: Add generation of LOGICAL BLOCK ADDRESS OUT OF RANGE
  target: Make unnecessarily global se_dev_align_max_sectors() static
  target: Remove se_session.sess_wait_list
  qla2xxx: Remove racy, now-redundant check of sess_tearing_down
  target: Check sess_tearing_down in target_get_sess_cmd()
  sbp-target: Consolidate duplicated error path code in sbp_handle_command()
  target: Un-export target_get_sess_cmd()
  qla2xxx: Get rid of redundant qla_tgt_sess.tearing_down
  target: Make core_disable_device_list_for_node use pre-refactoring lock ordering
  target: refactor core_update_device_list_for_node()
  target: Eliminate else using boolean logic
  target: Misc retval cleanups
  target: Remove hba param from core_dev_add_lun
  ...
2012-07-22 13:31:57 -07:00
Mike Marciniszyn 7fac33014f IB/qib: checkpatch fixes
Elminate some simple_strto* usage.

checkpatch also noted pr_ conversations, which have been done as
recommended.  The pr_fmt() define is used to shorten line length.

Other multi-line string warnings are also elmininated.

Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-19 11:20:04 -07:00
Mike Marciniszyn 36a8f01cd2 IB/qib: Add congestion control agent implementation
Add a congestion control agent in the driver that handles gets and
sets from the congestion control manager in the fabric for the
Performance Scale Messaging (PSM) library.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-19 11:20:04 -07:00
Mike Marciniszyn 551ace124d IB/qib: Reduce sdma_lock contention
Profiling has shown that sdma_lock is proving a bottleneck for
performance. The situations include:
 - RDMA reads when krcvqs > 1
 - post sends from multiple threads

For RDMA read the current global qib_wq mechanism runs on all CPUs
and contends for the sdma_lock when multiple RMDA read requests are
fielded on differenct CPUs. For post sends, the direct call to
qib_do_send() from multiple threads causes the contention.

Since the sdma mechanism is per port, this fix converts the existing
workqueue to a per port single thread workqueue to reduce the lock
contention in the RDMA read case, and for any other case where the QP
is scheduled via the workqueue mechanism from more than 1 CPU.

For the post send case, This patch modifies the post send code to test
for a non empty sdma engine.  If the sdma is not idle the (now single
thread) workqueue will be used to trigger the send engine instead of
the direct call to qib_do_send().

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-19 11:19:58 -07:00
Betty Dall f3331f88a4 IB/qib: Fix an incorrect log message
There is a cut-and-paste typo in the function qib_pci_slot_reset()
where it prints that the "link_reset" function is called rather than
the "slot_reset" function.  This makes the message misleading.

Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-19 11:19:08 -07:00
Amir Vadai d9236c3f10 {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:34:37 -07:00
Mike Marciniszyn 1fb9fed6d4 IB/qib: Fix QP RCU sparse warnings
Commit af061a644a ("IB/qib: Use RCU for qpn lookup") introduced sparse
warnings.

This patch corrects those issues.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-17 10:18:37 -07:00
David S. Miller 6700c2709c net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}()
This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more.  In the
future the routes will be without destination address, source address,
etc. keying.  One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information.  This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here.  Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-17 03:29:28 -07:00
Christoph Hellwig e672a47fd9 srpt: use target_execute_cmd for WRITEs in srpt_handle_rdma_comp
srpt_handle_rdma_comp is called from kthread context and thus can execute
target_execute_cmd directly.  srpt_abort_cmd sets the CMD_T_LUN_STOP
flag directly, and thus the abuse of transport_generic_handle_data can be
replaced with an opencoded variant of that code path.  I'm still not happy
about a fabric driver poking into target core internals like this, but
let's defer the bigger architecture changes for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2012-07-16 17:35:18 -07:00
Jack Morgenstein 6634961c14 mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them
To allow easy paravirtualization of P_Key and GID table sizes, keep
paravirtualized sizes in mlx4_dev->caps, but save the actual physical
sizes from FW in struct: mlx4_dev->phys_cap.

In addition, in SR-IOV mode, do the following:

1. Reduce reported P_Key table size by 1.
   This is done to reserve the highest P_Key index for internal use,
   for declaring an invalid P_Key in P_Key paravirtualization.
   We require a P_Key index which always contain an invalid P_Key
   value for this purpose (i.e., one which cannot be modified by
   the subnet manager).  The way to do this is to reduce the
   P_Key table size reported to the subnet manager by 1, so that
   it will not attempt to access the P_Key at index #127.

2. Paravirtualize the GID table size to 1. Thus, each guest sees
   only a single GID (at its paravirtualized index 0).

In addition, since we are paravirtualizing the GID table size to 1, we
add paravirtualization of the master GID event here (i.e., we do not
do ib_dispatch_event() for the GUID change event on the master, since
its (only) GUID never changes).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-11 11:52:23 -07:00
Dotan Barak 87d4abda83 IB/cm: Destroy idr as part of the module init error flow
Clean the idr as part of the error flow since it is a resource too.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-11 09:22:58 -07:00
Dotan Barak 47e956b2a6 IB/mlx4: Fill the masked_atomic_cap attribute in query device
When the user queries for device capabilities, fill in the
masked_atomic_cap attribute with the real support level of atomic
capabilities instead of using a hard coded value.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-11 09:22:57 -07:00
Dotan Barak 16551d4501 IB/mthca: Fill in sq_sig_type in query QP
The query QP code was didn't fill that attribute, do that.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-11 09:22:57 -07:00
Dotan Barak 9bbeb6663e IB/mthca: Warning about event for non-existent QPs should show event type
Events received for non-existent QPs should generate a warning that includes
the event type that was received.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-11 09:22:57 -07:00
David S. Miller 04c9f416e3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/batman-adv/bridge_loop_avoidance.c
	net/batman-adv/bridge_loop_avoidance.h
	net/batman-adv/soft-interface.c
	net/mac80211/mlme.c

With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).

The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:56:33 -07:00
Eric Dumazet b28ba72665 IPoIB: fix skb truesize underestimatiom
Or Gerlitz reported triggering of WARN_ON_ONCE(delta < len); in
skb_try_coalesce()
This warning tracks drivers that incorrectly set skb->truesize

IPoIB indeed allocates a full page to store a fragment, but only
accounts in skb->truesize the used part of the page (frame length)

This patch fixes skb truesize underestimation, and
also fixes a performance issue, because RX skbs have not enough tailroom
to allow IP and TCP stacks to pull their header in skb linear part
without an expensive call to pskb_expand_head()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: Shlomo Pongartz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10 23:33:12 -07:00
Mike Marciniszyn 7e23017704 IB/qib: Fix sparse RCU warnings in qib_keys.c
Commit 8aac4cc3a9 ("IB/qib: RCU locking for MR validation") introduced
new sparse warnings in qib_keys.c.

Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-10 10:01:56 -07:00
Jack Morgenstein 00f5ce99dc mlx4: Use port management change event instead of smp_snoop
The port management change event can replace smp_snoop.  If the
capability bit for this event is set in dev-caps, the event is used
(by the driver setting the PORT_MNG_CHG_EVENT bit in the async event
mask in the MAP_EQ fw command).  In this case, when the driver passes
incoming SMP PORT_INFO SET mads to the FW, the FW generates port
management change events to signal any changes to the driver.

If the FW generates these events, smp_snoop shouldn't be invoked in
ib_process_mad(), or duplicate events will occur (once from the
FW-generated event, and once from smp_snoop).

In the case where the FW does not generate port management change
events smp_snoop needs to be invoked to create these events.  The flow
in smp_snoop has been modified to make use of the same procedures as
in the fw-generated-event event case to generate the port management
events (LID change, Client-rereg, Pkey change, and/or GID change).

Port management change event handling required changing the
mlx4_ib_event and mlx4_dispatch_event prototypes; the "param" argument
(last argument) had to be changed to unsigned long in order to
accomodate passing the EQE pointer.

We also needed to move the definition of struct mlx4_eqe from
net/mlx4.h to file device.h -- to make it available to the IB driver,
to handle port management change events.

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-10 09:47:10 -07:00
Mike Marciniszyn 8aac4cc3a9 IB/qib: RCU locking for MR validation
Profiling indicates that MR validation locking is expensive.  The MR
table is largely read-only and is a suitable candidate for RCU locking.

The patch uses RCU locking during validation to eliminate one
lock/unlock during that validation.

Reviewed-by: Mike Heinz <michael.william.heinz@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-08 18:05:19 -07:00
Mike Marciniszyn 6a82649f21 IB/qib: Avoid returning EBUSY from MR deregister
A timing issue can occur where qib_mr_dereg can return -EBUSY if the
MR use count is not zero.

This can occur if the MR is de-registered while RDMA read response
packets are being progressed from the SDMA ring.  The suspicion is
that the peer sent an RDMA read request, which has already been copied
across to the peer.  The peer sees the completion of his request and
then communicates to the responder that the MR is not needed any
longer.  The responder tries to de-register the MR, catching some
responses remaining in the SDMA ring holding the MR use count.

The code now uses a get/put paradigm to track MR use counts and
coordinates with the MR de-registration process using a completion
when the count has reached zero.  A timeout on the delay is in place
to catch other EBUSY issues.

The reference count protocol is as follows:
- The return to the user counts as 1
- A reference from the lk_table or the qib_ibdev counts as 1.
- Transient I/O operations increase/decrease as necessary

A lot of code duplication has been folded into the new routines
init_qib_mregion() and deinit_qib_mregion().  Additionally, explicit
initialization of fields to zero is now handled by kzalloc().

Also, duplicated code 'while.*num_sge' that decrements reference
counts have been consolidated in qib_put_ss().

Reviewed-by: Ramkrishna Vepa <ramkrishna.vepa@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-07-08 18:05:19 -07:00