I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface.
v2: Mike triggered WARN_ON() in idr_preload() because send_mad(),
which may be used from non-process context, was calling
idr_preload() unconditionally. Preload iff @gfp_mask has
__GFP_WAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reported-by: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Problem reported by Dan Carpenter <dan.carpenter@oracle.com>:
The patch 3c86aa70bf67: "RDMA/cm: Add RDMA CM support for IBoE
devices" from Oct 13, 2010, leads to the following warning:
net/sunrpc/xprtrdma/svc_rdma_transport.c:722 svc_rdma_create()
error: passing non neg 1 to ERR_PTR
This bug would result in a NULL dereference. svc_rdma_create() is
supposed to return ERR_PTRs or valid pointers, but instead it returns
ERR_PTRs, valid pointers and 1.
The call tree is:
svc_rdma_create()
=> rdma_bind_addr()
=> cma_acquire_dev()
=> find_gid_port()
rdma_bind_addr() should return a valid errno. Fix this by having
find_gid_port() also return a valid errno. If we can't find the
specified GID on a given port, return -EADDRNOTAVAIL, rather than
-EAGAIN, to better indicate the error. We also drop using the
special return value of '1' and instead pass through the error
returned by the underlying verbs call. On such errors, rather
than aborting the search, we simply continue to check the next
device/port.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pulled mainline in order to get the UAPI infrastructure already
merged before I pull in David Howells's UAPI trees for networking.
Signed-off-by: David S. Miller <davem@davemloft.net>
set netlink_dump_control.module to avoid panic.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The retry_count and rnr_retry_count connection parameters are both
3-bit values. Check that the values are in range and reduce if
they're not.
This fixes a problem reported by Doug Ledford <dledford@redhat.com>
that resulted in the userspace rping test (part of the librdmacm
samples) failing to run over Intel IB HCAs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
[ Use min_t() to avoid warnings about type mismatch. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
CMA multicast joins for the IPoIB port space need to use the same
component mask used by the ipoib driver. Otherwise, it's possible for
the CMA to create a group to which a join made by ipoib will fail, or
vise-versa. Some of the component mask fields set by ipoib weren't
set by the CMA, fix that.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Provide an option for the user to specify that listens should only
accept connections where the incoming address family matches that of
the locally bound address. This is used to support the equivalent of
IPV6_V6ONLY socket option, which allows an app to only accept
connection requests directed to IPv6 addresses.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The rdma_cm maps IPv4 and IPv6 addresses to the same service ID. This
prevents apps from listening only for IPv4 or IPv6 addresses. It also
results in an app binding to an IPv4 address receiving connection
requests for an IPv6 address.
Change this to match socket behavior: restrict listens on IPv4
addresses to only IPv4 addresses, and if a listen is on an IPv6
address, allow it to receive either IPv4 or IPv6 addresses, based on
its address family binding.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA CM uses a single port space for all associated (tcp, udp,
etc.) port bindings, regardless of the address family that the user
binds to. The result is that if a user binds to AF_INET, but does not
specify an IP address, the bind will occur for AF_INET6. This causes
an attempt to bind to the same port using AF_INET6 to fail, and
connection requests to AF_INET6 will match with the AF_INET listener.
Align the behavior with sockets and restrict the bind to AF_INET only.
If a user binds to AF_INET6, we bind the port to AF_INET6 and
AF_INET depending on the value of bindv6only.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Change || check to the intended && when checking the QP type in a
received connection request against the listening endpoint.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Add ocrdma hardware driver for Emulex IB-over-Ethernet adapters
- Add generic and mlx4 support for "raw" QPs: allow suitably privileged
applications to send and receive arbitrary packets directly to/from
the hardware
- Add "doorbell drop" handling to the cxgb4 driver
- A fairly large batch of qib hardware driver changes
- A few fixes for lockdep-detected issues
- A few other miscellaneous fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJPumbZAAoJEENa44ZhAt0h8LgP/0fXe7Szm3n6P6UvMAVqkagM
4PpreH3mpWUFpzqeQE1JPDtgx700R6aPipbHqgIN+k61RWMpLjICGcNx7iwxn1I+
zqdquGygWgjceLz+BLVlk+iBmJt3vZ3fPRAXc7fdP+jhIarWkNIOy1pXWTUuRvED
jL8jIaxhCcgAVzm/zNyt6IPxkaHvCz7K9wqmpyU0dsO9OyPdGvWA9+CkGXwmOCPq
mxSVhWnfGsMkPBsL7EgTC5KP/ox2PKq6rFgysmVVS+rKCpP0L8BEVQyGX3Gf8KA8
yV+KdTi9ofDnFrv6R7Wz0v7HRUih8GRssakzBu7Y7HLfK1M/QwMG0GUAibXGZObc
vUXuQ3uRJ/cIzMPXqKeGYwpb5t+TmxyjhWu44OjCUQkNau91+9BSbA69S88KXc49
aTJiCZlhPuGf4uGMWJJuPLcE2xO2QCZj+8ckL2STYrIip6GWlCH02kJaQmRkuWH2
UhMOeJDBC4nvh4EQT/WwHpGzyhkavE2ayfo5YemxBJXo+P5Mdbf7WIDRQDLUEeQH
F8sPoccH4hDiAorN/SkTsm14jVTP7oWW1M40Ont59Nhbgm88MsVkvjoneHnfBvbD
HjK92soCWnYTAoREfj0G4xUxZgMdOZcezWrX0rx5LJ8Ju9y4zAi3cKGr7lg6hs4X
syKfN0VjiDRtJ+pxayi3
=yWfr
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA changes from Roland Dreier:
- Add ocrdma hardware driver for Emulex IB-over-Ethernet adapters
- Add generic and mlx4 support for "raw" QPs: allow suitably privileged
applications to send and receive arbitrary packets directly to/from
the hardware
- Add "doorbell drop" handling to the cxgb4 driver
- A fairly large batch of qib hardware driver changes
- A few fixes for lockdep-detected issues
- A few other miscellaneous fixes and cleanups
Fix up trivial conflict in drivers/net/ethernet/emulex/benet/be.h.
* tag 'rdma-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (53 commits)
RDMA/cxgb4: Include vmalloc.h for vmalloc and vfree
IB/mlx4: Fix mlx4_ib_add() error flow
IB/core: Fix IB_SA_COMP_MASK macro
IB/iser: Fix error flow in iser ep connection establishment
IB/mlx4: Increase the number of vectors (EQs) available for ULPs
RDMA/cxgb4: Add query_qp support
RDMA/cxgb4: Remove kfifo usage
RDMA/cxgb4: Use vmalloc() for debugfs QP dump
RDMA/cxgb4: DB Drop Recovery for RDMA and LLD queues
RDMA/cxgb4: Disable interrupts in c4iw_ev_dispatch()
RDMA/cxgb4: Add DB Overflow Avoidance
RDMA/cxgb4: Add debugfs RDMA memory stats
cxgb4: DB Drop Recovery for RDMA and LLD queues
cxgb4: Common platform specific changes for DB Drop Recovery
cxgb4: Detect DB FULL events and notify RDMA ULD
RDMA/cxgb4: Drop peer_abort when no endpoint found
RDMA/cxgb4: Always wake up waiters in c4iw_peer_abort_intr()
mlx4_core: Change bitmap allocator to work in round-robin fashion
RDMA/nes: Don't call event handler if pointer is NULL
RDMA/nes: Fix for the ORD value of the connecting peer
...
The following lockdep problem was reported by Or Gerlitz <ogerlitz@mellanox.com>:
[ INFO: possible recursive locking detected ]
3.3.0-32035-g1b2649e-dirty #4 Not tainted
---------------------------------------------
kworker/5:1/418 is trying to acquire lock:
(&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0138a41>] rdma_destroy_i d+0x33/0x1f0 [rdma_cm]
but task is already holding lock:
(&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disable_ca llback+0x24/0x45 [rdma_cm]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&id_priv->handler_mutex);
lock(&id_priv->handler_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by kworker/5:1/418:
#0: (ib_cm){.+.+.+}, at: [<ffffffff81042ac1>] process_one_work+0x210/0x4a 6
#1: ((&(&work->work)->work)){+.+.+.}, at: [<ffffffff81042ac1>] process_on e_work+0x210/0x4a6
#2: (&id_priv->handler_mutex){+.+.+.}, at: [<ffffffffa0135130>] cma_disab le_callback+0x24/0x45 [rdma_cm]
stack backtrace:
Pid: 418, comm: kworker/5:1 Not tainted 3.3.0-32035-g1b2649e-dirty #4
Call Trace:
[<ffffffff8102b0fb>] ? console_unlock+0x1f4/0x204
[<ffffffff81068771>] __lock_acquire+0x16b5/0x174e
[<ffffffff8106461f>] ? save_trace+0x3f/0xb3
[<ffffffff810688fa>] lock_acquire+0xf0/0x116
[<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffff81364351>] mutex_lock_nested+0x64/0x2ce
[<ffffffffa0138a41>] ? rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155
[<ffffffff81065abc>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa0138a41>] rdma_destroy_id+0x33/0x1f0 [rdma_cm]
[<ffffffffa0139c02>] cma_req_handler+0x418/0x644 [rdma_cm]
[<ffffffffa012ee88>] cm_process_work+0x32/0x119 [ib_cm]
[<ffffffffa0130299>] cm_req_handler+0x928/0x982 [ib_cm]
[<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm]
[<ffffffffa0130326>] cm_work_handler+0x33/0xfe5 [ib_cm]
[<ffffffff81065a78>] ? trace_hardirqs_on_caller+0x11e/0x155
[<ffffffffa01302f3>] ? cm_req_handler+0x982/0x982 [ib_cm]
[<ffffffff81042b6e>] process_one_work+0x2bd/0x4a6
[<ffffffff81042ac1>] ? process_one_work+0x210/0x4a6
[<ffffffff813669f3>] ? _raw_spin_unlock_irq+0x2b/0x40
[<ffffffff8104316e>] worker_thread+0x1d6/0x350
[<ffffffff81042f98>] ? rescuer_thread+0x241/0x241
[<ffffffff81046a32>] kthread+0x84/0x8c
[<ffffffff8136e854>] kernel_thread_helper+0x4/0x10
[<ffffffff81366d59>] ? retint_restore_args+0xe/0xe
[<ffffffff810469ae>] ? __init_kthread_worker+0x56/0x56
[<ffffffff8136e850>] ? gs_change+0xb/0xb
The actual locking is fine, since we're dealing with different locks,
but from the same lock class. cma_disable_callback() acquires the
listening id mutex, whereas rdma_destroy_id() acquires the mutex for
the new connection id. To fix this, delay the call to
rdma_destroy_id() until we've released the listening id mutex.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Both tagged traffic and untagged traffic use tc tool mapping.
Treat RDMA TOS same as IP TOS when mapping to SL
Signed-off-by: Amir Vadai <amirv@mellanox.com>
CC: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix endianness bugs reported by sparse in the RDMA core stack. Note
that these are real bugs, but don't affect any existing code to the
best of my knowledge. The mlid issue would only affect kernel users
of rdma_join_multicast which have the rdma_cm attach/detach its QP.
There are no current in tree users that do this. (rdma_join_multicast
may be used called by user space applications, which does not have
this issue.) And the pkey setting is simply returned as
informational.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
private_data_len is defined as a u8. If the user specifies a large
private_data size (> 220 bytes), we will calculate a total length that
exceeds 255, resulting in private_data_len wrapping back to 0. This
can lead to overwriting random kernel memory. Avoid this by verifying
that the resulting size fits into a u8.
Reported-by: B. Thery <benjamin.thery@bull.net>
Addresses: <http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2335>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
They had been getting it implicitly via device.h but we can't
rely on that for the future, due to a pending cleanup so fix
it now.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Allow users to connect XRC QPs through the rdma_cm.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add RDMA_PS_IB. XRC QP types will use the IB port space when operating
over the RDMA CM. For the 'IP protocol' field value, we select 0x3F,
which is listed as being for 'any local network'.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update struct iw_cm_event to support propagating the ird/ord values
upwards to the application.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Check that conn_param is not null before dereferencing it when
processing rdma_accept(). This is necessary to prevent a possible
system crash, which can be caused by user space.
Problem found by code inspection.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA CM uses the local qp_type to determine how to process an
incoming request. This can result in an incoming REQ being treated as
a SIDR REQ and vice versa. Fix this by switching off the event type
instead, and for good measure verify that the listener supports the
incoming connection request.
This problem showed up when a user space application mismatched the QP
types between a client and server app.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch fixes a kernel crash in cma_set_qkey().
When the link layer is Ethernet, it is wrong to use IPoIB port space
since no IPoIB interface is available. Specifically, setting the
Q_Key when port space is RDMA_PS_IPOIB requires MGID calculation and
an SA query, which doesn't make sense over Ethernet.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Avoid assigning an IS_ERR value to the cm_id pointer. This fixes a
few anomalies in the error flow due to confusion about checking for
NULL vs IS_ERR, and eliminates the need to test for the IS_ERR value
every time we wish to determine if the cma_id object has a cm device
associated with it.
Also, eliminate the now-unnecessary procedure cma_has_cm_dev (we can
check directly for the existence of the device pointer -- for a
non-NULL check, makes no difference if it is the iwarp or the ib
pointer).
Finally, make a few code changes here to improve coding consistency.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Save the PID associated with an RDMA CM ID for reporting via netlink.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add callbacks and data types for statistics export of all current
devices/ids. The schema for RDMA CM is a series of netlink messages.
Each one contains an rdma_cm_stat struct. Additionally, two netlink
attributes are created for the addresses for each message (if
applicable).
Their types used are:
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR (The source address for this ID)
RDMA_NL_RDMA_CM_ATTR_DST_ADDR (The destination address for this ID)
sockaddr_* structs are encapsulated within these attributes.
In other words, every transaction contains a series of messages like:
-------message 1-------
struct rdma_cm_id_stats {
__u32 qp_num;
__u32 bound_dev_if;
__u32 port_space;
__s32 pid;
__u8 cm_state;
__u8 node_type;
__u8 port_num;
__u8 reserved;
}
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute - contains the source address
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute - contains the destination address
-------end 1-------
-------message 2-------
struct rdma_cm_id_stats
RDMA_NL_RDMA_CM_ATTR_SRC_ADDR attribute
RDMA_NL_RDMA_CM_ATTR_DST_ADDR attribute
-------end 2-------
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The RDMA CM currently infers the QP type from the port space selected
by the user. In the future (eg with RDMA_PS_IB or XRC), there may not
be a 1-1 correspondence between port space and QP type. For netlink
export of RDMA CM state, we want to export the QP type to userspace,
so it is cleaner to explicitly associate a QP type to an ID.
Modify rdma_create_id() to allow the user to specify the QP type, and
use it to make our selections of datagram versus connected mode.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move cma.c's internal definition of enum cma_state to enum rdma_cm_state
in an exported header so that it can be exported via RDMA netlink.
Signed-off-by: Nir Muchtar <nirm@voltaire.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Lustre requires that clients bind to a privileged port number before
connecting to a remote server. On larger clusters (typically more
than about 1000 nodes), the number of privileged ports is exhausted,
resulting in lustre being unusable.
To handle this, we add support for reusable addresses to the rdma_cm.
This mimics the behavior of the socket option SO_REUSEADDR. A user
may set an rdma_cm_id to reuse an address before calling
rdma_bind_addr() (explicitly or implicitly). If set, other
rdma_cm_id's may be bound to the same address, provided that they all
have reuse enabled, and there are no active listens.
If rdma_listen() is called on an rdma_cm_id that has reuse enabled, it
will only succeed if there are no other id's bound to that same
address. The reuse option is exported to user space. The behavior of
the kernel reuse implementation was verified against that given by
sockets.
This patch is derived from a path by Ira Weiny <weiny2@llnl.gov>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
cma_use_port() assumes that the sockaddr is an IPv4 address. Since
IPv6 addressing is supported (and also to support other address
families) make the code more generic in its address handling.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
rdma_destroy_id currently uses the global rdma cm 'lock' to test if an
rdma_cm_id has been bound to a device. This prevents an active
address resolution callback handler from assigning a device to the
rdma_cm_id after rdma_destroy_id checks for one.
Instead, we can replace the use of the global lock around the check to
the rdma_cm_id device pointer by setting the id state to destroying,
then flushing all active callbacks. The latter is accomplished by
acquiring and releasing the handler_mutex. Any active handler will
complete first, and any newly scheduled handlers will find the
rdma_cm_id in an invalid state.
In addition to optimizing the current locking scheme, the use of the
rdma_cm_id mutex is a more intuitive synchronization mechanism than
that of the global lock. These changes are based on feedback from
Doug Ledford <dledford@redhat.com> while he was trying to debug a
crash in the rdma cm destroy path.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Doug Ledford and Red Hat reported a crash when running the rdma_cm on
a real-time OS. The crash has the following call trace:
cm_process_work
cma_req_handler
cma_disable_callback
rdma_create_id
kzalloc
init_completion
cma_get_net_info
cma_save_net_info
cma_any_addr
cma_zero_addr
rdma_translate_ip
rdma_copy_addr
cma_acquire_dev
rdma_addr_get_sgid
ib_find_cached_gid
cma_attach_to_dev
ucma_event_handler
kzalloc
ib_copy_ah_attr_to_user
cma_comp
[ preempted ]
cma_write
copy_from_user
ucma_destroy_id
copy_from_user
_ucma_find_context
ucma_put_ctx
ucma_free_ctx
rdma_destroy_id
cma_exch
cma_cancel_operation
rdma_node_get_transport
rt_mutex_slowunlock
bad_area_nosemaphore
oops_enter
They were able to reproduce the crash multiple times with the
following details:
Crash seems to always happen on the:
mutex_unlock(&conn_id->handler_mutex);
as conn_id looks to have been freed during this code path.
An examination of the code shows that a race exists in the request
handlers. When a new connection request is received, the rdma_cm
allocates a new connection identifier. This identifier has a single
reference count on it. If a user calls rdma_destroy_id() from another
thread after receiving a callback, rdma_destroy_id will proceed to
destroy the id and free the associated memory. However, the request
handlers may still be in the process of running. When control returns
to the request handlers, they can attempt to access the newly created
identifiers.
Fix this by holding a reference on the newly created rdma_cm_id until
the request handler is through accessing it.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:
GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.
The 3 bits user priority field of the packets are identical to the 3
bits of the SL.
In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE device binding and IP --> GID resolution. Path
resolving and multicast joining are implemented within cma.c by
filling in the responses and running callbacks in the CMA work queue.
IP --> GID resolution always yields IPv6 link local addresses; remote
GIDs are derived from the destination MAC address of the remote port.
Multicast GIDs are always mapped to multicast MACs as is done in IPv6.
(IPv4 multicast is enabled by translating IPv4 multicast addresses to
IPv6 multicast as described in
<http://www.mail-archive.com/ipng@sunroof.eng.sun.com/msg02134.html>.)
Some helper functions are added to ib_addr.h.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use kmemdup when some other buffer is immediately copied into the
allocated region.
A simplified version of the semantic patch that makes this change is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Randomize local port allocation in the way sctp_get_port_local() does.
Update rover at the end of loop since we're likely to pick a valid port
on the first try.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Check correct variable for allocation failure
RDMA/nes: Correct cap.max_inline_data assignment in nes_query_qp()
RDMA/cm: Set num_paths when manually assigning path records
IB/cm: Fix device_create() return value check
When manually assigning the path records to use for a connection, save
the number of paths that were set. Otherwise, checks against num_path
will show 0, even though path record data is available.
This was discovered by manually setting the path records from user
space, then querying the kernel to see if the correct path records
were assigned, only to discover that the kernel returned 0 path
records to the query.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Revert the following change from commit 6f8372b6 ("RDMA/cm: fix
loopback address support")
The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address. (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdma_bind_addr,
no device is associated with the rdma_cm_id. Fix this.
It turns out that important apps such as Open MPI depend on
rdma_bind_addr() NOT associating any RDMA device when binding to a
loopback address. Open MPI is being updated to deal with this, but at
least until a new Open MPI release is available, maintain the previous
behavior: allow rdma_bind_addr() to succeed, but do not bind to a
device.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Correct misspelled "CONFIG_IPv6" that was introduced in commit
d14714df ("IB/addr: Fix IPv6 routing lookup"). The config variable
should be all uppercase.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
[ This was my fault when I munged the original patch. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>