linux_old1

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Linus Torvalds	ab9f2faf8f	Initial 4.4 merge window submission - "Checksum offload support in user space" enablement - Misc cxgb4 fixes, add T6 support - Misc usnic fixes - 32 bit build warning fixes - Misc ocrdma fixes - Multicast loopback prevention extension - Extend the GID cache to store and return attributes of GIDs - Misc iSER updates - iSER clustering update - Network NameSpace support for rdma CM - Work Request cleanup series - New Memory Registration API -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJWPO5UAAoJELgmozMOVy/dSCQP/iX2ImMZOS3VkOYKhLR3dSv8 4vTEiYIoAT1JEXiPpiabuuACwotcZcMRk9kZ0dcWmBoFusTzKJmoDOkgAYd95XqY EsAyjqtzUGNNMjH5u5W+kdbaFdH9Ktq7IJvspRlJuvzC47Srax+qBxX01jrAkDgh 4PoA3hEa2KkvkDjY2Mhvk9EWd/uflO9Ky6o0D8jUQkWtEvKBRyDjQLk30oW6wHX9 pTWqww3dD0EXTrR+PDA88v2saKH1kZFU1Nt2eU8Bw+zlJM8hcX6U7PfRX0g3HT/J o+7ejTdLPWFDH35gJOU+KE519f1JbwfRjPJCqbOC9IttBB7iHSbhcpQLpWv4JV1x agdBeDA3TGQj3dHb2SkYMlWXCBp7q8UCbVGvvirTFzGSGU73sc6hhP+vCKvPQIlE Ah5tUqD7Y3mOBjvuDeIzKMLXILd5d3cH+m7Laytrf5e7fJPmBRZyOkcMh0QVElyl mKo+PFjghgeTFb405J7SDDw/vThVyN9HyIt7AGEzObaajzOOk9R1hkQr46XVy9TK yi58fl85yQ2n6TWV6NRnvkQoMy/N2HAEuXk/7HtO0PabV5w3Lo0zvXB9SnVrrVEm 58FWRBYCWorVSdSacuDnPm0iz45WSRIb9G9sBlhEC93eXRq2rSBoy4RvyLeliHFH hllyhNNolI6FJ64j07Xm =bBIY -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "This is my initial round of 4.4 merge window patches. There are a few other things I wish to get in for 4.4 that aren't in this pull, as this represents what has gone through merge/build/run testing and not what is the last few items for which testing is not yet complete. - "Checksum offload support in user space" enablement - Misc cxgb4 fixes, add T6 support - Misc usnic fixes - 32 bit build warning fixes - Misc ocrdma fixes - Multicast loopback prevention extension - Extend the GID cache to store and return attributes of GIDs - Misc iSER updates - iSER clustering update - Network NameSpace support for rdma CM - Work Request cleanup series - New Memory Registration API" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (76 commits) IB/core, cma: Make __attribute_const__ declarations sparse-friendly IB/core: Remove old fast registration API IB/ipath: Remove fast registration from the code IB/hfi1: Remove fast registration from the code RDMA/nes: Remove old FRWR API IB/qib: Remove old FRWR API iw_cxgb4: Remove old FRWR API RDMA/cxgb3: Remove old FRWR API RDMA/ocrdma: Remove old FRWR API IB/mlx4: Remove old FRWR API support IB/mlx5: Remove old FRWR API support IB/srp: Dont allocate a page vector when using fast_reg IB/srp: Remove srp_finish_mapping IB/srp: Convert to new registration API IB/srp: Split srp_map_sg RDS/IW: Convert to new memory registration API svcrdma: Port to new memory registration API xprtrdma: Port to new memory registration API iser-target: Port to new memory registration API IB/iser: Port to new fast registration API ...	2015-11-07 13:33:07 -08:00
Mel Gorman	d0164adc89	mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
santosh.shilimkar@oracle.com	7b5654349e	RDS: convert bind hash table to re-sizable hashtable To further improve the RDS connection scalabilty on massive systems where number of sockets grows into tens of thousands of sockets, there is a need of larger bind hashtable. Pre-allocated 8K or 16K table is not very flexible in terms of memory utilisation. The rhashtable infrastructure gives us the flexibility to grow the hashtbable based on use and also comes up with inbuilt efficient bucket(chain) handling. Reviewed-by: David Miller <davem@davemloft.net> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 15:36:23 -05:00
Saurabh Sengar	d3ffaefa1b	net: rds: changing the return type from int to void as result of function rds_iw_flush_mr_pool is nowhere checked, changing its return type from int to void. also removing the unused variable rc as there is nothing to return Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 15:35:19 -05:00
David S. Miller	b75ec3af27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2015-11-01 00:15:30 -04:00
Sagi Grimberg	9ddc87374a	RDS/IW: Convert to new memory registration API Get rid of fast_reg page list and its construction. Instead, just pass the RDS sg list to ib_map_mr_sg and post the new ib_reg_wr. This is done both for server IW RDMA_READ registration and the client remote key registration. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-10-28 22:27:18 -04:00
Doug Ledford	63e8790d39	Merge branch 'wr-cleanup' into k.o/for-4.4	2015-10-28 22:23:34 -04:00
Guy Shapiro	fa20105e09	IB/cma: Add support for network namespaces Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma, when listening on an ID and when looking for an ID for an incoming request. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-10-28 12:32:48 -04:00
Sowmini Varadhan	8ce675ff39	RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv Either of pskb_pull() or pskb_trim() may fail under low memory conditions. If rds_tcp_data_recv() ignores such failures, the application will receive corrupted data because the skb has not been correctly carved to the RDS datagram size. Avoid this by handling pskb_pull/pskb_trim failure in the same manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and retry via the deferred call to rds_send_worker() that gets set up on ENOMEM from rds_tcp_read_sock() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-27 19:46:34 -07:00
santosh.shilimkar@oracle.com	7b4b000951	RDS: fix rds-ping deadlock over TCP transport Sowmini found hang with rds-ping while testing RDS over TCP. Its a corner case and doesn't happen always. The issue is not reproducible with IB transport. Its clear from below dump why we see it with RDS TCP. [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740 [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30 [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20 [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp] [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds] [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds] [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds] [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130 [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0 [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp] [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0 [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp] [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0 [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp] [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0 [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp] [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0 [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220 [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880 [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220 This happens because rds_send_xmit() chain wants to take sock_lock which is already taken by tcp_v4_rcv() on its way to rds_tcp_data_ready(). Commit `db6526dcb5` ("RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL") which was trying to opportunistically finish the send request in same thread context. But because of above recursive lock hang with RDS TCP, the send work from rds_send_pong() needs to deferred to worker to avoid lock up. Given RDS ping is more of connectivity test than performance critical path, its should be ok even for transport like IB. Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-18 22:45:55 -07:00
Sowmini Varadhan	241b271952	RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one() Consider the following "duelling syn" sequence between two peers A and B: A B SYN1 --> <-- SYN2 SYN2ACK --> Note that the SYN/ACK has already been sent out by TCP before rds_tcp_accept_one() gets invoked as part of callbacks. If the inet_addr(A) is numerically less than inet_addr(B), the arbitration scheme in rds_tcp_accept_one() will prefer the TCP connection triggered by SYN1, and will send a CLOSE for the SYN2 (just after the SYN2ACK was sent). Since B also follows the same arbitration scheme, it will send the SYN-ACK for SYN1 that will set up a healthy ESTABLISHED connection on both sides. B will also get a CLOSE for SYN2, which should result in the cleanup of the TCP state machine for SYN2, but it should not trigger any stale RDS-TCP callbacks (such as ->writespace, ->state_change etc), that would disrupt the progress of the SYN2 based RDS-TCP connection. Thus the arbitration scheme in rds_tcp_accept_one() should restore rds_tcp callbacks for the winner before setting them up for the new accept socket, and also make sure that conn->c_outgoing is set to 0 so that we do not trigger any reconnect attempts on the passive side of the tcp socket in the future, in conformance with commit `c82ac7e69e` ("net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-13 04:22:41 -07:00
Sowmini Varadhan	486798001b	RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports. The IP address passed to rds_bind() should be vetted by the transport's ->laddr_check() for a previously bound transport. This needs to be done to avoid cases where, for example, the application has asked for an IB transport, but the IP address passed to bind is only usable on ethernet interfaces. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-13 04:22:40 -07:00
David S. Miller	91d2f14bc3	Merge branch 'net/rds/4.3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux Santosh Shilimkar says: ==================== RDS: connection scalability and performance improvements [v4] Re-sending the same patches from v3 again since my repost of patch 05/14 from v3 was whitespace damaged. [v3] Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to send worker" as per David Miller's comment [4] to avoid the magic value usage. Patch now makes use of already available but unused send_batch_count module parameter. Rest of the patches are same as earlier version v2 [3] [v2]: Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from earlier version [1]. I plan to address the hash table scalability using re-sizable hash tables as suggested by David Laight and David Miller [2] This series addresses RDS connection bottlenecks on massive workloads and improve the RDMA performance almost by 3X. RDS TCP also gets a small gain of about 12%. RDS is being used in massive systems with high scalability where several hundred thousand end points and tens of thousands of local processes are operating in tens of thousand sockets. Being RC(reliable connection), socket bind and release happens very often and any inefficiencies in bind hash look ups hurts the overall system performance. RDS bin hash-table uses global spin-lock which is the biggest bottleneck. To make matter worst, it uses rcu inside global lock for hash buckets. This is being addressed by simply using per bucket rw lock which makes the locking simple and very efficient. The hash table size is still an issue and I plan to address it by using re-sizable hash tables as suggested on the list. For RDS RDMA improvement, the completion handling is revamped so that we can do batch completions. Both send and receive completion handlers are split logically to achieve the same. RDS 8K messages being one of the key usecase, mr pool is adapted to have the 8K mrs along with default 1M mrs. And while doing this, few fixes and couple of bottlenecks seen with rds_sendmsg() are addressed. Series applies against 4.3-rc1 as well net-next. Its tested on Oracle hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is tested with iXGB NIC. Like last time, iWARP transport is untested with these changes. The patchset is also available at below git repo: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3 As a side note, the IB HCA driver I used for testing misses at least 3 important patches in upstream to see the full blown IB performance and am hoping to get that in mainline with help of them. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-08 04:38:37 -07:00
Christoph Hellwig	e622f2f4ad	IB: split struct ib_send_wr This patch split up struct ib_send_wr so that all non-trivial verbs use their own structure which embedds struct ib_send_wr. This dramaticly shrinks the size of a WR for most common operations: sizeof(struct ib_send_wr) (old): 96 sizeof(struct ib_send_wr): 48 sizeof(struct ib_rdma_wr): 64 sizeof(struct ib_atomic_wr): 96 sizeof(struct ib_ud_wr): 88 sizeof(struct ib_fast_reg_wr): 88 sizeof(struct ib_bind_mw_wr): 96 sizeof(struct ib_sig_handover_wr): 80 And with Sagi's pending MR rework the fast registration WR will also be down to a reasonable size: sizeof(struct ib_fastreg_wr): 64 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt] Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc] Tested-by: Haggai Eran <haggaie@mellanox.com> Tested-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Steve Wise <swise@opengridcomputing.com>	2015-10-08 11:09:10 +01:00
Santosh Shilimkar	0676651323	RDS: IB: split mr pool to improve 8K messages performance 8K message sizes are pretty important usecase for RDS current workloads so we make provison to have 8K mrs available from the pool. Based on number of SG's in the RDS message, we pick a pool to use. Also to make sure that we don't under utlise mrs when say 8k messages are dominating which could lead to 8k pull being exhausted, we fall-back to 1m pool till 8k pool recovers for use. This helps to at least push ~55 kB/s bidirectional data which is a nice improvement. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:02 -07:00
Santosh Shilimkar	41a4e96462	RDS: IB: use max_mr from HCA caps than max_fmr All HCA drivers seems to popullate max_mr caps and few of them do both max_mr and max_fmr. Hence update RDS code to make use of max_mr. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:02 -07:00
Santosh Shilimkar	67161e250a	RDS: IB: mark rds_ib_fmr_wq static Fix below warning by marking rds_ib_fmr_wq static net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static? Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:02 -07:00
Santosh Shilimkar	26139dc1db	RDS: IB: use already available pool handle from ibmr rds_ib_mr already keeps the pool handle which it associates with. Lets use that instead of round about way of fetching it from rds_ib_device. No functional change. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:02 -07:00
Santosh Shilimkar	2e1d6b813a	RDS: IB: fix the rds_ib_fmr_wq kick call RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need to use queue_delayed_work() to kick the work. This was hurting the performance since pool maintenance was less often triggered from other path. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:01 -07:00
Santosh Shilimkar	9441c973e1	RDS: IB: handle rds_ibdev release case instead of crashing the kernel Just in case we are still handling the QP receive completion while the rds_ibdev is released, drop the connection instead of crashing the kernel. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:01 -07:00
Santosh Shilimkar	0c28c04500	RDS: IB: split send completion handling and do batch ack Similar to what we did with receive CQ completion handling, we split the transmit completion handler so that it lets us implement batched work completion handling. We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to identify the send vs receive completion event handler invocation. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:01 -07:00
Santosh Shilimkar	f4f943c958	RDS: IB: ack more receive completions to improve performance For better performance, we split the receive completion IRQ handler. That lets us acknowledge several WCE events in one call. We also limit the WC to max 32 to avoid latency. Acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed. In next patch, send completion is also split which re-uses the poll_cq() and hence the code is moved to ib_cm.c Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:01 -07:00
Santosh Shilimkar	db6526dcb5	RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL In Transport indepedent rds_sendmsg(), we shouldn't make decisions based on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based transports. We can safely issue rds_send_xmit() and the using its return value take decision on deferred work. This will also fix the scenario where at times we are seeing connections stuck with the LL_SEND_FULL bit getting set and never cleared. We kick krdsd after any time we see -ENOMEM or -EAGAIN from the ring allocation code. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:19:01 -07:00
Santosh Shilimkar	4bebdd7a4d	RDS: defer the over_batch work to send worker Current process gives up if its send work over the batch limit. The work queue will get kicked to finish off any other requests. This fixes remainder condition from commit `443be0e5af` ("RDS: make sure not to loop forever inside rds_send_xmit"). The restart condition is only for the case where we reached to over_batch code for some other reason so just retrying again before giving up. While at it, make sure we use already available 'send_batch_count' parameter instead of magic value. The batch count threshold value of 1024 came via commit `443be0e5af` ("RDS: make sure not to loop forever inside rds_send_xmit"). The idea is to process as big a batch as we can but at the same time we don't hold other waiting processes for send. Hence back-off after the send_batch_count limit (1024) to avoid soft-lock ups. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-10-05 11:18:45 -07:00
Sowmini Varadhan	76b29ef120	RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit For the same reasons as commit `2f53384424` ("tcp: allow splice() to build full TSO packets") and commit `35f9c09fe9` ("tcp: tcp_sendpages() should call tcp_push() once"), rds_tcp_xmit may have multiple pages to send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to tcp_sendpage() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-05 03:34:53 -07:00
Sowmini Varadhan	1edd6a14d2	RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K) clobbers efficient use of TSO because it inflates the size_goal that is computed in tcp_sendmsg/tcp_sendpage and skews packet latency, and the default values for these parameters actually results in significantly better performance. In request-response tests using rds-stress with a packet size of 100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16) between a single pair of IP addresses achieves a throughput of 6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under equivalent conditions on these platforms. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-05 03:34:53 -07:00
Sowmini Varadhan	3b20fc3897	RDS: Use a single TCP socket for both send and receive. Commit `f711a6ae06` ("net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.") modified rds-tcp so that an incoming SYN would ignore an existing "client" TCP connection which had the local port set to the transient port. The motivation for ignoring the existing "client" connection in `f711a6ae` was to avoid race conditions and an endless duel of reconnect attempts triggered by a restart/abort of one of the nodes in the TCP connection. However, having separate sockets for active and passive sides is avoidable, and the simpler model of a single TCP socket for both send and receives of all RDS connections associated with that tcp socket makes for easier observability. We avoid the race conditions from `f711a6ae` by attempting reconnects in rds_conn_shutdown if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP. The c_outgoing bit is initialized in __rds_conn_create(). A side-effect of re-using the client rds_connection for an incoming SYN is the potential of encountering duelling SYNs, i.e., we have an outgoing RDS_CONN_CONNECTING socket when we get the incoming SYN. The logic to arbitrate this criss-crossing SYN exchange in rds_tcp_accept_one() has been modified to emulate the BGP state machine: the smaller IP address should back off from the connection attempt. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-05 03:34:51 -07:00
Santosh Shilimkar	9b9acde7e8	RDS: Use per-bucket rw lock for bind hash-table One global lock protecting hash-tables with 1024 buckets isn't efficient and it shows up in a massive systems with truck loads of RDS sockets serving multiple databases. The perf data clearly highlights the contention on the rw lock in these massive workloads. When the contention gets worse, the code gets into a state where it decides to back off on the lock. So while it has disabled interrupts, it sits and backs off on this lock get. This causes the system to become sluggish and eventually all sorts of bad things happen. The simple fix is to move the lock into the hash bucket and use per-bucket lock to improve the scalability. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-09-30 12:43:25 -04:00
Santosh Shilimkar	2812695988	RDS: fix rds_sock reference bug while doing bind One need to take rds socket reference while using it and release it once done with it. rds_add_bind() code path does not do that so lets fix it. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-09-30 12:43:25 -04:00
Santosh Shilimkar	8b0a6b461e	RDS: make socket bind/release locking scheme simple and more efficient RDS bind and release locking scheme is very inefficient. It uses RCU for maintaining the bind hash-table which is great but it also needs to hold spinlock for [add/remove]_bound(). So overall usecase, the hash-table concurrent speedup doesn't pay off. In fact blocking nature of synchronize_rcu() makes the RDS socket shutdown too slow which hurts RDS performance since connection shutdown and re-connect happens quite often to maintain the RC part of the protocol. So we make the locking scheme simpler and more efficient by replacing spin_locks with reader/writer locks and getting rid off rcu for bind hash-table. In subsequent patch, we also covert the global lock with per-bucket lock to reduce the global lock contention. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-09-30 12:43:24 -04:00
Santosh Shilimkar	59fe460674	RDS: use kfree_rcu in rds_ib_remove_ipaddr synchronize_rcu() slowing down un-necessarily the socket shutdown path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr() which is perfect usecase for kfree_rcu(); So lets use that to gain some speedup. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>	2015-09-30 12:43:24 -04:00
Linus Torvalds	65c61bc5db	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Fix out-of-bounds array access in netfilter ipset, from Jozsef Kadlecsik. 2) Use correct free operation on netfilter conntrack templates, from Daniel Borkmann. 3) Fix route leak in SCTP, from Marcelo Ricardo Leitner. 4) Fix sizeof(pointer) in mac80211, from Thierry Reding. 5) Fix cache pointer comparison in ip6mr leading to missed unlock of mrt_lock. From Richard Laing. 6) rds_conn_lookup() needs to consider network namespace in key comparison, from Sowmini Varadhan. 7) Fix deadlock in TIPC code wrt broadcast link wakeups, from Kolmakov Dmitriy. 8) Fix fd leaks in bpf syscall, from Daniel Borkmann. 9) Fix error recovery when installing ipv6 multipath routes, we would delete the old route before we would know if we could fully commit to the new set of nexthops. Fix from Roopa Prabhu. 10) Fix run-time suspend problems in r8152, from Hayes Wang. 11) In fec, don't program the MAC address into the chip when the clocks are gated off. From Fugang Duan. 12) Fix poll behavior for netlink sockets when using rx ring mmap, from Daniel Borkmann. 13) Don't allocate memory with GFP_KERNEL from get_stats64 in r8169 driver, from Corinna Vinschen. 14) In TCP Cubic congestion control, handle idle periods better where we are application limited, in order to keep cwnd from growing out of control. From Eric Dumzet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) tcp_cubic: better follow cubic curve after idle period tcp: generate CA_EVENT_TX_START on data frames xen-netfront: respect user provided max_queues xen-netback: respect user provided max_queues r8169: Fix sleeping function called during get_stats64, v2 ether: add IEEE 1722 ethertype - TSN netlink, mmap: fix edge-case leakages in nf queue zero-copy netlink, mmap: don't walk rx ring on poll if receive queue non-empty cxgb4: changes for new firmware 1.14.4.0 net: fec: add netif status check before set mac address r8152: fix the runtime suspend issues r8152: split DRIVER_VERSION ipv6: fix ifnullfree.cocci warnings add microchip LAN88xx phy driver stmmac: fix check for phydev being open net: qlcnic: delete redundant memsets net: mv643xx_eth: use kzalloc net: jme: use kzalloc() instead of kmalloc+memset net: cavium: liquidio: use kzalloc in setup_glist() net: ipv6: use common fib_default_rule_pref ...	2015-09-10 13:53:15 -07:00
Sasha Levin	74e98eb085	RDS: verify the underlying transport exists before creating a connection There was no verification that an underlying transport exists when creating a connection, this would cause dereferencing a NULL ptr. It might happen on sockets that weren't properly bound before attempting to send a message, which will cause a NULL ptr deref: [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN [135546.051270] Modules linked in: [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527 [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000 [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194) [135546.055666] RSP: 0018:ffff8800bc70fab0 EFLAGS: 00010202 [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000 [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038 [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000 [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000 [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000 [135546.061668] FS: 00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000 [135546.062836] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0 [135546.064723] Stack: [135546.065048] ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008 [135546.066247] 0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342 [135546.067438] 1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00 [135546.068629] Call Trace: [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134) [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298) [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278) [135546.071981] rds_sendmsg (net/rds/send.c:1058) [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38) [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298) [135546.074577] ? rds_send_drop_to (net/rds/send.c:976) [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795) [135546.076349] ? __might_fault (mm/memory.c:3795) [135546.077179] ? rds_send_drop_to (net/rds/send.c:976) [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620) [135546.078856] SYSC_sendto (net/socket.c:1657) [135546.079596] ? SYSC_connect (net/socket.c:1628) [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926) [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674) [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16) [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16) [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749) [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1 Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-09-09 12:38:30 -07:00
Linus Torvalds	26d2177e97	Changes for 4.3 - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc. mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135 m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5 zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L /T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt 30ZYFuW8evTL+YQcaV65 =EHLg -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull inifiniband/rdma updates from Doug Ledford: "This is a fairly sizeable set of changes. I've put them through a decent amount of testing prior to sending the pull request due to that. There are still a few fixups that I know are coming, but I wanted to go ahead and get the big, sizable chunk into your hands sooner rather than waiting for those last few fixups. Of note is the fact that this creates what is intended to be a temporary area in the drivers/staging tree specifically for some cleanups and additions that are coming for the RDMA stack. We deprecated two drivers (ipath and amso1100) and are waiting to hear back if we can deprecate another one (ehca). We also put Intel's new hfi1 driver into this area because it needs to be refactored and a transfer library created out of the factored out code, and then it and the qib driver and the soft-roce driver should all be modified to use that library. I expect drivers/staging/rdma to be around for three or four kernel releases and then to go away as all of the work is completed and final deletions of deprecated drivers are done. Summary of changes for 4.3: - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits) IB/ipoib: Suppress warning for send only join failures IB/ipoib: Clean up send-only multicast joins IB/srp: Fix possible protection fault IB/core: Move SM class defines from ib_mad.h to ib_smi.h IB/core: Remove unnecessary defines from ib_mad.h IB/hfi1: Add PSM2 user space header to header_install IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY mlx5: Fix incorrect wc pkey_index assignment for GSI messages IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow IB/uverbs: reject invalid or unknown opcodes IB/cxgb4: Fix if statement in pick_local_ip6adddrs IB/sa: Fix rdma netlink message flags IB/ucma: HW Device hot-removal support IB/mlx4_ib: Disassociate support IB/uverbs: Enable device removal when there are active user space applications IB/uverbs: Explicitly pass ib_dev to uverbs commands IB/uverbs: Fix race between ib_uverbs_open and remove_one IB/uverbs: Fix reference counting usage of event files IB/core: Make ib_dealloc_pd return void IB/srp: Create an insecure all physical rkey only if needed ...	2015-09-09 08:33:31 -07:00
Sowmini Varadhan	8f384c0177	RDS: rds_conn_lookup() should factor in the struct net for a match Only return a conn if the rds_conn_net(conn) matches the struct net passed to rds_conn_lookup(). Fixes: `467fa15356` ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-09-05 22:04:58 -07:00
Jason Gunthorpe	7dd78647a2	IB/core: Make ib_dealloc_pd return void The majority of callers never check the return value, and even if they did, they can't do anything about a failure. All possible failure cases represent a bug in the caller, so just WARN_ON inside the function instead. This fixes a few random errors: net/rd/iw.c infinite loops while it fails. (racing with EBUSY?) This also lays the ground work to get rid of error return from the drivers. Most drivers do not error, the few that do are broken since it cannot be handled. Since uverbs can legitimately make use of EBUSY, open code the check. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-08-30 18:12:39 -04:00
Jason Gunthorpe	e5580242aa	rds/ib: Remove ib_get_dma_mr calls The pd now has a local_dma_lkey member which completely replaces ib_get_dma_mr, use it instead. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-08-30 18:12:36 -04:00
Sagi Grimberg	fc27995942	RDS: Convert to ib_alloc_mr Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-08-30 18:08:46 -04:00
Haggai Eran	7c1eb45a22	IB/core: lock client data with lists_rwsem An ib_client callback that is called with the lists_rwsem locked only for read is protected from changes to the IB client lists, but not from ib_unregister_device() freeing its client data. This is because ib_unregister_device() will remove the device from the device list with lists_rwsem locked for write, but perform the rest of the cleanup, including the call to remove() without that lock. Mark client data that is undergoing de-registration with a new going_down flag in the client data context. Lock the client data list with lists_rwsem for write in addition to using the spinlock, so that functions calling the callback would be able to lock only lists_rwsem for read and let callbacks sleep. Since ib_unregister_client() now marks the client data context, no need for remove() to search the context again, so pass the client data directly to remove() callbacks. Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-08-30 15:48:21 -04:00
santosh.shilimkar@oracle.com	2724121419	RDS: remove superfluous from rds_ib_alloc_fmr() Memory allocated for 'ibmr' uses kzalloc_node() which already initialises the memory to zero. There is no need to do memset() 0 on that memory. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com	ef5217a6e2	RDS: flush the FMR pool less often FMR flush is an expensive and time consuming operation. Reduce the frequency of FMR pool flush by 50% so that more FMR work gets accumulated for more efficient flushing. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com	ad1d7dc0d7	RDS: push FMR pool flush work to its own worker RDS FMR flush operation and also it races with connect/reconect which happes a lot with RDS. FMR flush being on common rds_wq aggrevates the problem. Lets push RDS FMR pool flush work to its own worker. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:11 -07:00
Wengang Wang	6116c2030f	RDS: fix fmr pool dirty_count In rds_ib_flush_mr_pool(), dirty_count accounts the clean ones which is wrong. This can lead to a negative dirty count value. Lets fix it. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:11 -07:00
santosh.shilimkar@oracle.com	3f6b314303	RDS: Fix rds MR reference count in rds_rdma_unuse() rds_rdma_unuse() drops the mr reference count which it hasn't taken. Correct way of removing mr is to remove mr from the tree and then rdma_destroy_mr() it first, then rds_mr_put() to decrement its reference count. Whichever thread holds last reference will free the mr via rds_mr_put() This bug was triggering weird null pointer crashes. One if the trace for it is captured below. BUG: unable to handle kernel NULL pointer dereference at 0000000000000104 IP: [<ffffffffa0899471>] rds_ib_free_mr+0x31/0x130 [rds_rdma] PGD 4366fa067 PUD 4366f9067 PMD 0 Oops: 0000 [#1] SMP [...] task: ffff88046da6a000 ti: ffff88046da6c000 task.ti: ffff88046da6c000 RIP: 0010:[<ffffffffa0899471>] [<ffffffffa0899471>] rds_ib_free_mr+0x31/0x130 [rds_rdma] RSP: 0018:ffff88046fa43bd8 EFLAGS: 00010286 RAX: 0000000071d38b80 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880079e7ff40 RBP: ffff88046fa43bf8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88046fa43ca8 R11: ffff88046a802ed8 R12: ffff880079e7fa40 R13: 0000000000000000 R14: ffff880079e7ff40 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88046fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000104 CR3: 00000004366fb000 CR4: 00000000000006e0 Stack: ffff880079e7fa40 ffff880671d38f08 ffff880079e7ff40 0000000000000296 ffff88046fa43c28 ffffffffa087a38b ffff880079e7fa40 ffff880671d38f10 0000000000000000 0000000000000292 ffff88046fa43c48 ffffffffa087a3b6 Call Trace: <IRQ> [<ffffffffa087a38b>] rds_destroy_mr+0x8b/0xa0 [rds] [<ffffffffa087a3b6>] __rds_put_mr_final+0x16/0x30 [rds] [<ffffffffa087a492>] rds_rdma_unuse+0xc2/0x120 [rds] [<ffffffffa08766d3>] rds_recv_incoming_exthdrs+0x83/0xa0 [rds] [<ffffffffa0876782>] rds_recv_incoming+0x92/0x200 [rds] [<ffffffffa0895269>] rds_ib_process_recv+0x259/0x320 [rds_rdma] [<ffffffffa08962a8>] rds_ib_recv_tasklet_fn+0x1a8/0x490 [rds_rdma] [<ffffffff810dcd78>] ? __remove_hrtimer+0x58/0x90 [<ffffffff810799e1>] tasklet_action+0xb1/0xc0 [<ffffffff81079b52>] __do_softirq+0xe2/0x290 [<ffffffff81079df6>] irq_exit+0xa6/0xb0 [<ffffffff81613915>] do_IRQ+0x65/0xf0 [<ffffffff816118ab>] common_interrupt+0x6b/0x6b Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:10 -07:00
santosh.shilimkar@oracle.com	ba54d3ced9	RDS: fix the dangling reference to rds_ib_incoming_slab On rds_ib_frag_slab allocation failure, ensure rds_ib_incoming_slab is not pointing to the detsroyed memory. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 16:28:10 -07:00
David S. Miller	b01d04aa51	rds: Fix improper gfp_t usage. >> net/rds/ib_recv.c:382:28: sparse: incorrect type in initializer (different base types) net/rds/ib_recv.c:382:28: expected int [signed] can_wait net/rds/ib_recv.c:382:28: got restricted gfp_t net/rds/ib_recv.c:828:23: sparse: cast to restricted __le64 Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 15:54:25 -07:00
santosh.shilimkar@oracle.com	ae05368afa	RDS: check for valid cm_id before initiating connection Connection could have been dropped while the route is being resolved so check for valid cm_id before initiating the connection. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:31 -07:00
Mukesh Kacker	06e8941e22	RDS: return EMSGSIZE for oversize requests before processing/queueing rds_send_queue_rm() allows for the "current datagram" being queued to exceed SO_SNDBUF thresholds by checking bytes queued without counting in length of current datagram. (Since sk_sndbuf is set to twice requested SO_SNDBUF value as a kernel heuristic this is usually fine!) If this "current datagram" squeezing past the threshold is itself many times the size of the sk_sndbuf threshold itself then even twice the SO_SNDBUF does not save us and it gets queued but cannot be transmitted. Threads block and deadlock and device becomes unusable. The check for this datagram not exceeding SNDBUF thresholds (EMSGSIZE) is not done on this datagram as that check is only done if queueing attempt fails. (Datagrams that follow this datagram fail queueing attempts, go through the check and eventually trip EMSGSIZE error but zero length datagrams silently fail!) This fix moves the check for datagrams exceeding SNDBUF limits before any processing or queueing is attempted and returns EMSGSIZE early in the rds_sndmsg() code. This change also ensures that all datagrams get checked for exceeding SNDBUF/sk_sndbuf size limits and the large datagrams that exceed those limits do not get to rds_send_queue_rm() code for processing. Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com	dfcec251d2	RDS: make sure rds_send_drop_to properly takes the m_rs_lock rds_send_drop_to() is used during socket tear down to find all the messages on the socket and flush them . It can race with the acking code unless it takes the m_rs_lock on each and every message. This plugs a hole where we didn't take m_rs_lock on any message that didn't have the RDS_MSG_ON_CONN set. Taking m_rs_lock avoids double frees and other memory corruptions as the ack code trusts the message m_rs pointer on a socket that had actually been freed. We must take m_rs_lock to access m_rs. Because of lock nesting and rs access, we also need to acquire rs_lock. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:31 -07:00
Santosh Shilimkar	1c3be624f4	RDS: Don't destroy the rdma id until after we're done using it During connection resets, we are destroying the rdma id too soon. We can't destroy it when it is still in use. So lets move rdma_destroy_id() after we clear the rings. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com	5c240fa2ab	RDS: Fix assertion level from fatal to warning Fix the asserion level since its not fatal and can be hit in normal execution paths. There is no need to take the system down. We keep the WARN_ON() to detect the condition if we get here with bad pages. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:31 -07:00
santosh.shilimkar@oracle.com	3049147ca7	RDS: Make sure we do a signaled send for large-send WR(Work Requests )always generate a WC(Work Completion) with signaled send. Default RDS ib code is setup for un-signaled completion. Since RDS connction is persistent, we can end up sending the data even after large-send when the remote end is not active(for any reason). By doing a signaled send at least once per large-send, we can at least detect the problem in work completion handler there by avoiding sending more data to inactive remote. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com	4f73113c63	RDS: Mark message mapped before transmit rds_send_xmit() marks the rds message map flag after xmit_[rdma/atomic]() which is clearly wrong. We need to maintain the ownership between transport and rds. Also take care of error path. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com	0df5f9a68a	RDS: add a sock_destruct callback debug aid This helps to detect the accidental processes/apps trying to destroy the RDS socket which they are sharing with other processes/apps. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com	0c48424021	RDS: check for congestion updates during rds_send_xmit Ensure we don't keep sending the data if the link is congested. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com	73ce4317bf	RDS: make sure we post recv buffers If we get an ENOMEM during rds_ib_recv_refill, we might never come back and refill again later. Patch makes sure to kick krdsd into helping out. To achieve this we add RDS_RECV_REFILL flag and update in the refill path based on that so that at least some therad will keep posting receive buffers. Since krdsd and softirq both might race for refill, we decide to schedule on work queue based on ring_low instead of ring_empty. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:30 -07:00
santosh.shilimkar@oracle.com	e1f475a738	RDS: don't update ip address tables if the address hasn't changed If the ip address tables hasn't changed, there is no need to remove them only to be added back again. Lets fix it. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com	1bc7b863f2	RDS: destroy the ib state earlier during shutdown Destroy ib state early during shutdown. Otherwise we can get callbacks after the QP isn't really able to handle them. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com	43962dd7ee	RDS: always free recv frag as we free its ring entry We were still seeing rare occurrences of the WARN_ON(recv->r_frag) which indicates that the recv refill path was finding allocated frags in ring entries that were marked free. These were usually followed by OOM crashes. They only seem to be occurring in the presence of completion errors and connection resets. This patch ensures that we free the frag as we mark the ring entry free. This should stop the refill path from finding allocated frags in ring entries that were marked free. Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:29 -07:00
santosh.shilimkar@oracle.com	1d2e3f396c	RDS: restore return value in rds_cmsg_rdma_args() In rds_cmsg_rdma_args() 'ret' is used by rds_pin_pages() which returns number of pinned pages on success. And the same value is returned to the caller of rds_cmsg_rdma_args() on success which is not intended. Commit `f4a3fc03c1` ("RDS: Clean up error handling in rds_cmsg_rdma_args") removed the 'ret = 0' line which broke RDS RDMA mode. Fix it by restoring the return value on rds_pin_pages() success keeping the clean-up in place. Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-25 13:35:29 -07:00
David S. Miller	182ad468e7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/cavium/Kconfig The cavium conflict was overlapping dependency changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-13 16:23:11 -07:00
Sowmini Varadhan	467fa15356	RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns. Register pernet subsys init/stop functions that will set up and tear down per-net RDS-TCP listen endpoints. Unregister pernet subusys functions on 'modprobe -r' to clean up these end points. Enable keepalive on both accept and connect socket endpoints. The keepalive timer expiration will ensure that client socket endpoints will be removed as appropriate from the netns when an interface is removed from a namespace. Register a device notifier callback that will clean up all sockets (and thus avoid the need to wait for keepalive timeout) when the loopback device is unregistered from the netns indicating that the netns is getting deleted. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-07 11:29:58 -07:00
Sowmini Varadhan	d5a8ac28a7	RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net Open the sockets calling sock_create_kern() with the correct struct net pointer, and use that struct net pointer when verifying the address passed to rds_bind(). Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-07 11:29:57 -07:00
Dan Carpenter	468b732b6f	rds: fix an integer overflow test in rds_info_getsockopt() "len" is a signed integer. We check that len is not negative, so it goes from zero to INT_MAX. PAGE_SIZE is unsigned long so the comparison is type promoted to unsigned long. ULONG_MAX - 4095 is a higher than INT_MAX so the condition can never be true. I don't know if this is harmful but it seems safe to limit "len" to INT_MAX - 4095. Fixes: `a8c879a7ee` ('RDS: Info and stats') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-03 15:20:16 -07:00
Linus Torvalds	9090fdb9c9	Changes for 4.2-rc - Mainly fix-ups for the various 4.2 items -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJVpWPTAAoJELgmozMOVy/dRjkQALGpY+m0Q+dfS4geU+JvvH0/ tTNJevWA1lu5xIgzk8aW+RgNOt+IQV2K6XnKOdKWcdE2wg9Yg9/8Y7atxeNSLoB6 mdUfstmTHoDFRXea1wog45QVVv9ABqSXIOmFd/PQSf2pEHvwgmVaxO2KW5n0n71H 0G9nw28jx+Igd1IjX3UezbB4Rm5jeG6exQEM5KKQ/l/cD2Pt/DIr9OmTmkMhj4Ft 4ppC+JsDHZMT2mQljEB1UL6Va1BGgULn4JwZviXRGgTEFiS+rtUkWXy9CVeNIV+A bo6aHEBlO8zbj9xg0POkHGBP2XCkDpsc5EP6/zon30MLpOTpnnTp2zq/bhGpyr5w ytH1x6V8Od1Or9BrHygy1yLGaqaVthIuZMReLq0Bu26px8mTECdu8bUjwxEvbvep u51vQ75h99td5XtXBHULcd/I8mUb3JqbEHqig0ChLOcnOMh72KLql8qfhLvzrYSv gFMvnqUZWq4WYET8cKcWypj5+cIME4objOim6T/TG+zTCmgbPnhwBsgvujLhfhDy JhUjUfS+31S99I4smL7ceKQpfA/7awLuZf/C20zdLZO2wXnx7VmbNlhQA3mvssyt ZOXeLkLGcXDIT3egNj5VWQR/KlVtvUS9FNVslxqo4ItyZZwJtbHTwRAgATp3XRwD Po7E7bWoUQqpanhgiX4Q =N/5G -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: "Mainly fix-ups for the various 4.2 items" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits) IB/core: Destroy ocrdma_dev_id IDR on module exit IB/core: Destroy multcast_idr on module exit IB/mlx4: Optimize do_slave_init IB/mlx4: Fix memory leak in do_slave_init IB/mlx4: Optimize freeing of items on error unwind IB/mlx4: Fix use of flow-counters for process_mad IB/ipath: Convert use of __constant_<foo> to <foo> IB/ipoib: Set MTU to max allowed by mode when mode changes IB/ipoib: Scatter-Gather support in connected mode IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush IB/ucma: Fix lockdep warning in ucma_lock_files rds: rds_ib_device.refcount overflow RDMA/nes: Fix for incorrect recording of the MAC address RDMA/nes: Fix for resolving the neigh RDMA/core: Fixes for port mapper client registration IB/IPoIB: Fix bad error flow in ipoib_add_port() IB/mlx4: Do not attemp to report HCA clock offset on VFs IB/cm: Do not queue work to a device that's going away IB/srp: Avoid using uninitialized variable ...	2015-07-15 17:03:03 -07:00
Wengang Wang	4fabb59449	rds: rds_ib_device.refcount overflow Fixes: `3e0249f9c0` ("RDS/IB: add refcount tracking to struct rds_ib_device") There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr failed(mr pool running out). this lead to the refcount overflow. A complain in line 117(see following) is seen. From vmcore: s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448. That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely to return ERR_PTR(-EAGAIN). 115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev) 116 { 117 BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0); 118 if (atomic_dec_and_test(&rds_ibdev->refcount)) 119 queue_work(rds_wq, &rds_ibdev->free_work); 120 } fix is to drop refcount when rds_ib_alloc_fmr failed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-07-14 13:20:11 -04:00
Markus Elfring	f6e1c91666	net-RDS: Delete an unnecessary check before the function call "module_put" The module_put() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-03 09:27:42 -07:00
Linus Torvalds	e0456717e4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) Add TX fast path in mac80211, from Johannes Berg. 2) Add TSO/GRO support to ibmveth, from Thomas Falcon 3) Move away from cached routes in ipv6, just like ipv4, from Martin KaFai Lau. 4) Lots of new rhashtable tests, from Thomas Graf. 5) Run ingress qdisc lockless, from Alexei Starovoitov. 6) Allow servers to fetch TCP packet headers for SYN packets of new connections, for fingerprinting. From Eric Dumazet. 7) Add mode parameter to pktgen, for testing receive. From Alexei Starovoitov. 8) Cache access optimizations via simplifications of build_skb(), from Alexander Duyck. 9) Move page frag allocator under mm/, also from Alexander. 10) Add xmit_more support to hv_netvsc, from KY Srinivasan. 11) Add a counter guard in case we try to perform endless reclassify loops in the packet scheduler. 12) Extern flow dissector to be programmable and use it in new "Flower" classifier. From Jiri Pirko. 13) AF_PACKET fanout rollover fixes, performance improvements, and new statistics. From Willem de Bruijn. 14) Add netdev driver for GENEVE tunnels, from John W Linville. 15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso. 16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet. 17) Add an ECN retry fallback for the initial TCP handshake, from Daniel Borkmann. 18) Add tail call support to BPF, from Alexei Starovoitov. 19) Add several pktgen helper scripts, from Jesper Dangaard Brouer. 20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa. 21) Favor even port numbers for allocation to connect() requests, and odd port numbers for bind(0), in an effort to help avoid ip_local_port_range exhaustion. From Eric Dumazet. 22) Add Cavium ThunderX driver, from Sunil Goutham. 23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata, from Alexei Starovoitov. 24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai. 25) Double TCP Small Queues default to 256K to accomodate situations like the XEN driver and wireless aggregation. From Wei Liu. 26) Add more entropy inputs to flow dissector, from Tom Herbert. 27) Add CDG congestion control algorithm to TCP, from Kenneth Klette Jonassen. 28) Convert ipset over to RCU locking, from Jozsef Kadlecsik. 29) Track and act upon link status of ipv4 route nexthops, from Andy Gospodarek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits) bridge: vlan: flush the dynamically learned entries on port vlan delete bridge: multicast: add a comment to br_port_state_selection about blocking state net: inet_diag: export IPV6_V6ONLY sockopt stmmac: troubleshoot unexpected bits in des0 & des1 net: ipv4 sysctl option to ignore routes when nexthop link is down net: track link-status of ipv4 nexthops net: switchdev: ignore unsupported bridge flags net: Cavium: Fix MAC address setting in shutdown state drivers: net: xgene: fix for ACPI support without ACPI ip: report the original address of ICMP messages net/mlx5e: Prefetch skb data on RX net/mlx5e: Pop cq outside mlx5e_get_cqe net/mlx5e: Remove mlx5e_cq.sqrq back-pointer net/mlx5e: Remove extra spaces net/mlx5e: Avoid TX CQE generation if more xmit packets expected net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq() net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device ...	2015-06-24 16:49:49 -07:00
Fabian Frederick	d2a9ec6472	net: rds: use for_each_sg() for scatterlist parsing This patch also renames sg to sglist and aligns function parameters. See Documentation/DMA-API.txt - Part Id for scatterlist details Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-21 09:32:08 -07:00
Matan Barak	8e37210b38	IB/core: Change ib_create_cq to use struct ib_cq_init_attr Currently, ib_create_cq uses cqe and comp_vecotr instead of the extendible ib_cq_init_attr struct. Earlier patches already changed the vendors to work with ib_cq_init_attr. This patch changes the consumers too. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-06-12 14:49:10 -04:00
Doug Ledford	b806ef3bbe	Merge branch 'for-4.2-misc' into k.o/for-4.2	2015-06-02 09:33:22 -04:00
Wengang Wang	d655a9fbc8	rds: re-entry of rds_ib_xmit/rds_iw_xmit The BUG_ON at line 452/453 is triggered in function rds_send_xmit. 441 while (ret) { 442 tmp = min_t(int, ret, sg->length - 443 conn->c_xmit_data_off); 444 conn->c_xmit_data_off += tmp; 445 ret -= tmp; 446 if (conn->c_xmit_data_off == sg->length) { 447 conn->c_xmit_data_off = 0; 448 sg++; 449 conn->c_xmit_sg++; 450 if (ret != 0 && conn->c_xmit_sg == rm->data.op_nents) 451 printk(KERN_ERR "conn %p rm %p sg %p ret %d\n", conn, rm, sg, ret); 452 BUG_ON(ret != 0 && 453 conn->c_xmit_sg == rm->data.op_nents); 454 } 455 } it is complaining the total sent length is bigger that we want to send. rds_ib_xmit() is wrong for the second entry for the same rds_message returning wrong value. the sg and off passed by rds_send_xmit to rds_ib_xmit is based on scatterlist.offset/length, but the rds_ib_xmit action is based on scatterlist.dma_address/dma_length. in case dma_length is larger than length there is problem. for the 2nd and later entries of rds_ib_xmit for same rds_message, at least one of the following two is wrong: 1) the scatterlist to start with, the choosen one can far beyond the correct one. 2) the offset to start with within the scatterlist. fix: add op_dmasg and op_dmaoff to rm_data_op structure indicating the scatterlist and offset within the it to start with for rds_ib_xmit respectively. op_dmasg and op_dmaoff are initialized to zero when doing dma mapping for the first see of the message and are changed when filling send slots. the same applies to rds_iw_xmit too. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-06-02 09:22:31 -04:00
Sowmini Varadhan	8ba38460f3	net/rds Add getsockopt support for SO_RDS_TRANSPORT The currently attached transport for a PF_RDS socket may be obtained from user space by invoking getsockopt(2) using the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer optval returned will be one of the RDS_TRANS_* constants defined in linux/rds.h. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Sowmini Varadhan	d97dac54bf	net/rds: Add setsockopt support for SO_RDS_TRANSPORT An application may deterministically attach the underlying transport for a PF_RDS socket by invoking setsockopt(2) with the SO_RDS_TRANSPORT option at the SOL_RDS level. The integer argument to setsockopt must be one of the RDS_TRANS_* transport types, e.g., RDS_TRANS_TCP. The option must be specified before invoking bind(2) on the socket, and may only be used once on the socket. An attempt to set the option on a bound socket, or to invoke the option after a successful SO_RDS_TRANSPORT attachment, will return EOPNOTSUPP. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Sowmini Varadhan	a28c257c9e	net/rds: Declare SO_RDS_TRANSPORT and RDS_TRANS_* constants in uapi/linux/rds.h User space applications that desire to explicitly select the underlying transport for a PF_RDS socket may do so by using the SO_RDS_TRANSPORT socket option at the SOL_RDS level before bind(). The integer argument provided to the socket option would be one of the RDS_TRANS_* values, e.g., RDS_TRANS_TCP. This commit exports the constant values need by such applications via <linux/rds.h> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-31 21:47:23 -07:00
Sagi Grimberg	3c88f3dcff	RDS: Switch to generic logging helpers Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>	2015-05-18 13:44:23 -04:00
David S. Miller	b04096ff33	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Four minor merge conflicts: 1) qca_spi.c renamed the local variable used for the SPI device from spi_device to spi, meanwhile the spi_set_drvdata() call got moved further up in the probe function. 2) Two changes were both adding new members to codel params structure, and thus we had overlapping changes to the initializer function. 3) 'net' was making a fix to sk_release_kernel() which is completely removed in 'net-next'. 4) In net_namespace.c, the rtnl_net_fill() call for GET operations had the command value fixed, meanwhile 'net-next' adjusted the argument signature a bit. This also matches example merge resolutions posted by Stephen Rothwell over the past two days. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 14:31:43 -04:00
Eric W. Biederman	11aa9c28b4	net: Pass kern from net_proto_family.create to sk_alloc In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 10:50:17 -04:00
Sowmini Varadhan	c82ac7e69e	net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket. When the peer of an RDS-TCP connection restarts, a reconnect attempt should only be made from the active side of the TCP connection, i.e. the side that has a transient TCP port number. Do not add the passive side of the TCP connection to the c_hash_node and thus avoid triggering rds_queue_reconnect() for passive rds connections. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-09 16:03:28 -04:00
Sowmini Varadhan	f711a6ae06	net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection. When running RDS over TCP, the active (client) side connects to the listening ("passive") side at the RDS_TCP_PORT. After the connection is established, if the client side reboots (potentially without even sending a FIN) the server still has a TCP socket in the esablished state. If the server-side now gets a new SYN comes from the client with a different client port, TCP will create a new socket-pair, but the RDS layer will incorrectly pull up the old rds_connection (which is still associated with the stale t_sock and RDS socket state). This patch corrects this behavior by having rds_tcp_accept_one() always create a new connection for an incoming TCP SYN. The rds and tcp state associated with the old socket-pair is cleaned up via the rds_tcp_state_change() callback which would typically be invoked in most cases when the client-TCP sends a FIN on TCP restart, triggering a transition to CLOSE_WAIT state. In the rarer event of client death without a FIN, TCP_KEEPALIVE probes on the socket will detect the stale socket, and the TCP transition to CLOSE state will trigger the RDS state cleanup. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-09 16:03:27 -04:00
David Ahern	e2783717a7	net/rds: Fix new sparse warning `c0adf54a10` introduced new sparse warnings: CHECK /home/dahern/kernels/linux.git/net/rds/ib_cm.c net/rds/ib_cm.c:191:34: warning: incorrect type in initializer (different base types) net/rds/ib_cm.c:191:34: expected unsigned long long [unsigned] [usertype] dp_ack_seq net/rds/ib_cm.c:191:34: got restricted __be64 <noident> net/rds/ib_cm.c:194:51: warning: cast to restricted __be64 The temporary variable for sequence number should have been declared as __be64 rather than u64. Make it so. Signed-off-by: David Ahern <david.ahern@oracle.com> Cc: shamir rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-04 15:23:06 -04:00
shamir rabinovitch	c0adf54a10	net/rds: fix unaligned memory access rdma_conn_param private data is copied using memcpy after headers such as cma_hdr (see cma_resolve_ib_udp as example). so the start of the private data is aligned to the end of the structure that come before. if this structure end with u32 the meaning is that the start of the private data will be 4 bytes aligned. structures that use u8/u16/u32/u64 are naturally aligned but in case the structure start is not 8 bytes aligned, all u64 members of this structure will not be aligned. to solve this issue we must use special macros that allow unaligned access to those unaligned members. Addresses the following kernel log seen when attempting to use RDMA: Kernel unaligned access at TPC[10507a88] rds_ib_cm_connect_complete+0x1bc/0x1e0 [rds_rdma] Acked-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com> [Minor tweaks for top of tree by:] Signed-off-by: David Ahern <david.ahern@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-03 23:39:01 -04:00
David S. Miller	87ffabb1f0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The dwmac-socfpga.c conflict was a case of a bug fix overlapping changes in net-next to handle an error pointer differently. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-14 15:44:14 -04:00
Sowmini Varadhan	443be0e5af	RDS: make sure not to loop forever inside rds_send_xmit If a determined set of concurrent senders keep the send queue full, we can loop forever inside rds_send_xmit. This fix has two parts. First we are dropping out of the while(1) loop after we've processed a large batch of messages. Second we add a generation number that gets bumped each time the xmit bit lock is acquired. If someone else has jumped in and made progress in the queue, we skip our goto restart. Original patch by Chris Mason. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-08 15:17:32 -04:00
Sowmini Varadhan	1789b2c077	RDS: only use passive connections when addresses match Passive connections were added for the case where one loopback IB connection between identical addresses needs another connection to store the second QP. Unfortunately, they were also created in the case where the addesses differ and we already have both QPs. This lead to a message reordering bug. - two different IB interfaces and addresses on a machine: A B - traffic is sent from A to B - connection from A-B is created, connect request sent - listening accepts connect request, B-A is created - traffic flows, next_rx is incremented - unacked messages exist on the retrans list - connection A-B is shut down, new connect request sent - listen sees existing loopback B-A, creates new passive B-A - retrans messages are sent and delivered because of 0 next_rx The problem is that the second connection request saw the previously existing parent connection. Instead of using it, and using the existing next_rx_seq state for the traffic between those IPs, it mistakenly thought that it had to create a passive connection. We fix this by only using passive connections in the special case where laddr and faddr match. In this case we'll only ever have one parent sending connection requests and one passive connection created as the listening path sees the existing parent connection which initiated the request. Original patch by Zach Brown Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-08 15:17:23 -04:00
David S. Miller	0fa74a4be4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c net/core/sysctl_net_core.c net/ipv4/inet_diag.c The be_main.c conflict resolution was really tricky. The conflict hunks generated by GIT were very unhelpful, to say the least. It split functions in half and moved them around, when the real actual conflict only existed solely inside of one function, that being be_map_pci_bars(). So instead, to resolve this, I checked out be_main.c from the top of net-next, then I applied the be_main.c changes from 'net' since the last time I merged. And this worked beautifully. The inet_diag.c and sysctl_net_core.c conflicts were simple overlapping changes, and were easily to resolve. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-20 18:51:09 -04:00
Arnd Bergmann	f862e07cf9	rds: avoid potential stack overflow The rds_iw_update_cm_id function stores a large 'struct rds_sock' object on the stack in order to pass a pair of addresses. This happens to just fit withint the 1024 byte stack size warning limit on x86, but just exceed that limit on ARM, which gives us this warning: net/rds/iw_rdma.c:200:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=] As the use of this large variable is basically bogus, we can rearrange the code to not do that. Instead of passing an rds socket into rds_iw_get_device, we now just pass the two addresses that we have available in rds_iw_update_cm_id, and we change rds_iw_get_mr accordingly, to create two address structures on the stack there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 00:28:01 -04:00
Ying Xue	1b78414047	net: Remove iocb argument from sendmsg and recvmsg After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 13:06:31 -05:00
Sowmini Varadhan	80ad0d4a7a	rds: rds_cong_queue_updates needs to defer the congestion update transmission When the RDS transport is TCP, we cannot inline the call to rds_send_xmit from rds_cong_queue_update because (a) we are already holding the sock_lock in the recv path, and will deadlock when tcp_setsockopt/tcp_sendmsg try to get the sock lock (b) cong_queue_update does an irqsave on the rds_cong_lock, and this will trigger warnings (for a good reason) from functions called out of sock_lock. This patch reverts the change introduced by `2fa57129d` ("RDS: Bypass workqueue when queueing cong updates"). The patch has been verified for both RDS/TCP as well as RDS/RDMA to ensure that there are not regressions for either transport: - for verification of RDS/TCP a client-server unit-test was used, with the server blocked in gdb and thus unable to drain its rcvbuf, eventually triggering a RDS congestion update. - for RDS/RDMA, the standard IB regression tests were used Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-11 14:35:44 -08:00
Sowmini Varadhan	d0a47d3272	rds: Make rds_message_copy_from_user() return 0 on success. Commit `083735f4b0` ("rds: switch rds_message_copy_from_user() to iov_iter") breaks rds_message_copy_from_user() semantics on success, and causes it to return nbytes copied, when it should return 0. This commit fixes that bug. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-07 22:41:56 -08:00
Rasmus Villemoes	11ac11999b	net: rds: Remove repeated function names from debug output The macro rdsdebug is defined as pr_debug("%s(): " fmt, __func__ , ##args) Hence it doesn't make sense to include the name of the calling function explicitly in the format string passed to rdsdebug. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-07 22:41:56 -08:00
Sasha Levin	db27ebb111	net: rds: use correct size for max unacked packets and bytes Max unacked packets/bytes is an int while sizeof(long) was used in the sysctl table. This means that when they were getting read we'd also leak kernel memory to userspace along with the timeout values. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-04 16:07:27 -08:00
Geert Uytterhoeven	6ff4a8ad4b	rds: Fix min() warning in rds_message_inc_copy_to_user() net/rds/message.c: In function ‘rds_message_inc_copy_to_user’: net/rds/message.c:328: warning: comparison of distinct pointer types lacks a cast Use min_t(unsigned long, ...) like is done in rds_message_copy_from_user(). Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-15 11:49:09 -05:00
Gu Zheng	f95b414edb	net: introduce helper macro for_each_cmsghdr Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-10 22:41:55 -05:00
Al Viro	c0371da604	put iov_iter into msghdr Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:29:03 -05:00
Al Viro	083735f4b0	rds: switch rds_message_copy_from_user() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 05:16:43 -05:00
Al Viro	c310e72c89	rds: switch ->inc_copy_to_user() to passing iov_iter instances get considerably simpler from that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 05:16:43 -05:00
Linus Torvalds	2e923b0251	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Include fixes for netrom and dsa (Fabian Frederick and Florian Fainelli) 2) Fix FIXED_PHY support in stmmac, from Giuseppe CAVALLARO. 3) Several SKB use after free fixes (vxlan, openvswitch, vxlan, ip_tunnel, fou), from Li ROngQing. 4) fec driver PTP support fixes from Luwei Zhou and Nimrod Andy. 5) Use after free in virtio_net, from Michael S Tsirkin. 6) Fix flow mask handling for megaflows in openvswitch, from Pravin B Shelar. 7) ISDN gigaset and capi bug fixes from Tilman Schmidt. 8) Fix route leak in ip_send_unicast_reply(), from Vasily Averin. 9) Fix two eBPF JIT bugs on x86, from Alexei Starovoitov. 10) TCP_SKB_CB() reorganization caused a few regressions, fixed by Cong Wang and Eric Dumazet. 11) Don't overwrite end of SKB when parsing malformed sctp ASCONF chunks, from Daniel Borkmann. 12) Don't call sock_kfree_s() with NULL pointers, this function also has the side effect of adjusting the socket memory usage. From Cong Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) bna: fix skb->truesize underestimation net: dsa: add includes for ethtool and phy_fixed definitions openvswitch: Set flow-key members. netrom: use linux/uaccess.h dsa: Fix conversion from host device to mii bus tipc: fix bug in bundled buffer reception ipv6: introduce tcp_v6_iif() sfc: add support for skb->xmit_more r8152: return -EBUSY for runtime suspend ipv4: fix a potential use after free in fou.c ipv4: fix a potential use after free in ip_tunnel_core.c hyperv: Add handling of IP header with option field in netvsc_set_hash() openvswitch: Create right mask with disabled megaflows vxlan: fix a free after use openvswitch: fix a use after free ipv4: dst_entry leak in ip_send_unicast_reply() ipv4: clean up cookie_v4_check() ipv4: share tcp_v4_save_options() with cookie_v4_check() ipv4: call __ip_options_echo() in cookie_v4_check() atm: simplify lanai.c by using module_pci_driver ...	2014-10-18 09:31:37 -07:00
Linus Torvalds	0429fbc0bd	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...	2014-10-15 07:48:18 +02:00
Cong Wang	dee49f203a	rds: avoid calling sock_kfree_s() on allocation failure It is okay to free a NULL pointer but not okay to mischarge the socket optmem accounting. Compile test only. Reported-by: rucsoftsec@gmail.com Cc: Chien Yen <chien.yen@oracle.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-14 17:00:19 -04:00
Linus Torvalds	35a9ad8af0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...	2014-10-08 21:40:54 -04:00
Herton R. Krzesinski	593cbb3ec6	net/rds: fix possible double free on sock tear down I got a report of a double free happening at RDS slab cache. One suspicion was that may be somewhere we were doing a sock_hold/sock_put on an already freed sock. Thus after providing a kernel with the following change: static inline void sock_hold(struct sock *sk) { - atomic_inc(&sk->sk_refcnt); + if (!atomic_inc_not_zero(&sk->sk_refcnt)) + WARN(1, "Trying to hold sock already gone: %p (family: %hd)\n", + sk, sk->sk_family); } The warning successfuly triggered: Trying to hold sock already gone: ffff81f6dda61280 (family: 21) WARNING: at include/net/sock.h:350 sock_hold() Call Trace: <IRQ> [<ffffffff8adac135>] :rds:rds_send_remove_from_sock+0xf0/0x21b [<ffffffff8adad35c>] :rds:rds_send_drop_acked+0xbf/0xcf [<ffffffff8addf546>] :rds_rdma:rds_ib_recv_tasklet_fn+0x256/0x2dc [<ffffffff8009899a>] tasklet_action+0x8f/0x12b [<ffffffff800125a2>] __do_softirq+0x89/0x133 [<ffffffff8005f30c>] call_softirq+0x1c/0x28 [<ffffffff8006e644>] do_softirq+0x2c/0x7d [<ffffffff8006e4d4>] do_IRQ+0xee/0xf7 [<ffffffff8005e625>] ret_from_intr+0x0/0xa <EOI> Looking at the call chain above, the only way I think this would be possible is if somewhere we already released the same socket->sock which is assigned to the rds_message at rds_send_remove_from_sock. Which seems only possible to happen after the tear down done on rds_release. rds_release properly calls rds_send_drop_to to drop the socket from any rds_message, and some proper synchronization is in place to avoid race with rds_send_drop_acked/rds_send_remove_from_sock. However, I still see a very narrow window where it may be possible we touch a sock already released: when rds_release races with rds_send_drop_acked, we check RDS_MSG_ON_CONN to avoid cleanup on the same rds_message, but in this specific case we don't clear rm->m_rs. In this case, it seems we could then go on at rds_send_drop_to and after it returns, the sock is freed by last sock_put on rds_release, with concurrently we being at rds_send_remove_from_sock; then at some point in the loop at rds_send_remove_from_sock we process an rds_message which didn't have rm->m_rs unset for a freed sock, and a possible sock_hold on an sock already gone at rds_release happens. This hopefully address the described condition above and avoids a double free on "second last" sock_put. In addition, I removed the comment about socket destruction on top of rds_send_drop_acked: we call rds_send_drop_to in rds_release and we should have things properly serialized there, thus I can't see the comment being accurate there. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-03 12:52:00 -07:00
Herton R. Krzesinski	eb74cc97b8	net/rds: do proper house keeping if connection fails in rds_tcp_conn_connect I see two problems if we consider the sock->ops->connect attempt to fail in rds_tcp_conn_connect. The first issue is that for example we don't remove the previously added rds_tcp_connection item to rds_tcp_tc_list at rds_tcp_set_callbacks, which means that on a next reconnect attempt for the same rds_connection, when rds_tcp_conn_connect is called we can again call rds_tcp_set_callbacks, resulting in duplicated items on rds_tcp_tc_list, leading to list corruption: to avoid this just make sure we call properly rds_tcp_restore_callbacks before we exit. The second issue is that we should also release the sock properly, by setting sock = NULL only if we are returning without error. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-03 12:51:59 -07:00
Herton R. Krzesinski	310886dd5f	net/rds: call rds_conn_drop instead of open code it at rds_connect_complete Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-03 12:51:59 -07:00
Jesper Dangaard Brouer	d7cdb96808	treewide: fix synchronize_rcu() in comments Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-08-28 15:01:24 +02:00
Christoph Lameter	903ceff7ca	net: Replace get_cpu_var through this_cpu_ptr Replace uses of get_cpu_var for address calculation through this_cpu_ptr. Cc: netdev@vger.kernel.org Cc: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2014-08-26 13:45:47 -04:00
Linus Torvalds	f9da455b93	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...	2014-06-12 14:27:40 -07:00
Himangi Saraogi	01728371dc	rds/tcp_listen: Replace comma with semicolon This patch replaces a comma between expression statements by a semicolon. A simplified version of the semantic patch that performs this transformation is as follows: // <smpl> @r@ expression e1,e2,e; type T; identifier i; @@ e1 -, +; e2; // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-30 17:48:58 -07:00
Himangi Saraogi	cc2afe9fe2	RDS/RDMA: Replace comma with semicolon This patch replaces a comma between expression statements by a semicolon. A simplified version of the semantic patch that performs this transformation is as follows: // <smpl> @r@ expression e1,e2,e; type T; identifier i; @@ e1 -, +; e2; // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-30 17:48:58 -07:00
Manuel Schölling	71fd762f2e	net: rds: Use time_after() for time comparison To be future-proof and for better readability the time comparisons are modified to use time_after() instead of raw math. Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-18 21:24:52 -04:00
wangweidong	be7faf7168	rds: remove the unneed NULL checking unregister_net_sysctl_table will check the ctl_table_header, so remove the unneed checking Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-09 15:59:45 -04:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
David S. Miller	676d23690f	net: Fix use after free by removing length arg from sk_data_ready callbacks. Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-11 16:15:36 -04:00
Sasha Levin	bf39b4247b	rds: prevent dereference of a NULL device in rds_iw_laddr_check Binding might result in a NULL device which is later dereferenced without checking. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-31 16:25:52 -04:00
Steffen Hurrle	342dfc306f	net: add build-time checks for msg->msg_name size This is a follow-up patch to `f3d3342602` ("net: rework recvmsg handler msg_name and msg_namelen logic"). DECLARE_SOCKADDR validates that the structure we use for writing the name information to is not larger than the buffer which is reserved for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR consistently in sendmsg code paths. Signed-off-by: Steffen Hurrle <steffen@hurrle.net> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-18 23:04:16 -08:00
David S. Miller	4180442058	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c net/ipv4/tcp_metrics.c Overlapping changes between the "don't create two tcp metrics objects with the same key" race fix in net and the addition of the destination address in the lookup key in net-next. Minor overlapping changes in bnx2x driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-18 00:55:41 -08:00
Gerald Schaefer	c196403b79	net: rds: fix per-cpu helper usage commit `ae4b46e9d` "net: rds: use this_cpu_* per-cpu helper" broke per-cpu handling for rds. chpfirst is the result of __this_cpu_read(), so it is an absolute pointer and not __percpu. Therefore, __this_cpu_write() should not operate on chpfirst, but rather on cache->percpu->first, just like __this_cpu_read() did before. Cc: <stable@vger.kernel.org> # 3.8+ Signed-off-byd Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-17 17:52:22 -08:00
Aruna-Hewapathirane	63862b5bef	net: replace macros net_random and net_srandom with direct calls to prandom This patch removes the net_random and net_srandom macros and replaces them with direct calls to the prandom ones. As new commits only seem to use prandom_u32 there is no use to keep them around. This change makes it easier to grep for users of prandom_u32. Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-14 15:15:25 -08:00
Sasha Levin	c2349758ac	rds: prevent dereference of a NULL device Binding might result in a NULL device, which is dereferenced causing this BUG: [ 1317.260548] BUG: unable to handle kernel NULL pointer dereference at 000000000000097 4 [ 1317.261847] IP: [<ffffffff84225f52>] rds_ib_laddr_check+0x82/0x110 [ 1317.263315] PGD 418bcb067 PUD 3ceb21067 PMD 0 [ 1317.263502] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1317.264179] Dumping ftrace buffer: [ 1317.264774] (ftrace buffer empty) [ 1317.265220] Modules linked in: [ 1317.265824] CPU: 4 PID: 836 Comm: trinity-child46 Tainted: G W 3.13.0-rc4- next-20131218-sasha-00013-g2cebb9b-dirty #4159 [ 1317.267415] task: ffff8803ddf33000 ti: ffff8803cd31a000 task.ti: ffff8803cd31a000 [ 1317.268399] RIP: 0010:[<ffffffff84225f52>] [<ffffffff84225f52>] rds_ib_laddr_check+ 0x82/0x110 [ 1317.269670] RSP: 0000:ffff8803cd31bdf8 EFLAGS: 00010246 [ 1317.270230] RAX: 0000000000000000 RBX: ffff88020b0dd388 RCX: 0000000000000000 [ 1317.270230] RDX: ffffffff8439822e RSI: 00000000000c000a RDI: 0000000000000286 [ 1317.270230] RBP: ffff8803cd31be38 R08: 0000000000000000 R09: 0000000000000000 [ 1317.270230] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1317.270230] R13: 0000000054086700 R14: 0000000000a25de0 R15: 0000000000000031 [ 1317.270230] FS: 00007ff40251d700(0000) GS:ffff88022e200000(0000) knlGS:000000000000 0000 [ 1317.270230] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1317.270230] CR2: 0000000000000974 CR3: 00000003cd478000 CR4: 00000000000006e0 [ 1317.270230] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1317.270230] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602 [ 1317.270230] Stack: [ 1317.270230] 0000000054086700 5408670000a25de0 5408670000000002 0000000000000000 [ 1317.270230] ffffffff84223542 00000000ea54c767 0000000000000000 ffffffff86d26160 [ 1317.270230] ffff8803cd31be68 ffffffff84223556 ffff8803cd31beb8 ffff8800c6765280 [ 1317.270230] Call Trace: [ 1317.270230] [<ffffffff84223542>] ? rds_trans_get_preferred+0x42/0xa0 [ 1317.270230] [<ffffffff84223556>] rds_trans_get_preferred+0x56/0xa0 [ 1317.270230] [<ffffffff8421c9c3>] rds_bind+0x73/0xf0 [ 1317.270230] [<ffffffff83e4ce62>] SYSC_bind+0x92/0xf0 [ 1317.270230] [<ffffffff812493f8>] ? context_tracking_user_exit+0xb8/0x1d0 [ 1317.270230] [<ffffffff8119313d>] ? trace_hardirqs_on+0xd/0x10 [ 1317.270230] [<ffffffff8107a852>] ? syscall_trace_enter+0x32/0x290 [ 1317.270230] [<ffffffff83e4cece>] SyS_bind+0xe/0x10 [ 1317.270230] [<ffffffff843a6ad0>] tracesys+0xdd/0xe2 [ 1317.270230] Code: 00 8b 45 cc 48 8d 75 d0 48 c7 45 d8 00 00 00 00 66 c7 45 d0 02 00 89 45 d4 48 89 df e8 78 49 76 ff 41 89 c4 85 c0 75 0c 48 8b 03 <80> b8 74 09 00 00 01 7 4 06 41 bc 9d ff ff ff f6 05 2a b6 c2 02 [ 1317.270230] RIP [<ffffffff84225f52>] rds_ib_laddr_check+0x82/0x110 [ 1317.270230] RSP <ffff8803cd31bdf8> [ 1317.270230] CR2: 0000000000000974 Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-27 12:33:58 -05:00
Venkat Venkatsubra	18fc25c94e	rds: prevent BUG_ON triggered on congestion update to loopback After congestion update on a local connection, when rds_ib_xmit returns less bytes than that are there in the message, rds_send_xmit calls back rds_ib_xmit with an offset that causes BUG_ON(off & RDS_FRAG_SIZE) to trigger. For a 4Kb PAGE_SIZE rds_ib_xmit returns min(8240,4096)=4096 when actually the message contains 8240 bytes. rds_send_xmit thinks there is more to send and calls rds_ib_xmit again with a data offset "off" of 4096-48(rds header) =4048 bytes thus hitting the BUG_ON(off & RDS_FRAG_SIZE) [RDS_FRAG_SIZE=4k]. The commit `6094628bfd` "rds: prevent BUG_ON triggering on congestion map updates" introduced this regression. That change was addressing the triggering of a different BUG_ON in rds_send_xmit() on PowerPC architecture with 64Kbytes PAGE_SIZE: BUG_ON(ret != 0 && conn->c_xmit_sg == rm->data.op_nents); This was the sequence it was going through: (rds_ib_xmit) /* Do not send cong updates to IB loopback */ if (conn->c_loopback && rm->m_inc.i_hdr.h_flags & RDS_FLAG_CONG_BITMAP) { rds_cong_map_updated(conn->c_fcong, ~(u64) 0); return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES; } rds_ib_xmit returns 8240 rds_send_xmit: c_xmit_data_off = 0 + 8240 - 48 (rds header accounted only the first time) = 8192 c_xmit_data_off < 65536 (sg->length), so calls rds_ib_xmit again rds_ib_xmit returns 8240 rds_send_xmit: c_xmit_data_off = 8192 + 8240 = 16432, calls rds_ib_xmit again and so on (c_xmit_data_off 24672,32912,41152,49392,57632) rds_ib_xmit returns 8240 On this iteration this sequence causes the BUG_ON in rds_send_xmit: while (ret) { tmp = min_t(int, ret, sg->length - conn->c_xmit_data_off); [tmp = 65536 - 57632 = 7904] conn->c_xmit_data_off += tmp; [c_xmit_data_off = 57632 + 7904 = 65536] ret -= tmp; [ret = 8240 - 7904 = 336] if (conn->c_xmit_data_off == sg->length) { conn->c_xmit_data_off = 0; sg++; conn->c_xmit_sg++; BUG_ON(ret != 0 && conn->c_xmit_sg == rm->data.op_nents); [c_xmit_sg = 1, rm->data.op_nents = 1] What the current fix does: Since the congestion update over loopback is not actually transmitted as a message, all that rds_ib_xmit needs to do is let the caller think the full message has been transmitted and not return partial bytes. It will return 8240 (RDS_CONG_MAP_BYTES+48) when PAGE_SIZE is 4Kb. And 64Kb+48 when page size is 64Kb. Reported-by: Josh Hunt <joshhunt00@gmail.com> Tested-by: Honggang Li <honli@redhat.com> Acked-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-03 11:54:18 -05:00
Hannes Frederic Sowa	f3d3342602	net: rework recvmsg handler msg_name and msg_namelen logic This patch now always passes msg->msg_namelen as 0. recvmsg handlers must set msg_namelen to the proper size <= sizeof(struct sockaddr_storage) to return msg_name to the user. This prevents numerous uninitialized memory leaks we had in the recvmsg handlers and makes it harder for new code to accidentally leak uninitialized memory. Optimize for the case recvfrom is called with NULL as address. We don't need to copy the address at all, so set it to NULL before invoking the recvmsg handler. We can do so, because all the recvmsg handlers must cope with the case a plain read() is called on them. read() also sets msg_name to NULL. Also document these changes in include/linux/net.h as suggested by David Miller. Changes since RFC: Set msg->msg_name = NULL if user specified a NULL in msg_name but had a non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't affect sendto as it would bail out earlier while trying to copy-in the address. It also more naturally reflects the logic by the callers of verify_iovec. With this change in place I could remove " if (!uaddr \|\| msg_sys->msg_namelen == 0) msg->msg_name = NULL ". This change does not alter the user visible error logic as we ignore msg_namelen as long as msg_name is NULL. Also remove two unnecessary curly brackets in ___sys_recvmsg and change comments to netdev style. Cc: David Miller <davem@davemloft.net> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-11-20 21:52:30 -05:00
Hannes Frederic Sowa	1bbdceef1e	inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once Initialize the ehash and ipv6_hash_secrets with net_get_random_once. Each compilation unit gets its own secret now: ipv4/inet_hashtables.o ipv4/udp.o ipv6/inet6_hashtables.o ipv6/udp.o rds/connection.o The functions still get inlined into the hashing functions. In the fast path we have at most two (needed in ipv6) if (unlikely(...)). Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-19 19:45:35 -04:00
Hannes Frederic Sowa	65cd8033ff	ipv4: split inet_ehashfn to hash functions per compilation unit This duplicates a bit of code but let's us easily introduce separate secret keys later. The separate compilation units are ipv4/inet_hashtabbles.o, ipv4/udp.o and rds/connection.o. Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-19 19:45:34 -04:00
Joe Perches	c1b1203d65	net: misc: Remove extern from function prototypes There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-19 19:12:11 -04:00
Joe Perches	fe2c6338fd	net: Convert uses of typedef ctl_table to struct ctl_table Reduce the uses of this unnecessary typedef. Done via perl script: $ git grep --name-only -w ctl_table net \| \ xargs perl -p -i -e '\ sub trim { my ($local) = @_; $local =~ s/(^\s+\|\s+$)//g; return $local; } \ s/\b(?<!struct\s)ctl_table\b(\s\\s*\|\s+\w+)/"struct ctl_table " . trim($1)/ge' Reflow the modified lines that now exceed 80 columns. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-13 02:36:09 -07:00
Chen Gang	2e85d67690	net/rds: zero last byte for strncpy for NUL terminated string, need be always sure '\0' in the end. additional info: strncpy will pads with zeroes to the end of the given buffer. should initialise every bit of memory that is going to be copied to userland Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-08 00:35:44 -05:00
Linus Torvalds	9da060d0ed	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "A moderately sized pile of fixes, some specifically for merge window introduced regressions although others are for longer standing items and have been queued up for -stable. I'm kind of tired of all the RDS protocol bugs over the years, to be honest, it's way out of proportion to the number of people who actually use it. 1) Fix missing range initialization in netfilter IPSET, from Jozsef Kadlecsik. 2) ieee80211_local->tim_lock needs to use BH disabling, from Johannes Berg. 3) Fix DMA syncing in SFC driver, from Ben Hutchings. 4) Fix regression in BOND device MAC address setting, from Jiri Pirko. 5) Missing usb_free_urb in ISDN Hisax driver, from Marina Makienko. 6) Fix UDP checksumming in bnx2x driver for 57710 and 57711 chips, fix from Dmitry Kravkov. 7) Missing cfgspace_lock initialization in BCMA driver. 8) Validate parameter size for SCTP assoc stats getsockopt(), from Guenter Roeck. 9) Fix SCTP association hangs, from Lee A Roberts. 10) Fix jumbo frame handling in r8169, from Francois Romieu. 11) Fix phy_device memory leak, from Petr Malat. 12) Omit trailing FCS from frames received in BGMAC driver, from Hauke Mehrtens. 13) Missing socket refcount release in L2TP, from Guillaume Nault. 14) sctp_endpoint_init should respect passed in gfp_t, rather than use GFP_KERNEL unconditionally. From Dan Carpenter. 15) Add AISX AX88179 USB driver, from Freddy Xin. 16) Remove MAINTAINERS entries for drivers deleted during the merge window, from Cesar Eduardo Barros. 17) RDS protocol can try to allocate huge amounts of memory, check that the user's request length makes sense, from Cong Wang. 18) SCTP should use the provided KMALLOC_MAX_SIZE instead of it's own, bogus, definition. From Cong Wang. 19) Fix deadlocks in FEC driver by moving TX reclaim into NAPI poll, from Frank Li. Also, fix a build error introduced in the merge window. 20) Fix bogus purging of default routes in ipv6, from Lorenzo Colitti. 21) Don't double count RTT measurements when we leave the TCP receive fast path, from Neal Cardwell." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) tcp: fix double-counted receiver RTT when leaving receiver fast path CAIF: fix sparse warning for caif_usb rds: simplify a warning message net: fec: fix build error in no MXC platform net: ipv6: Don't purge default router if accept_ra=2 net: fec: put tx to napi poll function to fix dead lock sctp: use KMALLOC_MAX_SIZE instead of its own MAX_KMALLOC_SIZE rds: limit the size allocated by rds_message_alloc() MAINTAINERS: remove eexpress MAINTAINERS: remove drivers/net/wan/cycx* MAINTAINERS: remove 3c505 caif_dev: fix sparse warnings for caif_flow_cb ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver sctp: use the passed in gfp flags instead GFP_KERNEL ipv[4\|6]: correct dropwatch false positive in local_deliver_finish l2tp: Restore socket refcount when sendmsg succeeds net/phy: micrel: Disable asymmetric pause for KSZ9021 bgmac: omit the fcs phy: Fix phy_device_free memory leak bnx2x: Fix KR2 work-around condition ...	2013-03-05 18:42:29 -08:00
Cong Wang	7dac1b514a	rds: simplify a warning message Cc: David S. Miller <davem@davemloft.net> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-04 14:12:07 -05:00
Cong Wang	ece6b0a2b2	rds: limit the size allocated by rds_message_alloc() Dave Jones reported the following bug: "When fed mangled socket data, rds will trust what userspace gives it, and tries to allocate enormous amounts of memory larger than what kmalloc can satisfy." WARNING: at mm/page_alloc.c:2393 __alloc_pages_nodemask+0xa0d/0xbe0() Hardware name: GA-MA78GM-S2H Modules linked in: vmw_vsock_vmci_transport vmw_vmci vsock fuse bnep dlci bridge 8021q garp stp mrp binfmt_misc l2tp_ppp l2tp_core rfcomm s Pid: 24652, comm: trinity-child2 Not tainted 3.8.0+ #65 Call Trace: [<ffffffff81044155>] warn_slowpath_common+0x75/0xa0 [<ffffffff8104419a>] warn_slowpath_null+0x1a/0x20 [<ffffffff811444ad>] __alloc_pages_nodemask+0xa0d/0xbe0 [<ffffffff8100a196>] ? native_sched_clock+0x26/0x90 [<ffffffff810b2128>] ? trace_hardirqs_off_caller+0x28/0xc0 [<ffffffff810b21cd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff811861f8>] alloc_pages_current+0xb8/0x180 [<ffffffff8113eaaa>] __get_free_pages+0x2a/0x80 [<ffffffff811934fe>] kmalloc_order_trace+0x3e/0x1a0 [<ffffffff81193955>] __kmalloc+0x2f5/0x3a0 [<ffffffff8104df0c>] ? local_bh_enable_ip+0x7c/0xf0 [<ffffffffa0401ab3>] rds_message_alloc+0x23/0xb0 [rds] [<ffffffffa04043a1>] rds_sendmsg+0x2b1/0x990 [rds] [<ffffffff810b21cd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff81564620>] sock_sendmsg+0xb0/0xe0 [<ffffffff810b2052>] ? get_lock_stats+0x22/0x70 [<ffffffff810b24be>] ? put_lock_stats.isra.23+0xe/0x40 [<ffffffff81567f30>] sys_sendto+0x130/0x180 [<ffffffff810b872d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff816c547b>] ? _raw_spin_unlock_irq+0x3b/0x60 [<ffffffff816cd767>] ? sysret_check+0x1b/0x56 [<ffffffff810b8695>] ? trace_hardirqs_on_caller+0x115/0x1a0 [<ffffffff81341d8e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff816cd742>] system_call_fastpath+0x16/0x1b ---[ end trace eed6ae990d018c8b ]--- Reported-by: Dave Jones <davej@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-04 14:12:06 -05:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Kees Cook	e34430eeca	net/rds: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Acked-by: David S. Miller <davem@davemloft.net>	2013-01-11 11:40:02 -08:00
Marciniszyn, Mike	a49675988c	IB/rds: suppress incompatible protocol when version is known Add an else to only print the incompatible protocol message when version hasn't been established. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-26 15:17:37 -08:00
Marciniszyn, Mike	f2e9bd7032	IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len `0b088e00` ("RDS: Use page_remainder_alloc() for recv bufs") added uses of sg_dma_len() and sg_dma_address(). This makes RDS DOA with the qib driver. IB ulps should use ib_sg_dma_len() and ib_sg_dma_address respectively since some HCAs overload ib_sg_dma* operations. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-26 15:17:37 -08:00
Shan Wei	ae4b46e9d7	net: rds: use this_cpu_* per-cpu helper Signed-off-by: Shan Wei <davidshan@tencent.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-19 18:59:44 -05:00
jeff.liu	5175a5e76b	RDS: fix rds-ping spinlock recursion This is the revised patch for fixing rds-ping spinlock recursion according to Venkat's suggestions. RDS ping/pong over TCP feature has been broken for years(2.6.39 to 3.6.0) since we have to set TCP cork and call kernel_sendmsg() between ping/pong which both need to lock "struct sock *sk". However, this lock has already been hold before rds_tcp_data_ready() callback is triggerred. As a result, we always facing spinlock resursion which would resulting in system panic. Given that RDS ping is only used to test the connectivity and not for serious performance measurements, we can queue the pong transmit to rds_wq as a delayed response. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> CC: David S. Miller <davem@davemloft.net> CC: James Morris <james.l.morris@oracle.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-09 13:57:23 -04:00
Ying Xue	bfdc587c5a	rds: Don't disable BH on BH context Since we have already in BH context when _write_space(), _data_ready() as well as *_state_change() are called, it's unnecessary to disable BH. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-22 22:52:04 -07:00
Weiping Pan	06b6a1cf6e	rds: set correct msg_namelen Jay Fenlason (fenlason@redhat.com) found a bug, that recvfrom() on an RDS socket can return the contents of random kernel memory to userspace if it was called with a address length larger than sizeof(struct sockaddr_in). rds_recvmsg() also fails to set the addr_len paramater properly before returning, but that's just a bug. There are also a number of cases wher recvfrom() can return an entirely bogus address. Anything in rds_recvmsg() that returns a non-negative value but does not go through the "sin = (struct sockaddr_in )msg->msg_name;" code path at the end of the while(1) loop will return up to 128 bytes of kernel memory to userspace. And I write two test programs to reproduce this bug, you will see that in rds_server, fromAddr will be overwritten and the following sock_fd will be destroyed. Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is better to make the kernel copy the real length of address to user space in such case. How to run the test programs ? I test them on 32bit x86 system, 3.5.0-rc7. 1 compile gcc -o rds_client rds_client.c gcc -o rds_server rds_server.c 2 run ./rds_server on one console 3 run ./rds_client on another console 4 you will see something like: server is waiting to receive data... old socket fd=3 server received data from client:data from client msg.msg_namelen=32 new socket fd=-1067277685 sendmsg() : Bad file descriptor /************** rds_client.c *****************/ int main(void) { int sock_fd; struct sockaddr_in serverAddr; struct sockaddr_in toAddr; char recvBuffer[128] = "data from client"; struct msghdr msg; struct iovec iov; sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0); if (sock_fd < 0) { perror("create socket error\n"); exit(1); } memset(&serverAddr, 0, sizeof(serverAddr)); serverAddr.sin_family = AF_INET; serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); serverAddr.sin_port = htons(4001); if (bind(sock_fd, (struct sockaddr)&serverAddr, sizeof(serverAddr)) < 0) { perror("bind() error\n"); close(sock_fd); exit(1); } memset(&toAddr, 0, sizeof(toAddr)); toAddr.sin_family = AF_INET; toAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); toAddr.sin_port = htons(4000); msg.msg_name = &toAddr; msg.msg_namelen = sizeof(toAddr); msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = strlen(recvBuffer) + 1; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; if (sendmsg(sock_fd, &msg, 0) == -1) { perror("sendto() error\n"); close(sock_fd); exit(1); } printf("client send data:%s\n", recvBuffer); memset(recvBuffer, '\0', 128); msg.msg_name = &toAddr; msg.msg_namelen = sizeof(toAddr); msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = 128; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; if (recvmsg(sock_fd, &msg, 0) == -1) { perror("recvmsg() error\n"); close(sock_fd); exit(1); } printf("receive data from server:%s\n", recvBuffer); close(sock_fd); return 0; } /*************** rds_server.c *****************/ int main(void) { struct sockaddr_in fromAddr; int sock_fd; struct sockaddr_in serverAddr; unsigned int addrLen; char recvBuffer[128]; struct msghdr msg; struct iovec iov; sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0); if(sock_fd < 0) { perror("create socket error\n"); exit(0); } memset(&serverAddr, 0, sizeof(serverAddr)); serverAddr.sin_family = AF_INET; serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1"); serverAddr.sin_port = htons(4000); if (bind(sock_fd, (struct sockaddr)&serverAddr, sizeof(serverAddr)) < 0) { perror("bind error\n"); close(sock_fd); exit(1); } printf("server is waiting to receive data...\n"); msg.msg_name = &fromAddr; /* * I add 16 to sizeof(fromAddr), ie 32, * and pay attention to the definition of fromAddr, * recvmsg() will overwrite sock_fd, * since kernel will copy 32 bytes to userspace. * * If you just use sizeof(fromAddr), it works fine. * / msg.msg_namelen = sizeof(fromAddr) + 16; / msg.msg_namelen = sizeof(fromAddr); */ msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_iov->iov_base = recvBuffer; msg.msg_iov->iov_len = 128; msg.msg_control = 0; msg.msg_controllen = 0; msg.msg_flags = 0; while (1) { printf("old socket fd=%d\n", sock_fd); if (recvmsg(sock_fd, &msg, 0) == -1) { perror("recvmsg() error\n"); close(sock_fd); exit(1); } printf("server received data from client:%s\n", recvBuffer); printf("msg.msg_namelen=%d\n", msg.msg_namelen); printf("new socket fd=%d\n", sock_fd); strcat(recvBuffer, "--data from server"); if (sendmsg(sock_fd, &msg, 0) == -1) { perror("sendmsg()\n"); close(sock_fd); exit(1); } } close(sock_fd); return 0; } Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 01:01:44 -07:00
Ben Hutchings	2c53040f01	net: Fix (nearly-)kernel-doc comments for various functions Fix incorrect start markers, wrapped summary lines, missing section breaks, incorrect separators, and some name mismatches. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 23:13:45 -07:00
Thadeu Lima de Souza Cascardo	a0c6ffbcfe	rds_rdma: don't assume infiniband device is PCI RDS code assumes that the struct ib_device dma_device member, which is a pointer, points to a struct device embedded in a struct pci_dev. This is not the case for ehca, for example, which is a OF driver, and makes dma_device point to a struct device embedded in a struct platform_device. This will make the system crash when rds_rdma is loaded in a system with ehca, since it will try to access the bus member of a non-existent struct pci_dev. The only reason rds_rdma uses the struct pci_dev is to get the NUMA node the device is attached to. Using dev_to_node for that is much better, since it won't assume which bus the infiniband is attached to. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Cc: dledford@redhat.com Cc: Jes.Sorensen@redhat.com Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-29 17:30:07 -04:00
Pavel Emelyanov	4a17fd5229	sock: Introduce named constants for sk_reuse Name them in a "backward compatible" manner, i.e. reuse or not are still 1 and 0 respectively. The reuse value of 2 means that the socket with it will forcibly reuse everyone else's port. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-21 15:52:25 -04:00
Eric W. Biederman	ec8f23ce0f	net: Convert all sysctl registrations to register_net_sysctl This results in code with less boiler plate that is a bit easier to read. Additionally stops us from using compatibility code in the sysctl core, hastening the day when the compatibility code can be removed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-20 21:22:30 -04:00
Eric W. Biederman	5dd3df105b	net: Move all of the network sysctls without a namespace into init_net. This makes it clearer which sysctls are relative to your current network namespace. This makes it a little less error prone by not exposing sysctls for the initial network namespace in other namespaces. This is the same way we handle all of our other network interfaces to userspace and I can't honestly remember why we didn't do this for sysctls right from the start. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-20 21:21:17 -04:00
Dan Carpenter	f0229eaaf3	RDS: use gfp flags from caller in conn_alloc() We should be using the gfp flags the caller specified here, instead of GFP_KERNEL. I think this might be a bugfix, depending on the value of "sock->sk->sk_allocation" when we call rds_conn_create_outgoing() in rds_sendmsg(). Otherwise, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-03-22 19:29:58 -04:00
Linus Torvalds	9f3938346a	Merge branch 'kmap_atomic' of git://github.com/congwang/linux Pull kmap_atomic cleanup from Cong Wang. It's been in -next for a long time, and it gets rid of the (no longer used) second argument to k[un]map_atomic(). Fix up a few trivial conflicts in various drivers, and do an "evil merge" to catch some new uses that have come in since Cong's tree. * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits) feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename] drbd: remove the second argument of k[un]map_atomic() zcache: remove the second argument of k[un]map_atomic() gma500: remove the second argument of k[un]map_atomic() dm: remove the second argument of k[un]map_atomic() tomoyo: remove the second argument of k[un]map_atomic() sunrpc: remove the second argument of k[un]map_atomic() rds: remove the second argument of k[un]map_atomic() net: remove the second argument of k[un]map_atomic() mm: remove the second argument of k[un]map_atomic() lib: remove the second argument of k[un]map_atomic() power: remove the second argument of k[un]map_atomic() kdb: remove the second argument of k[un]map_atomic() udf: remove the second argument of k[un]map_atomic() ubifs: remove the second argument of k[un]map_atomic() squashfs: remove the second argument of k[un]map_atomic() reiserfs: remove the second argument of k[un]map_atomic() ocfs2: remove the second argument of k[un]map_atomic() ntfs: remove the second argument of k[un]map_atomic() ...	2012-03-21 09:40:26 -07:00
Linus Torvalds	69a7aebcf0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "It's indeed trivial -- mostly documentation updates and a bunch of typo fixes from Masanari. There are also several linux/version.h include removals from Jesper." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits) kcore: fix spelling in read_kcore() comment constify struct pci_dev * in obvious cases Revert "char: Fix typo in viotape.c" init: fix wording error in mm_init comment usb: gadget: Kconfig: fix typo for 'different' Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c" writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header writeback: fix typo in the writeback_control comment Documentation: Fix multiple typo in Documentation tpm_tis: fix tis_lock with respect to RCU Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c" Doc: Update numastat.txt qla4xxx: Add missing spaces to error messages compiler.h: Fix typo security: struct security_operations kerneldoc fix Documentation: broken URL in libata.tmpl Documentation: broken URL in filesystems.tmpl mtd: simplify return logic in do_map_probe() mm: fix comment typo of truncate_inode_pages_range power: bq27x00: Fix typos in comment ...	2012-03-20 21:12:50 -07:00
Dave Jones	a6506e1486	Remove printk from rds_sendmsg no socket layer outputs a message for this error and neither should rds. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-03-20 16:12:11 -04:00
Cong Wang	6114eab535	rds: remove the second argument of k[un]map_atomic() Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:28 +08:00
Masanari Iida	5fd5c44d3f	rds: Fix typo in iw_recv.c and ib_recv.c Correct spelling "inclue" to "include" in net/rds/iw_recv.c and net/rds/ib_recv.c Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-02-09 23:09:54 +01:00
David S. Miller	efc3dbc374	rds: Make rds_sock_lock BH rather than IRQ safe. rds_sock_info() triggers locking warnings because we try to perform a local_bh_enable() (via sock_i_ino()) while hardware interrupts are disabled (via taking rds_sock_lock). There is no reason for rds_sock_lock to be a hardware IRQ disabling lock, none of these access paths run in hardware interrupt context. Therefore making it a BH disabling lock is safe and sufficient to fix this bug. Reported-by: Kumar Sanghvi <kumaras@chelsio.com> Reported-by: Josh Boyer <jwboyer@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-24 17:03:44 -05:00
Roland Dreier	5b7bf42e3d	RDS: Remove some unused iWARP code rds_iw_flush_goal() just returns a count, but it is only called in one place and its return value is ignored there. So delete all the dead code. Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-12 20:05:28 -08:00
Paul Bolle	77c1c7c4bd	rds: drop "select LLIST" Commit `1bc144b625` ("net, rds, Replace xlist in net/rds/xlist.h with llist") added "select LLIST" to the RDS_RDMA Kconfig entry. But there is no Kconfig symbol named LLIST. The select statement for that symbol is a nop. Drop it. lib/llist.o is builtin, so all that's needed to use the llist functionality is to include linux/llist.h, which this commit also did. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:10:50 -05:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Joe Perches	b9075fa968	treewide: use __printf not __attribute__((format(printf,...))) Standardize the style for compiler based printf format verification. Standardized the location of __printf too. Done via script and a little typing. $ grep -rPl --include=.[ch] -w "__attribute__" \| \ grep -vP "^(tools\|scripts\|include/linux/compiler-gcc.h)" \| \ xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s$\s\(\sformat\s\(\sprintf\s,\s(.+)\s,\s(.+)\s$\s\)\s\)/__printf($1, $2)/g ; print; }' [akpm@linux-foundation.org: revert arch bits] Signed-off-by: Joe Perches <joe@perches.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-10-31 17:30:54 -07:00
Paul Gortmaker	bc3b2d7fb9	net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules These files are non modular, but need to export symbols using the macros now living in export.h -- call out the include so that things won't break when we remove the implicit presence of module.h from everywhere. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:30 -04:00
Paul Gortmaker	d9b9384215	net: add moduleparam.h for users of module_param/MODULE_PARM_DESC These files were getting access to these two via the implicit presence of module.h everywhere. They aren't modules, so they don't need the full module.h inclusion though. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:29 -04:00
Paul Gortmaker	3a9a231d97	net: Fix files explicitly needing to include module.h With calls to modular infrastructure, these files really needs the full module.h header. Call it out so some of the cleanups of implicit and unrequired includes elsewhere can be cleaned up. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:28 -04:00
Linus Torvalds	8a9ea3237e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1745 commits) dp83640: free packet queues on remove dp83640: use proper function to free transmit time stamping packets ipv6: Do not use routes from locally generated RAs \|PATCH net-next] tg3: add tx_dropped counter be2net: don't create multiple RX/TX rings in multi channel mode be2net: don't create multiple TXQs in BE2 be2net: refactor VF setup/teardown code into be_vf_setup/clear() be2net: add vlan/rx-mode/flow-control config to be_setup() net_sched: cls_flow: use skb_header_pointer() ipv4: avoid useless call of the function check_peer_pmtu TCP: remove TCP_DEBUG net: Fix driver name for mdio-gpio.c ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT rtnetlink: Add missing manual netlink notification in dev_change_net_namespaces ipv4: fix ipsec forward performance regression jme: fix irq storm after suspend/resume route: fix ICMP redirect validation net: hold sock reference while processing tx timestamps tcp: md5: add more const attributes Add ethtool -g support to virtio_net ... Fix up conflicts in: - drivers/net/Kconfig: The split-up generated a trivial conflict with removal of a stale reference to Documentation/networking/net-modules.txt. Remove it from the new location instead. - fs/sysfs/dir.c: Fairly nasty conflicts with the sysfs rb-tree usage, conflicting with Eric Biederman's changes for tagged directories.	2011-10-25 13:25:22 +02:00
Linus Torvalds	59e5253417	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits) MAINTAINERS: linux-m32r is moderated for non-subscribers linux@lists.openrisc.net is moderated for non-subscribers Drop default from "DM365 codec select" choice parisc: Kconfig: cleanup Kernel page size default Kconfig: remove redundant CONFIG_ prefix on two symbols cris: remove arch/cris/arch-v32/lib/nand_init.S microblaze: add missing CONFIG_ prefixes h8300: drop puzzling Kconfig dependencies MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers tty: drop superfluous dependency in Kconfig ARM: mxc: fix Kconfig typo 'i.MX51' Fix file references in Kconfig files aic7xxx: fix Kconfig references to READMEs Fix file references in drivers/ide/ thinkpad_acpi: Fix printk typo 'bluestooth' bcmring: drop commented out line in Kconfig btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888' doc: raw1394: Trivial typo fix CIFS: Don't free volume_info->UNC until we are entirely done with it. treewide: Correct spelling of successfully in comments ...	2011-10-25 12:11:02 +02:00
David S. Miller	88c5100c28	Merge branch 'master' of github.com:davem330/net Conflicts: net/batman-adv/soft-interface.c	2011-10-07 13:38:43 -04:00
Jonathan Lallinger	85a6488949	RDSRDMA: Fix cleanup of rds_iw_mr_pool In the rds_iw_mr_pool struct the free_pinned field keeps track of memory pinned by free MRs. While this field is incremented properly upon allocation, it is never decremented upon unmapping. This would cause the rds_rdma module to crash the kernel upon unloading, by triggering the BUG_ON in the rds_iw_destroy_mr_pool function. This change keeps track of the MRs that become unpinned, so that free_pinned can be decremented appropriately. Signed-off-by: Jonathan Lallinger <jonathan@ogc.us> Signed-off-by: Steve Wise <swise@ogc.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-29 14:57:19 -04:00
Huang Ying	1bc144b625	net, rds, Replace xlist in net/rds/xlist.h with llist The functionality of xlist and llist is almost same. This patch replace xlist with llist to avoid code duplication. Known issues: don't know how to test this, need special hardware? Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Andy Grover <andy.grover@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 15:36:32 -04:00
Joe Perches	3dbd443983	net: Convert vmalloc/memset to vzalloc Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-09-15 13:59:25 +02:00
Amerigo Wang	80f1ff97d0	notifiers: cpu: move cpu notifiers into cpu.h We presently define all kinds of notifiers in notifier.h. This is not necessary at all, since different subsystems use different notifiers, they are almost non-related with each other. This can also save much build time. Suppose I add a new netdevice event, really I don't have to recompile all the source, just network related. Without this patch, all the source will be recompiled. I move the notify events near to their subsystem notifier registers, so that they can be found more easily. This patch: It is not necessary to share the same notifier.h. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-25 20:57:14 -07:00
Greg Dietsche	3e878b8d54	net: rds: fix const array syntax Correct the syntax so that both array and pointer are const. Signed-off-by: Greg Dietsche <Gregory.Dietsche@cuw.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 16:16:19 -07:00
Manuel Zerpies	cb0a605649	net/rds: use prink_ratelimited() instead of printk_ratelimit() Since printk_ratelimit() shouldn't be used anymore (see comment in include/linux/printk.h), replace it with printk_ratelimited() Signed-off-by: Manuel Zerpies <manuel.f.zerpies@ww.stud.uni-erlangen.de> Signed-off-by: David S. Miller <davem@conan.davemloft.net>	2011-06-17 00:03:03 -04:00
Alexey Dobriyan	a6b7a40786	net: remove interrupt.h inclusion from netdevice.h * remove interrupt.g inclusion from netdevice.h -- not needed * fixup fallout, add interrupt.h and hardirq.h back where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-06 22:55:11 -07:00
Sean Hefty	b26f9b9949	RDMA/cma: Pass QP type into rdma_create_id() The RDMA CM currently infers the QP type from the port space selected by the user. In the future (eg with RDMA_PS_IB or XRC), there may not be a 1-1 correspondence between port space and QP type. For netlink export of RDMA CM state, we want to export the QP type to userspace, so it is cleaner to explicitly associate a QP type to an ID. Modify rdma_create_id() to allow the user to specify the QP type, and use it to make our selections of datagram versus connected mode. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>	2011-05-25 13:46:23 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Akinobu Mita	e1dc1c81b9	rds: use little-endian bitops As a preparation for removing ext2 non-atomic bit operations from asm/bitops.h. This converts ext2 non-atomic bit operations to little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Andy Grover <andy.grover@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:16 -07:00
Akinobu Mita	12ce22423a	rds: stop including asm-generic/bitops/le.h directly asm-generic/bitops/le.h is only intended to be included directly from asm-generic/bitops/ext2-non-atomic.h or asm-generic/bitops/minix-le.h which implements generic ext2 or minix bit operations. This stops including asm-generic/bitops/le.h directly and use ext2 non-atomic bit operations instead. It seems odd to use ext2_*_bit() on rds, but it will replaced with __{set,clear,test}_bit_le() after introducing little endian bit operations for all architectures. This indirect step is necessary to maintain bisectability for some architectures which have their own little-endian bit operations. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Andy Grover <andy.grover@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:10 -07:00
Linus Torvalds	7a6362800c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1480 commits) bonding: enable netpoll without checking link status xfrm: Refcount destination entry on xfrm_lookup net: introduce rx_handler results and logic around that bonding: get rid of IFF_SLAVE_INACTIVE netdev->priv_flag bonding: wrap slave state work net: get rid of multiple bond-related netdevice->priv_flags bonding: register slave pointer for rx_handler be2net: Bump up the version number be2net: Copyright notice change. Update to Emulex instead of ServerEngines e1000e: fix kconfig for crc32 dependency netfilter ebtables: fix xt_AUDIT to work with ebtables xen network backend driver bonding: Improve syslog message at device creation time bonding: Call netif_carrier_off after register_netdevice bonding: Incorrect TX queue offset net_sched: fix ip_tos2prio xfrm: fix __xfrm_route_forward() be2net: Fix UDP packet detected status in RX compl Phonet: fix aligned-mode pipe socket buffer header reserve netxen: support for GbE port settings ... Fix up conflicts in drivers/staging/brcm80211/brcmsmac/wl_mac80211.c with the staging updates.	2011-03-16 16:29:25 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
David S. Miller	33175d84ee	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x_cmn.c	2011-03-10 14:26:00 -08:00
Neil Horman	6094628bfd	rds: prevent BUG_ON triggering on congestion map updates Recently had this bug halt reported to me: kernel BUG at net/rds/send.c:329! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=1024 NUMA pSeries Modules linked in: rds sunrpc ipv6 dm_mirror dm_region_hash dm_log ibmveth sg ext4 jbd2 mbcache sd_mod crc_t10dif ibmvscsic scsi_transport_srp scsi_tgt dm_mod [last unloaded: scsi_wait_scan] NIP: d000000003ca68f4 LR: d000000003ca67fc CTR: d000000003ca8770 REGS: c000000175cab980 TRAP: 0700 Not tainted (2.6.32-118.el6.ppc64) MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 44000022 XER: 00000000 TASK = c00000017586ec90[1896] 'krdsd' THREAD: c000000175ca8000 CPU: 0 GPR00: 0000000000000150 c000000175cabc00 d000000003cb7340 0000000000002030 GPR04: ffffffffffffffff 0000000000000030 0000000000000000 0000000000000030 GPR08: 0000000000000001 0000000000000001 c0000001756b1e30 0000000000010000 GPR12: d000000003caac90 c000000000fa2500 c0000001742b2858 c0000001742b2a00 GPR16: c0000001742b2a08 c0000001742b2820 0000000000000001 0000000000000001 GPR20: 0000000000000040 c0000001742b2814 c000000175cabc70 0800000000000000 GPR24: 0000000000000004 0200000000000000 0000000000000000 c0000001742b2860 GPR28: 0000000000000000 c0000001756b1c80 d000000003cb68e8 c0000001742b27b8 NIP [d000000003ca68f4] .rds_send_xmit+0x4c4/0x8a0 [rds] LR [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds] Call Trace: [c000000175cabc00] [d000000003ca67fc] .rds_send_xmit+0x3cc/0x8a0 [rds] (unreliable) [c000000175cabd30] [d000000003ca7e64] .rds_send_worker+0x54/0x100 [rds] [c000000175cabdb0] [c0000000000b475c] .worker_thread+0x1dc/0x3c0 [c000000175cabed0] [c0000000000baa9c] .kthread+0xbc/0xd0 [c000000175cabf90] [c000000000032114] .kernel_thread+0x54/0x70 Instruction dump: 4bfffd50 60000000 60000000 39080001 935f004c f91f0040 41820024 813d017c 7d094a78 7d290074 7929d182 394a0020 <0b090000> 40e2ff68 4bffffa4 39200000 Kernel panic - not syncing: Fatal exception Call Trace: [c000000175cab560] [c000000000012e04] .show_stack+0x74/0x1c0 (unreliable) [c000000175cab610] [c0000000005a365c] .panic+0x80/0x1b4 [c000000175cab6a0] [c00000000002fbcc] .die+0x21c/0x2a0 [c000000175cab750] [c000000000030000] ._exception+0x110/0x220 [c000000175cab910] [c000000000004b9c] program_check_common+0x11c/0x180 Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-08 11:22:43 -08:00
Tejun Heo	c534a107e8	rds/ib: use system_wq instead of rds_ib_fmr_wq With cmwq, there's no reason to use dedicated rds_ib_fmr_wq - it's not in the memory reclaim path and the maximum number of concurrent work items is bound by the number of devices. Drop it and use system_wq instead. This rds_ib_fmr_init/exit() noops. Both removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Grover <andy.grover@oracle.com>	2011-02-01 11:42:43 +01:00
Shan Wei	441c793a56	net: cleanup unused macros in net directory Clean up some unused macros in net/*. 1. be left for code change. e.g. PGV_FROM_VMALLOC, PGV_FROM_VMALLOC, KMEM_SAFETYZONE. 2. never be used since introduced to kernel. e.g. P9_RDMA_MAX_SGE, UTIL_CTRL_PKT_SIZE. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Acked-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 23:20:04 -08:00
Tracey Dent	094f2faaa2	Net: rds: Makefile: Remove deprecated items Changed Makefile to use <modules>-y instead of <modules>-objs because -objs is deprecated and not mentioned in Documentation/kbuild/makefiles.txt. Also, use the ccflags-$ flag instead of EXTRA_CFLAGS because EXTRA_CFLAGS is deprecated and should now be switched. Last but not least, took out if-conditionals. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-22 08:16:15 -08:00
Dan Rosenberg	218854af84	rds: Integer overflow in RDS cmsg handling In rds_cmsg_rdma_args(), the user-provided args->nr_local value is restricted to less than UINT_MAX. This seems to need a tighter upper bound, since the calculation of total iov_size can overflow, resulting in a small sock_kmalloc() allocation. This would probably just result in walking off the heap and crashing when calling rds_rdma_pages() with a high count value. If it somehow doesn't crash here, then memory corruption could occur soon after. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 12:20:52 -08:00
Pavel Emelyanov	aa58163a76	rds: Fix rds message leak in rds_message_map_pages The sgs allocation error path leaks the allocated message. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-08 12:17:09 -08:00
Pavel Emelyanov	8200a59f24	rds: Remove kfreed tcp conn from list All the rds_tcp_connection objects are stored list, but when being freed it should be removed from there. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-03 18:50:07 -07:00
Pavel Emelyanov	58c490babd	rds: Lost locking in loop connection freeing The conn is removed from list in there and this requires proper lock protection. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-03 18:50:06 -07:00
Andy Grover	d139ff0907	RDS: Let rds_message_alloc_sgs() return NULL Even with the previous fix, we still are reading the iovecs once to determine SGs needed, and then again later on. Preallocating space for sg lists as part of rds_message seemed like a good idea but it might be better to not do this. While working to redo that code, this patch attempts to protect against userspace rewriting the rds_iovec array between the first and second accesses. The consequences of this would be either a too-small or too-large sg list array. Too large is not an issue. This patch changes all callers of message_alloc_sgs to handle running out of preallocated sgs, and fail gracefully. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:34:18 -07:00
Andy Grover	fc8162e3c0	RDS: Copy rds_iovecs into kernel memory instead of rereading from userspace Change rds_rdma_pages to take a passed-in rds_iovec array instead of doing copy_from_user itself. Change rds_cmsg_rdma_args to copy rds_iovec array once only. This eliminates the possibility of userspace changing it after our sanity checks. Implement stack-based storage for small numbers of iovecs, based on net/socket.c, to save an alloc in the extremely common case. Although this patch reduces iovec copies in cmsg_rdma_args to 1, we still do another one in rds_rdma_extra_size. Getting rid of that one will be trickier, so it'll be a separate patch. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:34:17 -07:00
Andy Grover	f4a3fc03c1	RDS: Clean up error handling in rds_cmsg_rdma_args We don't need to set ret = 0 at the end -- it's initialized to 0. Also, don't increment s_send_rdma stat if we're exiting with an error. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:34:17 -07:00
Andy Grover	a09f69c49b	RDS: Return -EINVAL if rds_rdma_pages returns an error rds_cmsg_rdma_args would still return success even if rds_rdma_pages returned an error (or overflowed). Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:34:16 -07:00
Linus Torvalds	1b1f693d7a	net: fix rds_iovec page count overflow As reported by Thomas Pollet, the rdma page counting can overflow. We get the rdma sizes in 64-bit unsigned entities, but then limit it to UINT_MAX bytes and shift them down to pages (so with a possible "+1" for an unaligned address). So each individual page count fits comfortably in an 'unsigned int' (not even close to overflowing into signed), but as they are added up, they might end up resulting in a signed return value. Which would be wrong. Catch the case of tot_pages turning negative, and return the appropriate error code. Reported-by: Thomas Pollet <thomas.pollet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:34:16 -07:00
David S. Miller	2198a10b50	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/core/dev.c	2010-10-21 08:43:05 -07:00
stephen hemminger	ff51bf8415	rds: make local functions/variables static The RDS protocol has lots of functions that should be declared static. rds_message_get/add_version_extension is removed since it defined but never used. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 04:26:39 -07:00
Linus Torvalds	799c10559d	De-pessimize rds_page_copy_user Don't try to "optimize" rds_page_copy_user() by using kmap_atomic() and the unsafe atomic user mode accessor functions. It's actually slower than the straightforward code on any reasonable modern CPU. Back when the code was written (although probably not by the time it was actually merged, though), 32-bit x86 may have been the dominant architecture. And there kmap_atomic() can be a lot faster than kmap() (unless you have very good locality, in which case the virtual address caching by kmap() can overcome all the downsides). But these days, x86-64 may not be more populous, but it's getting there (and if you care about performance, it's definitely already there - you'd have upgraded your CPU's already in the last few years). And on x86-64, the non-kmap_atomic() version is faster, simply because the code is simpler and doesn't have the "re-try page fault" case. People with old hardware are not likely to care about RDS anyway, and the optimization for the 32-bit case is simply buggy, since it doesn't verify the user addresses properly. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-10-15 11:09:28 -07:00
David S. Miller	e40051d134	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c	2010-09-27 01:03:03 -07:00
Eric Dumazet	f064af1e50	net: fix a lockdep splat We have for each socket : One spinlock (sk_slock.slock) One rwlock (sk_callback_lock) Possible scenarios are : (A) (this is used in net/sunrpc/xprtsock.c) read_lock(&sk->sk_callback_lock) (without blocking BH) <BH> spin_lock(&sk->sk_slock.slock); ... read_lock(&sk->sk_callback_lock); ... (B) write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) (C) spin_lock_bh(&sk->sk_slock) ... write_lock_bh(&sk->sk_callback_lock) stuff write_unlock_bh(&sk->sk_callback_lock) spin_unlock_bh(&sk->sk_slock) This (C) case conflicts with (A) : CPU1 [A] CPU2 [C] read_lock(callback_lock) <BH> spin_lock_bh(slock) <wait to spin_lock(slock)> <wait to write_lock_bh(callback_lock)> We have one problematic (C) use case in inet_csk_listen_stop() : local_bh_disable(); bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock) WARN_ON(sock_owned_by_user(child)); ... sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock) lockdep is not happy with this, as reported by Tetsuo Handa It seems only way to deal with this is to use read_lock_bh(callbacklock) everywhere. Thanks to Jarek for pointing a bug in my first attempt and suggesting this solution. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-24 22:26:10 -07:00
Dan Carpenter	aef3ea33e8	rds: spin_lock_irq() is not nestable This is basically just a cleanup. IRQs were disabled on the previous line so we don't need to do it again here. In the current code IRQs would get turned on one line earlier than intended. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:44 -07:00
Dan Carpenter	f4fa7f3807	rds: double unlock in rds_ib_cm_handle_connect() We unlock after we goto out. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:44 -07:00
Dan Carpenter	9b9d2e00bf	rds: signedness bug In the original code if the copy_from_user() fails in rds_rdma_pages() then the error handling fails and we get a stack trace from kmalloc(). Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-19 11:59:43 -07:00
Andy Grover	20c72bd5f5	RDS: Implement masked atomic operations Add two CMSGs for masked versions of cswp and fadd. args struct modified to use a union for different atomic op type's arguments. Change IB to do masked atomic ops. Atomic op type in rds_message similarly unionized. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:51 -07:00
Zach Brown	59f740a6ae	RDS/IB: print string constants in more places This prints the constant identifier for work completion status and rdma cm event types, like we already do for IB event types. A core string array helper is added that each string type uses. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:50 -07:00
Zach Brown	4518071ac1	RDS: cancel connection work structs as we shut down Nothing was canceling the send and receive work that might have been queued as a conn was being destroyed. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:49 -07:00
Zach Brown	ffcec0e110	RDS: don't call rds_conn_shutdown() from rds_conn_destroy() rds_conn_shutdown() can return before the connection is shut down when it encounters an existing state that it doesn't understand. This lets rds_conn_destroy() then start tearing down the conn from under paths that are still using it. It's more reliable the shutdown work and wait for krdsd to complete the shutdown callback. This stopped some hangs I was seeing where krdsd was trying to shut down a freed conn. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:48 -07:00
Zach Brown	5adb5bc65f	RDS: have sockets get transport module references Right now there's nothing to stop the various paths that use rs->rs_transport from racing with rmmod and executing freed transport code. The simple fix is to have binding to a transport also hold a reference to the transport's module, removing this class of races. We already had an unused t_owner field which was set for the modular transports and which wasn't set for the built-in loop transport. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:47 -07:00
Zach Brown	77510481c0	RDS: remove old rs_transport comment rs_transport is now also used by the rdma paths once the socket is bound. We don't need this stale comment to tell us what cscope can. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:46 -07:00
Zach Brown	fe8ff6b58f	RDS: lock rds_conn_count decrement in rds_conn_destroy() rds_conn_destroy() can race with all other modifications of the rds_conn_count but it was modifying the count without locking. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:45 -07:00
Zach Brown	ea819867b7	RDS/IB: protect the list of IB devices The RDS IB device list wasn't protected by any locking. Traversal in both the get_mr and FMR flushing paths could race with additon and removal. List manipulation is done with RCU primatives and is protected by the write side of a rwsem. The list traversal in the get_mr fast path is protected by a rcu read critical section. The FMR list traversal is more problematic because it can block while traversing the list. We protect this with the read side of the rwsem. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:44 -07:00
Zach Brown	1bde04a63d	RDS/IB: print IB event strings as well as their number It's nice to not have to go digging in the code to see which event occurred. It's easy to throw together a quick array that maps the ib event enums to their strings. I didn't see anything in the stack that does this translation for us, but I also didn't look very hard. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:43 -07:00
Chris Mason	8576f374ac	RDS: flush fmrs before allocating new ones Flushing FMRs is somewhat expensive, and is currently kicked off when the interrupt handler notices that we are getting low. The result of this is that FMR flushing only happens from the interrupt cpus. This spreads the load more effectively by triggering flushes just before we allocate a new FMR. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:16:42 -07:00
Chris Mason	b4e1da3c9a	RDS: properly use sg_init_table This is only needed to keep debugging code from bugging. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:16:41 -07:00
Zach Brown	f046011cd7	RDS/IB: track signaled sends We're seeing bugs today where IB connection shutdown clears the send ring while the tasklet is processing completed sends. Implementation details cause this to dereference a null pointer. Shutdown needs to wait for send completion to stop before tearing down the connection. We can't simply wait for the ring to empty because it may contain unsignaled sends that will never be processed. This patch tracks the number of signaled sends that we've posted and waits for them to complete. It also makes sure that the tasklet has finished executing. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:40 -07:00
Zach Brown	ef87b7ea39	RDS: remove __init and __exit annotation The trivial amount of memory saved isn't worth the cost of dealing with section mismatches. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:39 -07:00
Andy Grover	c20f5b9633	RDS/IB: Use SLAB_HWCACHE_ALIGN flag for kmem_cache_create() We are definitely counting cycles as closely as DaveM, so ensure hwcache alignment for our recv ring control structs. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:38 -07:00
Zach Brown	d455ab6409	RDS/IB: always process recv completions The recv refill path was leaking fragments because the recv event handler had marked a ring element as free without freeing its frag. This was happening because it wasn't processing receives when the conn wasn't marked up or connecting, as can be the case if it races with rmmod. Two observations support always processing receives in the callback. First, buildup should only post receives, thus triggering recv event handler calls, once it has built up all the state to handle them. Teardown should destroy the CQ and drain the ring before tearing down the state needed to process recvs. Both appear to be true today. Second, this test was fundamentally racy. There is nothing to stop rmmod and connection destruction from swooping in the moment after the conn state was sampled but before real receive procesing starts. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:36 -07:00
Zach Brown	80c51be56f	RDS: return to a single-threaded krdsd We were seeing very nasty bugs due to fundamental assumption the current code makes about concurrent work struct processing. The code simpy isn't able to handle concurrent connection shutdown work function execution today, for example, which is very much possible once a multi-threaded krdsd was introduced. The problem compounds as additional work structs are added to the mix. krdsd is no longer perforance critical now that send and receive posting and FMR flushing are done elsewhere, so the safest fix is to move back to the single threaded krdsd that the current code was built around. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:35 -07:00
Zach Brown	515e079dab	RDS/IB: create a work queue for FMR flushing This patch moves the FMR flushing work in to its own mult-threaded work queue. This is to maintain performance in preparation for returning the main krdsd work queue back to a single threaded work queue to avoid deep-rooted concurrency bugs. This is also good because it further separates FMRs, which might be removed some day, from the rest of the code base. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:34 -07:00
Zach Brown	8aeb1ba663	RDS/IB: destroy connections on rmmod IB connections were not being destroyed during rmmod. First, recently IB device removal callback was changed to disconnect connections that used the removing device rather than destroying them. So connections with devices during rmmod were not being destroyed. Second, rds_ib_destroy_nodev_conns() was being called before connections are disassociated with devices. It would almost never find connections in the nodev list. We first get rid of rds_ib_destroy_conns(), which is no longer called, and refactor the existing caller into the main body of the function and get rid of the list and lock wrappers. Then we call rds_ib_destroy_nodev_conns() after ib_unregister_client() has removed the IB device from all the conns and put the conns on the nodev list. The result is that IB connections are destroyed by rmmod. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:33 -07:00
Zach Brown	24fa163a4b	RDS/IB: wait for IB dev freeing work to finish during rmmod The RDS IB client removal callback can queue work to drop the final reference to an IB device. We have to make sure that this function has returned before we complete rmmod or the work threads can try to execute freed code. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:16:32 -07:00
Andy Grover	b6fb0df12d	RDS/IB: Make ib_recv_refill return void Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:31 -07:00
Andy Grover	fbf4d7e3d0	RDS: Remove unused XLIST_PTR_TAIL and xlist_protect() Not used. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:16:06 -07:00
Andy Grover	c9455d9996	RDS: whitespace	2010-09-08 18:15:32 -07:00
Chris Mason	7a0ff5dbdd	RDS: use delayed work for the FMR flushes Using a delayed work queue helps us make sure a healthy number of FMRs have queued up over the limit. It makes for a large improvement in RDMA iops. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:30 -07:00
Chris Mason	eabb732279	rds: more FMRs are faster When we add more FMRs, we flush them less often and so we go faster. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:29 -07:00
Chris Mason	6fa70da608	rds: recycle FMRs through lockless lists FRM allocation and recycling is performance critical and fairly lock intensive. The current code has a per connection lock that all processes bang on and it becomes a major bottleneck on large systems. This changes things to use a number of cmpxchg based lists instead, allowing us to go through the whole FMR lifecycle without locking inside RDS. Zach Brown pointed out that our usage of cmpxchg for xlist removal is racey if someone manages to remove and add back an FMR struct into the list while another CPU can see the FMR's address at the head of the list. The second CPU might assume the list hasn't changed when in fact any number of operations might have happened in between the deletion and reinsertion. This commit maintains a per cpu count of CPUs that are currently in xlist removal, and establishes a grace period to make sure that nobody can see an entry we have just removed from the list. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:28 -07:00
Zach Brown	0f4b1c7e89	rds: fix rds_send_xmit() serialization rds_send_xmit() was changed to hold an interrupt masking spinlock instead of a mutex so that it could be called from the IB receive tasklet path. This broke the TCP transport because its xmit method can block and masks and unmasks interrupts. This patch serializes callers to rds_send_xmit() with a simple bit instead of the current spinlock or previous mutex. This enables rds_send_xmit() to be called from any context and to call functions which block. Getting rid of the c_send_lock exposes the bare c_lock acquisitions which are changed to block interrupts. A waitqueue is added so that rds_conn_shutdown() can wait for callers to leave rds_send_xmit() before tearing down partial send state. This lets us get rid of c_senders. rds_send_xmit() is changed to check the conn state after acquiring the RDS_IN_XMIT bit to resolve races with the shutdown path. Previously both worked with the conn state and then the lock in the same order, allowing them to race and execute the paths concurrently. rds_send_reset() isn't racing with rds_send_xmit() now that rds_conn_shutdown() properly ensures that rds_send_xmit() can't start once the conn state has been changed. We can remove its previous use of the spinlock. Finally, c_send_generation is redundant. Callers can race to test the c_flags bit by simply retrying instead of racing to test the c_send_generation atomic. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:27 -07:00
Zach Brown	501dcccdb7	rds: block ints when acquiring c_lock in rds_conn_message_info() conn->c_lock is acquired in interrupt context. rds_conn_message_info() is called from user context and was acquiring c_lock without blocking interrupts, leading to possible deadlocks. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:26 -07:00
Zach Brown	671202f349	rds: remove unused rds_send_acked_before() rds_send_acked_before() wasn't blocking interrupts when acquiring c_lock from user context but nothing calls it. Rather than fix its use of c_lock we just remove the function. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:25 -07:00
Chris Mason	037f18a307	RDS: use friendly gfp masks for prefill When prefilling the rds frags, we end up doing a lot of allocations. We're not in atomic context here, and so there's no reason to dip into atomic reserves. This changes the prefills to use masks that allow waiting. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:24 -07:00
Chris Mason	3324412587	RDS/IB: Add caching of frags and incs This patch is based heavily on an initial patch by Chris Mason. Instead of freeing slab memory and pages, it keeps them, and funnels them back to be reused. The lock minimization strategy uses xchg and cmpxchg atomic ops for manipulation of pointers to list heads. We anchor the lists with a pointer to a list_head struct instead of a static list_head struct. We just have to carefully use the existing primitives with the difference between a pointer and a static head struct. For example, 'list_empty()' means that our anchor pointer points to a list with a single item instead of meaning that our static head element doesn't point to any list items. Original patch by Chris, with significant mods and fixes by Andy and Zach. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:23 -07:00
Andy Grover	fc24f78085	RDS/IB: Remove ib_recv_unmap_page() All it does is call unmap_sg(), so just call that directly. The comment above unmap_page also may be incorrect, so we shouldn't hold on to it, either. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:22 -07:00
Andy Grover	3427e854e1	RDS: Assume recv->r_frag is always NULL in refill_one() refill_one() should never be called on a recv struct that doesn't need a new r_frag allocated. Add a WARN and remove conditional around r_frag alloc code. Also, add a comment to explain why r_ibinc may or may not need refilling. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:21 -07:00
Andy Grover	0b088e003c	RDS: Use page_remainder_alloc() for recv bufs Instead of splitting up a page into RDS_FRAG_SIZE chunks ourselves, ask rds_page_remainder_alloc() to do it. While it is possible PAGE_SIZE > FRAG_SIZE, on x86en it isn't, so having duplicate "carve up a page into buffers" code seems excessive. The other modification this spawns is the use of a single struct scatterlist in rds_page_frag instead of a bare page ptr. This causes verbosity to increase in some places, and decrease in others. Finally, I decided to unify the lifetimes and alloc/free of rds_page_frag and its page. This is a nice simplification in itself, but will be extra-nice once we come to adding cmason's recycling patch. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:15:20 -07:00
Zach Brown	fc19de38be	RDS/IB: disconnect when IB devices are removed Currently IB device removal destroys connections which are associated with the device. This prevents connections from being re-established when replacement devices are added. Instead we'll queue shutdown work on the connections as their devices are removed. When we see that devices are added we triger connection attempts on all connections that don't currently have a device. The result is that RDS sockets can resume device-independent work (bcopy, not RDMA) across IB device removal and restoration. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:19 -07:00
Zach Brown	f3c6808d3d	RDS: introduce rds_conn_connect_if_down() A few paths had the same block of code to queue a connection's connect work if it was in the right state. Let's move this in to a helper function. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:18 -07:00
Zach Brown	3e0249f9c0	RDS/IB: add refcount tracking to struct rds_ib_device The RDS IB client .remove callback used to free the rds_ibdev for the given device unconditionally. This could race other users of the struct. This patch adds refcounting so that we only free the rds_ibdev once all of its users are done. Many rds_ibdev users are tied to connections. We give the connection a reference and change these users to reference the device in the connection instead of looking it up in the IB client data. The only user of the IB client data remaining is the first lookup of the device as connections are built up. Incrementing the reference count of a device found in the IB client data could race with final freeing so we use an RCU grace period to make sure that freeing won't happen until those lookups are done. MRs need the rds_ibdev to get at the pool that they're freed in to. They exist outside a connection and many MRs can reference different devices from one socket, so it was natural to have each MR hold a reference. MR refs can be dropped from interrupt handlers and final device teardown can block so we push it off to a work struct. Pool teardown had to be fixed to cancel its pending work instead of deadlocking waiting for all queued work, including itself, to finish. MRs get their reference from the global device list, which gets a reference. It is left unprotected by locks and remains racy. A simple global lock would be a significant bottleneck. More scalable (complicated) locking should be done carefully in a later patch. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:17 -07:00
Zach Brown	89bf9d4158	RDS/IB: get the xmit max_sge from the RDS IB device on the connection rds_ib_xmit_rdma() was calling ib_get_client_data() to get at the rds_ibdevice just to get the max_sge for the transmit. This patch instead has it get it directly off the rds_ibdev which is stored on the connection. The current code won't free the rds_ibdev until all the IB connections that use it are freed. So it's safe to reference the rds_ibdev this way. In the future it also makes it easier to support proper reference counting of the rds_ibdev struct. As an additional bonus, this gets rid of the performance hit of calling in to the IB stack to look up the rds_ibdev. The current implementation in the IB stack acquires an interrupt blocking spinlock to protect the registration of client callback data. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:16 -07:00
Zach Brown	a46ca94e7f	RDS/IB: rds_ib_cm_handle_connect() forgot to unlock c_cm_lock rds_ib_cm_handle_connect() could return without unlocking the c_conn_lock if rds_setup_qp() failed. Rather than adding another imbalanced mutex_unlock() to this error path we only unlock the mutex once as we exit the function, reducing the likelyhood of making this same mistake in the future. We remove the previous mulitple return sites, leaving one unambigious return path. Signed-off-by: Zach Brown <zach.brown@oracle.com>	2010-09-08 18:15:15 -07:00
Chris Mason	1cc2228c59	rds: Fix reference counting on the for xmit_atomic and xmit_rdma This makes sure we have the proper number of references in rds_ib_xmit_atomic and rds_ib_xmit_rdma. We also consistently drop references the same way for all message types as the IOs end. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:13 -07:00
Chris Mason	bcf50ef2ce	rds: use RCU to protect the connection hash The connection hash was almost entirely RCU ready, this just makes the final couple of changes to use RCU instead of spinlocks for everything. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:12 -07:00
Chris Mason	abf454398c	RDS: use locking on the connection hash list rds_conn_destroy really needs locking while it changes the connection hash. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:11 -07:00
Chris Mason	c9e65383a2	rds: Fix RDMA message reference counting The RDS send_xmit code was trying to get fancy with message counting and was dropping the final reference on the RDMA messages too early. This resulted in memory corruption and oopsen. The fix here is to always add a ref as the parts of the message passes through rds_send_xmit, and always drop a ref as the parts of the message go through completion handling. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:10 -07:00
Chris Mason	7e3f2952ee	rds: don't let RDS shutdown a connection while senders are present This is the first in a long line of patches that tries to fix races between RDS connection shutdown and RDS traffic. Here we are maintaining a count of active senders to make sure the connection doesn't go away while they are using it. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:09 -07:00
Chris Mason	38a4e5e613	rds: Use RCU for the bind lookup searches The RDS bind lookups are somewhat expensive in terms of CPU time and locking overhead. This commit changes them into a faster RCU based hash tree instead of the rbtrees they were using before. On large NUMA systems it is a significant improvement. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:15:08 -07:00
Andy Grover	e4c52c98e0	RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node() Allocate send/recv rings in memory that is node-local to the HCA. This significantly helps performance. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:14:06 -07:00
Andy Grover	4a81802b5e	RDS/IB: Remove unused variable in ib_remove_addr() Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:29 -07:00
Chris Mason	764f2dd92f	rds: rcu-ize rds_ib_get_device() rds_ib_get_device is called very often as we turn an ip address into a corresponding device structure. It currently take a global spinlock as it walks different lists to find active devices. This commit changes the lists over to RCU, which isn't very complex because they are not updated very often at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:28 -07:00
Chris Mason	c83188dcd7	rds: per-rm flush_wait waitq This removes a global waitqueue used to wait for rds messages and replaces it with a waitqueue inside the rds_message struct. The global waitqueue turns into a global lock and significantly bottlenecks operations on large machines. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:27 -07:00
Chris Mason	976673ee1b	rds: switch to rwlock on bind_lock The bind_lock is almost entirely readonly, but it gets hammered during normal operations and is a major bottleneck. This commit changes it to an rwlock, which takes it from 80% of the system time on a big numa machine down to much lower numbers. A better fix would involve RCU, which is done in a later commit Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-09-08 18:12:26 -07:00
Andy Grover	ce47f52f42	RDS: Update comments in rds_send_xmit() Update comments to reflect changes in previous commit. Keeping as separate commits due to different authorship. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:25 -07:00
Chris Mason	9e29db0e36	RDS: Use a generation counter to avoid rds_send_xmit loop rds_send_xmit is required to loop around after it releases the lock because someone else could done a trylock, found someone working on the list and backed off. But, once we drop our lock, it is possible that someone else does come in and make progress on the list. We should detect this and not loop around if another process is actually working on the list. This patch adds a generation counter that is bumped every time we get the lock and do some send work. If the retry notices someone else has bumped the generation counter, it does not need to loop around and continue working. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:24 -07:00
Andy Grover	acfcd4d4ec	RDS: Get pong working again Call send_xmit() directly from pong() Set pongs as op_active Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:23 -07:00
Andy Grover	a40aa9233a	RDS: Do wait_event_interruptible instead of wait_event Can't see a reason not to allow signals to interrupt the wait. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:22 -07:00
Andy Grover	fcc5450c63	RDS: Remove send_quota from send_xmit() The purpose of the send quota was really to give fairness when different connections were all using the same workq thread to send backlogged msgs -- they could only send so many before another connection could make progress. Now that each connection is pushing the backlog from its completion handler, they are all guaranteed to make progress and the quota isn't needed any longer. A thread will have to send all previously queued data, as well as any further msgs placed on the queue while while c_send_lock was held. In a pathological case a single process can get roped into doing this for long periods while other threads get off free. But, since it can only do this until the transport reports full, this is a bounded scenario. Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:21 -07:00
Andy Grover	51e2cba8b5	RDS: Move atomic stats from general to ib-specific area Signed-off-by: Andy Grover <andy.grover@oracle.com>	2010-09-08 18:12:20 -07:00

... 3 4 5 6 7 ...

602 Commits