linux

Commit Graph

Author	SHA1	Message	Date
Marcel Holtmann	f4cdbb3f25	Bluetooth: Handle HCI raw socket transition from unbound to bound In case an unbound HCI raw socket is later on bound, ensure that the monitor notification messages indicate a close and re-open. None of the userspace tools use the socket this, but it is actually possible to use an ioctl on an unbound socket and then later bind it. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	f81f5b2db8	Bluetooth: Send control open and close messages for HCI raw sockets When opening and closing HCI raw sockets their main usage is for legacy userspace. To track interaction with the modern mgmt interface, send open and close monitoring messages for these action. The HCI raw sockets is special since it supports unbound ioctl operation and for that special case delay the notification message until at least one ioctl has been executed. The difference between a bound and unbound socket will be detailed by the fact the HCI index is present or not. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	d0bef1d26f	Bluetooth: Add extra channel checks for control open/close messages The control open and close monitoring events require special channel checks to ensure messages are only send when the right events happen. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	5a6d2cf5f1	Bluetooth: Assign the channel early when binding HCI sockets Assignment of the hci_pi(sk)->channel should be done early when binding the HCI socket. This avoids confusion with the RAW channel that is used for legacy access. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	0ef2c42f8c	Bluetooth: Send control open and close only when cookie is present Only when the cookie has been assigned, then send the open and close monitor messages. Also if the socket is bound to a device, then include the index into the message. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	9e8305b39b	Bluetooth: Use numbers for subsystem version string Instead of keeping a version string around, use version and revision numbers and then stringify them for use as module parameter. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	df1cb87af9	Bluetooth: Introduce helper functions for socket cookie handling Instead of manually allocating cookie information each time, use helper functions for generating and releasing cookies. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	9db5c62951	Bluetooth: Use command status event for Set IO Capability errors In case of failure, the Set IO Capability command is suppose to return command status and not command complete. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	56f787c502	Bluetooth: Fix wrong Get Clock Information return parameters The address information of the Get Clock Information return parameters is copying from a different memory location. It uses &cmd->param while it actually needs to be cmd->param. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	5504c3a310	Bluetooth: Use individual flags for certain management events Instead of hiding everything behind a general managment events flag, introduce indivdual flags that allow fine control over which events are send to a given management channel. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Johan Hedberg	37d3a1fab5	Bluetooth: mgmt: Fix sending redundant event for Advertising Instance When an Advertising Instance is removed, the Advertising Removed event shouldn't be sent to the same socket that issued the Remove Advertising command (it gets a command complete event instead). The mgmt_advertising_removed() function already has a parameter for skipping a specific socket, but there was no code to propagate the right value to this parameter. This patch fixes the issue by making sure the intermediate hci_req_clear_adv_instance() function gets the socket pointer. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	38ceaa00d0	Bluetooth: Add support for sending MGMT commands and events to monitor This adds support for tracing all management commands and events via the monitor interface. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	249fa1699f	Bluetooth: Add support for sending MGMT open and close to monitor This sends new notifications to the monitor support whenever a management channel has been opened or closed. This allows tracing of control channels really easily. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	03c979c471	Bluetooth: Introduce helper to pack mgmt version information The mgmt version information will be also needed for the control changell tracing feature. This provides a helper to pack them. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	70ecce91e3	Bluetooth: Store control socket cookie and comm information To further allow unique identification and tracking of control socket, store cookie and comm information when binding the socket. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	47b0f573f2	Bluetooth: Check SOL_HCI for raw socket options The SOL_HCI level should be enforced when using socket options on the HCI raw socket interface. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
Aristeu Rozanski	bd89bb6daa	mac802154: use rate limited warnings for malformed frames Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Aristeu Rozanski	ca1de81aa2	mac802154: don't warn on unsupported frames Just because we don't support certain types of frames yet doesn't mean we have to flood the message log with warnings about "invalid" frames. Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Alexander Aring	5ddedce3b7	6lowpan: ndisc: no overreact if no short address is available This patch removes handling to remove short address for a neigbour entry if RS/RA/NS/NA doesn't contain a short address. If these messages doesn't has any short address option, the existing short address from ndisc cache will be used. The current behaviour will set that the neigbour doesn't has a short address anymore. Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Alexander Aring	abbcc341ad	mac802154: set phy net namespace for new ifaces This patch sets the net namespace when creating SoftMAC interfaces. This is important if the namespace at phy layer was switched before. Currently we losing interfaces in some namespace and it's not possible to recover that. Signed-off-by: Alexander Aring <aar@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2016-09-19 20:19:34 +02:00
Marcel Holtmann	e64c97b53b	Bluetooth: Add combined LED trigger for controller power Instead of just having a LED trigger for power on a specific controller, this adds the LED trigger "bluetooth-power" that combines the power states of all controllers into a single trigger. This simplifies the trigger selection and also supports multiple controllers per host system via a single LED. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>	2016-09-19 20:19:34 +02:00
David Vrabel	d48f9ce73c	sunrpc: fix write space race causing stalls Write space becoming available may race with putting the task to sleep in xprt_wait_for_buffer_space(). The existing mechanism to avoid the race does not work. This (edited) partial trace illustrates the problem: [1] rpc_task_run_action: task:43546@5 ... action=call_transmit [2] xs_write_space <-xs_tcp_write_space [3] xprt_write_space <-xs_write_space [4] rpc_task_sleep: task:43546@5 ... [5] xs_write_space <-xs_tcp_write_space [1] Task 43546 runs but is out of write space. [2] Space becomes available, xs_write_space() clears the SOCKWQ_ASYNC_NOSPACE bit. [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but this has not yet been queued and the wake up is lost. [4] xs_nospace() is called which calls xprt_wait_for_buffer_space() which queues task 43546. [5] The call to sk->sk_write_space() at the end of xs_nospace() (which is supposed to handle the above race) does not call xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and thus the task is not woken. Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace() so the second call to sk->sk_write_space() calls xprt_write_space(). Suggested-by: Trond Myklebust <trondmy@primarydata.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> cc: stable@vger.kernel.org # 4.4 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:21:36 -04:00
Chuck Lever	496b77a5c5	xprtrdma: Eliminate rpcrdma_receive_worker() Clean up: the extra layer of indirection doesn't add value. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	1519e9697d	xprtrdma: Rename rpcrdma_receive_wc() Clean up: When converting xprtrdma to use the new CQ API, I missed a spot. The naming convention elsewhere is: {svc_rdma,rpcrdma}_wc_{operation} Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	eeb30613e1	xprtrmda: Report address of frmr, not mw Tie frwr debugging messages together by always reporting the address of the frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	44829d02d2	xprtrdma: Support larger inline thresholds The Version One default inline threshold is still 1KB. But allow testing with thresholds up to 64KB. This maximum is somewhat arbitrary. There's no fundamental architectural limit I'm aware of, but it's good to keep the size of Receive buffers reasonable. Now that Send can use a s/g list, a Send buffer is only as large as each RPC requires. Receive buffers are always the size of the inline threshold, however. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	655fec6987	xprtrdma: Use gathered Send for large inline messages An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	c8b920bb49	xprtrdma: Basic support for Remote Invalidation Have frwr's ro_unmap_sync recognize an invalidated rkey that appears as part of a Receive completion. Local invalidation can be skipped for that rkey. Use an out-of-band signaling mechanism to indicate to the server that the client is prepared to receive RDMA Send With Invalidate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	87cfb9a0c8	xprtrdma: Client-side support for rpcrdma_connect_private Send an RDMA-CM private message on connect, and look for one during a connection-established event. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Once the client knows the server's inline threshold maxima, it can adjust the use of Reply chunks, and eliminate most use of Position Zero Read chunks. Moderately-sized I/O can be done using a pure inline RDMA Send instead of RDMA operations that require memory registration. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	6ea8e71150	xprtrdma: Move recv_wr to struct rpcrdma_rep Clean up: The fields in the recv_wr do not vary. There is no need to initialize them before each ib_post_recv(). This removes a large-ish data structure from the stack. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	90aab60296	xprtrdma: Move send_wr to struct rpcrdma_req Clean up: Most of the fields in each send_wr do not vary. There is no need to initialize them before each ib_post_send(). This removes a large-ish data structure from the stack. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	b157380af1	xprtrdma: Simplify rpcrdma_ep_post_recv() Clean up. Since commit `fc66448549` ("xprtrdma: Split the completion queue"), rpcrdma_ep_post_recv() no longer uses the "ep" argument. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	13650c23f1	xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf Clean up. The "ia" argument is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:38 -04:00
Chuck Lever	54cbd6b0c6	xprtrdma: Delay DMA mapping Send and Receive buffers Currently, each regbuf is allocated and DMA mapped at the same time. This is done during transport creation. When a device driver is unloaded, every DMA-mapped buffer in use by a transport has to be unmapped, and then remapped to the new device if the driver is loaded again. Remapping will have to be done _after_ the connect worker has set up the new device. But there's an ordering problem: call_allocate, which invokes xprt_rdma_allocate which calls rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_ the connect worker can run to set up the new device. Instead, at transport creation, allocate each buffer, but leave it unmapped. Once the RPC carries these buffers into ->send_request, by which time a transport connection should have been established, check to see that the RPC's buffers have been DMA mapped. If not, map them there. When device driver unplug support is added, it will simply unmap all the transport's regbufs, but it doesn't have to deallocate the underlying memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	99ef4db329	xprtrdma: Replace DMA_BIDIRECTIONAL The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt. Fortunately, xprtrdma now knows which direction I/O is going as soon as it allocates each regbuf. The RPC Call and Reply buffers are no longer the same regbuf. They can each be labeled correctly now. The RPC Reply buffer is never part of either a Send or Receive WR, but it can be part of Reply chunk, which is mapped and registered via ->ro_map . So it is not DMA mapped when it is allocated (DMA_NONE), to avoid a double- mapping. Since Receive buffers are no longer DMA_BIDIRECTIONAL and their contents are never modified by the host CPU, DMA-API-HOWTO.txt suggests that a DMA sync before posting each buffer should be unnecessary. (See my_card_interrupt_handler). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	08cf2efd54	xprtrdma: Use smaller buffers for RPC-over-RDMA headers Commit `949317464b` ("xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers") capped the number of chunks that may appear in RPC-over-RDMA headers. The maximum header size can be estimated and fixed to avoid allocating buffer space that is never used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	9c40c49f14	xprtrdma: Initialize separate RPC call and reply buffers RPC-over-RDMA needs to separate its RPC call and reply buffers. o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA Send operation using DMA_TO_DEVICE o If the client expects a large RPC reply, it DMA maps rq_rcv_buf as part of a Reply chunk using DMA_FROM_DEVICE The two mappings are for data movement in opposite directions. DMA-API.txt suggests that if these mappings share a DMA cacheline, bad things can happen. This could occur in the final bytes of rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers happen to share a DMA cacheline. On x86_64 the cacheline size is typically 8 bytes, and RPC call messages are usually much smaller than the send buffer, so this hasn't been a noticeable problem. But the DMA cacheline size can be larger on other platforms. Also, often rq_rcv_buf starts most of the way into a page, thus an additional RDMA segment is needed to map and register the end of that buffer. Try to avoid that scenario to reduce the cost of registering and invalidating Reply chunks. Instead of carrying a single regbuf that covers both rq_snd_buf and rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for rq_snd_buf and one regbuf for rq_rcv_buf. Some incidental changes worth noting: - To clear out some spaghetti, refactor xprt_rdma_allocate. - The value stored in rg_size is the same as the value stored in the iov.length field, so eliminate rg_size Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	5a6d1db455	SUNRPC: Add a transport-specific private field in rpc_rqst Currently there's a hidden and indirect mechanism for finding the rpcrdma_req that goes with an rpc_rqst. It depends on getting from the rq_buffer pointer in struct rpc_rqst to the struct rpcrdma_regbuf that controls that buffer, and then to the struct rpcrdma_req it goes with. This was done back in the day to avoid the need to add a per-rqst pointer or to alter the buf_free API when support for RPC-over-RDMA was introduced. I'm about to change the way regbuf's work to support larger inline thresholds. Now is a good time to replace this indirect mechanism with something that is more straightforward. I guess this should be considered a clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	68778945e4	SUNRPC: Separate buffer pointers for RPC Call and Reply messages For xprtrdma, the RPC Call and Reply buffers are involved in real I/O operations. To start with, the DMA direction of the I/O for a Call is opposite that of a Reply. In the current arrangement, the Reply buffer address is on a four-byte alignment just past the call buffer. Would be friendlier on some platforms if that was at a DMA cache alignment instead. Because the current arrangement allocates a single memory region which contains both buffers, the RPC Reply buffer often contains a page boundary in it when the Call buffer is large enough (which is frequent). It would be a little nicer for setting up DMA operations (and possible registration of the Reply buffer) if the two buffers were separated, well-aligned, and contained as few page boundaries as possible. Now, I could just pad out the single memory region used for the pair of buffers. But frequently that would mean a lot of unused space to ensure the Reply buffer did not have a page boundary. Add a separate pointer to rpc_rqst that points right to the RPC Reply buffer. This makes no difference to xprtsock, but it will help xprtrdma in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	3435c74aed	SUNRPC: Generalize the RPC buffer release API xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Instead of passing just the rq_buffer into the buf_free method, pass the task structure and let buf_free take care of freeing both XDR buffers at once. There's a micro-optimization here. In the common case, both xprt_release and the transport's buf_free method were checking if rq_buffer was NULL. Now the check is done only once per RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	5fe6eaa1f9	SUNRPC: Generalize the RPC buffer allocation API xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Transports that want to allocate separate Call and Reply buffers will ignore the "size" argument anyway. Don't bother passing it. The buf_alloc method can't return two pointers. Instead, make the method's return value an error code, and set the rq_buffer pointer in the method itself. This gives call_allocate an opportunity to terminate an RPC instead of looping forever when a permanent problem occurs. If a request is just bogus, or the transport is in a state where it can't allocate resources for any request, there needs to be a way to kill the RPC right there and not loop. This immediately fixes a rare problem in the backchannel send path, which loops if the server happens to send a CB request whose call+reply size is larger than a page (which it shouldn't do yet). One more issue: looks like xprt_inject_disconnect was incorrectly placed in the failure path in call_allocate. It needs to be in the success path, as it is for other call-sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	b9c5bc03be	SUNRPC: Refactor rpc_xdr_buf_init() Clean up: there is some XDR initialization logic that is common to the forward channel and backchannel. Move it to an XDR header so it can be shared. rpc_rqst::rq_buffer points to a buffer containing big-endian data. Update its annotation as part of the clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Chuck Lever	eb342e9a38	xprtrdma: Eliminate INLINE_THRESHOLD macros Clean up: r_xprt is already available everywhere these macros are invoked, so just dereference that directly. RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Andy Adamson	fda0ab4117	SUNRPC: rpc_clnt_add_xprt setup function for NFS layer Use a setup function to call into the NFS layer to test an rpc_xprt for session trunking so as to not leak the rpc_xprt_switch into the nfs layer. Search for the address in the rpc_xprt_switch first so as not to put an unnecessary EXCHANGE_ID on the wire. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	39e5d2df95	SUNRPC search xprt switch for sockaddr Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	dd69171769	SUNRPC rpc_clnt_xprt_switch_add_xprt Give the NFS layer access to the rpc_xprt_switch_add_xprt function Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	3b58a8a904	SUNRPC rpc_clnt_xprt_switch_put Give the NFS layer access to the xprt_switch_put function Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	7705f6abbb	SUNRPC remove rpc_task_release_client from rpc_task_set_client rpc_task_set_client is only called from rpc_run_task after rpc_new_task and rpc_task_release_client is not needed as the task is new. When called from rpc_new_task, rpc_task_set_client also removed the assigned rpc_xprt which is not desired. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	d002526886	SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create() Clean up. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla	2813b626e3	sunrpc: Remove unnecessary variable The variable `err` is not used anywhere and just returns the predefined value `0` at the end of the function. Hence, remove the variable and return 0 explicitly. Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:35 -04:00
Ilan Tayari	b588479358	xfrm: Fix memory leak of aead algorithm name commit `1a6509d991` ("[IPSEC]: Add support for combined mode algorithms") introduced aead. The function attach_aead kmemdup()s the algorithm name during xfrm_state_construct(). However this memory is never freed. Implementation has since been slightly modified in commit `ee5c23176f` ("xfrm: Clone states properly on migration") without resolving this leak. This patch adds a kfree() call for the aead algorithm name. Fixes: `1a6509d991` ("[IPSEC]: Add support for combined mode algorithms") Signed-off-by: Ilan Tayari <ilant@mellanox.com> Acked-by: Rami Rosen <roszenrami@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-19 12:08:58 +02:00
David S. Miller	e867e87ae8	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV90ZzvSw1s6N8H32AQIBsg//StI2J/tyGGTvvVXoeWPAGeDZp8907Qbc O6671V3feft0HvenQ/uFfC9PdpRTYHTlQhWidk9/wTx5nvg/5b7HGHSD8ghPnCRj Jyu4XbpXM6vw650qSBpRAg5TLIAkSsrdjaANzFSYw70sX6eTLdKJF3YFyoDRD2Cc T2Y0XIUAcACl5/A/KIHRaKjTtmzhoc5Vih9lOWQRZ/dw2u0A6GdwN0qoFfHqXwvI t1S4MtSkXrO8D9uWqmFhTpRzSAzO57L7bccVJbr+OzdNAm/Bkicjrxyzo6YepDAg VRdYaNtKCR9vPRFb9vZFlZy3vyUHb/22mrgM6/V5GG5qd/NFSnouFZ/WGxLnv1Yt 5X+F0qJXXgxSmaGt0f6cV7d9OO24HU/YWGl7wUCHsMZ4bWDySBUWtk0X4USpCsiD +AUxYJTEUNKZcCsW8XmC+cfnkRLNPycX603/pCH7aiPdQ5qY+ue89ICxu3PrJrf/ S8PuQPIoEWmEdTQ9QweG9FpIuQc2LO9fmzOJdiZXHYbGAKFOfNNz5Z64tSVRvPVE WbxW0HYSNCJmBt15OI3hyONmbgACgmuD+e3EFWzCATz94FwMogLBFNqPcw1YpbwX 48RdcP62gx7FMZ4MdqfWllviZKjTlTVmacCRghwGFBOuZL3ifu1gfhsTbrAxqXCs CSlnXgZV7gI= =/2cJ -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Tracepoint addition and improvement Here is a set of patches that add some more tracepoints and improve a couple of existing ones. New additions include: (1) Connection refcount tracking. (2) Client connection state machine tracking. (3) Tx and Rx packet lifecycle. (4) ACK reception and transmission. (5) recvmsg processing. Updates include: (1) Print the symbolic packet name in the Rx packet tracepoint. (2) Additional call refcount trace events. (3) Improvements to sk_buff tracking with AF_RXRPC. In addition: (1) Config option to inject packet loss during both transmission and reception. (2) Removal of some printks. This series needs to be applied on top of the previously posted fixes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:52:21 -04:00
David S. Miller	5b0c6fc8ef	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV90ZnPSw1s6N8H32AQKguA//Q7lvTa3n3DFvMQcIPsyJZ6VniUqksTA1 wmQrw4GHXRUgM8UWz7G9Y5aqxUp2q6y6Vm9BeHkQ2bYZjSrOx5Dc/AImWhBn5au+ h+HZYcEs4mFM5AoVT7GK8o/nNODDjNt2qwcH/Nf8+SM7Xf52zYHelrSteLZZ1YWO S1m4pa5YUBa/ICD3+K52lWq9SCG0VMmy41UHDXU6uSakfN62rn9ZYCcNeathlJGS 2D0cG3GzYYHiBZ1CkmdPQgGQhTM3wzI+0OvbNnidFlF78zVDlxX8C9zgWs/VlTCg 02ok4+ftzDojgXH9W+DziEiyCNq14GTDbrdSai5WA+vHVGagr6OoSwDWgPHkXBvW pYQh7jqRoBOKN2fsVkU0t19hPc2CCVGYMh49A6AFv8lgS+nWoDXOlmZ0snh8deQg Z0HO5mx+V+4yJplBlwH6ncvbRB9ywpsvIuLriZXC/aJg6aY4a8nrU35d1+6xUaM7 RBMud0uj+7oU+sC9N7CuM8m8HpBOg6+qAsbsfATSwadMRcMdS4LSoXcBg0WvljIH JmtL924yEnMDw1yPkmuDBcQ9K6DuxeOYZg4A2756tBtGulxuVjntmI1MVAQlsbqH CnNPWpxIDoLRQsHVcYWS5O1F16drGobzFhmj7Hf/6HmGa28x7nQhDafzfFj/3Dos MAdM2pdO2x8= =MjVT -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes & miscellany Here are some more AF_RXRPC fix patches with a couple of miscellaneous changes also. Fixes include: (1) Make RxRPC IPv6 support conditional on IPv6 being available. (2) Move the condition check in rxrpc_locate_data() into the caller and check the error return. (3) Fix the detection of the last received packet in recvmsg. (4) Account calls that need acceptance and clean up any unaccepted ones if the socket gets closed. (5) Fix the cleanup of client connections. (6) Fix the soft-ACK parsing and the retransmission of packets based on those ACKs. (7) Suppress transmission of an ACK when there's no pending ACK to transmit because another thread stole it. And some miscellany: (8) Whitespace removal. (9) Switch-value consistency in rxrpc_send_call_packet(). (10) Fix the basic transmission packet size to allow for spur-of-the-moment jumbo DATA packet production. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:51:21 -04:00
Florian Westphal	48da34b7a7	sched: add and use qdisc_skb_head helpers This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head. Its similar to the skb_buff_head api, but does not use skb->prev pointers. Qdiscs will commonly enqueue at the tail of a list and dequeue at head. While skb_buff_head works fine for this, enqueue/dequeue needs to also adjust the prev pointer of next element. The ->prev pointer is not required for qdiscs so we can just leave it undefined and avoid one cacheline write access for en/dequeue. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	ed760cb8aa	sched: replace __skb_dequeue with __qdisc_dequeue_head After previous patch these functions are identical. Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head. Next patch will then make __qdisc_dequeue_head handle single-linked list instead of strcut sk_buff_head argument. Doesn't change generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	ec32336879	sched: remove qdisc arg from __qdisc_dequeue_head Moves qdisc stat accouting to qdisc_dequeue_head. The only direct caller of the __qdisc_dequeue_head version open-codes this now. This allows us to later use __qdisc_dequeue_head as a replacement of __skb_dequeue() (which operates on sk_buff_head list). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	97d0678f91	sched: don't use skb queue helpers A followup change will replace the sk_buff_head in the qdisc struct with a slightly different list. Use of the sk_buff_head helpers will thus cause compiler warnings. Open-code these accesses in an extra change to ease review. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Florian Westphal	1486587b2f	pie: use qdisc_dequeue_head wrapper Doesn't change generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:47:18 -04:00
Christophe Jaillet	e8bc8f9a67	sctp: Remove some redundant code In commit `311b21774f` ("sctp: simplify sk_receive_queue locking"), a call to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was hidden in 'sctp_skb_list_tail()' Now, the code around it looks redundant. The '_init()' part of 'skb_queue_splice_tail_init()' should already do the same. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:34:01 -04:00
Mahesh Bandewar	e8bffe0cf9	net: Add _nf_(un)register_hooks symbols Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow caller to hold RTNL mutex. Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:25:22 -04:00
Mahesh Bandewar	d409b84768	ipv6: Export p6_route_input_lookup symbol Make ip6_route_input_lookup available outside of ipv6 the module similar to ip_route_input_noref in the IPv4 world. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-19 01:25:22 -04:00
Nogah Frankel	69ae6ad2ff	net: core: Add offload stats to if_stats_msg Add a nested attribute of offload stats to if_stats_msg named IFLA_STATS_LINK_OFFLOAD_XSTATS. Under it, add SW stats, meaning stats only per packets that went via slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT. Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:33:42 -04:00
David S. Miller	c13ed534b8	This time we have various things - all across the board: * MU-MIMO sniffer support in mac80211 * a create_singlethread_workqueue() cleanup * interface dump filtering that was documented but not implemented * support for the new radiotap timestamp field * send delBA in two unexpected conditions (as required by the spec) * connect keys cleanups - allow only WEP with index 0-3 * per-station aggregation limit to work around broken APs * debugfs improvement for the integrated codel algorithm and various other small improvements and cleanups. -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2+umAAoJEGt7eEactAAdIMkP/jMmpbxkzD64L7nTkO4APGva r6RmMM1SmgVD/CtVkjlBLuvo5YOTWv/vWvy6KoUESOINAx/e6T3T7bmmCOXzbsOL e5/YYcS1AOqgn5SdhgIj1E5cpdYIhlUGRlNJ0qEjeLLrh4/TLUNbCcuPhOYybUMz fUrdPKgDeWb7x9EHLENhPsVtCXWwKnkDIS4qclPZCWgRj46XM4pNB4OlvCUzGY6k bOqGJfrtjYjgKFDmPFqfYA4JDA56980qqO41+eEKXeMvDKNs+pSiNco130Q+uU3E o7tk9DMnAnCy2GihpV1ZYVkLr6O+7o9xVuenj3NRlhyd1mn2gXxLcO4AkHcrZBkf Ei+2L+KgnWELyqiSOaGTJKlugsgS4DDoNnFEIVjSweQ9DIoBA/Gj/6+4uZeHXJ3M bEjtHnCLi5CuI067uBoevwXFoMi1poWra2KnZKOZzFS5OL3xHv4//x/Wmnn2/5Jz ffEwVyRmTY76sLWfnwXUDClrFWAYQrpNyTryc+k3cpYKzhnseiqt+z43cBuISm00 uh5B9PpPB8RhtUnXrL/SHRyf8YEluaidTsI2lc1LvwXOc0+Zbp73mTCgP+rzLs9p K2qVRiozpIXanW6hKmmaDwjKlcAKKLP0xN2v90MqwQt4YdLIKlXnll1AH2BawzuP OWB3n8D0I6y0PWH+Yo8o =s1MY -----END PGP SIGNATURE----- Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== This time we have various things - all across the board: * MU-MIMO sniffer support in mac80211 * a create_singlethread_workqueue() cleanup * interface dump filtering that was documented but not implemented * support for the new radiotap timestamp field * send delBA in two unexpected conditions (as required by the spec) * connect keys cleanups - allow only WEP with index 0-3 * per-station aggregation limit to work around broken APs * debugfs improvement for the integrated codel algorithm and various other small improvements and cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:29:08 -04:00
David S. Miller	7ac327318e	Two more fixes: * reject aggregation sessions for TSID/TID 8-16 that we can never use anyway and which could confuse drivers * check return value of skb_linearize() -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2+l7AAoJEGt7eEactAAdedMP/0k225qWRR+19lavkZ3/lSqb YiQdYNwKCZf9WR+5FiQKFcTV4XMCfKiWX2eJJWqFI5r+V4mfTJAZDehZpddCvJou KW3dZYe25+DaFApdhQiJXTxKSeJAbRB7EWT4i0P54Ned0YUeFwwOjvxuH+fq2VY0 tScX5JogPuGFVzXhlqUVZTjbHlBkcILGv7QjOXXL9yf4d2x57uXyabOB14l351RF 4hqXm7u+JQedbGjhYHDMKnyuwXQW4mbhELBdyMh7heLPqufWf4Xakrwh8WYSKuBb H5dICeq+/JPKnwOStmwtz1FQVIlU08PMLnWzQbByrkDvQPCIkMHLu6h6oNierqAH z4vyy8w05KzFNG3R5egygopXbncZvSOCcOPkDQxbTYmh2tK97OMUjXbUfPEUXZs2 WbSuECm+RVSNKtLlUip7TrM3iYI7xmCN/qPB4Wuq+z/JiSOrnP+G047iUbhdDpnE fwLmRWWdEGkjnv3TFdzzxLVAJPCQWyuLfxaUBlHlYRNqBY8ufTR3FHUKj4dVwqJK EqnSiDUqbZE6qJw516MEp+CqZGb8hzeMCeAu+docWMa3CmnNlglZtYgNiou45y9Z 9/8dAO93/qAzsaw9A/XprnIVPcu1vsFSicrhjOdZsnYoVJDGoW6rjzz6HpCk9QxB 7FjxZ5LilJ5Ws4ltlwr7 =3dek -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more fixes: * reject aggregation sessions for TSID/TID 8-16 that we can never use anyway and which could confuse drivers * check return value of skb_linearize() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:26:49 -04:00
Eric Dumazet	695b4ec0f0	pkt_sched: fq: use proper locking in fq_dump_stats() When fq is used on 32bit kernels, we need to lock the qdisc before copying 64bit fields. Otherwise "tc -s qdisc ..." might report bogus values. Fixes: `afe4fd0624` ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:15:08 -04:00
Thadeu Lima de Souza Cascardo	db74a3335e	openvswitch: use percpu flow stats Instead of using flow stats per NUMA node, use it per CPU. When using megaflows, the stats lock can be a bottleneck in scalability. On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps when forwarding between two 40GbE ports with a single flow configured on the datapath. This has been tested on a system with possible CPUs 0-7,16-23. After module removal, there were no corruption on the slab cache. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Cc: pravin shelar <pshelar@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:14:01 -04:00
Thadeu Lima de Souza Cascardo	40773966cc	openvswitch: fix flow stats accounting when node 0 is not possible On a system with only node 1 as possible, all statistics is going to be accounted on node 0 as it will have a single writer. However, when getting and clearing the statistics, node 0 is not going to be considered, as it's not a possible node. Tested that statistics are not zero on a system with only node 1 possible. Also compile-tested with CONFIG_NUMA off. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:14:01 -04:00
Xin Long	41001faf95	sctp: not return ENOMEM err back in sctp_packet_transmit As David and Marcelo's suggestion, ENOMEM err shouldn't return back to user in transmit path. Instead, sctp's retransmit would take care of the chunks that fail to send because of ENOMEM. This patch is only to do some release job when alloc_skb fails, not to return ENOMEM back any more. Besides, it also cleans up sctp_packet_transmit's err path, and fixes some issues in err path: - It didn't free the head skb in nomem: path. - No need to check nskb in no_route: path. - It should goto err: path if alloc_skb fails for head. - Not all the NOMEMs should free nskb. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:33 -04:00
Xin Long	83dbc3d4a3	sctp: make sctp_outq_flush/tail/uncork return void sctp_outq_flush return value is meaningless now, this patch is to make sctp_outq_flush return void, as well as sctp_outq_fail and sctp_outq_uncork. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:33 -04:00
Xin Long	645194409b	sctp: save transmit error to sk_err in sctp_outq_flush Every time when sctp calls sctp_outq_flush, it sends out the chunks of control queue, retransmit queue and data queue. Even if some trunks are failed to transmit, it still has to flush all the transports, as it's the only chance to clean that transmit_list. So the latest transmit error here should be returned back. This transmit error is an internal error of sctp stack. I checked all the places where it uses the transmit error (the return value of sctp_outq_flush), most of them are actually just save it to sk_err. Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if it's failed to send a REPLY, which is actually incorrect, as we can't be sure the error that sctp_outq_flush returns is from sending that REPLY. So it's meaningless for sctp_outq_flush to return error back. This patch is to save transmit error to sk_err in sctp_outq_flush, the new error can update the old value. Eventually, sctp_wait_for_* would check for it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	b61c654f9b	sctp: free msg->chunks when sctp_primitive_SEND return err Last patch "sctp: do not return the transmit err back to sctp_sendmsg" made sctp_primitive_SEND return err only when asoc state is unavailable. In this case, chunks are not enqueued, they have no chance to be freed if we don't take care of them later. This Patch is actually to revert commit `1cd4d5c432` ("sctp: remove the unused sctp_datamsg_free()"), commit `69b5777f2e` ("sctp: hold the chunks only after the chunk is enqueued in outq") and commit `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg"), to use sctp_datamsg_free to free the chunks of current msg. Fixes: `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	66388f2c08	sctp: do not return the transmit err back to sctp_sendmsg Once a chunk is enqueued successfully, sctp queues can take care of it. Even if it is failed to transmit (like because of nomem), it should be put into retransmit queue. If sctp report this error to users, it confuses them, they may resend that msg, but actually in kernel sctp stack is in charge of retransmit it already. Besides, this error probably is not from the failure of transmitting current msg, but transmitting or retransmitting another msg's chunks, as sctp_outq_flush just tries to send out all transports' chunks. This patch is to make sctp_cmd_send_msg return avoid, and not return the transmit err back to sctp_sendmsg Fixes: `8b570dc9f7` ("sctp: only drop the reference on the datamsg after sending a msg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Xin Long	2c89791eeb	sctp: remove the unnecessary state check in sctp_outq_tail Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks the asoc's state through statetable before calling sctp_outq_tail. So there's no need to check the asoc's state again in sctp_outq_tail. Besides, sctp_do_sm is protected by lock_sock, even if sending msg is interrupted by timer events, the event's processes still need to acquire lock_sock first. It means no others CMDs can be enqueue into side effect list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it. This patch is to remove redundant asoc->state check from sctp_outq_tail. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-18 22:02:32 -04:00
Alexei Starovoitov	8d79266bc4	ip6_tunnel: add collect_md mode to IPv6 tunnels Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels to operate in 'collect metadata' mode. Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function for both collect_md and traditional tunnels. bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:13:07 -04:00
Alexei Starovoitov	cfc7381b30	ip_tunnel: add collect_md mode to IPIP tunnel Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. ovs can use it as well in the future (once appropriate ovs-vport abstractions and user apis are added). Note that just like in other tunnels we cannot cache the dst, since tunnel_info metadata can be different for every packet. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:13:07 -04:00
Julia Lawall	eb94737d71	l2tp: constify net_device_ops structures Check for net_device_ops structures that are only stored in the netdev_ops field of a net_device structure. This field is declared const, so net_device_ops structures that have this property can be declared as const also. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r disable optional_qualifier@ identifier i; position p; @@ static struct net_device_ops i@p = { ... }; @ok@ identifier r.i; struct net_device e; position p; @@ e.netdev_ops = &i@p; @bad@ position p != {r.p,ok.p}; identifier r.i; struct net_device_ops e; @@ e@i@p @depends on !bad disable optional_qualifier@ identifier r.i; @@ static +const struct net_device_ops i = { ... }; // </smpl> The result of size on this file before the change is: text data bss dec hex filename 3401 931 44 4376 1118 net/l2tp/l2tp_eth.o and after the change it is: text data bss dec hex filename 3993 347 44 4384 1120 net/l2tp/l2tp_eth.o Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:07:23 -04:00
Alan Cox	5ff904d55d	llc: switch type to bool as the timeout is only tested versus 0 (As asked by Dave in Februrary) Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:05:05 -04:00
Eric Dumazet	3613b3dbd1	tcp: prepare skbs for better sack shifting With large BDP TCP flows and lossy networks, it is very important to keep a low number of skbs in the write queue. RACK and SACK processing can perform a linear scan of it. We should avoid putting any payload in skb->head, so that SACK shifting can be done if needed. With this patch, we allow to pack ~0.5 MB per skb instead of the 64KB initially cooked at tcp_sendmsg() time. This gives a reduction of number of skbs in write queue by eight. tcp_rack_detect_loss() likes this. We still allow payload in skb->head for first skb put in the queue, to not impact RPC workloads. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 10:05:05 -04:00
phil.turnbull@oracle.com	8ab86c00e3	irda: Free skb on irda_accept error path. skb is not freed if newsk is NULL. Rework the error path so free_skb is unconditionally called on function exit. Fixes: `c3ea9fa274` ("[IrDA] af_irda: IRDA_ASSERT cleanups") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 09:59:31 -04:00
Eric Dumazet	ffb4d6c850	tcp: fix overflow in __tcp_retransmit_skb() If a TCP socket gets a large write queue, an overflow can happen in a test in __tcp_retransmit_skb() preventing all retransmits. The flow then stalls and resets after timeouts. Tested: sysctl -w net.core.wmem_max=1000000000 netperf -H dest -- -s 1000000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 09:59:30 -04:00
David Howells	8a681c3605	rxrpc: Add config to inject packet loss Add a configuration option to inject packet loss by discarding approximately every 8th packet received and approximately every 8th DATA packet transmitted. Note that no locking is used, but it shouldn't really matter. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	71f3ca408f	rxrpc: Improve skb tracing Improve sk_buff tracing within AF_RXRPC by the following means: (1) Use an enum to note the event type rather than plain integers and use an array of event names rather than a big multi ?: list. (2) Distinguish Rx from Tx packets and account them separately. This requires the call phase to be tracked so that we know what we might find in rxtx_buffer[]. (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the event type. (4) A pair of 'rotate' events are added to indicate packets that are about to be rotated out of the Rx and Tx windows. (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for packet loss injection recording. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	ba39f3a0ed	rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one uses an uninitialised variable. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:04 +01:00
David Howells	849979051c	rxrpc: Add a tracepoint to follow what recvmsg does Add a tracepoint to follow what recvmsg does within AF_RXRPC. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	58dc63c998	rxrpc: Add a tracepoint to follow packets in the Rx buffer Add a tracepoint to follow the life of packets that get added to a call's receive buffer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	f3639df2d9	rxrpc: Add a tracepoint to log ACK transmission Add a tracepoint to log information about ACK transmission. Signed-off-by: David Howels <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	ec71eb9ada	rxrpc: Add a tracepoint to log received ACK packets Add a tracepoint to log information from received ACK packets. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a124fe3ee5	rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer Add a tracepoint to follow the insertion of a packet into the transmit buffer, its transmission and its rotation out of the buffer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	363deeab6d	rxrpc: Add connection tracepoint and client conn state tracepoint Add a pair of tracepoints, one to track rxrpc_connection struct ref counting and the other to track the client connection cache state. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a84a46d730	rxrpc: Add some additional call tracing Add additional call tracepoint points for noting call-connected, call-released and connection-failed events. Also fix one tracepoint that was using an integer instead of the corresponding enum value as the point type. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	a3868bfc8d	rxrpc: Print the packet type name in the Rx packet trace Print a symbolic packet type name for each valid received packet in the trace output, not just a number. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 11:24:03 +01:00
David Howells	182f505624	rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes Fix the basic transmit DATA packet content size at 1412 bytes so that they can be arbitrarily assembled into jumbo packets. In the future, I'm thinking of moving to keeping a jumbo packet header at the beginning of each packet in the Tx queue and creating the packet header on the spot when kernel_sendmsg() is invoked. That way, jumbo packets can be assembled on the spur of the moment for (re-)transmission. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:54:32 +01:00
David Howells	2311e327cd	rxrpc: Be consistent about switch value in rxrpc_send_call_packet() rxrpc_send_call_packet() should use type in both its switch-statements rather than using pkt->whdr.type. This might give the compiler an easier job of uninitialised variable checking. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:54:21 +01:00
David Howells	27d0fc431c	rxrpc: Don't transmit an ACK if there's no reason set Don't transmit an ACK if call->ackr_reason in unset. There's the possibility of a race between recvmsg() sending an ACK and the background processing thread trying to send the same one. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:55 +01:00
David Howells	dfa7d92040	rxrpc: Fix retransmission algorithm Make the retransmission algorithm use for-loops instead of do-loops and move the counter increments into the for-statement increment slots. Though the do-loops are slighly more efficient since there will be at least one pass through the each loop, the counter increments are harder to get right as the continue-statements skip them. Without this, if there are any positive acks within the loop, the do-loop will cycle forever because the counter increment is never done. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	d01dc4c3c1	rxrpc: Fix the parsing of soft-ACKs The soft-ACK parser doesn't increment the pointer into the soft-ACK list, resulting in the first ACK/NACK value being applied to all the relevant packets in the Tx queue. This has the potential to miss retransmissions and cause excessive retransmissions. Fix this by incrementing the pointer. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	78883793f8	rxrpc: Fix unexposed client conn release If the last call on a client connection is release after the connection has had a bunch of calls allocated but before any DATA packets are sent (so that it's not yet marked RXRPC_CONN_EXPOSED), an assertion will happen in rxrpc_disconnect_client_call(). af_rxrpc: Assertion failed - 1(0x1) >= 2(0x2) is false ------------[ cut here ]------------ kernel BUG at ../net/rxrpc/conn_client.c:753! This is because it's expecting the conn to have been exposed and to have 2 or more refs - but this isn't necessarily the case. Simply remove the assertion. This allows the conn to be moved into the inactive state and deleted if it isn't resurrected before the final put is called. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	357f5ef646	rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call() Call rxrpc_release_call() on getting an error in rxrpc_new_client_call() rather than trying to do the cleanup ourselves. This isn't a problem, provided we set RXRPC_CALL_HAS_USERID only if we actually add the call to the calls tree as cleanup code fragments that would otherwise cause problems are conditional. Without this, we miss some of the cleanup. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:21 +01:00
David Howells	66d58af7f4	rxrpc: Fix the putting of client connections In rxrpc_put_one_client_conn(), if a connection has RXRPC_CONN_COUNTED set on it, then it's accounted for in rxrpc_nr_client_conns and may be on various lists - and this is cleaned up correctly. However, if the connection doesn't have RXRPC_CONN_COUNTED set on it, then the put routine returns rather than just skipping the extra bit of cleanup. Fix this by making the extra bit of clean up conditional instead and always killing off the connection. This manifests itself as connections with a zero usage count hanging around in /proc/net/rxrpc_conns because the connection allocated, but discarded, due to a race with another process that set up a parallel connection, which was then shared instead. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:53:20 +01:00
David Howells	0360da6db7	rxrpc: Purge the to_be_accepted queue on socket release Purge the queue of to_be_accepted calls on socket release. Note that purging sock_calls doesn't release the ref owned by to_be_accepted. Probably the sock_calls list is redundant given a purges of the recvmsg_q, the to_be_accepted queue and the calls tree. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	e6f3afb3fc	rxrpc: Record calls that need to be accepted Record calls that need to be accepted using sk_acceptq_added() otherwise the backlog counter goes negative because sk_acceptq_removed() is called. This causes the preallocator to malfunction. Calls that are preaccepted by AFS within the kernel aren't affected by this. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	816c9fce12	rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data() The code for determining the last packet in rxrpc_recvmsg_data() has been using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points to the last packet or not. This isn't a good idea, however, as the input code may be running simultaneously on another CPU and that sets the flag before updating the top pointer. Fix this by the following means: (1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only. There's otherwise a synchronisation problem between detecting the flag and checking tx_top. This could probably be dealt with by appropriate application of memory barriers, but there's a simpler way. (2) Set RXRPC_CALL_RX_LAST after setting rx_top. (3) Make rxrpc_rotate_rx_window() consult the flags header field of the DATA packet it's about to discard to see if that was the last packet. Use this as the basis for ending the Rx phase. This shouldn't be a problem because the recvmsg side of things is guaranteed to see the packets in order. (4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if: (a) the packet it has just processed is marked as RXRPC_LAST_PACKET (b) the call's Rx phase has been ended. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:51:54 +01:00
David Howells	2e2ea51dec	rxrpc: Check the return value of rxrpc_locate_data() Check the return value of rxrpc_locate_data() in rxrpc_recvmsg_data(). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:49 +01:00
David Howells	4b22457c06	rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller Move the check of rx_pkt_offset from rxrpc_locate_data() to the caller, rxrpc_recvmsg_data(), so that it's more clear what's going on there. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:48 +01:00
David Howells	fabf920180	rxrpc: Remove some whitespace. Remove a tab that's on a line that should otherwise be blank. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-17 10:50:15 +01:00
David Howells	d19127473a	rxrpc: Make IPv6 support conditional on CONFIG_IPV6 Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it. This is then made conditional on CONFIG_IPV6. Without this, the following can be seen: net/built-in.o: In function `rxrpc_init_peer': >> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags' Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-17 03:58:45 -04:00
Linus Torvalds	87ee1280ff	Fix a memory corruption bug that I introduced in 4.7. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX3GKYAAoJECebzXlCjuG+7jEQAIgMFl0feSlYrWqPjHZuUXcu uNdvnxKgWWP+JJp5pEh+uFLUsXqamfMqcEj4B0OgcoZcsYMDVsAEafeSWkdNVQeO Wr+j+IBcC1oECGA5MbZWzHBzw1CMq5/l411pvU+2xeZl8iG83MCl9su8dgf4ctAe hc7asU5fiQ+nazhfBaKpVzMAQO+G76E5J90B/Y+VbjYvtQoITUaT3R/PqBIKoG6j ezlElPHKddD2Yd8JnqPfnQ521I4yQL2AhrtJzQayCAnDlBXJ0B41IchcFRxPJPBX teNybHjHX9XQMGyE13EM4MO43AM9Bfcn7hAUPppag4tnvmS34fkIbBsWwWJjTmhR sjfLoBbP11KgGUjg7vRbNNsQ1zHhDyvss5ChUQAqfkDDH7CpnmOPXUqKKSGIfVvw fCpBMqxh5xwSzX3m2Wm2wVsUYH0rSyTKmcomdrZw0o577L2HsoaqWa2hcD/v7fE2 MtPnYLszJXJA3k1ZLoJ6L4sFphznOwzwrj8oLpan5YGsJnonyHcequZw9Z4YoMYG YPztd/kwctIezKhHyRqmr8lghTJ5vjLK49FfJEjrDAb8vfXq3Gwb7ZkwtZBH9RSw CV2vmJ1ny9PRRCKw0UCYAfUv5ucS2JiRbk31qjoIMYpDWJQghrZHnltRxgKg5UDW ZxTGNLwH/WXRVHJzTaeE =gr4D -----END PGP SIGNATURE----- Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux Pull nfsd bugfix from Bruce Fields: "Fix a memory corruption bug that I introduced in 4.7" * tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux: svcauth_gss: Revert `64c59a3726` ("Remove unnecessary allocation")	2016-09-16 17:00:26 -07:00
Luca Coelho	fbd05e4a6e	cfg80211: add helper to find an IE that matches a byte-array There are a few places where an IE that matches not only the EID, but also other bytes inside the element, needs to be found. To simplify that and reduce the amount of similar code, implement a new helper function to match the EID and an extra array of bytes. Additionally, simplify cfg80211_find_vendor_ie() by using the new match function. Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-16 14:49:52 +02:00
Emmanuel Grumbach	c68df2e7be	mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE In `46fa38e84b` ("mac80211: allow software PS-Poll/U-APSD with AP_LINK_PS"), Johannes allowed to use mac80211's code for handling stations that go to PS or send PS-Poll / uAPSD trigger frames for devices that enable RSS. This means that mac80211 doesn't look at frames anymore but rather relies on a notification that will come from the device when a PS transition occurs or when a PS-Poll / trigger frame is detected by the device. iwlwifi will need this capability but still needs mac80211 to take care of the TIM IE. Today, if a driver sets AP_LINK_PS, mac80211 will not update the TIM IE. Change mac80211 to check existence of the set_tim driver callback rather than using AP_LINK_PS to decide if the driver handles the TIM IE internally or not. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> [reword commit message a bit] Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-16 14:49:23 +02:00
John Crispin	cafdc45c94	net-next: dsa: add Qualcomm tag RX/TX handler Add support for the 2-bytes Qualcomm tag that gigabit switches such as the QCA8337/N might insert when receiving packets, or that we need to insert while targeting specific switch ports. The tag is inserted directly behind the ethernet header. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: John Crispin <john@phrozen.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:31:51 -04:00
Mark Tomlinson	d6f64d725b	net: VRF: Pass original iif to ip_route_input() The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except the global VRF, this replaces skb->dev with the VRF master interface. When calling ip_route_input_noref() from here, the checks for forwarding look at this master device instead of the initial ingress interface. This will allow packets to be routed which normally would be dropped. For example, an interface that is not assigned an IP address should drop packets, but because the checking is against the master device, the packet will be forwarded. The fix here is to still call l3mdev_ip_rcv(), but remember the initial net_device. This is passed to the other functions within ip_rcv_finish, so they still see the original interface. Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:24:07 -04:00
David S. Miller	6777ae8083	Here are two batman-adv bugfix patches: - Fix reference counting for last_bonding_candidate, by Sven Eckelmann - Fix head room reservation for ELP packets, by Linus Luessing -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdBQJX2UCwFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS nqHp+w/+MGG87sjUfObJKrSTmsIDgWHl2VSQwKBivbkvIuBm9gxIhFblaBQEJd5A 88Z5LOGWpnxh2rfFIPurTLxkYAZyARpIxWOLVtmPE3TdfMi4savTsvkOywd0ZHqE kXZH1QHE/Y3CQBf9FaM5cgTiwAXjZ+KWr5kRg3WNmH0oQUepndUBb5AAbjpD2G3f Gt4TsfunplXCmA/5uJMISOjKiub8usUhXVXBHVpxR+ZItdoIolwN197wDzXT8pWK FJe+Flqrvxi3n0kNFknUzCDdt09TFLms4m0AEu+8f4P6t1mR7v+YHNM4IlBx8BX0 6Kwiz009h9+5JFZvOSTCy6tkrodn5cAk9LNNsam5uPTWXeY76gFOzegdHIcIaBxq MLEqnTUgONqOakZs+4NopUz0HUvtJXDGYJcy7SDLYIYEwhaHP7seyuPkjA+xap2p uPZR3dTYESdeNGs/uPTEGVh1q8w6xXqXe9IRzP1KvGAOsg5IXCboLvJHpVMc25aT 4CT9qz5H94sVqdR6d5pBb+0CLoxnhbl+4IjKnHwMBzVM/0LVDJZRmx9qgiNHJ+0f PGQeQEbAfrihORRn2jCEq0YLCHKhjvLw9O7GVj353wmiVkvvuR+MQEv+bcDgu0DE mH9maOpcOahud6S+x+m9of3f3ytwWR4UO2Qk5t1RdIuVRSD7sXM= =mHPX -----END PGP SIGNATURE----- Merge tag 'batadv-net-for-davem-20160914' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are two batman-adv bugfix patches: - Fix reference counting for last_bonding_candidate, by Sven Eckelmann - Fix head room reservation for ELP packets, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:17:33 -04:00
Eric Dumazet	76f0dcbb5a	tcp: fix a stale ooo_last_skb after a replace When skb replaces another one in ooo queue, I forgot to also update tp->ooo_last_skb as well, if the replaced skb was the last one in the queue. To fix this, we simply can re-use the code that runs after an insertion, trying to merge skbs at the right of current skb. This not only fixes the bug, but also remove all small skbs that might be a subset of the new one. Example: We receive segments 2001:3001, 4001:5001 Then we receive 2001:8001 : We should replace 2001:3001 with the big skb, but also remove 4001:50001 from the queue to save space. packetdrill test demonstrating the bug 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +0.100 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0.01 < . 1001:2001(1000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001> +0.01 < . 1001:3001(2000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001> Fixes: `9f5afeae51` ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 04:09:49 -04:00
David S. Miller	364eac0c8b	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV9h/5PSw1s6N8H32AQLJCQ//RFbu0SNSoJnnZbOTwkxBaGYnGg4KbNVt iR4zumQfFssyYr7WcH1S6kuPzM/dJfjkRYqollyUGCEfnWyDwyfnjM9Na9PQoZ9F k7xnbim8N65njHLdGF6QMhenmoRXSBVCN2E0uPTbBXurFHJ8ZgQQs+DhogalvGUl 2TL/aMdpqRoo1Vg0/APVOKeLGqgHEhrXxelTZB/74IXyYT+rzjfzu+ZfwxUAijsM d+FBSwY+D8RYSV4LXQzMNNFCwNORbG2Rse2nEqd7bVqdVywWsuhbgeESjx1Y3+ge /mofVyxrpoblT9qsScbISbIQEe6cLxRiQgQHEudennRI2/3EbpNSijhNFWVon2Em NAa7r+tfOPtVx5JTL9NyvwtrXPfAgDi7Stpml3Yhhr/CjRHYK9kfKysowMXL5vOz NHD0fUozNLecpGCmdxG+alwf5BJ5q9DRPP7bI7KE/4FfVsYe8bO6pQ9G+myeP/A4 h8DuvK4xSJUEpEa5dpDLA1wSC2XH5PgYdXIr2DFBaFjllIdf1cGNKKIYRF2eIX/I obVD7c72oZV1kIK3URTLJpE+CdA4KgTFuL3YqIxquA2Iedb1t2uOrQcp4WwaWf7V REY9KbBn1F0+yJfO3Fjckerzle+MlAmrAHQpkZUduo5JzdRm3DY++YzswXQT5Fpr 8S3T30nwY0k= =T1wh -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160913-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Support IPv6 Here is a set of patches that add IPv6 support. They need to be applied on top of the just-posted miscellaneous fix patches. They are: (1) Make autobinding of an unconnected socket work when sendmsg() is called to initiate a client call. (2) Don't specify the protocol when creating the client socket, but rather take the default instead. (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were doing the same thing manually. This allows the IPv6 address extraction to be done in fewer places. (4) Add IPv6 support. With this, calls can be made to IPv6 servers from userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the RPC calls need to be upgradeable. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:57:19 -04:00
David S. Miller	39caa8bf6d	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV9h5A/Sw1s6N8H32AQJOPA//UI0606GZV2zjGqvWYbwquxjhWbbiVfEx CB5BeiQjKs8MxrJeHT/+bh6Z1Y6YorkyrVCc7kI1RQ+yiN0hw49bhFfF9Kr46DBF gYI2VdiKjIFEgC9fTenLkhMDQC7Hhf9O50hzk9QcC4y7w1Lhytah97d9w+Df0ECy a2QLMe2Ad9K5qR08ih3yTH7+G9K1m4/iqIrON2Hd9Opb+oFJgOiixvUVPr9f/6Xd /2YeAPDy/2A1MQ2nNE+oSW4C5uD+mJICqjjSw9YyhYl31lIfwBZ7+DE9hjR1qCXj UzMJLKrutXQQ1U7/Fbbke6UU5yKVm1djQB1qTF8t1hCHp/q88E7T06UUU9oBDqe0 98CjPofEXBcqn9hjrXIvJgxCEISTPHx9ikaq0i5yF/6pSHZ9G8gLUfrqbMwipkfk mXItd6HAHXhX7cS5u76v7I4c9u5olexX5cJ91/ibtOdsupiJTMLwCx4twR6knEcS /6SSqjklFL4f6HjuNlNJ8m2dB98DII+Ym0qo/ZQy4KUm/+0yzrkpGHvt32CR4wng qjtDN+KgxNss1duu4zkHgQe22u3iSRToxwydWTIQYY6tx4e08X1eSIFRL5ddYpEC bjnOtmniAyDP5YF1jRwFDLS3YzT9Uvrf0TVAOvU7/FjPh3KCGa8fn38xIbEsX6eI 1uadG1bf9wg= =vHfH -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160913-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Miscellaneous fixes Here's a set of miscellaneous fix patches. There are a couple of points of note: (1) There is one non-fix patch that adjusts the call ref tracking tracepoint to make kernel API-held refs on calls more obvious. This is a prerequisite for the patch that fixes prealloc refcounting. (2) The final patch alters how jumbo packets that partially exceed the receive window are handled. Previously, space was being left in the Rx buffer for them, but this significantly hurts performance as the Rx window can't be increased to match the OpenAFS Tx window size. Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK is generated for the first. To avoid the problem of someone trying to run the kernel out of space by feeding the kernel a series of overlapping maximal jumbo packets, we stop allowing jumbo packets on a call if we encounter more than three jumbo packets with duplicate or excessive subpackets. ==================== Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:52:20 -04:00
David S. Miller	3e454fd5db	A few more fixes: * better mesh path fixing, from Thomas * fix TIM IE recalculation after sending frames to a sleeping station, from Felix * fix sequence number assignment while sending frames to a sleeping station, also from Felix * validate number of probe response CSA counter offsets, fixing a copy/paste bug (from myself) -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJX2FrGAAoJEGt7eEactAAd/ZUP/ix4CokfC/WYxAYoOI6R2X+X UVvkFgG8hsZMzm2PZ4bg14jx9cQdnisgzxJnDgwoYvAnjBR0mbPl50LXoJ+JTlVr JnoIeMEKXqtxP099B4ssSCpW8YEr6Ex8RFweqN5vLuyIhlW63To1pOMvj9+zvUDB 3P7CmX2D7cWYys8ciDEaNhHziJRAdd9AruWf3emS1IhScDw05NlDQlff6FZpG/gW JMVGjKs+v5DdT3SH677/S6KfpwkPEpIoWzgYH32OW7nXxfuYZCoMWSdeoCfgNe+i WgZ154RPBD/YcXsCdNtw36uped7qlrS55Fnlyi1bKMx7vdKLJHABh+FB7qxFpO2F 2gh3G/KBcuGFfOihPXp11m3LlUn+ml1Ri3F8ffwvoO2ZoZGDplrH6oX2+Q9Hlm7U zvhuVzLriq0xHGZhv4EXqMt+70Zk0GvwzBviSjYXEwUJ+h/FkCyrVyUCpfmko9lf 8QMhI+rm0b5BtOvTkGI4ghEbpz80x1UnnUEGiMdzFhFIunSy43EzJ/RbpiAiaOCI pou/moi9+0yMAuJQkMEK2Yg+DdluTsSNuMHB7NKTx8o/wzwgFWFJ6jHlQO72SM43 k0Ku66LyJfovmv03dPaTxDSISaoxsn+1+SlsUPAF0usLyLka9A0WBjOE8edfDI0h hAamoQ0QHd4OlHtgKib6 =xzhM -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A few more fixes: * better mesh path fixing, from Thomas * fix TIM IE recalculation after sending frames to a sleeping station, from Felix * fix sequence number assignment while sending frames to a sleeping station, also from Felix * validate number of probe response CSA counter offsets, fixing a copy/paste bug (from myself) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-16 01:34:08 -04:00
Lance Richardson	2679d04041	openvswitch: avoid deferred execution of recirc actions The ovs kernel data path currently defers the execution of all recirc actions until stack utilization is at a minimum. This is too limiting for some packet forwarding scenarios due to the small size of the deferred action FIFO (10 entries). For example, broadcast traffic sent out more than 10 ports with recirculation results in packet drops when the deferred action FIFO becomes full, as reported here: http://openvswitch.org/pipermail/dev/2016-March/067672.html Since the current recursion depth is available (it is already tracked by the exec_actions_level pcpu variable), we can use it to determine whether to execute recirculation actions immediately (safe when recursion depth is low) or defer execution until more stack space is available. With this change, the deferred action fifo size becomes a non-issue for currently failing scenarios because it is no longer used when there are three or fewer recursions through ovs_execute_actions(). Suggested-by: Pravin Shelar <pshelar@ovn.org> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:35:52 -04:00
Or Gerlitz	a53d850a79	net/sched: cls_flower: Remove an unused field from the filter key structure Commit `c3f8324188` "net: Add full IPv6 addresses to flow_keys" added an unused instance of struct flow_dissector_key_addrs into struct fl_flow_key, remove it. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:27:23 -04:00
Or Gerlitz	aa72d70837	net/sched: cls_flower: Support masking for matching on tcp/udp ports Add the definitions for src/dst udp/tcp port masks and use them when setting && dumping the relevant keys. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 20:27:23 -04:00
Jamal Hadi Salim	86da71b573	net_sched: Introduce skbmod action This action is intended to be an upgrade from a usability perspective from pedit (as well as operational debugability). Compare this: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action pedit munge offset -14 u8 set 0x02 \ munge offset -13 u8 set 0x15 \ munge offset -12 u8 set 0x15 \ munge offset -11 u8 set 0x15 \ munge offset -10 u16 set 0x1515 \ pipe to: sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbmod dmac 02:15:15:15:15:15 Also try to do a MAC address swap with pedit or worse try to debug a policy with destination mac, source mac and etherype. Then make few rules out of those and you'll get my point. In the future common use cases on pedit can be migrated to this action (as an example different fields in ip v4/6, transports like tcp/udp/sctp etc). For this first cut, this allows modifying basic ethernet header. The most important ethernet use case at the moment is when redirecting or mirroring packets to a remote machine. The dst mac address needs a re-write so that it doesnt get dropped or confuse an interconnecting (learning) switch or dropped by a target machine (which looks at the dst mac). And at times when flipping back the packet a swap of the MAC addresses is needed. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:33:47 -04:00
Daniel Borkmann	f53d8c7b18	bpf: use skb_at_tc_ingress helper in tcf_bpf We have a small skb_at_tc_ingress() helper for testing for ingress, so make use of it. cls_bpf already uses it and so should act_bpf. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:29:47 -04:00
Daniel Borkmann	04b3f8de4b	bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is actually unnecessary and can be removed altogether. This was added by commit `a166151cbe` ("bpf: fix bpf helpers to use skb->mac_header relative offsets"), which was later on improved by `3431205e03` ("bpf: make programs see skb->data == L2 for ingress and egress"). We're always guaranteed to have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf(). Reason is that since `6d1ccff627` ("net: reset mac header in dev_start_xmit()") we do skb_reset_mac_header() in __dev_queue_xmit() before we could call into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress() always sees a valid mac header as well (things like skb_reset_mac_len() would badly fail otherwise). Thus, drop the unnecessary test in classifier and action case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:29:47 -04:00
Hadar Hen Zion	07c0f09e23	net/sched: act_tunnel_key: Remove rcu_read_lock protection Remove rcu_read_lock protection from tunnel_key_dump and use rtnl_dereference, dump operation is protected by rtnl lock. Also, remove rcu_read_lock from tunnel_key_release and use rcu_dereference_protected. Both operations are running exclusively and a writer couldn't modify t->params while those functions are executed. Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key') Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-15 19:18:18 -04:00
Johannes Berg	ec53c832ee	cfg80211: remove unnecessary pointer-of For an array, there's no need to use &array, so just use the plain wiphy->addresses[i].addr here to silence smatch. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:20 +02:00
Rajkumar Manoharan	e8a24cd4b8	mac80211: allow driver to handle packet-loss mechanism Based on consecutive msdu failures, mac80211 triggers CQM packet-loss mechanism. Drivers like ath10k that have its own connection monitoring algorithm, offloaded to firmware for triggering station kickout. In case of station kickout, driver will report low ack status by mac80211 API (ieee80211_report_low_ack). This flag will enable the driver to completely rely on firmware events for station kickout and bypass mac80211 packet loss mechanism. Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:20 +02:00
Johannes Berg	c7e9dbcf09	mac80211: remove sta_remove_debugfs driver callback No drivers implement this, relying either on the recursive directory removal to remove their debugfs, or not having any to start with. Remove the dead driver callback. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:19 +02:00
Johannes Berg	8826fef95b	mac80211: remove pointless chanctx NULL check If chanctx is derived as container_of() from a non-NULL pointer, it can't ever be NULL. Since we checked conf before, that's true here, so remove the useless NULL check. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:19 +02:00
Johannes Berg	5140974dca	mac80211: remove unused assignment The next line overwrites this assignment, so remove it; there's no real value in using it for the next assignment either. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:18 +02:00
Johannes Berg	53b18980fd	nl80211: always check nla_put* return values A few instances were found where we didn't check them, add the missing checks even though they'll probably never trigger as the message should be large enough here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:17 +02:00
Johannes Berg	76e1fb4b55	nl80211: always check nla_nest_start() return value If the message got full during nla_nest_start(), it can return NULL. None of the cases here seem like that can really happen, but check the return value nonetheless. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:17 +02:00
Johannes Berg	58bd7f1158	mac80211: fix scan completed tracing Passing the 'info' pointer where a 'info->aborted' is expected will always lead to tracing to erroneously record that the scan was aborted, fix that by passing the correct info->aborted. The remaining data will be collected in cfg80211, so I haven't duplicated it here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:16 +02:00
Johannes Berg	93db1d9e6c	mac80211: fix possible out-of-bounds access In the unlikely situation that the supplicant has negotiated admission for the background AC (which it has no reason to as it's not supposed to be requiring admission control to start with, and we'd ignore such a requirement anyway), the loop here may terminate with non_acm_ac == 4, which leads to an array overrun. Check this explicitly just for completeness. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:46:16 +02:00
Johannes Berg	f1c1f17ac5	cfg80211: allow connect keys only with default (TX) key There's no point in allowing connect keys when one of them isn't also configured as the TX key, it would just confuse drivers and probably cause them to pick something for TX. Disallow this confusing and erroneous configuration. As wpa_supplicant will always send NL80211_ATTR_KEYS, even when there are no keys inside, allow that and treat it as though the attribute isn't present at all. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 16:45:41 +02:00
Johannes Berg	85d5313ed7	mac80211: reject TSPEC TIDs (TSIDs) for aggregation Since mac80211 doesn't currently support TSIDs 8-15 which can only be used after QoS TSPEC negotiation (and not even after WMM negotiation), reject attempts to set up aggregation sessions for them, which might confuse drivers. In mac80211 we do correctly handle that, but the TSIDs should never get used anyway, and drivers might not be able to handle it. Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-15 10:08:52 +02:00
Johannes Berg	0b97a484e5	mac80211: check skb_linearize() return value The A-MSDU TX code (within TXQs) didn't always check the return value of skb_linearize() properly, resulting in potentially passing a frag- list SKB down to the driver even when it said it can't handle it. Fix that. Fixes: `6e0456b545` ("mac80211: add A-MSDU tx support") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-14 12:08:33 +02:00
David Howells	75b54cb57c	rxrpc: Add IPv6 support Add IPv6 support to AF_RXRPC. With this, AF_RXRPC sockets can be created: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6); instead of: service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); The AFS filesystem doesn't support IPv6 at the moment, though, since that requires upgrades to some of the RPC calls. Note that a good portion of this patch is replacing "%pI4:%u" in print statements with "%pISpc" which is able to handle both protocols and print the port. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	1c2bc7b948	rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually There are two places that want to transmit a packet in response to one just received and manually pick the address to reply to out of the sk_buff. Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled automatically. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	aaa31cbc66	rxrpc: Don't specify protocol to when creating transport socket Pass 0 as the protocol argument when creating the transport socket rather than IPPROTO_UDP. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	cd5892c756	rxrpc: Create an address for sendmsg() to bind unbound socket with Create an address for sendmsg() to bind unbound socket with rather than using a completely blank address otherwise the transport socket creation will fail because it will try to use address family 0. We use the address family specified in the protocol argument when the AF_RXRPC socket was created and SOCK_DGRAM as the default. For anything else, bind() must be used. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 23:09:13 +01:00
David Howells	75e4212639	rxrpc: Correctly initialise, limit and transmit call->rx_winsize call->rx_winsize should be initialised to the sysctl setting and the sysctl setting should be limited to the maximum we want to permit. Further, we need to place this in the ACK info instead of the sysctl setting. Furthermore, discard the idea of accepting the subpackets of a jumbo packet that lie beyond the receive window when the first packet of the jumbo is within the window. Just discard the excess subpackets instead. This allows the receive window to be opened up right to the buffer size less one for the dead slot. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:45 +01:00
David Howells	3432a757b1	rxrpc: Fix prealloc refcounting The preallocated call buffer holds a ref on the calls within that buffer. The ref was being released in the wrong place - it worked okay for incoming calls to the AFS cache manager service, but doesn't work right for incoming calls to a userspace service. Instead of releasing an extra ref service calls in rxrpc_release_call(), the ref needs to be released during the acceptance/rejectance process. To this end: (1) The prealloc ref is now normally released during rxrpc_new_incoming_call(). (2) For preallocated kernel API calls, the kernel API's ref needs to be released when the call is discarded on socket close. (3) We shouldn't take a second ref in rxrpc_accept_call(). (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds the call to the to_be_accepted socket queue. In doing (4) above, we would prefer not to put the call's refcount down to 0 as that entails doing cleanup in softirq context, but it's unlikely as there are several refs held elsewhere, at least one of which must be put by someone in process context calling rxrpc_release_call(). However, it's not a problem if we do have to do that. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:37 +01:00
David Howells	cbd00891de	rxrpc: Adjust the call ref tracepoint to show kernel API refs Adjust the call ref tracepoint to show references held on a call by the kernel API separately as much as possible and add an additional trace to at the allocation point from the preallocation buffer for an incoming call. Note that this doesn't show the allocation of a client call for the kernel separately at the moment. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:30 +01:00
David Howells	01fd074224	rxrpc: Allow tx_winsize to grow in response to an ACK Allow tx_winsize to grow when the ACK info packet shows a larger receive window at the other end rather than only permitting it to shrink. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:38:24 +01:00
David Howells	89a80ed4c0	rxrpc: Use skb->len not skb->data_len skb->len should be used rather than skb->data_len when referring to the amount of data in a packet. This will only cause a malfunction in the following cases: (1) We receive a jumbo packet (validation and splitting both are wrong). (2) We see if there's extra ACK info in an ACK packet (we think it's not there and just ignore it). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:22 +01:00
David Howells	b25de36053	rxrpc: Add missing unlock in rxrpc_call_accept() Add a missing unlock in rxrpc_call_accept() in the path taken if there's no call to wake up. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:22 +01:00
David Howells	33b603fda8	rxrpc: Requeue call for recvmsg if more data rxrpc_recvmsg() needs to make sure that the call it has just been processing gets requeued for further attention if the buffer has been filled and there's more data to be consumed. The softirq producer only queues the call and wakes the socket if it fills the first slot in the window, so userspace might end up sleeping forever otherwise, despite there being data available. This is not a problem provided the userspace buffer is big enough or it empties the buffer completely before more data comes in. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	91c2c7b656	rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the timer is set for it. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	bc4abfcf51	rxrpc: Add missing wakeup on Tx window rotation We need to wake up the sender when Tx window rotation due to an incoming ACK makes space in the buffer otherwise the sender is liable to just hang endlessly. This problem isn't noticeable if the Tx phase transfers no more than will fit in a single window or the Tx window rotates fast enough that it doesn't get full. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
David Howells	08a39685a7	rxrpc: Make sure we initialise the peer hash key Peer records created for incoming connections weren't getting their hash key set. This meant that incoming calls wouldn't see more than one DATA packet - which is not a problem for AFS CM calls with small request data blobs. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-13 22:36:21 +01:00
Johannes Berg	89b706fb28	cfg80211: reduce connect key caching struct size After the previous patches, connect keys can only (correctly) be used for storing static WEP keys. Therefore, remove all the data for dealing with key index 4/5 and reduce the size of the key material to the maximum for WEP keys. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:54 +02:00
Johannes Berg	e9c8f8d3a4	cfg80211: validate key index better Don't accept it if a key_idx < 0 snuck through, reject WEP keys with key index 4 and 5 (which are used for IGTKs) and don't allow IGTKs with key indices other than 4 and 5. This makes the key data match expectations better. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:53 +02:00
Johannes Berg	9381e267b6	cfg80211: wext: only allow WEP keys to be configured before connected When not connected, anything but WEP keys shouldn't be allowed to be configured for later - only static WEP keys make sense at this point. Change wext to reject anything else just like nl80211 does. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:52 +02:00
Johannes Berg	386b1f2738	nl80211: only allow WEP keys during connect command This was already documented that way in nl80211.h, but the parsing code still accepted other key types. Change it to really only accept WEP keys as documented. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:52 +02:00
Johannes Berg	42ee231cd1	nl80211: fix connect keys range check Only key index 0-3 should be accepted, 4/5 are for IGTKs and cannot be used as connect keys. Fix the range checking to not allow such erroneous configurations. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:51 +02:00
Johannes Berg	b6b5555bc8	cfg80211: disallow shared key authentication with key index 4 Key index 4 can only be used for an IGTK, so the range checks for shared key authentication should treat 4 as an error, fix that in the code. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:20:51 +02:00
Johannes Berg	ad5987b47e	nl80211: validate number of probe response CSA counters Due to an apparent copy/paste bug, the number of counters for the beacon configuration were checked twice, instead of checking the number of probe response counters. Fix this to check the number of probe response counters before parsing those. Cc: stable@vger.kernel.org Fixes: `9a774c78e2` ("cfg80211: Support multiple CSA counters") Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 20:19:27 +02:00
Xin Long	715f5552b1	sctp: hold the transport before using it in sctp_hash_cmp Since commit `4f00878126` ("sctp: apply rhashtable api to send/recv path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp, in which it compares the members of the transport with the rhashtable args to check if it's the right transport. But sctp uses the transport without holding it in sctp_hash_cmp, it can cause a use-after-free panic. As after it gets transport from hashtable, another CPU may close the sk and free the asoc. In sctp_association_free, it frees all the transports, meanwhile, the assoc's refcnt may be reduced to 0, assoc can be destroyed by sctp_association_destroy. So after that, transport->assoc is actually an unavailable memory address in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still can not avoid this, as assoc is not freed by RCU. This patch is to hold the transport before checking it's members with sctp_transport_hold, in which it checks the refcnt first, holds it if it's not 0. Fixes: `4f00878126` ("sctp: apply rhashtable api to send/recv path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-13 11:44:58 -04:00
Wei Yongjun	c20cb81193	tipc: fix possible memory leak in tipc_udp_enable() 'ub' is malloced in tipc_udp_enable() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: `ba5aa84a2d` ("tipc: split UDP nl address parsing") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-13 11:28:32 -04:00
Vivien Didelot	308433155a	net: bridge: add helper to call /sbin/bridge-stp If /sbin/bridge-stp is available on the system, bridge tries to execute it instead of the kernel implementation when starting/stopping STP. If anything goes wrong with /sbin/bridge-stp, bridge silently falls back to kernel STP, making hard to debug userspace STP. This patch adds a br_stp_call_user helper to start/stop userspace STP and debug errors from the program: abnormal exit status is stored in the lower byte and normal exit status is stored in higher byte. Below is a simple example on a kernel with dynamic debug enabled: # ln -s /bin/false /sbin/bridge-stp # brctl stp br0 on br0: failed to start userspace STP (256) # dmesg br0: /sbin/bridge-stp exited with code 1 br0: failed to start userspace STP (256) br0: using kernel STP Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-13 11:21:31 -04:00
David S. Miller	67b9f0b737	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Endianess fix for the new nf_tables netlink trace infrastructure, NFTA_TRACE_POLICY endianess was not correct, patch from Liping Zhang. 2) Fix broken re-route after userspace queueing in nf_tables route chain. This patch is large but it is simple since it is just getting this code in sync with iptable_mangle. Also from Liping. 3) NAT mangling via ctnetlink lies to userspace when nf_nat_setup_info() fails to setup the NAT conntrack extension. This problem has been there since the beginning, but it can now show up after rhashtable conversion. 4) Fix possible NULL pointer dereference due to failures in allocating the synproxy and seqadj conntrack extensions, from Gao feng. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-13 11:17:24 -04:00
Johannes Berg	4854f175c3	mac80211: remove useless open_count check __ieee80211_suspend() checks early on if there's anything to do by checking open_count, so there's no need to check again later in the function. Remove the useless check. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 15:39:29 +02:00
Gao Feng	4440a2ab3b	netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions When memory is exhausted, nfct_seqadj_ext_add may fail to add the synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't check if get valid seqadj pointer by the nfct_seqadj. Now drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer in nf_ct_seqadj_init from init_conntrack(). Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-13 10:50:56 +02:00
Laura Garcia Liebana	14e2dee099	netfilter: nft_hash: fix hash overflow validation The overflow validation in the init() function establishes that the maximum value that the hash could reach is less than U32_MAX, which is likely to be true. The fix detects the overflow when the maximum hash value is less than the offset itself. Fixes: `70ca767ea1` ("netfilter: nft_hash: Add hash offset value") Reported-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-13 10:49:23 +02:00
Johannes Berg	11d62caf93	mac80211: simplify TDLS RA lookup smatch pointed out that the second check of "tdls_auth" was pointless since if it was true, we returned from the function already. We can further simplify the code by moving the first check (if it's a TDLS peer at all) into the outer if, to only handle that inside. This simplifies the control flow here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 08:28:22 +02:00
Toke Høiland-Jørgensen	8d51dbb8c7	mac80211: Re-structure aqm debugfs output and keep CoDel stats per txq Currently the 'aqm' stats in mac80211 only keeps overlimit drop stats, not CoDel stats. This moves the CoDel stats into the txqi structure to keep them per txq in order to show them in debugfs. In addition, the aqm debugfs output is restructured by splitting it up into three files: One global per phy, one per netdev and one per station, in the appropriate directories. The files are all called aqm, and are only created if the driver supports the wake_tx_queue op (rather than emitting an error on open as previously). Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-13 08:20:16 +02:00
David S. Miller	b20b378d49	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/qlogic/qed/qed_dcbx.c drivers/net/phy/Kconfig All conflicts were cases of overlapping commits. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-12 15:52:44 -07:00
Linus Torvalds	2c937eb4dd	NFS client bugfixes for 4.8 Highlights include: Stable patches: - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state accounting - Fix the CREATE_SESSION slot number Bugfixes: - sunrpc: fix a UDP memory accounting regression - NFS: Fix an error reporting regression in nfs_file_write() - pNFS: Fix further layout stateid issues - RPC/rdma: Revert `3d4cf35bd4` ("xprtrdma: Reply buffer exhaustion...") - RPC/rdma: Fix receive buffer accounting -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX1wEwAAoJEGcL54qWCgDysPMP/iEgzv6Peky9DVYG35btxZXC QQxZDfvOa3Xxe9cH0JwfyisaDHw2gO5RQqFFCCxA/x0dZsf2s3Nrjt6C9yH8q7qF i8c1OQ8oEBMgM+BsByCQniUubSaAvs2jVVpAs7G+eOYPSqxFKzsHJwDqqRp4aZrW YDohIumsHFoKl1GYCx9jv44wtmQQJjgIJ0Uq8SJvMkSzzRaGgVIeCbfpRgtqVD3g mU8k3XV0C+fnLgtwtlG1dkqbnuNSp1gT72f8joId+SJjtnGgjxqi0eIn48vY5k4N SJ5+4N6Uko87k9uQ2zn1UTR2Jrltn7mtMI7RHJVuiLnbZjAsf0lfOIF3sgItAwhS G0F/EHzMbt3+vs4P9EsGJgTcViVplgJeXw0hQIqXbJN0IwsXG0/UYGuPUFxtMOHQ +ko8BYJaNWcQCVdkFc5rVyt/tM6rKDahLlA3sIn3bCGssL67CYgkfNsBIoOEmjp9 u4XTYwJYD2hXMpskc8W623voQ2/VDbbWB6bphmZH9EeOvlzRB5TW5OvEB0VE805+ WYZal32LNnaUE4rpUtr78rYEvzPqn7tb9+OglP/tYa1QB3A0nwC9f74CDQ6s08oR K00fVXu9yffkBty8Cm0e4HpUcjT+95BMVdJUJU3lhbUbu+eq74L/32OSjuGmdRWf c4S6sHfgCeX6uJPCb2rD =j4kB -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state accounting - Fix the CREATE_SESSION slot number Bugfixes: - sunrpc: fix a UDP memory accounting regression - NFS: Fix an error reporting regression in nfs_file_write() - pNFS: Fix further layout stateid issues - RPC/rdma: Revert `3d4cf35bd4` ("xprtrdma: Reply buffer exhaustion...") - RPC/rdma: Fix receive buffer accounting" * tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Fix the CREATE_SESSION slot number accounting xprtrdma: Fix receive buffer accounting xprtrdma: Revert `3d4cf35bd4` ("xprtrdma: Reply buffer exhaustion...") pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised NFS: Fix error reporting in nfs_file_write() sunrpc: fix UDP memory accounting	2016-09-12 14:13:45 -07:00
Chuck Lever	bf2c4b6f9b	svcauth_gss: Revert `64c59a3726` ("Remove unnecessary allocation") rsc_lookup steals the passed-in memory to avoid doing an allocation of its own, so we can't just pass in a pointer to memory that someone else is using. If we really want to avoid allocation there then maybe we should preallocate somwhere, or reference count these handles. For now we should revert. On occasion I see this on my server: kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851! kernel: invalid opcode: 0000 [#1] SMP kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15 kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 kernel: Workqueue: events do_cache_clean [sunrpc] kernel: task: ffff8808541d8000 task.stack: ffff880854344000 kernel: RIP: 0010:[<ffffffff811e7075>] [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP: 0018:ffff880854347d70 EFLAGS: 00010246 kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600 kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064 kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780 kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500 kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0 kernel: FS: 0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0 kernel: Stack: kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0 kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006 kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f kernel: Call Trace: kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss] kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss] kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc] kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc] kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0 kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0 kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0 kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0 kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40 kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110 kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff kernel: RIP [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP <ffff880854347d70> kernel: ---[ end trace 3fdec044969def26 ]--- It seems to be most common after a server reboot where a client has been using a Kerberos mount, and reconnects to continue its workload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-12 16:57:16 -04:00
Pablo Neira Ayuso	ecfcdfec7e	netfilter: nf_nat: handle NF_DROP from nfnetlink_parse_nat_setup() nf_nat_setup_info() returns NF_* verdicts, so convert them to error codes that is what ctnelink expects. This has passed overlook without having any impact since this nf_nat_setup_info() has always returned NF_ACCEPT so far. Since `870190a9ec` ("netfilter: nat: convert nat bysrc hash to rhashtable"), this is problem. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 20:32:57 +02:00
Liping Zhang	2e917d602a	netfilter: nft_numgen: fix race between num generate and store it After we generate a new number, we still use the priv->counter and store it to the dreg. This is not correct, another cpu may already change it to a new number. So we must use the generated number, not the priv->counter itself. Fixes: `91dbc6be0a` ("netfilter: nf_tables: add number generator expression") Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 20:00:23 +02:00
Florian Westphal	8e8118f893	netfilter: conntrack: remove packet hotpath stats These counters sit in hot path and do show up in perf, this is especially true for 'found' and 'searched' which get incremented for every packet processed. Information like searched=212030105 new=623431 found=333613 delete=623327 does not seem too helpful nowadays: - on busy systems found and searched will overflow every few hours (these are 32bit integers), other more busy ones every few days. - for debugging there are better methods, such as iptables' trace target, the conntrack log sysctls. Nowadays we also have perf tool. This removes packet path stat counters except those that are expected to be 0 (or close to 0) on a normal system, e.g. 'insert_failed' (race happened) or 'invalid' (proto tracker rejects). The insert stat is retained for the ctnetlink case. The found stat is retained for the tuple-is-taken check when NAT has to determine if it needs to pick a different source address. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 19:59:39 +02:00
Gao Feng	23d07508d2	netfilter: Add the missed return value check of nft_register_chain_type There are some codes of netfilter module which did not check the return value of nft_register_chain_type. Add the checks now. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 19:54:45 +02:00
Gao Feng	4e6577de71	netfilter: Add the missed return value check of register_netdevice_notifier There are some codes of netfilter module which did not check the return value of register_netdevice_notifier. Add the checks now. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 19:54:43 +02:00
Pablo Neira	cf71c03edf	netfilter: nf_conntrack: simplify __nf_ct_try_assign_helper() return logic Instead of several goto's just to return the result, simply return it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 19:54:34 +02:00
Pablo Neira Ayuso	71212c9b04	netfilter: nf_tables: don't drop IPv6 packets that cannot parse transport This is overly conservative and not flexible at all, so better let them go through and let the filtering policy decide what to do with them. We use skb_header_pointer() all over the place so we would just fail to match when trying to access fields from malformed traffic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 18:52:32 +02:00
Pablo Neira Ayuso	10151d7b03	netfilter: nf_tables_bridge: use nft_set_pktinfo_ipv{4, 6}_validate Consolidate pktinfo setup and validation by using the new generic functions so we converge to the netdev family codebase. We only need a linear IPv4 and IPv6 header from the reject expression, so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate() to net/bridge/netfilter/nft_reject_bridge.c. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 18:52:15 +02:00
Pablo Neira Ayuso	ddc8b6027a	netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate() These functions are extracted from the netdev family, they initialize the pktinfo structure and validate that the IPv4 and IPv6 headers are well-formed given that these functions are called from a path where layer 3 sanitization did not happen yet. These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h so they can be reused by a follow up patch to use them from the bridge family too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 18:52:09 +02:00
Pablo Neira Ayuso	beac5afa2d	netfilter: nf_tables: ensure proper initialization of nft_pktinfo fields This patch introduces nft_set_pktinfo_unspec() that ensures proper initialization all of pktinfo fields for non-IP traffic. This is used by the bridge, netdev and arp families. This new function relies on nft_set_pktinfo_proto_unspec() to set a new tprot_set field that indicates if transport protocol information is available. Remain fields are zeroed. The meta expression has been also updated to check to tprot_set in first place given that zero is a valid tprot value. Even a handcrafted packet may come with the IPPROTO_RAW (255) protocol number so we can't rely on this value as tprot unset. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 18:51:57 +02:00
Pablo Neira Ayuso	dbd2be0646	netfilter: nft_dynset: allow to invert match criteria The dynset expression matches if we can fit a new entry into the set. If there is no room for it, then it breaks the rule evaluation. This patch introduces the inversion flag so you can add rules to explicitly drop packets that don't fit into the set. For example: # nft filter input flow table xyz size 4 { ip saddr timeout 120s counter } overflow drop This is useful to provide a replacement for connlimit. For the rule above, every new entry uses the IPv4 address as key in the set, this entry gets a timeout of 120 seconds that gets refresh on every packet seen. If we get new flow and our set already contains 4 entries already, then this packet is dropped. You can already express this in positive logic, assuming default policy to drop: # nft filter input flow table xyz size 4 { ip saddr timeout 10s counter } accept Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-12 18:49:50 +02:00
Laura Garcia Liebana	70ca767ea1	netfilter: nft_hash: Add hash offset value Add support to pass through an offset to the hash value. With this feature, the sysadmin is able to generate a hash with a given offset value. Example: meta mark set jhash ip saddr mod 2 seed 0xabcd offset 100 This option generates marks according to the source address from 100 to 101. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>	2016-09-12 18:37:12 +02:00
Linus Torvalds	da499f8f53	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Mostly small sets of driver fixes scattered all over the place. 1) Mediatek driver fixes from Sean Wang. Forward port not written correctly during TX map, missed handling of EPROBE_DEFER, and mistaken use of put_page() instead of skb_free_frag(). 2) Fix socket double-free in KCM code, from WANG Cong. 3) QED driver fixes from Sudarsana Reddy Kalluru, including a fix for using the dcbx buffers before initializing them. 4) Mellanox Switch driver fixes from Jiri Pirko, including a fix for double fib removals and an error handling fix in mlxsw_sp_module_init(). 5) Fix kernel panic when enabling LLDP in i40e driver, from Dave Ertman. 6) Fix padding of TSO packets in thunderx driver, from Sunil Goutham. 7) TCP's rcv_wup not initialized properly when using fastopen, from Neal Cardwell. 8) Don't use uninitialized flow keys in flow dissector, from Gao Feng. 9) Use after free in l2tp module unload, from Sabrina Dubroca. 10) Fix interrupt registry ordering issues in smsc911x driver, from Jeremy Linton. 11) Fix crashes in bonding having to do with enslaving and rx_handler, from Mahesh Bandewar. 12) AF_UNIX deadlock fixes from Linus. 13) In mlx5 driver, don't read skb->xmit_mode after it might have been freed from the TX reclaim path. From Tariq Toukan. 14) Fix a bug from 2015 in TCP Yeah where the congestion window does not increase, from Artem Germanov. 15) Don't pad frames on receive in NFP driver, from Jakub Kicinski. 16) Fix chunk fragmenting in SCTP wrt. GSO, from Marcelo Ricardo Leitner. 17) Fix deletion of VRF routes, from Mark Tomlinson. 18) Fix device refcount leak when DAD fails in ipv6, from Wei Yongjun" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (101 commits) net/mlx4_en: Fix panic on xmit while port is down net/mlx4_en: Fixes for DCBX net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state() net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_all() net: ethernet: renesas: sh_eth: add POST registers for rz drivers: net: phy: mdio-xgene: Add hardware dependency dwc_eth_qos: do not register semi-initialized device sctp: identify chunks that need to be fragmented at IP level mlxsw: spectrum: Set port type before setting its address mlxsw: spectrum_router: Fix error path in mlxsw_sp_router_init nfp: don't pad frames on receive nfp: drop support for old firmware ABIs nfp: remove linux/version.h includes tcp: cwnd does not increase in TCP YeAH net/mlx5e: Fix parsing of vlan packets when updating lro header net/mlx5e: Fix global PFC counters replication net/mlx5e: Prevent casting overflow net/mlx5e: Move an_disable_cap bit to a new position net/mlx5e: Fix xmit_more counter race issue tcp: fastopen: avoid negative sk_forward_alloc ...	2016-09-12 07:56:06 -07:00
Pedersen, Thomas	5df20f2141	mac80211: make mpath path fixing more robust A fixed mpath was not quite being treated as such: 1) if a PERR frame was received, a fixed mpath was deactivated. 2) queued path discovery for fixed mpath was potentially being considered, changing mpath state. 3) other mpath flags were potentially being inherited when fixing the mpath. Just assign PATH_FIXED and SN_VALID. This solves several issues when fixing a mesh path in one direction. The reverse direction mpath should probably also be fixed, or root announcements at least be enabled. Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 12:27:14 +02:00
Felix Fietkau	df6ef5d8a8	mac80211: fix sequence number assignment for PS response frames When using intermediate queues, sequence number allocation is deferred until dequeue. This doesn't work for PS response frames, which bypass those queues. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:56:49 +02:00
Felix Fietkau	83843c80dc	mac80211: fix tim recalculation after PS response Handle the case where the mac80211 intermediate queues are empty and the driver has buffered frames Fixes: `ba8c3d6f16` ("mac80211: add an intermediate software queue implementation") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:54:42 +02:00
Johannes Berg	53f249747d	mac80211: send delBA on unexpected BlockAck Request If we don't have a BA session, send delBA, as requested by the IEEE 802.11 spec. Apply the same limit of sending such a delBA only once as in the previous patch. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:46:31 +02:00
Johannes Berg	bfe40fa395	mac80211: send delBA on unexpected BlockAck data frames When we receive data frames with ACK policy BlockAck, send delBA as requested by the 802.11 spec. Since this would be happening for every frame inside an A-MPDU if it's really received outside a session, limit it to a single attempt. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:46:21 +02:00
Johannes Berg	99ee7cae3b	mac80211: add support for radiotap timestamp field Use the existing device timestamp from the RX status information to add support for the new radiotap timestamp field. Currently only 32-bit counters are supported, but we also add the radiotap mactime where applicable. This new field allows more flexibility in where the timestamp is taken etc. The non-timestamp data in the field is taken from a new field in the hw struct. Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:45:45 +02:00
Aviya Erenfeld	42bd20d998	mac80211: add support for MU-MIMO air sniffer add support to MU-MIMO air sniffer according groupID: in monitor mode, use a given MU-MIMO groupID to monitor stations that belongs to that group using MU-MIMO. add support for following a station according to its MAC address using VHT MU-MIMO sniffer: the monitors wait until they get an action MU-MIMO notification frame, then parses it in order to find the groupID that corresponds to the given MAC address and monitors packets destined to that groupID using VHT MU-MIMO. Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:44:52 +02:00
Maxim Altshul	480dd46b9d	mac80211: RX BA support for sta max_rx_aggregation_subframes The ability to change the max_rx_aggregation frames is useful in cases of IOP. There exist some devices (latest mobile phones and some AP's) that tend to not respect a BA sessions maximum size (in Kbps). These devices won't respect the AMPDU size that was negotiated during association (even though they do respect the maximal number of packets). This violation is characterized by a valid number of packets in a single AMPDU. Even so, the total size will exceed the size negotiated during association. Eventually, this will cause some undefined behavior, which in turn causes the hw to drop packets, causing the throughput to plummet. This patch will make the subframe limitation to be held by each station, instead of being held only by hw. Signed-off-by: Maxim Altshul <maxim.altshul@ti.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:36:21 +02:00
Bhaktipriya Shridhar	e481901384	cfg80211: Remove deprecated create_singlethread_workqueue The workqueue "cfg80211_wq" is involved in cleanup, scan and event related works. It queues multiple work items &rdev->event_work, &rdev->dfs_update_channels_wk, &wiphy_to_rdev(request->wiphy)->scan_done_wk, &wiphy_to_rdev(wiphy)->sched_scan_results_wk, which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:24:48 +02:00
Aviya Erenfeld	d82121845d	mac80211: refactor monitor representation in sdata Insert the u32 monitor flags variable in a new structure that represents a monitor interface. This will allow to add more configuration variables to that structure which will happen in an upcoming change. Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:24:47 +02:00
Denis Kenzior	b7fb44daca	nl80211: Allow GET_INTERFACE dumps to be filtered This patch allows GET_INTERFACE dumps to be filtered based on NL80211_ATTR_WIPHY or NL80211_ATTR_WDEV. The documentation for GET_INTERFACE mentions that this is possible: "Request an interface's configuration; either a dump request on a %NL80211_ATTR_WIPHY or ..." However, this behavior has not been implemented until now. Johannes: rewrite most of the patch: * use nl80211_dump_wiphy_parse() to also allow passing an interface to be able to dump its siblings * fix locking (must hold rtnl around using nl80211_fam.attrbuf) * make init self-contained instead of relying on other cb->args Signed-off-by: Denis Kenzior <denkenz@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-09-12 11:24:46 +02:00
David Ahern	8a966fc016	net: ipv6: Remove l3mdev_get_saddr6 No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:53 -07:00
David Ahern	d66f6c0a8f	net: ipv4: Remove l3mdev_get_saddr No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:53 -07:00
David Ahern	e0d56fdd73	net: l3mdev: remove redundant calls A previous patch added l3mdev flow update making these hooks redundant. Remove them. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:52 -07:00
David Ahern	4c1feac58e	net: vrf: Flip IPv6 output path from FIB lookup hook to out hook Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv6 output processing to send the packet to the VRF driver on xmit. Link scope addresses (linklocal and multicast) need special handling: specifically the oif the flow struct can not be changed because we want the lookup tied to the enslaved interface. ie., the source address and the returned route MUST point to the interface scope passed in. Convert the existing vrf_get_rt6_dst to handle only link scope addresses. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:52 -07:00
David Ahern	ebfc102c56	net: vrf: Flip IPv4 output path from FIB lookup hook to out hook Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst is not returned on the first FIB lookup. Instead, the dst on the skb is switched at the beginning of the IPv4 output processing to send the packet to the VRF driver on xmit. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:52 -07:00
David Ahern	5f02ce24c2	net: l3mdev: Allow the l3mdev to be a loopback Allow an L3 master device to act as the loopback for that L3 domain. For IPv4 the device can also have the address 127.0.0.1. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:52 -07:00
David Ahern	a8e3e1a9f0	net: l3mdev: Add hook to output path This patch adds the infrastructure to the output path to pass an skb to an l3mdev device if it has a hook registered. This is the Tx parallel to l3mdev_ip{6}_rcv in the receive path and is the basis for removing the existing hook that returns the vrf dst on the fib lookup. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:52 -07:00
David Ahern	9ee0034b8f	net: flow: Add l3mdev flow update Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif in flow struct if its oif or iif points to a device enslaved to an L3 Master device. Only 1 needs to be converted to match the l3mdev FIB rule. This moves the flow adjustment for l3mdev to a single point catching all lookups. It is redundant for existing hooks (those are removed in later patches) but is needed for missed lookups such as PMTU updates. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 23:12:51 -07:00
Eric Dumazet	2594a2a928	tcp: better use ooo_last_skb in tcp_data_queue_ofo() Willem noticed that we could avoid an rbtree lookup if the the attempt to coalesce incoming skb to the last skb failed for some reason. Since most ooo additions are at the tail, this is definitely worth adding a test and fast path. Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:43:41 -07:00
Thadeu Lima de Souza Cascardo	ed227099da	openvswitch: use alias for genetlink family names When userspace tries to create datapaths and the module is not loaded, it will simply fail. With this patch, the module will be automatically loaded. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:42:46 -07:00
Javier Martinez Canillas	65b323e2ff	xfrm: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:11 -07:00
Javier Martinez Canillas	aebf5de07a	sctp: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:11 -07:00
Javier Martinez Canillas	0013de38a8	net: sched: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:11 -07:00
Javier Martinez Canillas	9dd79945b0	l2tp: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:11 -07:00
Javier Martinez Canillas	6ca40d4e84	ipv4: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:10 -07:00
Javier Martinez Canillas	181402a5c7	net: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:10 -07:00
Javier Martinez Canillas	9a81c34ace	lec: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:10 -07:00
Javier Martinez Canillas	a73ec314a0	appletalk: use IS_ENABLED() instead of checking for built-in or module The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either built-in or as a module, use that macro instead of open coding the same. Using the macro makes the code more readable by helping abstract away some of the Kconfig built-in and module enable details. Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 21:19:10 -07:00
Amir Vadai	d0f6dd8a91	net/sched: Introduce act_tunnel_key This action could be used before redirecting packets to a shared tunnel device, or when redirecting packets arriving from a such a device. The action will release the metadata created by the tunnel device (decap), or set the metadata with the specified values for encap operation. For example, the following flower filter will forward all ICMP packets destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before redirecting, a metadata for the vxlan tunnel is created using the tunnel_key action and it's arguments: $ tc filter add dev net0 protocol ip parent ffff: \ flower \ ip_proto 1 \ dst_ip 11.11.11.2 \ action tunnel_key set \ src_ip 11.11.0.1 \ dst_ip 11.11.0.2 \ id 11 \ action mirred egress redirect dev vxlan0 Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 20:53:56 -07:00
Amir Vadai	bc3103f1ed	net/sched: cls_flower: Classify packet in ip tunnels Introduce classifying by metadata extracted by the tunnel device. Outer header fields - source/dest ip and tunnel id, are extracted from the metadata when classifying. For example, the following will add a filter on the ingress Qdisc of shared vxlan device named 'vxlan0'. To forward packets with outer src ip 11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be forwarded to tap device 'vnet0' (after metadata is released): $ tc filter add dev vxlan0 protocol ip parent ffff: \ flower \ enc_src_ip 11.11.0.2 \ enc_dst_ip 11.11.0.1 \ enc_key_id 11 \ dst_ip 11.11.11.1 \ action tunnel_key release \ action mirred egress redirect dev vnet0 The action tunnel_key, will be introduced in the next patch in this series. Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 20:53:55 -07:00
Amir Vadai	d817f432c2	net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id() Add utility functions to convert a 32 bits key into a 64 bits tunnel and vice versa. These functions will be used instead of cloning code in GRE and VXLAN, and in tc act_iptunnel which will be introduced in a following patch in this patchset. Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jiri Benc <jbenc@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-10 20:53:55 -07:00
Daniel Borkmann	f3694e0012	bpf: add BPF_CALL_x macros for declaring helpers This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros that are used today. Motivation for this is to hide all the register handling and all necessary casts from the user, so that it is done automatically in the background when adding a BPF_CALL_<n>() call. This makes current helpers easier to review, eases to write future helpers, avoids getting the casting mess wrong, and allows for extending all helpers at once (f.e. build time checks, etc). It also helps detecting more easily in code reviews that unused registers are not instrumented in the code by accident, breaking compatibility with existing programs. BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some fundamental differences, for example, for generating the actual helper function that carries all u64 regs, we need to fill unused regs, so that we always end up with 5 u64 regs as an argument. I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and they look all as expected. No sparse issue spotted. We let this also sit for a few days with Fengguang's kbuild test robot, and there were no issues seen. On s390, it barked on the "uses dynamic stack allocation" notice, which is an old one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion to the call wrapper, just telling that the perf raw record/frag sits on stack (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests and they were fine as well. All eBPF helpers are now converted to use these macros, getting rid of a good chunk of all the raw castings. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:36:04 -07:00
Daniel Borkmann	374fb54eea	bpf: add own ctx rewriter on ifindex for clsact progs When fetching ifindex, we don't need to test dev for being NULL since we're always guaranteed to have a valid dev for clsact programs. Thus, avoid this test in fast path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:36:04 -07:00
Daniel Borkmann	f035a51536	bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit which otherwise often result in overly long bytes_to_bpf_size(sizeof()) and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF()) check in convert_bpf_extensions(), but we should rather make that generic as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF() users to detect any rewriter size issues at compile time. Note, there are currently none, but we want to assert that it stays this way. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:36:04 -07:00
Daniel Borkmann	6088b5823b	bpf: minor cleanups in helpers Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding and in __bpf_skb_max_len(), I missed that we always have skb->dev valid anyway, so we can drop the unneeded test for dev; also few more other misc bits addressed here. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:36:03 -07:00
Eric Dumazet	bf8d85d4f9	ip_tunnel: do not clear l4 hashes If skb has a valid l4 hash, there is no point clearing hash and force a further flow dissection when a tunnel encapsulation is added. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:33:11 -07:00
David S. Miller	fa5f4aaf6e	RxRPC rewrite -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAV9FCuvSw1s6N8H32AQKo5w/8CySGsorFk67/QiQGdBt+URd8cxR2NuvF i3P7Kbo30ycJO7Q4Uc4DvO3kTqWiNMbXWVgGLfA64HDFojjuuXfQdFwf98FZ2WtQ OxQUV5fzSPFwlDktd5nWm5qTCdv7+lIvBCVsEPuX2pSkc7HesiYMsZt2ilOac9Ho Meon2/S1oq3hctZv2DTiaI+Ae8YBMar7GSUfylRGa2TkXCgG8eYcjGyGigLJ2F03 e+/8w6+jtrW5hASCJPI9re+qiYgmnYa7UVpwrVjM1dVOYYZfmU02Jq6HgW9bSd24 MYk6neksMGVpQbVmAbj5/MmxUg98q8UpY9ygt2IWP4UvGNDYBGCiSbfyQoTnoWUP 02k3E6HnFfs8SPbxuNmA4uB2BHL2y87+G8u1g0IUZkT8i3zFwLd01UBwJqB23tYE EIRAad1xWwGaSJGyFgsmry1RJsitSUAG9w/68Ni1IMQxsHsIROTz6TNBki1tMcOh AAsbj4iJ0rJ2Ca/Xbk9kAdPzEr85ZA3Za5BwA9ZDwZjmt2X1RrzuK9gIaKB8hsWS zVjRjpvSOaTyx97rtEVfkT310GMGYC5r9ba+kE4ukGeHWKRVkMk5tkADZw9RFKdf ubXN/zyfv4YABHHUIfQn5UgHHmxl4GpN0CD+cY7hPtmB9J2wvsadckqrzBOFIQL+ dg7jZAb+fjc= =GfEj -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160908' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Rewrite data and ack handling This patch set constitutes the main portion of the AF_RXRPC rewrite. It consists of five fix/helper patches: (1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values. (2) Update some protocol definitions slightly. (3) Use of an hlist for RCU purposes. (4) Removal of per-call sk_buff accounting (not really needed when skbs aren't being queued on the main queue). (5) Addition of a tracepoint to log incoming packets in the data_ready callback and to log the end of the data_ready callback. And then there are two patches that form the main part: (6) Preallocation of resources for incoming calls so that in patch (7) the data_ready handler can be made to fully instantiate an incoming call and make it live. This extends through into AFS so that AFS can preallocate its own incoming call resources. The preallocation size is capped at the listen() backlog setting - and that is capped at a sysctl limit which can be set between 4 and 32. The preallocation is (re)charged either by accepting/rejecting pending calls or, in the case of AFS, manually. If insufficient preallocation resources exist, a BUSY packet will be transmitted. The advantage of using this preallocation is that once a call is set up in the data_ready handler, DATA packets can be queued on it immediately rather than the DATA packets being queued for a background work item to do all the allocation and then try and sort out the DATA packets whilst other DATA packets may still be coming in and going either to the background thread or the new call. (7) Rewrite the handling of DATA, ACK and ABORT packets. In the receive phase, DATA packets are now held in per-call circular buffers with deduplication, out of sequence detection and suchlike being done in data_ready. Since there is only one producer and only once consumer, no locks need be used on the receive queue. Received ACK and ABORT packets are now parsed and discarded in data_ready to recycle resources as fast as possible. sk_buffs are no longer pulled, trimmed or cloned, but rather the offset and size of the content is tracked. This particularly affects jumbo DATA packets which need insertion into the receive buffer in multiple places. Annotations are kept to track which bit is which. Packets are no longer queued on the socket receive queue; rather, calls are queued. Dummy packets to convey events therefore no longer need to be invented and metadata packets can be discarded as soon as parsed rather then being pushed onto the socket receive queue to indicate terminal events. The preallocation facility added in (6) is now used to set up incoming calls with very little locking required and no calls to the allocator in data_ready. Decryption and verification is now handled in recvmsg() rather than in a background thread. This allows for the future possibility of decrypting directly into the user buffer. With this patch, the code is a lot simpler and most of the mass of call event and state wangling code in call_event.c is gone. With this, the majority of the AF_RXRPC rewrite is complete. However, there are still things to be done, including: () Limit the number of active service calls to prevent an attacker from filling up a server's memory. () Limit the number of calls on the rebuff-with-BUSY queue. () Transmit delayed/deferred ACKs from recvmsg() if possible, rather than punting to the background thread. Ideally, the background thread shouldn't run at all, but data_ready can't call kernel_sendmsg() and we can't rely on recvmsg() attending to the call in a timely fashion. () Prevent the call at the front of the socket queue from hogging recvmsg()'s attention if there's a sufficiently continuous supply of data. () Distribute ICMP errors by connection rather than by call. Possibly parse the ICMP packet to try and pin down the exact connection and call. () Encrypt/decrypt directly between user buffers and socket buffers where possible. () IPv6. () Service ID upgrade. This is a facility whereby a special flag bit is set in the DATA packet header when making a call that tells the server that it is allowed to change the service ID to an upgraded one and reply with an equivalent call from the upgraded service. This is used, for example, to override certain AFS calls so that IPv6 addresses can be returned. (*) Allow userspace to preallocate call user IDs for incoming calls. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:24:21 -07:00
Marcelo Ricardo Leitner	7303a14750	sctp: identify chunks that need to be fragmented at IP level Previously, without GSO, it was easy to identify it: if the chunk didn't fit and there was no data chunk in the packet yet, we could fragment at IP level. So if there was an auth chunk and we were bundling a big data chunk, it would fragment regardless of the size of the auth chunk. This also works for the context of PMTU reductions. But with GSO, we cannot distinguish such PMTU events anymore, as the packet is allowed to exceed PMTU. So we need another check: to ensure that the chunk that we are adding, actually fits the current PMTU. If it doesn't, trigger a flush and let it be fragmented at IP level in the next round. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 19:18:33 -07:00
Colin Ian King	05f1b12f71	net: x25: remove null checks on arrays calling_ae and called_ae dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and cannot be null and hence are redundant checks. Remove these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 18:13:30 -07:00
stephen hemminger	b8b867e132	rtnetlink: remove unused ifla_stats_policy This structure is defined but never used. Flagged with W=1 Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 16:52:43 -07:00
Guillaume Nault	73483c1289	ipv6: report NLM_F_CREATE and NLM_F_EXCL flags in RTM_NEWROUTE events Since commit `37a1d3611c` ("ipv6: include NLM_F_REPLACE in route replace notifications"), RTM_NEWROUTE notifications have their NLM_F_REPLACE flag set if the new route replaced a preexisting one. However, other flags aren't set. This patch reports the missing NLM_F_CREATE and NLM_F_EXCL flag bits. NLM_F_APPEND is not reported, because in ipv6 a NLM_F_CREATE request is interpreted as an append request (contrary to ipv4, "prepend" is not supported, so if NLM_F_EXCL is not set then NLM_F_APPEND is implicit). As a result, the possible flag combination can now be reported (iproute2's terminology into parentheses): * NLM_F_CREATE \| NLM_F_EXCL: route didn't exist, exclusive creation ("add"). * NLM_F_CREATE: route did already exist, new route added after preexisting ones ("append"). * NLM_F_REPLACE: route did already exist, new route replaced the first preexisting one ("change"). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 16:50:23 -07:00
Guillaume Nault	b93e1fa710	ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events fib_table_insert() inconsistently fills the nlmsg_flags field in its notification messages. Since commit `b8f5583135` ("[RTNETLINK]: Fix sending netlink message when replace route."), the netlink message has its nlmsg_flags set to NLM_F_REPLACE if the route replaced a preexisting one. Then commit `a2bb6d7d6f` ("ipv4: include NLM_F_APPEND flag in append route notifications") started setting nlmsg_flags to NLM_F_APPEND if the route matched a preexisting one but was appended. In other cases (exclusive creation or prepend), nlmsg_flags is 0. This patch sets ->nlmsg_flags in all situations, preserving the semantic of the NLM_F_* bits: * NLM_F_CREATE: a new fib entry has been created for this route. * NLM_F_EXCL: no other fib entry existed for this route. * NLM_F_REPLACE: this route has overwritten a preexisting fib entry. * NLM_F_APPEND: the new fib entry was added after other entries for the same route. As a result, the possible flag combination can now be reported (iproute2's terminology into parentheses): * NLM_F_CREATE \| NLM_F_EXCL: route didn't exist, exclusive creation ("add"). * NLM_F_CREATE \| NLM_F_APPEND: route did already exist, new route added after preexisting ones ("append"). * NLM_F_CREATE: route did already exist, new route added before preexisting ones ("prepend"). * NLM_F_REPLACE: route did already exist, new route replaced the first preexisting one ("change"). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-09 16:50:23 -07:00
Liping Zhang	fe01111d23	netfilter: nft_queue: check the validation of queues_total and queuenum Although the validation of queues_total and queuenum is checked in nft utility, but user can add nft rules via nfnetlink, so it is necessary to check the validation at the nft_queue expr init routine too. Tested by run ./nft-test.py any/queue.t: any/queue.t: 6 unit tests, 0 error, 0 warning Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-09 15:54:48 +02:00
thomas.zeitlhofer+lkml@ze-it.at	1fb81e09d4	vti: use right inner_mode for inbound inter address family policy checks In case of inter address family tunneling (IPv6 over vti4 or IPv4 over vti6), the inbound policy checks in vti_rcv_cb() and vti6_rcv_cb() are using the wrong address family. As a result, all inbound inter address family traffic is dropped. Use the xfrm_ip2inner_mode() helper, as done in xfrm_input() (i.e., also increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error), to select the inner_mode that contains the right address family for the inbound policy checks. Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-09 09:02:08 +02:00
Mathias Krause	2f30ea5090	xfrm_user: propagate sec ctx allocation errors When we fail to attach the security context in xfrm_state_construct() we'll return 0 as error value which, in turn, will wrongly claim success to userland when, in fact, we won't be adding / updating the XFRM state. This is a regression introduced by commit `fd21150a0f` ("[XFRM] netlink: Inline attach_encap_tmpl(), attach_sec_ctx(), and attach_one_addr()"). Fix it by propagating the error returned by security_xfrm_state_alloc() in this case. Fixes: `fd21150a0f` ("[XFRM] netlink: Inline attach_encap_tmpl()...") Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2016-09-09 09:02:08 +02:00
Eric Dumazet	e895cdce68	ipv4: accept u8 in IP_TOS ancillary data In commit `f02db315b8` ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data") Francesco added IP_TOS values specified as integer. However, kernel sends to userspace (at recvmsg() time) an IP_TOS value in a single byte, when IP_RECVTOS is set on the socket. It can be very useful to reflect all ancillary options as given by the kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with EINVAL after Francesco patch. So this patch extends IP_TOS ancillary to accept an u8, so that an UDP server can simply reuse same ancillary block without having to mangle it. Jesper can then augment https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c to add TOS reflection ;) Fixes: `f02db315b8` ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Francesco Fusco <ffusco@redhat.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 17:45:57 -07:00
Yaogong Wang	9f5afeae51	tcp: use an RB tree for ooo receive queue Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: Yaogong Wang <wygivan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 17:25:58 -07:00
Artem Germanov	db7196a0d0	tcp: cwnd does not increase in TCP YeAH Commit `76174004a0` (tcp: do not slow start when cwnd equals ssthresh ) introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth ~100KB/s. That is caused by stalled cwnd when cwnd equals ssthresh. This patch fixes it by proper increasing cwnd in this case. Signed-off-by: Artem Germanov <agermanov@anchorfree.com> Acked-by: Dmitry Adamushko <d.adamushko@anchorfree.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 17:16:12 -07:00
Eric Garver	018c1dda5f	openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes Add support for 802.1ad including the ability to push and pop double tagged vlans. Add support for 802.1ad to netlink parsing and flow conversion. Uses double nested encap attributes to represent double tagged vlan. Inner TPID encoded along with ctci in nested attributes. This is based on Thomas F Herbert's original v20 patch. I made some small clean ups and bug fixes. Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com> Signed-off-by: Eric Garver <e@erig.me> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 17:10:28 -07:00
Lorenzo Colitti	d545caca82	net: inet: diag: expose the socket mark to privileged processes. This adds the capability for a process that has CAP_NET_ADMIN on a socket to see the socket mark in socket dumps. Commit `a52e95abf7` ("net: diag: allow socket bytecode filters to match socket marks") recently gave privileged processes the ability to filter socket dumps based on mark. This patch is complementary: it ensures that the mark is also passed to userspace in the socket's netlink attributes. It is useful for tools like ss which display information about sockets. Tested: https://android-review.googlesource.com/270210 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 16:13:09 -07:00
Eric Dumazet	76061f631c	tcp: fastopen: avoid negative sk_forward_alloc When DATA and/or FIN are carried in a SYN/ACK message or SYN message, we append an skb in socket receive queue, but we forget to call sk_forced_mem_schedule(). Effect is that the socket has a negative sk->sk_forward_alloc as long as the message is not read by the application. Josh Hunt fixed a similar issue in commit `d22e153718` ("tcp: fix tcp fin memory accounting") Fixes: `168a8f5805` ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 16:08:10 -07:00
David S. Miller	40e3012e6e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== ipsec 2016-09-08 1) Fix a crash when xfrm_dump_sa returns an error. From Vegard Nossum. 2) Remove some incorrect WARN() on normal error handling. From Vegard Nossum. 3) Ignore socket policies when rebuilding hash tables, socket policies are not inserted into the hash tables. From Tobias Brunner. 4) Initialize and check tunnel pointers properly before we use it. From Alexey Kodanev. 5) Fix l3mdev oif setting on xfrm dst lookups. From David Ahern. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 13:12:37 -07:00
David S. Miller	575f9c43e7	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2016-09-08 1) Constify the xfrm_replay structures. From Julia Lawall 2) Protect xfrm state hash tables with rcu, lookups can be done now without acquiring xfrm_state_lock. From Florian Westphal. 3) Protect xfrm policy hash tables with rcu, lookups can be done now without acquiring xfrm_policy_lock. From Florian Westphal. 4) We don't need to have a garbage collector list per namespace anymore, so use a global one instead. From Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-08 13:09:41 -07:00
David Howells	248f219cb8	rxrpc: Rewrite the data and ack handling code Rewrite the data and ack handling code such that: (1) Parsing of received ACK and ABORT packets and the distribution and the filing of DATA packets happens entirely within the data_ready context called from the UDP socket. This allows us to process and discard ACK and ABORT packets much more quickly (they're no longer stashed on a queue for a background thread to process). (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim(). We instead keep track of the offset and length of the content of each packet in the sk_buff metadata. This means we don't do any allocation in the receive path. (3) Jumbo DATA packet parsing is now done in data_ready context. Rather than cloning the packet once for each subpacket and pulling/trimming it, we file the packet multiple times with an annotation for each indicating which subpacket is there. From that we can directly calculate the offset and length. (4) A call's receive queue can be accessed without taking locks (memory barriers do have to be used, though). (5) Incoming calls are set up from preallocated resources and immediately made live. They can than have packets queued upon them and ACKs generated. If insufficient resources exist, DATA packet #1 is given a BUSY reply and other DATA packets are discarded). (6) sk_buffs no longer take a ref on their parent call. To make this work, the following changes are made: (1) Each call's receive buffer is now a circular buffer of sk_buff pointers (rxtx_buffer) rather than a number of sk_buff_heads spread between the call and the socket. This permits each sk_buff to be in the buffer multiple times. The receive buffer is reused for the transmit buffer. (2) A circular buffer of annotations (rxtx_annotations) is kept parallel to the data buffer. Transmission phase annotations indicate whether a buffered packet has been ACK'd or not and whether it needs retransmission. Receive phase annotations indicate whether a slot holds a whole packet or a jumbo subpacket and, if the latter, which subpacket. They also note whether the packet has been decrypted in place. (3) DATA packet window tracking is much simplified. Each phase has just two numbers representing the window (rx_hard_ack/rx_top and tx_hard_ack/tx_top). The hard_ack number is the sequence number before base of the window, representing the last packet the other side says it has consumed. hard_ack starts from 0 and the first packet is sequence number 1. The top number is the sequence number of the highest-numbered packet residing in the buffer. Packets between hard_ack+1 and top are soft-ACK'd to indicate they've been received, but not yet consumed. Four macros, before(), before_eq(), after() and after_eq() are added to compare sequence numbers within the window. This allows for the top of the window to wrap when the hard-ack sequence number gets close to the limit. Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also to indicate when rx_top and tx_top point at the packets with the LAST_PACKET bit set, indicating the end of the phase. (4) Calls are queued on the socket 'receive queue' rather than packets. This means that we don't need have to invent dummy packets to queue to indicate abnormal/terminal states and we don't have to keep metadata packets (such as ABORTs) around (5) The offset and length of a (sub)packet's content are now passed to the verify_packet security op. This is currently expected to decrypt the packet in place and validate it. However, there's now nowhere to store the revised offset and length of the actual data within the decrypted blob (there may be a header and padding to skip) because an sk_buff may represent multiple packets, so a locate_data security op is added to retrieve these details from the sk_buff content when needed. (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is individually secured and needs to be individually decrypted. The code to do this is broken out into rxrpc_recvmsg_data() and shared with the kernel API. It now iterates over the call's receive buffer rather than walking the socket receive queue. Additional changes: (1) The timers are condensed to a single timer that is set for the soonest of three timeouts (delayed ACK generation, DATA retransmission and call lifespan). (2) Transmission of ACK and ABORT packets is effected immediately from process-context socket ops/kernel API calls that cause them instead of them being punted off to a background work item. The data_ready handler still has to defer to the background, though. (3) A shutdown op is added to the AF_RXRPC socket so that the AFS filesystem can shut down the socket and flush its own work items before closing the socket to deal with any in-progress service calls. Future additional changes that will need to be considered: (1) Make sure that a call doesn't hog the front of the queue by receiving data from the network as fast as userspace is consuming it to the exclusion of other calls. (2) Transmit delayed ACKs from within recvmsg() when we've consumed sufficiently more packets to avoid the background work item needing to run. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:12 +01:00
David Howells	00e907127e	rxrpc: Preallocate peers, conns and calls for incoming service requests Make it possible for the data_ready handler called from the UDP transport socket to completely instantiate an rxrpc_call structure and make it immediately live by preallocating all the memory it might need. The idea is to cut out the background thread usage as much as possible. [Note that the preallocated structs are not actually used in this patch - that will be done in a future patch.] If insufficient resources are available in the preallocation buffers, it will be possible to discard the DATA packet in the data_ready handler or schedule a BUSY packet without the need to schedule an attempt at allocation in a background thread. To this end: (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a maximum number each of the listen backlog size. The backlog size is limited to a maxmimum of 32. Only this many of each can be in the preallocation buffer. (2) For userspace sockets, the preallocation is charged initially by listen() and will be recharged by accepting or rejecting pending new incoming calls. (3) For kernel services {,re,dis}charging of the preallocation buffers is handled manually. Two notifier callbacks have to be provided before kernel_listen() is invoked: (a) An indication that a new call has been instantiated. This can be used to trigger background recharging. (b) An indication that a call is being discarded. This is used when the socket is being released. A function, rxrpc_kernel_charge_accept() is called by the kernel service to preallocate a single call. It should be passed the user ID to be used for that call and a callback to associate the rxrpc call with the kernel service's side of the ID. (4) Discard the preallocation when the socket is closed. (5) Temporarily bump the refcount on the call allocated in rxrpc_incoming_call() so that rxrpc_release_call() can ditch the preallocation ref on service calls unconditionally. This will no longer be necessary once the preallocation is used. Note that this does not yet control the number of active service calls on a client - that will come in a later patch. A future development would be to provide a setsockopt() call that allows a userspace server to manually charge the preallocation buffer. This would allow user call IDs to be provided in advance and the awkward manual accept stage to be bypassed. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:12 +01:00
David Howells	49e19ec7d3	rxrpc: Add tracepoints to record received packets and end of data_ready Add two tracepoints: (1) Record the RxRPC protocol header of packets retrieved from the UDP socket by the data_ready handler. (2) Record the outcome of the data_ready handler. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:12 +01:00
David Howells	2ab27215ea	rxrpc: Remove skb_count from struct rxrpc_call Remove the sk_buff count from the rxrpc_call struct as it's less useful once we stop queueing sk_buffs. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:12 +01:00
David Howells	de8d6c7401	rxrpc: Convert rxrpc_local::services to an hlist Convert the rxrpc_local::services list to an hlist so that it can be accessed under RCU conditions more readily. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:11 +01:00
David Howells	cf13258fd4	rxrpc: Fix ASSERTCMP and ASSERTIFCMP to handle signed values Fix ASSERTCMP and ASSERTIFCMP to be able to handle signed values by casting both parameters to the type of the first before comparing. Without this, both values are cast to unsigned long, which means that checks for values less than zero don't work. The downside of this is that the state enum values in struct rxrpc_call and struct rxrpc_connection can't be bitfields as __typeof__ can't handle them. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-08 11:10:11 +01:00
subashab@codeaurora.org	0f76d25644	net: xfrm: Change u32 sysctl entries to use proc_douintvec proc_dointvec limits the values to INT_MAX in u32 sysctl entries. proc_douintvec allows to write upto UINT_MAX. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-07 23:17:53 -07:00
Lorenzo Colitti	f95bf34622	net: diag: make udp_diag_destroy work for mapped addresses. udp_diag_destroy does look up the IPv4 UDP hashtable for mapped addresses, but it gets the IPv4 address to look up from the beginning of the IPv6 address instead of the end. Tested: https://android-review.googlesource.com/269874 Fixes: `5d77dca828` ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-07 17:31:30 -07:00
Andrey Vagin	733ade23de	netlink: don't forget to release a rhashtable_iter structure This bug was detected by kmemleak: unreferenced object 0xffff8804269cc3c0 (size 64): comm "criu", pid 1042, jiffies 4294907360 (age 13.713s) hex dump (first 32 bytes): a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00 .2.,............ 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ backtrace: [<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280 [<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag] v2: don't remove a reference on a rhashtable_iter structure to release it from netlink_diag_dump_done Cc: Herbert Xu <herbert@gondor.apana.org.au> Fixes: `ad20207432` ("netlink: Use rhashtable walk interface in diag dump") Signed-off-by: Andrei Vagin <avagin@openvz.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-07 17:29:38 -07:00
David Howells	5a42976d4f	rxrpc: Add tracepoint for working out where aborts happen Add a tracepoint for working out where local aborts happen. Each tracepoint call is labelled with a 3-letter code so that they can be distinguished - and the DATA sequence number is added too where available. rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can indicate the circumstances when it aborts a call. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 16:34:40 +01:00
David Howells	e8d6bbb05a	rxrpc: Fix returns of call completion helpers rxrpc_set_call_completion() returns bool, not int, so the ret variable should match this. rxrpc_call_completed() and __rxrpc_call_completed() should return the value of rxrpc_set_call_completion(). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 16:34:30 +01:00
David Howells	8d94aa381d	rxrpc: Calls shouldn't hold socket refs rxrpc calls shouldn't hold refs on the sock struct. This was done so that the socket wouldn't go away whilst the call was in progress, such that the call could reach the socket's queues. However, we can mark the socket as requiring an RCU release and rely on the RCU read lock. To make this work, we do: (1) rxrpc_release_call() removes the call's call user ID. This is now only called from socket operations and not from the call processor: rxrpc_accept_call() / rxrpc_kernel_accept_call() rxrpc_reject_call() / rxrpc_kernel_reject_call() rxrpc_kernel_end_call() rxrpc_release_calls_on_socket() rxrpc_recvmsg() Though it is also called in the cleanup path of rxrpc_accept_incoming_call() before we assign a user ID. (2) Pass the socket pointer into rxrpc_release_call() rather than getting it from the call so that we can get rid of uninitialised calls. (3) Fix call processor queueing to pass a ref to the work queue and to release that ref at the end of the processor function (or to pass it back to the work queue if we have to requeue). (4) Skip out of the call processor function asap if the call is complete and don't requeue it if the call is complete. (5) Clean up the call immediately that the refcount reaches 0 rather than trying to defer it. Actual deallocation is deferred to RCU, however. (6) Don't hold socket refs for allocated calls. (7) Use the RCU read lock when queueing a message on a socket and treat the call's socket pointer according to RCU rules and check it for NULL. We also need to use the RCU read lock when viewing a call through procfs. (8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call() if this hasn't been done yet so that we can then disconnect the call. Once the call is disconnected, it won't have any access to the connection struct and the UDP socket for the call work processor to be able to send the ACK. Terminal retransmission will be handled by the connection processor. (9) Release all calls immediately on the closing of a socket rather than trying to defer this. Incomplete calls will be aborted. The call refcount model is much simplified. Refs are held on the call by: (1) A socket's user ID tree. (2) A socket's incoming call secureq and acceptq. (3) A kernel service that has a call in progress. (4) A queued call work processor. We have to take care to put any call that we failed to queue. (5) sk_buffs on a socket's receive queue. A future patch will get rid of this. Whilst we're at it, we can do: (1) Get rid of the RXRPC_CALL_EV_RELEASE event. Release is now done entirely from the socket routines and never from the call's processor. (2) Get rid of the RXRPC_CALL_DEAD state. Calls now end in the RXRPC_CALL_COMPLETE state. (3) Get rid of the rxrpc_call::destroyer work item. Calls are now torn down when their refcount reaches 0 and then handed over to RCU for final cleanup. (4) Get rid of the rxrpc_call::deadspan timer. Calls are cleaned up immediately they're finished with and don't hang around. Post-completion retransmission is handled by the connection processor once the call is disconnected. (5) Get rid of the dead call expiry setting as there's no longer a timer to set. (6) rxrpc_destroy_all_calls() can just check that the call list is empty. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:33:20 +01:00
David Howells	6543ac5235	rxrpc: Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() if the call is available just in case call->conn is NULL. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:30:22 +01:00
David Howells	8b7fac50ab	rxrpc: Pass the connection pointer to rxrpc_post_packet_to_call() Pass the connection pointer to rxrpc_post_packet_to_call() as the call might get disconnected whilst we're looking at it, but the connection pointer determined by rxrpc_data_read() is guaranteed by RCU for the duration of the call. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:30:22 +01:00
David Howells	278ac0cdd5	rxrpc: Cache the security index in the rxrpc_call struct Cache the security index in the rxrpc_call struct so that we can get at it even when the call has been disconnected and the connection pointer cleared. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:30:22 +01:00
David Howells	f4fdb3525b	rxrpc: Use call->peer rather than call->conn->params.peer Use call->peer rather than call->conn->params.peer to avoid the possibility of call->conn being NULL and, whilst we're at it, check it for NULL before we access it. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:30:22 +01:00
David Howells	fff72429c2	rxrpc: Improve the call tracking tracepoint Improve the call tracking tracepoint by showing more differentiation between some of the put and get events, including: (1) Getting and putting refs for the socket call user ID tree. (2) Getting and putting refs for queueing and failing to queue the call processor work item. Note that these aren't necessarily used in this patch, but will be taken advantage of in future patches. An enum is added for the event subtype numbers rather than coding them directly as decimal numbers and a table of 3-letter strings is provided rather than a sequence of ?: operators. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 15:30:22 +01:00
David Howells	e796cb4192	rxrpc: Delete unused rxrpc_kernel_free_skb() Delete rxrpc_kernel_free_skb() as it's unused. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 14:43:43 +01:00
David Howells	71a17de307	rxrpc: Whitespace cleanup Remove some whitespace. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-07 14:43:39 +01:00
Marco Angaroni	1bcabc81ee	netfilter: nf_ct_sip: allow tab character in SIP headers Current parsing methods for SIP headers do not allow the presence of tab characters between header name and header value. As a result Call-ID SIP headers like the following are discarded by IPVS SIP persistence engine: "Call-ID\t: mycallid@abcde" "Call-ID:\tmycallid@abcde" In above examples Call-IDs are represented as strings in C language. Obviously in real message we have byte "09" before/after colon (":"). Proposed fix is in nf_conntrack_sip module. Function sip_skip_whitespace() should skip tabs in addition to spaces, since in SIP grammar whitespace (WSP) corresponds to space or tab. Below is an extract of relevant SIP ABNF syntax. Call-ID = ( "Call-ID" / "i" ) HCOLON callid callid = word [ "@" word ] HCOLON = ( SP / HTAB ) ":" SWS SWS = [LWS] ; sep whitespace LWS = [WSP CRLF] 1WSP ; linear whitespace WSP = SP / HTAB word = 1(alphanum / "-" / "." / "!" / "%" / "*" / "_" / "+" / "`" / "'" / "~" / "(" / ")" / "<" / ">" / ":" / "\" / DQUOTE / "/" / "[" / "]" / "?" / "{" / "}" ) Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 13:53:43 +02:00
Pablo Neira Ayuso	22609b43b1	netfilter: nft_quota: introduce nft_overquota() This is patch renames the existing function to nft_overquota() and make it return a boolean that tells us if we have exceeded our byte quota. Just a cleanup. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 11:02:06 +02:00
Pablo Neira Ayuso	db6d857b81	netfilter: nft_quota: fix overquota logic Use xor to decide to break further rule evaluation or not, since the existing logic doesn't achieve the expected inversion. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 11:00:56 +02:00
Laura Garcia Liebana	0d9932b287	netfilter: nft_numgen: rename until attribute by modulus The _until_ attribute is renamed to _modulus_ as the behaviour is similar to other expresions with number limits (ex. nft_hash). Renaming is possible because there isn't a kernel release yet with these changes. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:55:46 +02:00
Gao Feng	ddb075b0cd	netfilter: ftp: Remove the useless code There are some debug code which are commented out in find_pattern by #if 0. Now remove them. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:38:00 +02:00
Gao Feng	723eb299de	netfilter: ftp: Remove the useless dlen==0 condition check in find_pattern The caller function "help" has already make sure the datalen could not be zero before invoke find_pattern as a parameter by the following codes if (dataoff >= skb->len) { pr_debug("ftp: dataoff(%u) >= skblen(%u)\n", dataoff, skb->len); return NF_ACCEPT; } datalen = skb->len - dataoff; And the latter codes "ends_in_nl = (fb_ptr[datalen - 1] == '\n');" use datalen directly without checking if it is zero. So it is unneccessary to check it in find_pattern too. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:37:59 +02:00
Marco Angaroni	f0608ceaa7	netfilter: nf_ct_sip: correct allowed characters in Call-ID SIP header Current parsing methods for SIP header Call-ID do not check correctly all characters allowed by RFC 3261. In particular "," character is allowed instead of "'" character. As a result Call-ID headers like the following are discarded by IPVS SIP persistence engine. Call-ID: -.!%_+`'~()<>:\"/[]?{} Above example is composed using all non-alphanumeric characters listed in RFC 3261 for Call-ID header syntax. Proposed fix is in nf_conntrack_sip module; function iswordc() checks this range: (c >= '(' && c <= '/') which includes these characters: ()+,-./ They are all allowed except ",". Instead "'" is not included in the list. Below is an extract of relevant SIP ABNF syntax. Call-ID = ( "Call-ID" / "i" ) HCOLON callid callid = word [ "@" word ] HCOLON = ( SP / HTAB ) ":" SWS SWS = [LWS] ; sep whitespace LWS = [WSP CRLF] 1WSP ; linear whitespace WSP = SP / HTAB word = 1(alphanum / "-" / "." / "!" / "%" / "*" / "_" / "+" / "`" / "'" / "~" / "(" / ")" / "<" / ">" / ":" / "\" / DQUOTE / "/" / "[" / "]" / "?" / "{" / "}" ) Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:37:58 +02:00
Marco Angaroni	68cb9fe47e	netfilter: nf_ct_sip: correct parsing of continuation lines in SIP headers Current parsing methods for SIP headers do not properly manage continuation lines: in case of Call-ID header the first character of Call-ID header value is truncated. As a result IPVS SIP persistence engine hashes over a call-id that is not exactly the one present in the originale message. Example: "Call-ID: \r\n abcdeABCDE1234" results in extracted call-id equal to "bcdeABCDE1234". In above example Call-ID is represented as a string in C language. Obviously in real message the first bytes after colon (":") are "20 0d 0a 20". Proposed fix is in nf_conntrack_sip module. Since sip_follow_continuation() function walks past the leading spaces or tabs of the continuation line, sip_skip_whitespace() should simply return the ouput of sip_follow_continuation(). Otherwise another iteration of the for loop is done and dptr is incremented by one pointing to the second character of the first word in the header. Below is an extract of relevant SIP ABNF syntax. Call-ID = ( "Call-ID" / "i" ) HCOLON callid callid = word [ "@" word ] HCOLON = ( SP / HTAB ) ":" SWS SWS = [LWS] ; sep whitespace LWS = [WSP CRLF] 1WSP ; linear whitespace WSP = SP / HTAB word = 1(alphanum / "-" / "." / "!" / "%" / "*" / "_" / "+" / "`" / "'" / "~" / "(" / ")" / "<" / ">" / ":" / "\" / DQUOTE / "/" / "[" / "]" / "?" / "{" / "}" ) Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:37:57 +02:00
Gao Feng	c579a9e7d5	netfilter: gre: Use consistent GRE and PTTP header structure instead of the ones defined by netfilter There are two existing strutures which defines the GRE and PPTP header. So use these two structures instead of the ones defined by netfilter to keep consitent with other codes. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:36:52 +02:00
Gao Feng	ecc6569f35	netfilter: gre: Use consistent GRE_* macros instead of ones defined by netfilter. There are already some GRE_* macros in kernel, so it is unnecessary to define these macros. And remove some useless macros Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-07 10:36:48 +02:00
Wei Yongjun	751eb6b604	ipv6: addrconf: fix dev refcont leak when DAD failed In general, when DAD detected IPv6 duplicate address, ifp->state will be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed work, the call tree should be like this: ndisc_recv_ns -> addrconf_dad_failure <- missing ifp put -> addrconf_mod_dad_work -> schedule addrconf_dad_work() -> addrconf_dad_stop() <- missing ifp hold before call it addrconf_dad_failure() called with ifp refcont holding but not put. addrconf_dad_work() call addrconf_dad_stop() without extra holding refcount. This will not cause any issue normally. But the race between addrconf_dad_failure() and addrconf_dad_work() may cause ifp refcount leak and netdevice can not be unregister, dmesg show the following messages: IPv6: eth0: IPv6 duplicate address fe80::XX:XXXX:XXXX:XX detected! ... unregister_netdevice: waiting for eth0 to become free. Usage count = 1 Cc: stable@vger.kernel.org Fixes: `c15b1ccadb` ("ipv6: move DAD and addrconf_verify processing to workqueue") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 14:17:49 -07:00
Mark Tomlinson	5a56a0b3a4	net: Don't delete routes in different VRFs When deleting an IP address from an interface, there is a clean-up of routes which refer to this local address. However, there was no check to see that the VRF matched. This meant that deletion wasn't confined to the VRF it should have been. To solve this, a new field has been added to fib_info to hold a table id. When removing fib entries corresponding to a local ip address, this table id is also used in the comparison. The table id is populated when the fib_info is created. This was already done in some places, but not in ip_rt_ioctl(). This has now been fixed. Fixes: `021dd3b8a1` ("net: Add routes to the table associated with the device") Acked-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 13:56:13 -07:00
David S. Miller	c7ee5672f3	RxRPC rewrite -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAV8yHP/Sw1s6N8H32AQL7cQ//S3MP4x690tV49UocezdAV8PuV3HpM5kt a+43v8RqjTHaZQxOwQaG71OpAM2gH1z7QPSq6jsLy0PEmaPiP/MZOAWqUudBAyqo eVsEMEVlw0f7PogyKqr06uPoVo21IdLNX9E89CqaGqFuDfsKlj1bKQHH4/248Arr i4ztzid/Mj98fHvzqMdr631c06GvLozU/5X6xE7hzkGkqVmRtjIB6qETGqerwx/p GJlcTZVFw2EviS6/Ft/t26xgVsOg1ogzXjWLUufnnJ1GpRaucqMfHwYD0WqQV3sB Bu6WRx2JgXRPBi5m7gWymkgT0pUNRiDFuWN6qdlJbHJgKGuVojF6tnh2go2bj9Cq q/GLbi8Y810v64293i1vdz6yyM1PzDG648+6z8vbTpsLI7cHDq5csPYMHRIh34IM FQmSZblKIuALD8BXqW1lXrqVKU0WEFVI9WjcRk9OSIanqPrQyP8xOAJVp03uIGe/ uxkheJBy7XvghwKWZNZ1y0A0g6NxY0gNMhVsmW77VbNfOsKOpEcpSCvMwjgMOvDi npynisM8tf2cx973SZYXjvvl2LSMHjiVQ/tf0iTMYmBjtU7ft5i8p3w290yAD+JQ JKqpKq3TXVty5MMEXl5bStIr/HkgPk7e6v1sH9WIREu9m9gROANGe/9Mr4VvweHk jEuFHy2EZNE= =3Don -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160904-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Split output code from sendmsg code Here's a set of small patches that split the packet transmission code from the sendmsg code and simply rearrange the new file to make it more logically laid out ready for being rewritten. An enum is also moved out of the header file to there as it's only used there. This needs to be applied on top of the just-posted fixes patch set. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 13:53:29 -07:00
David S. Miller	0122c6d5fa	RxRPC rewrite -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAV8yHKPSw1s6N8H32AQKbWA/+MEjzVzIOpp4edvAj5DzdO5pls3NvhUAg 2sP15E9nOL9JHrOD5snCVSRd9ROEGRE0/S9WXUjeb2VKz7C2pmTixo+MjiJRMzAR TZZlE2Ydx0h+A7WiywKoTY4g7VICL+UC+4XcheMzTLNS2mzqKb2GGOm3BwvGaFeT RKRVIlwIHziShaIs7K7ZkfKxGaDIL/9x344uPfFHaDKb33aOBnTuY6HlFf5Yu2qm pzh+R7cBYJMvhEqd71ESYPSbSGnjBN5zRzjZDSvXeI/30k9Ee6mFastv4fdG3Mrk 0WOLxml9yLTlJnfXeuN0T9B0C/ur4oD4hKEDREXVTxcTEXRq/VxSJ3cYxeM3DlAz U795lcveiYajRv7F73jcfNuaEENQg5HsZuaFs+CJgVxQpsqs9IOpEl9CNrVsvjmf 9crgamUj34ehZ5lgsV/Qbm7OFk16dmQ59ImGClQFgsIU6hEWCP0VgwKXWTwP7YZD Ucp1zWp/XTAtbrRNdkye/Z0WE5QOtoUWzftPrf95TP7LFewp0DGZ+miBV/atSHbZ bADXR0SOtax8u9bfs1HYTadwHk1LJYuVXNYn/KN06c9q1WKuKH/qg0JAjv/KUnJm Nnx0TFh4kUg999+3CtxqmupiqkD5SCckCN2ggifgDCmQhEwr1CYrGKWpZvNMq+S6 nVNTUr7PYSI= =RRYk -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160904-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Small fixes Here's a set of small fix patches: (1) Fix some uninitialised variables. (2) Set the client call state before making it live by attaching it to the conn struct. (3) Randomise the epoch and starting client conn ID values, and don't change the epoch when the client conn ID rolls round. (4) Replace deprecated create_singlethread_workqueue() calls. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 13:46:26 -07:00
Chuck Lever	05c974669e	xprtrdma: Fix receive buffer accounting An RPC can terminate before its reply arrives, if a credential problem or a soft timeout occurs. After this happens, xprtrdma reports it is out of Receive buffers. A Receive buffer is posted before each RPC is sent, and returned to the buffer pool when a reply is received. If no reply is received for an RPC, that Receive buffer remains posted. But xprtrdma tries to post another when the next RPC is sent. If this happens a few dozen times, there are no receive buffers left to be posted at send time. I don't see a way for a transport connection to recover at that point, and it will spit warnings and unnecessarily delay RPCs on occasion for its remaining lifetime. Commit `1e465fd4ff` ("xprtrdma: Replace send and receive arrays") removed a little bit of logic to detect this case and not provide a Receive buffer so no more buffers are posted, and then transport operation continues correctly. We didn't understand what that logic did, and it wasn't commented, so it was removed as part of the overhaul to support backchannel requests. Restore it, but be wary of the need to keep extra Receives posted to deal with backchannel requests. Fixes: `1e465fd4ff` ("xprtrdma: Replace send and receive arrays") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-09-06 15:59:35 -04:00
Chuck Lever	78d506e1b7	xprtrdma: Revert `3d4cf35bd4` ("xprtrdma: Reply buffer exhaustion...") Receive buffer exhaustion, if it were to actually occur, would be catastrophic. However, when there are no reply buffers to post, that means all of them have already been posted and are waiting for incoming replies. By design, there can never be more RPCs in flight than there are available receive buffers. A receive buffer can be left posted after an RPC exits without a received reply; say, due to a credential problem or a soft timeout. This does not result in fewer posted receive buffers than there are pending RPCs, and there is already logic in xprtrdma to deal appropriately with this case. It also looks like the "+ 2" that was removed was accidentally accommodating the number of extra receive buffers needed for receiving backchannel requests. That will need to be addressed by another patch. Fixes: `3d4cf35bd4` ("xprtrdma: Reply buffer exhaustion can be...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-09-06 15:59:35 -04:00
Dave Jones	03c2778a93	ipv6: release dst in ping_v6_sendmsg Neither the failure or success paths of ping_v6_sendmsg release the dst it acquires. This leads to a flood of warnings from "net/core/dst.c:288 dst_release" on older kernels that don't have `8bf4ada2e2` backported. That patch optimistically hoped this had been fixed post 3.10, but it seems at least one case wasn't, where I've seen this triggered a lot from machines doing unprivileged icmp sockets. Cc: Martin Lau <kafai@fb.com> Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 12:54:17 -07:00
David S. Miller	60175ccdf4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Most relevant updates are the removal of per-conntrack timers to use a workqueue/garbage collection approach instead from Florian Westphal, the hash and numgen expression for nf_tables from Laura Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag, removal of ip_conntrack sysctl and many other incremental updates on our Netfilter codebase. More specifically, they are: 1) Retrieve only 4 bytes to fetch ports in case of non-linear skb transport area in dccp, sctp, tcp, udp and udplite protocol conntrackers, from Gao Feng. 2) Missing whitespace on error message in physdev match, from Hangbin Liu. 3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang. 4) Add nf_ct_expires() helper function and use it, from Florian Westphal. 5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also from Florian. 6) Rename nf_tables set implementation to nft_set_{name}.c 7) Introduce the hash expression to allow arbitrary hashing of selector concatenations, from Laura Garcia Liebana. 8) Remove ip_conntrack sysctl backward compatibility code, this code has been around for long time already, and we have two interfaces to do this already: nf_conntrack sysctl and ctnetlink. 9) Use nf_conntrack_get_ht() helper function whenever possible, instead of opencoding fetch of hashtable pointer and size, patch from Liping Zhang. 10) Add quota expression for nf_tables. 11) Add number generator expression for nf_tables, this supports incremental and random generators that can be combined with maps, very useful for load balancing purpose, again from Laura Garcia Liebana. 12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King. 13) Introduce a nft_chain_parse_hook() helper function to parse chain hook configuration, this is used by a follow up patch to perform better chain update validation. 14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the nft_set_hash implementation to honor the NLM_F_EXCL flag. 15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(), patch from Florian Westphal. 16) Don't use the DYING bit to know if the conntrack event has been already delivered, instead a state variable to track event re-delivery states, also from Florian. 17) Remove the per-conntrack timer, use the workqueue approach that was discussed during the NFWS, from Florian Westphal. 18) Use the netlink conntrack table dump path to kill stale entries, again from Florian. 19) Add a garbage collector to get rid of stale conntracks, from Florian. 20) Reschedule garbage collector if eviction rate is high. 21) Get rid of the __nf_ct_kill_acct() helper. 22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger. 23) Make nf_log_set() interface assertive on unsupported families. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-06 12:45:26 -07:00
Liping Zhang	d1a6cba576	netfilter: nft_chain_route: re-route before skb is queued to userspace Imagine such situation, user add the following nft rules, and queue the packets to userspace for further check: # ip rule add fwmark 0x0/0x1 lookup eth0 # ip rule add fwmark 0x1/0x1 lookup eth1 # nft add table filter # nft add chain filter output {type route hook output priority 0 \;} # nft add rule filter output mark set 0x1 # nft add rule filter output queue num 0 But after we reinject the skbuff, the packet will be sent via the wrong route, i.e. in this case, the packet will be routed via eth0 table, not eth1 table. Because we skip to do re-route when verdict is NF_QUEUE, even if the mark was changed. Acctually, we should not touch sk_buff if verdict is NF_DROP or NF_STOLEN, and when re-route fails, return NF_DROP with error code. This is consistent with the mangle table in iptables. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-06 18:02:37 +02:00
Liping Zhang	5210d393ef	netfilter: nf_tables_trace: fix endiness when dump chain policy NFTA_TRACE_POLICY attribute is big endian, but we forget to call htonl to convert it. Fortunately, this attribute is parsed as big endian in libnftnl. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-09-05 19:28:23 +02:00
David Howells	3dc20f090d	rxrpc Move enum rxrpc_command to sendmsg.c Move enum rxrpc_command to sendmsg.c as it's now only used in that file. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 21:41:39 +01:00
David Howells	df423a4af1	rxrpc: Rearrange net/rxrpc/sendmsg.c Rearrange net/rxrpc/sendmsg.c to be in a more logical order. This makes it easier to follow and eliminates forward declarations. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 21:41:39 +01:00
David Howells	0b58b8a18b	rxrpc: Split sendmsg from packet transmission code Split the sendmsg code from the packet transmission code (mostly to be found in output.c). Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 21:41:39 +01:00
David Howells	090f85deb6	rxrpc: Don't change the epoch It seems the local epoch should only be changed on boot, so remove the code that changes it for client connections. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 21:41:39 +01:00
David Howells	5f2d9c4438	rxrpc: Randomise epoch and starting client conn ID values Create a random epoch value rather than a time-based one on startup and set the top bit to indicate that this is the case. Also create a random starting client connection ID value. This will be incremented from here as new client connections are created. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 21:41:39 +01:00
Linus Torvalds	6e1ce3c345	af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock' Right now we use the 'readlock' both for protecting some of the af_unix IO path and for making the bind be single-threaded. The two are independent, but using the same lock makes for a nasty deadlock due to ordering with regards to filesystem locking. The bind locking would want to nest outside the VSF pathname locking, but the IO locking wants to nest inside some of those same locks. We tried to fix this earlier with commit `c845acb324` ("af_unix: Fix splice-bind deadlock") which moved the readlock inside the vfs locks, but that caused problems with overlayfs that will then call back into filesystem routines that take the lock in the wrong order anyway. Splitting the locks means that we can go back to having the bind lock be the outermost lock, and we don't have any deadlocks with lock ordering. Acked-by: Rainer Weikusat <rweikusat@cyberadapt.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 13:29:29 -07:00
Linus Torvalds	38f7bd94a9	Revert "af_unix: Fix splice-bind deadlock" This reverts commit `c845acb324`. It turns out that it just replaces one deadlock with another one: we can still get the wrong lock ordering with the readlock due to overlayfs calling back into the filesystem layer and still taking the vfs locks after the readlock. The proper solution ends up being to just split the readlock into two pieces: the bind lock (taken outside the vfs locks) and the IO lock (taken inside the filesystem locks). The two locks are independent anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 13:29:29 -07:00
Mahesh Bandewar	24b27fc4cd	bonding: Fix bonding crash Following few steps will crash kernel - (a) Create bonding master > modprobe bonding miimon=50 (b) Create macvlan bridge on eth2 > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \ type macvlan (c) Now try adding eth2 into the bond > echo +eth2 > /sys/class/net/bond0/bonding/slaves <crash> Bonding does lots of things before checking if the device enslaved is busy or not. In this case when the notifier call-chain sends notifications, the bond_netdev_event() assumes that the rx_handler /rx_handler_data is registered while the bond_enslave() hasn't progressed far enough to register rx_handler for the new slave. This patch adds a rx_handler check that can be performed right at the beginning of the enslave code to avoid getting into this situation. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 11:41:12 -07:00
WANG Cong	bc51dddf98	netns: avoid disabling irq for netns id We never read or change netns id in hardirq context, the only place we read netns id in softirq context is in vxlan_xmit(). So, it should be enough to just disable BH. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 11:39:59 -07:00
WANG Cong	38f507f1ba	vxlan: call peernet2id() in fdb notification netns id should be already allocated each time we change netns, that is, in dev_change_net_namespace() (more precisely in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 11:39:58 -07:00
Joe Stringer	76644232e6	openvswitch: Free tmpl with tmpl_free. When an error occurs during conntrack template creation as part of actions validation, we need to free the template. Previously we've been using nf_ct_put() to do this, but nf_ct_tmpl_free() is more appropriate. Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-04 11:38:10 -07:00
David Howells	af338a9ea6	rxrpc: The client call state must be changed before attachment to conn We must set the client call state to RXRPC_CALL_CLIENT_SEND_REQUEST before attaching the call to the connection struct, not after, as it's liable to receive errors and conn aborts as soon as the assignment is made - and these will cause its state to be changed outside of the initiating thread's control. Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-04 13:10:10 +01:00
Paolo Abeni	a41bd25ae6	sunrpc: fix UDP memory accounting The commit `f9b2ee714c` ("SUNRPC: Move UDP receive data path into a workqueue context"), as a side effect, moved the skb_free_datagram() call outside the scope of the related socket lock, but UDP sockets require such lock to be held for proper memory accounting. Fix it by replacing skb_free_datagram() with skb_free_datagram_locked(). Fixes: `f9b2ee714c` ("SUNRPC: Move UDP receive data path into a workqueue context") Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2016-09-03 10:00:49 -04:00
Jon Paul Maloy	e0a05ebe26	tipc: send broadcast nack directly upon sequence gap detection Because of the risk of an excessive number of NACK messages and retransissions, receivers have until now abstained from sending broadcast NACKS directly upon detection of a packet sequence number gap. We have instead relied on such gaps being detected by link protocol STATE message exchange, something that by necessity delays such detection and subsequent retransmissions. With the introduction of unicast NACK transmission and rate control of retransmissions we can now remove this limitation. We now allow receiving nodes to send NACKS immediately, while coordinating the permission to do so among the nodes in order to avoid NACK storms. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-02 17:10:25 -07:00
Jon Paul Maloy	7c4a54b963	tipc: rate limit broadcast retransmissions As cluster sizes grow, so does the amount of identical or overlapping broadcast NACKs generated by the packet receivers. This often leads to 'NACK crunches' resulting in huge numbers of redundant retransmissions of the same packet ranges. In this commit, we introduce rate control of broadcast retransmissions, so that a retransmitted range cannot be retransmitted again until after at least 10 ms. This reduces the frequency of duplicate, redundant retransmissions by an order of magnitude, while having a significant positive impact on overall throughput and scalability. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-02 17:10:24 -07:00
Jon Paul Maloy	02d11ca200	tipc: transfer broadcast nacks in link state messages When we send broadcasts in clusters of more 70-80 nodes, we sometimes see the broadcast link resetting because of an excessive number of retransmissions. This is caused by a combination of two factors: 1) A 'NACK crunch", where loss of broadcast packets is discovered and NACK'ed by several nodes simultaneously, leading to multiple redundant broadcast retransmissions. 2) The fact that the NACKS as such also are sent as broadcast, leading to excessive load and packet loss on the transmitting switch/bridge. This commit deals with the latter problem, by moving sending of broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused bits in word 8 of the said message for this purpose, and introduce a new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change backwards compatible. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-02 17:10:24 -07:00
David Howells	00b5407e42	rxrpc: Fix uninitialised variable warning Fix the following uninitialised variable warning: ../net/rxrpc/call_event.c: In function 'rxrpc_process_call': ../net/rxrpc/call_event.c:879:58: warning: 'error' may be used uninitialized in this function [-Wmaybe-uninitialized] _debug("post net error %d", error); ^ Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-02 22:39:44 +01:00
Arnd Bergmann	30787a4170	rxrpc: fix undefined behavior in rxrpc_mark_call_released gcc -Wmaybe-initialized correctly points out a newly introduced bug through which we can end up calling rxrpc_queue_call() for a dead connection: net/rxrpc/call_object.c: In function 'rxrpc_mark_call_released': net/rxrpc/call_object.c:600:5: error: 'sched' may be used uninitialized in this function [-Werror=maybe-uninitialized] This sets the 'sched' variable to zero to restore the previous behavior. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `f5c17aaeb2` ("rxrpc: Calls should only have one terminal state") Signed-off-by: David Howells <dhowells@redhat.com>	2016-09-02 22:39:44 +01:00
Sabrina Dubroca	2f86953e74	l2tp: fix use-after-free during module unload Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq -> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU -> l2tp_tunnel_destruct). By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish destroying the socket, the private data reserved via the net_generic mechanism has already been freed, but l2tp_tunnel_destruct() actually uses this data. Make sure tunnel deletion for the netns has completed before returning from l2tp_exit_net() by first flushing the tunnel removal workqueue, and then waiting for RCU callbacks to complete. Fixes: `167eb17e0b` ("l2tp: create tunnel sockets in the right namespace") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-02 11:44:44 -07:00
Eli Cooper	ab34380162	ipv6: Don't unset flowi6_proto in ipxip6_tnl_xmit() Commit `8eb30be035` ("ipv6: Create ip6_tnl_xmit") unsets flowi6_proto in ip4ip6_tnl_xmit() and ip6ip6_tnl_xmit(). Since xfrm_selector_match() relies on this info, IPv6 packets sent by an ip6tunnel cannot be properly selected by their protocols after removing it. This patch puts flowi6_proto back. Cc: stable@vger.kernel.org Fixes: `8eb30be035` ("ipv6: Create ip6_tnl_xmit") Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 23:41:24 -07:00
Nikolay Aleksandrov	b6cb5ac833	net: bridge: add per-port multicast flood flag Add a per-port flag to control the unknown multicast flood, similar to the unknown unicast flood flag and break a few long lines in the netlink flag exports. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 22:48:33 -07:00
Nikolay Aleksandrov	8addd5e7d3	net: bridge: change unicast boolean to exact pkt_type Remove the unicast flag and introduce an exact pkt_type. That would help us for the upcoming per-port multicast flood flag and also slightly reduce the tests in the input fast path. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 22:48:33 -07:00
Gao Feng	635c223cfa	rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly The original codes depend on that the function parameters are evaluated from left to right. But the parameter's evaluation order is not defined in C standard actually. When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys, hashrnd) with some compilers or environment, the keys passed to flow_keys_have_l4 is not initialized. Fixes: `6db61d79c1` ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash") Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 22:45:03 -07:00
Roopa Prabhu	d297653dd6	rtnetlink: fdb dump: optimize by saving last interface markers fdb dumps spanning multiple skb's currently restart from the first interface again for every skb. This results in unnecessary iterations on the already visited interfaces and their fdb entries. In large scale setups, we have seen this to slow down fdb dumps considerably. On a system with 30k macs we see fdb dumps spanning across more than 300 skbs. To fix the problem, this patch replaces the existing single fdb marker with three markers: netdev hash entries, netdevs and fdb index to continue where we left off instead of restarting from the first netdev. This is consistent with link dumps. In the process of fixing the performance issue, this patch also re-implements fix done by commit `472681d57a` ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump") (with an internal fix from Wilson Kok) in the following ways: - change ndo_fdb_dump handlers to return error code instead of the last fdb index - use cb->args strictly for dump frag markers and not error codes. This is consistent with other dump functions. Below results were taken on a system with 1000 netdevs and 35085 fdb entries: before patch: $time bridge fdb show \| wc -l 15065 real 1m11.791s user 0m0.070s sys 1m8.395s (existing code does not return all macs) after patch: $time bridge fdb show \| wc -l 35085 real 0m2.017s user 0m0.113s sys 0m1.942s Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 16:56:15 -07:00
David Howells	d001648ec7	rxrpc: Don't expose skbs to in-kernel users [ver #2 ] Don't expose skbs to in-kernel users, such as the AFS filesystem, but instead provide a notification hook the indicates that a call needs attention and another that indicates that there's a new call to be collected. This makes the following possibilities more achievable: (1) Call refcounting can be made simpler if skbs don't hold refs to calls. (2) skbs referring to non-data events will be able to be freed much sooner rather than being queued for AFS to pick up as rxrpc_kernel_recv_data will be able to consult the call state. (3) We can shortcut the receive phase when a call is remotely aborted because we don't have to go through all the packets to get to the one cancelling the operation. (4) It makes it easier to do encryption/decryption directly between AFS's buffers and sk_buffs. (5) Encryption/decryption can more easily be done in the AFS's thread contexts - usually that of the userspace process that issued a syscall - rather than in one of rxrpc's background threads on a workqueue. (6) AFS will be able to wait synchronously on a call inside AF_RXRPC. To make this work, the following interface function has been added: int rxrpc_kernel_recv_data( struct socket sock, struct rxrpc_call call, void buffer, size_t bufsize, size_t _offset, bool want_more, u32 *_abort_code); This is the recvmsg equivalent. It allows the caller to find out about the state of a specific call and to transfer received data into a buffer piecemeal. afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction logic between them. They don't wait synchronously yet because the socket lock needs to be dealt with. Five interface functions have been removed: rxrpc_kernel_is_data_last() rxrpc_kernel_get_abort_code() rxrpc_kernel_get_error_number() rxrpc_kernel_free_skb() rxrpc_kernel_data_consumed() As a temporary hack, sk_buffs going to an in-kernel call are queued on the rxrpc_call struct (->knlrecv_queue) rather than being handed over to the in-kernel user. To process the queue internally, a temporary function, temp_deliver_data() has been added. This will be replaced with common code between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a future patch. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 16:43:27 -07:00
Neal Cardwell	28b346cbc0	tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data Yuchung noticed that on the first TFO server data packet sent after the (TFO) handshake, the server echoed the TCP timestamp value in the SYN/data instead of the timestamp value in the final ACK of the handshake. This problem did not happen on regular opens. The tcp_replace_ts_recent() logic that decides whether to remember an incoming TS value needs tp->rcv_wup to hold the latest receive sequence number that we have ACKed (latest tp->rcv_nxt we have ACKed). This commit fixes this issue by ensuring that a TFO server properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends a SYN/ACK for the SYN/data. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Fixes: `168a8f5805` ("tcp: TCP Fast Open Server - main code path") Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 16:40:15 -07:00
Nikolay Aleksandrov	85a3d4a935	net: bridge: don't increment tx_dropped in br_do_proxy_arp pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the skb isn't changed/dropped and processing continues so we shouldn't increment tx_dropped. CC: Kyeyoon Park <kyeyoonp@codeaurora.org> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: bridge@lists.linux-foundation.org Fixes: `958501163d` ("bridge: Add support for IEEE 802.11 Proxy ARP") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 16:35:30 -07:00
Nicolas Dichtel	29c994e361	netconf: add a notif when settings are created All changes are notified, but the initial state was missing. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 15:18:08 -07:00
Nicolas Dichtel	d26c638c16	ipv6: add missing netconf notif when 'all' is updated The 'default' value was not advertised. Fixes: `f3a1bfb11c` ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 15:18:08 -07:00
stephen hemminger	f5bb341e1d	l2tp: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:01 -07:00
stephen hemminger	4f70c96ffd	tcp: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:01 -07:00
stephen hemminger	6501f34ff7	ila: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:01 -07:00
stephen hemminger	3f18ff2b42	fou: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:00 -07:00
stephen hemminger	3ee5256da0	netns: make nla_policy const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:00 -07:00
stephen hemminger	deeb91f59d	batman: make netlink attributes const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:00 -07:00
stephen hemminger	85bae4bd8a	drop_monitor: make genl_multicast_group const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:00 -07:00
stephen hemminger	12d8de6d95	net: make genetlink ctrl ops const Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 14:09:00 -07:00
stephen hemminger	ce927bf174	mpls: get rid of trivial returns return at end of function is useless. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 10:13:15 -07:00
Parthasarathy Bhuvaragan	d2f394dc48	tipc: fix random link resets while adding a second bearer In a dual bearer configuration, if the second tipc link becomes active while the first link still has pending nametable "bulk" updates, it randomly leads to reset of the second link. When a link is established, the function named_distribute(), fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL) with NAME_DISTRIBUTOR message for each PUBLICATION. However, the function named_distribute() allocates the buffer by increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR). This consumes the space allocated for TUNNEL_PROTOCOL. When establishing the second link, the link shall tunnel all the messages in the first link queue including the "bulk" update. As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds the link mtu the transmission fails (-EMSGSIZE). Thus, the synch point based on the message count of the tunnel packets is never reached leading to link timeout. In this commit, we adjust the size of name distributor message so that they can be tunnelled. Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-09-01 10:12:26 -07:00
WANG Cong	c0338aff22	kcm: fix a socket double free Dmitry reported a double free on kcm socket, which could be easily reproduced by: #include <unistd.h> #include <sys/syscall.h> int main() { int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0); syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0); return 0; } This is because on the error path, after we install the new socket file, we call sock_release() to clean up the socket, which leaves the fd pointing to a freed socket. Fix this by calling sys_close() on that fd directly. Fixes: `ab7ac4eb98` ("kcm: Kernel Connection Multiplexor module") Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-31 21:00:19 -07:00
Vivien Didelot	8df3025520	net: dsa: add MDB support Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-31 14:15:42 -07:00
Davide Caratti	9264251ee2	bridge: re-introduce 'fix parsing of MLDv2 reports' commit `bc8c20acae` ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave") seems to have accidentally reverted commit `47cc84ce0c` ("bridge: fix parsing of MLDv2 reports"). This commit brings back a change to br_ip6_multicast_mld2_report() where parsing of MLDv2 reports stops when the first group is successfully added to the MDB cache. Fixes: `bc8c20acae` ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-31 09:29:58 -07:00
David Ahern	48d2ab609b	net: mpls: Fixups for GSO As reported by Lennert the MPLS GSO code is failing to properly segment large packets. There are a couple of problems: 1. the inner protocol is not set so the gso segment functions for inner protocol layers are not getting run, and 2 MPLS labels for packets that use the "native" (non-OVS) MPLS code are not properly accounted for in mpls_gso_segment. The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment to call the gso segment functions for the higher layer protocols. That means skb_mac_gso_segment is called twice -- once with the network protocol set to MPLS and again with the network protocol set to the inner protocol. This patch sets the inner skb protocol addressing item 1 above and sets the network_header and inner_network_header to mark where the MPLS labels start and end. The MPLS code in OVS is also updated to set the two network markers. >From there the MPLS GSO code uses the difference between the network header and the inner network header to know the size of the MPLS header that was pushed. It then pulls the MPLS header, resets the mac_len and protocol for the inner protocol and then calls skb_mac_gso_segment to segment the skb. Afterward the inner protocol segmentation is done the skb protocol is set to mpls for each segment and the network and mac headers restored. Reported-by: Lennert Buytenhek <buytenh@wantstofly.org> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 22:27:18 -07:00
Roopa Prabhu	14972cbd34	net: lwtunnel: Handle fragmentation Today mpls iptunnel lwtunnel_output redirect expects the tunnel output function to handle fragmentation. This is ok but can be avoided if we did not do the mpls output redirect too early. ie we could wait until ip fragmentation is done and then call mpls output for each ip fragment. To make this work we will need, 1) the lwtunnel state to carry encap headroom 2) and do the redirect to the encap output handler on the ip fragment (essentially do the output redirect after fragmentation) This patch adds tunnel headroom in lwtstate to make sure we account for tunnel data in mtu calculations during fragmentation and adds new xmit redirect handler to redirect to lwtunnel xmit func after ip fragmentation. This includes IPV6 and some mtu fixes and testing from David Ahern. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 22:27:18 -07:00
Eric Dumazet	41852497a9	net: batch calls to flush_all_backlogs() After commit `145dd5f9c8` ("net: flush the softnet backlog in process context"), we can easily batch calls to flush_all_backlogs() for all devices processed in rollback_registered_many() Tested: Before patch, on an idle host. modprobe dummy numdummies=10000 perf stat -e context-switches -a rmmod dummy Performance counter stats for 'system wide': 1,211,798 context-switches 1.302137465 seconds time elapsed After patch: perf stat -e context-switches -a rmmod dummy Performance counter stats for 'system wide': 225,523 context-switches 0.721623566 seconds time elapsed Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 22:17:20 -07:00
David S. Miller	2df5d103a6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Allow nf_tables reject expression from input, forward and output hooks, since only there the routing information is available, otherwise we crash. 2) Fix unsafe list iteration when flushing timeout and accouting objects. 3) Fix refcount leak on timeout policy parsing failure. 4) Unlink timeout object for unconfirmed conntracks too 5) Missing validation of pkttype mangling from bridge family. 6) Fix refcount leak on ebtables on second lookup for the specific bridge match extension, this patch from Sabrina Dubroca. 7) Remove unnecessary ip_hdr() in nf_tables_netdev family. Patches from 1-5 and 7 from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 22:02:09 -07:00
David S. Miller	15543692a0	Three little fixes: * revert a recent wext patch, which Ben Hutchings noticed was wrong, and it turns out not to be necessary for any driver * fix an infinite loop that can occur under certain conditions in mac80211's TDLS code (depending on regulatory information) * add a cfg80211_get_station() static inline when cfg80211 isn't built, to allow other modules to not have to depend on it for it -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJXxSR4AAoJEGt7eEactAAdl3EP/AiUwqrYqbLnnFy6C7obFS3p eBBMxQAZbT+q+fFlZvqRrt5tPdkYriPLhm/0sAzuapnyS+Q6seNJ/vPoo91uC1jU ZI/j97v9NwUtRLfNCq+0Jwvs7ma0U1VEcPV9wDdV5JgnKk0Z1CUIcsErYr1+v0YQ EpRwxczhzJNTULW36UP7RvVQpxwIGldPhxSZ0t1uHWaYTFliaTlnJUAk0ql44Lmm WLvoMSjFgX99P11ToCe81MPEzF2IXILvxPwtNZmn5tldEN2xknKEoEmmbN65fYDf OIJIJ3s1CijQvnkgXtU0RWWCMnyOoJjsLckgSDdy0euhbS5xRIfxBN2n+kqaI9WV a/aIvWNNhvAy2vNdWUJk0FrVBnDjlTtG1afIEAgJyP7uxTQqepQfyaRENLtH+kKe lWbOITUZztyagGIn8Bv1pDrrqwO+fSjiEsVEVAQMMmNKBpUWf8urhDQmabCLYGDB Nxh2e3wjv5ZQ+55uJIGRDCcPIrddh86FVtQBqTID+86r4a1RwPaWfhzFZYVj84fg 504UzwtYlw1ITUhGbdMwribLVkwtBMVuEvpPrh6avwzS8wAH4upkhp4GGl/tfd/Z De0LqpxCKbDiI+VmmDo8FD4nx4wu4nTYIaecLjNoUSXhbjbPyI6V8/hIXqKiQUA3 ObkKlGicZJmhMa4zna0q =dFzv -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Three little fixes: * revert a recent wext patch, which Ben Hutchings noticed was wrong, and it turns out not to be necessary for any driver * fix an infinite loop that can occur under certain conditions in mac80211's TDLS code (depending on regulatory information) * add a cfg80211_get_station() static inline when cfg80211 isn't built, to allow other modules to not have to depend on it for it ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 21:34:48 -07:00
Linus Torvalds	0cf21c6609	NFS client bugfixes for 4.8 Highlights include: Stable patches: - Fix a refcount leak in nfs_callback_up_net - Fix an Oopsable condition when the flexfile pNFS driver connection to the DS fails - Fix an Oopsable condition in NFSv4.1 server callback races - Ensure pNFS clients stop doing I/O to the DS if their lease has expired, as required by the NFSv4.1 protocol Bugfixes: - Fix potential looping in the NFSv4.x migration code - Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN - Silence WARN_ON when NFSv4.1 over RDMA is in use - Fix a LAYOUTCOMMIT race in the pNFS/blocks client - Fix pNFS timeout issues when the DS fails -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJXxbnyAAoJEGcL54qWCgDykWoP/jqgBBR/cSaOtx+5m39wlf0P pTdQkgcpWnhBS90tKZtC6zfJ2DFVt8sUNVn9+mVzT4Q7TgEcAmENQ//s0igxHLbl bkXPvULydvD05Db8m1xmq2snj72tWbpg3CaA7nfx6yiP63k237QxhyNZVkmEQDur ynU8dPzmxRaSTQdVgatdS0zqx8sF47OFnXVxkV0ssBKORGsWj3yKDcs293NZNFAM Ztkih5oW1mm+BtWUQVNrjRnfZFG+PxAxWv090JM6wABDRbDHwSaKmwmI0kWRKXoH DHrj4i/Wzws65Fg5AyVPSRkF8YvHSVsLnw/FlwKKZFsrWjU6WtLdLSzgzwQ47x98 tQk/YGgNyiiD1cAcw+l0d3Ct1SO4AptNuisdJK0cn3iCdsbh6Y0eW6yRRtQY6jQI 8qOyMTT8fp9ooEQK+nMNOhJVVlsG0hbvWAt/uiiBdPhjAfVB0UFRuua/vNKUO7yv hJkDY9i7EkMXKACf5BCpBuvYdU7rwqp43K9x34029A5vFTKOhJZS4hnAIocDd/WF Hw7yqHdpkvI5RgFbBV5tmfZPyS65k8AzzTtT1QHKlH0qEtN2iMaXsXM9EzK5bKfW 85Cc6yzRk7NzDZKmZFs/T8zCYdzet48sCY7wVyOQjL0aIkIDNNcZhex+C1GuD1dp Ld0H5f9eZdwv/OAqJ8tm =U+XK -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a refcount leak in nfs_callback_up_net - Fix an Oopsable condition when the flexfile pNFS driver connection to the DS fails - Fix an Oopsable condition in NFSv4.1 server callback races - Ensure pNFS clients stop doing I/O to the DS if their lease has expired, as required by the NFSv4.1 protocol Bugfixes: - Fix potential looping in the NFSv4.x migration code - Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN - Silence WARN_ON when NFSv4.1 over RDMA is in use - Fix a LAYOUTCOMMIT race in the pNFS/blocks client - Fix pNFS timeout issues when the DS fails" * tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.x: Fix a refcount leak in nfs_callback_up_net NFS4: Avoid migration loops pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN NFSv4.1: Defer bumping the slot sequence number until we free the slot NFSv4.1: Delay callback processing when there are referring triples NFSv4.1: Fix Oopsable condition in server callback races SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use pnfs/blocklayout: update last_write_offset atomically with extents pNFS: The client must not do I/O to the DS if it's lease has expired pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls pNFS/flexfiles: Set reasonable default retrans values for the data channel NFS: Allow the mount option retrans=0 pNFS/flexfiles: Fix layoutstat periodic reporting	2016-08-30 11:14:02 -07:00
David Howells	4de48af663	rxrpc: Pass struct socket * to more rxrpc kernel interface functions Pass struct socket * to more rxrpc kernel interface functions. They should be starting from this rather than the socket pointer in the rxrpc_call struct if they need to access the socket. I have left: rxrpc_kernel_is_data_last() rxrpc_kernel_get_abort_code() rxrpc_kernel_get_error_number() rxrpc_kernel_free_skb() rxrpc_kernel_data_consumed() unmodified as they're all about to be removed (and, in any case, don't touch the socket). Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 16:07:53 +01:00
David Howells	ea82aaec98	rxrpc: Use call->peer rather than going to the connection Use call->peer rather than call->conn->params.peer as call->conn may become NULL. Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 16:07:53 +01:00
David Howells	8324f0bcfb	rxrpc: Provide a way for AFS to ask for the peer address of a call Provide a function so that kernel users, such as AFS, can ask for the peer address of a call: void rxrpc_kernel_get_peer(struct rxrpc_call call, struct sockaddr_rxrpc _srx); In the future the kernel service won't get sk_buffs to look inside. Further, this allows us to hide any canonicalisation inside AF_RXRPC for when IPv6 support is added. Also propagate this through to afs_find_server() and issue a warning if we can't handle the address family yet. Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 16:07:53 +01:00
David Howells	e34d4234b0	rxrpc: Trace rxrpc_call usage Add a trace event for debuging rxrpc_call struct usage. Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 16:02:36 +01:00
David Howells	f5c17aaeb2	rxrpc: Calls should only have one terminal state Condense the terminal states of a call state machine to a single state, plus a separate completion type value. The value is then set, along with error and abort code values, only when the call is transitioned to the completion state. Helpers are provided to simplify this. Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 15:58:31 +01:00
David Howells	ccbd3dbe85	rxrpc: Fix a potential NULL-pointer deref in rxrpc_abort_calls The call pointer in a channel on a connection will be NULL if there's no active call on that channel. rxrpc_abort_calls() needs to check for this before trying to take the call's state_lock. Signed-off-by: David Howells <dhowells@redhat.com>	2016-08-30 15:56:12 +01:00
Gao Feng	779994fa36	netfilter: log: Check param to avoid overflow in nf_log_set The nf_log_set is an interface function, so it should do the strict sanity check of parameters. Convert the return value of nf_log_set as int instead of void. When the pf is invalid, return -EOPNOTSUPP. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:52:32 +02:00
Gao Feng	3cb27991aa	netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1' There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP, so we could use it instead of the literal number '1'. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:51:08 +02:00
Florian Westphal	ad66713f5a	netfilter: remove __nf_ct_kill_acct helper After timer removal this just calls nf_ct_delete so remove the __ prefix version and make nf_ct_kill a shorthand for nf_ct_delete. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:10 +02:00
Florian Westphal	c023c0e4a0	netfilter: conntrack: resched gc again if eviction rate is high If we evicted a large fraction of the scanned conntrack entries re-schedule the next gc cycle for immediate execution. This triggers during tests where load is high, then drops to zero and many connections will be in TW/CLOSE state with < 30 second timeouts. Without this change it will take several minutes until conntrack count comes back to normal. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:09 +02:00
Florian Westphal	b87a2f9199	netfilter: conntrack: add gc worker to remove timed-out entries Conntrack gc worker to evict stale entries. GC happens once every 5 seconds, but we only scan at most 1/64th of the table (and not more than 8k) buckets to avoid hogging cpu. This means that a complete scan of the table will take several minutes of wall-clock time. Considering that the gc run will never have to evict any entries during normal operation because those will happen from packet path this should be fine. We only need gc to make sure userspace (conntrack event listeners) eventually learn of the timeout, and for resource reclaim in case the system becomes idle. We do not disable BH and cond_resched for every bucket so this should not introduce noticeable latencies either. A followup patch will add a small change to speed up GC for the extreme case where most entries are timed out on an otherwise idle system. v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on nulls value change in gc worker, suggested by Eric Dumazet. v3: don't call cancel_delayed_work_sync twice (again, Eric). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:09 +02:00
Florian Westphal	2344d64ec7	netfilter: evict stale entries on netlink dumps When dumping we already have to look at the entire table, so we might as well toss those entries whose timeout value is in the past. We also look at every entry during resize operations. However, eviction there is not as simple because we hold the global resize lock so we can't evict without adding a 'expired' list to drop from later. Considering that resizes are very rare it doesn't seem worth doing it. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:09 +02:00
Florian Westphal	f330a7fdbe	netfilter: conntrack: get rid of conntrack timer With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as Eric Dumazet pointed out during netfilter workshop 2016. Eric also says: "Another reason was the fact that Thomas was about to change max timer range [..]" (`500462a9de`, 'timers: Switch to a non-cascading wheel'). Remove the timer and use a 32bit jiffies value containing timestamp until entry is valid. During conntrack lookup, even before doing tuple comparision, check the timeout value and evict the entry in case it is too old. The dying bit is used as a synchronization point to avoid races where multiple cpus try to evict the same entry. Because lookup is always lockless, we need to bump the refcnt once when we evict, else we could try to evict already-dead entry that is being recycled. This is the standard/expected way when conntrack entries are destroyed. Followup patches will introduce garbage colliction via work queue and further places where we can reap obsoleted entries (e.g. during netlink dumps), this is needed to avoid expired conntracks from hanging around for too long when lookup rate is low after a busy period. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:09 +02:00
Florian Westphal	616b14b469	netfilter: don't rely on DYING bit to detect when destroy event was sent The reliable event delivery mode currently (ab)uses the DYING bit to detect which entries on the dying list have to be skipped when re-delivering events from the eache worker in reliable event mode. Currently when we delete the conntrack from main table we only set this bit if we could also deliver the netlink destroy event to userspace. If we fail we move it to the dying list, the ecache worker will reattempt event delivery for all confirmed conntracks on the dying list that do not have the DYING bit set. Once timer is gone, we can no longer use if (del_timer()) to detect when we 'stole' the reference count owned by the timer/hash entry, so we need some other way to avoid racing with other cpu. Pablo suggested to add a marker in the ecache extension that skips entries that have been unhashed from main table but are still waiting for the last reference count to be dropped (e.g. because one skb waiting on nfqueue verdict still holds a reference). We do this by adding a tristate. If we fail to deliver the destroy event, make a note of this in the eache extension. The worker can then skip all entries that are in a different state. Either they never delivered a destroy event, e.g. because the netlink backend was not loaded, or redelivery took place already. Once the conntrack timer is removed we will now be able to replace del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid racing with other cpu that tries to evict the same conntrack. Because DYING will then be set right before we report the destroy event we can no longer skip event reporting when dying bit is set. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:08 +02:00
Florian Westphal	95a8d19f28	netfilter: restart search if moved to other chain In case nf_conntrack_tuple_taken did not find a conflicting entry check that all entries in this hash slot were tested and restart in case an entry was moved to another chain. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: `ea781f197d` ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:43:08 +02:00
Liping Zhang	c73c248490	netfilter: nf_tables_netdev: remove redundant ip_hdr assignment We have already use skb_header_pointer to get the ip header pointer, so there's no need to use ip_hdr again. Moreover, in NETDEV INGRESS hook, ip header maybe not linear, so use ip_hdr is not appropriate, remove it. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2016-08-30 11:41:04 +02:00
Arik Nemtsov	554d072e7b	mac80211: TDLS: don't require beaconing for AP BW Stop downgrading TDLS chandef when reaching the AP BW. The AP provides the necessary regulatory protection in this case. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=153961, which reported an infinite loop here. Reported-by: Kamil Toman <kamil.toman@gmail.com> Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2016-08-30 08:03:41 +02:00
David S. Miller	6abdd5f593	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-30 00:54:02 -04:00
Arnd Bergmann	0b498a5277	net_sched: fix use of uninitialized ethertype variable in cls_flower The addition of VLAN support caused a possible use of uninitialized data if we encounter a zero TCA_FLOWER_KEY_ETH_TYPE key, as pointed out by "gcc -Wmaybe-uninitialized": net/sched/cls_flower.c: In function 'fl_change': net/sched/cls_flower.c:366:22: error: 'ethertype' may be used uninitialized in this function [-Werror=maybe-uninitialized] This changes the code to only set the ethertype field if it was nonzero, as before the patch. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `9399ae9a6c` ("net_sched: flower: Add vlan support") Cc: Hadar Hen Zion <hadarh@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-29 00:30:23 -04:00
Eric Dumazet	c9c3321257	tcp: add tcp_add_backlog() When TCP operates in lossy environments (between 1 and 10 % packet losses), many SACK blocks can be exchanged, and I noticed we could drop them on busy senders, if these SACK blocks have to be queued into the socket backlog. While the main cause is the poor performance of RACK/SACK processing, we can try to avoid these drops of valuable information that can lead to spurious timeouts and retransmits. Cause of the drops is the skb->truesize overestimation caused by : - drivers allocating ~2048 (or more) bytes as a fragment to hold an Ethernet frame. - various pskb_may_pull() calls bringing the headers into skb->head might have pulled all the frame content, but skb->truesize could not be lowered, as the stack has no idea of each fragment truesize. The backlog drops are also more visible on bidirectional flows, since their sk_rmem_alloc can be quite big. Let's add some room for the backlog, as only the socket owner can selectively take action to lower memory needs, like collapsing receive queues or partial ofo pruning. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-29 00:20:24 -04:00
Tom Herbert	96a5908347	kcm: Remove TCP specific references from kcm and strparser kcm and strparser need to work with any type of stream socket not just TCP. Eliminate references to TCP and call generic proto_ops functions of read_sock and peek_len. Also in strp_init check if the socket support the proto_ops read_sock and peek_len. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-28 23:32:41 -04:00
Tom Herbert	3203558589	tcp: Set read_sock and peek_len proto_ops In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to tcp_peek_len (which is just a stub function that calls tcp_inq). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-28 23:32:41 -04:00
Richard Alpe	832629ca5c	tipc: add UDP remoteip dump to netlink API When using replicast a UDP bearer can have an arbitrary amount of remote ip addresses associated with it. This means we cannot simply add all remote ip addresses to an existing bearer data message as it might fill the message, leaving us with a truncated message that we can't safely resume. To handle this we introduce the new netlink command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be called when the bearer data message has the TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than one remote ip (replicast). Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:41 -07:00
Richard Alpe	fdb3accc2c	tipc: add the ability to get UDP options via netlink Add UDP bearer options to netlink bearer get message. This is used by the tipc user space tool to display UDP options. The UDP bearer information is passed using either a sockaddr_in or sockaddr_in6 structs. This means the user space receiver should intermediately store the retrieved data in a large enough struct (sockaddr_strage) before casting to the proper IP version type. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:41 -07:00
Richard Alpe	c9b64d492b	tipc: add replicast peer discovery Automatically learn UDP remote IP addresses of communicating peers by looking at the source IP address of incoming TIPC link configuration messages (neighbor discovery). This makes configuration slightly easier and removes the problematic scenario where a node receives directly addressed neighbor discovery messages sent using replicast which the node cannot "reply" to using mutlicast, leaving the link FSM in a limbo state. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:41 -07:00
Richard Alpe	ef20cd4dd1	tipc: introduce UDP replicast This patch introduces UDP replicast. A concept where we emulate multicast by sending multiple unicast messages to configured peers. The purpose of replicast is mainly to be able to use TIPC in cloud environments where IP multicast is disabled. Using replicas to unicast multicast messages is costly as we have to copy each skb and send the copies individually. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:41 -07:00
Richard Alpe	1ca73e3fa1	tipc: refactor multicast ip check Add a function to check if a tipc UDP media address is a multicast address or not. This is a purely cosmetic change. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:40 -07:00
Richard Alpe	ce984da36e	tipc: split UDP send function Split the UDP send function into two. One callback that prepares the skb and one transmit function that sends the skb. This will come in handy in later patches, when we introduce UDP replicast. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-26 21:38:40 -07:00

... 5 6 7 8 9 ...

43844 Commits