Commit Graph

43844 Commits

Author SHA1 Message Date
Marcel Holtmann f4cdbb3f25 Bluetooth: Handle HCI raw socket transition from unbound to bound
In case an unbound HCI raw socket is later on bound, ensure that the
monitor notification messages indicate a close and re-open. None of
the userspace tools use the socket this, but it is actually possible
to use an ioctl on an unbound socket and then later bind it.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann f81f5b2db8 Bluetooth: Send control open and close messages for HCI raw sockets
When opening and closing HCI raw sockets their main usage is for legacy
userspace. To track interaction with the modern mgmt interface, send
open and close monitoring messages for these action.

The HCI raw sockets is special since it supports unbound ioctl operation
and for that special case delay the notification message until at least
one ioctl has been executed. The difference between a bound and unbound
socket will be detailed by the fact the HCI index is present or not.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann d0bef1d26f Bluetooth: Add extra channel checks for control open/close messages
The control open and close monitoring events require special channel
checks to ensure messages are only send when the right events happen.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 5a6d2cf5f1 Bluetooth: Assign the channel early when binding HCI sockets
Assignment of the hci_pi(sk)->channel should be done early when binding
the HCI socket. This avoids confusion with the RAW channel that is used
for legacy access.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 0ef2c42f8c Bluetooth: Send control open and close only when cookie is present
Only when the cookie has been assigned, then send the open and close
monitor messages. Also if the socket is bound to a device, then include
the index into the message.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 9e8305b39b Bluetooth: Use numbers for subsystem version string
Instead of keeping a version string around, use version and revision
numbers and then stringify them for use as module parameter.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann df1cb87af9 Bluetooth: Introduce helper functions for socket cookie handling
Instead of manually allocating cookie information each time, use helper
functions for generating and releasing cookies.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 9db5c62951 Bluetooth: Use command status event for Set IO Capability errors
In case of failure, the Set IO Capability command is suppose to return
command status and not command complete.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 56f787c502 Bluetooth: Fix wrong Get Clock Information return parameters
The address information of the Get Clock Information return parameters
is copying from a different memory location. It uses &cmd->param while
it actually needs to be cmd->param.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 5504c3a310 Bluetooth: Use individual flags for certain management events
Instead of hiding everything behind a general managment events flag,
introduce indivdual flags that allow fine control over which events are
send to a given management channel.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Johan Hedberg 37d3a1fab5 Bluetooth: mgmt: Fix sending redundant event for Advertising Instance
When an Advertising Instance is removed, the Advertising Removed event
shouldn't be sent to the same socket that issued the Remove
Advertising command (it gets a command complete event instead). The
mgmt_advertising_removed() function already has a parameter for
skipping a specific socket, but there was no code to propagate the
right value to this parameter. This patch fixes the issue by making
sure the intermediate hci_req_clear_adv_instance() function gets the
socket pointer.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 38ceaa00d0 Bluetooth: Add support for sending MGMT commands and events to monitor
This adds support for tracing all management commands and events via the
monitor interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 249fa1699f Bluetooth: Add support for sending MGMT open and close to monitor
This sends new notifications to the monitor support whenever a
management channel has been opened or closed. This allows tracing of
control channels really easily.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 03c979c471 Bluetooth: Introduce helper to pack mgmt version information
The mgmt version information will be also needed for the control
changell tracing feature. This provides a helper to pack them.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 70ecce91e3 Bluetooth: Store control socket cookie and comm information
To further allow unique identification and tracking of control socket,
store cookie and comm information when binding the socket.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Marcel Holtmann 47b0f573f2 Bluetooth: Check SOL_HCI for raw socket options
The SOL_HCI level should be enforced when using socket options on the
HCI raw socket interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
Aristeu Rozanski bd89bb6daa mac802154: use rate limited warnings for malformed frames
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Aristeu Rozanski ca1de81aa2 mac802154: don't warn on unsupported frames
Just because we don't support certain types of frames yet doesn't mean
we have to flood the message log with warnings about "invalid" frames.

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Alexander Aring 5ddedce3b7 6lowpan: ndisc: no overreact if no short address is available
This patch removes handling to remove short address for a neigbour entry
if RS/RA/NS/NA doesn't contain a short address. If these messages
doesn't has any short address option, the existing short address from
ndisc cache will be used. The current behaviour will set that the
neigbour doesn't has a short address anymore.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Alexander Aring abbcc341ad mac802154: set phy net namespace for new ifaces
This patch sets the net namespace when creating SoftMAC interfaces. This
is important if the namespace at phy layer was switched before.
Currently we losing interfaces in some namespace and it's not possible
to recover that.

Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-09-19 20:19:34 +02:00
Marcel Holtmann e64c97b53b Bluetooth: Add combined LED trigger for controller power
Instead of just having a LED trigger for power on a specific controller,
this adds the LED trigger "bluetooth-power" that combines the power
states of all controllers into a single trigger. This simplifies the
trigger selection and also supports multiple controllers per host
system via a single LED.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2016-09-19 20:19:34 +02:00
David Vrabel d48f9ce73c sunrpc: fix write space race causing stalls
Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
    SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
    this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
    which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
    is supposed to handle the above race) does not call
    xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
    thus the task is not woken.

Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
so the second call to sk->sk_write_space() calls xprt_write_space().

Suggested-by: Trond Myklebust <trondmy@primarydata.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
cc: stable@vger.kernel.org # 4.4
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:21:36 -04:00
Chuck Lever 496b77a5c5 xprtrdma: Eliminate rpcrdma_receive_worker()
Clean up: the extra layer of indirection doesn't add value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 1519e9697d xprtrdma: Rename rpcrdma_receive_wc()
Clean up: When converting xprtrdma to use the new CQ API, I missed a
spot. The naming convention elsewhere is:

  {svc_rdma,rpcrdma}_wc_{operation}

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever eeb30613e1 xprtrmda: Report address of frmr, not mw
Tie frwr debugging messages together by always reporting the address
of the frwr.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 44829d02d2 xprtrdma: Support larger inline thresholds
The Version One default inline threshold is still 1KB. But allow
testing with thresholds up to 64KB.

This maximum is somewhat arbitrary. There's no fundamental
architectural limit I'm aware of, but it's good to keep the size of
Receive buffers reasonable. Now that Send can use a s/g list, a
Send buffer is only as large as each RPC requires. Receive buffers
are always the size of the inline threshold, however.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 655fec6987 xprtrdma: Use gathered Send for large inline messages
An RPC Call message that is sent inline but that has a data payload
(ie, one or more items in rq_snd_buf's page list) must be "pulled
up:"

- call_allocate has to reserve enough RPC Call buffer space to
accommodate the data payload

- call_transmit has to memcopy the rq_snd_buf's page list and tail
into its head iovec before it is sent

As the inline threshold is increased beyond its current 1KB default,
however, this means data payloads of more than a few KB are copied
by the host CPU. For example, if the inline threshold is increased
just to 4KB, then NFS WRITE requests up to 4KB would involve a
memcpy of the NFS WRITE's payload data into the RPC Call buffer.
This is an undesirable amount of participation by the host CPU.

The inline threshold may be much larger than 4KB in the future,
after negotiation with a peer server.

Instead of copying the components of rq_snd_buf into its head iovec,
construct a gather list of these components, and send them all in
place. The same approach is already used in the Linux server's
RPC-over-RDMA reply path.

This mechanism also eliminates the need for rpcrdma_tail_pullup,
which is used to manage the XDR pad and trailing inline content when
a Read list is present.

This requires that the pages in rq_snd_buf's page list be DMA-mapped
during marshaling, and unmapped when a data-bearing RPC is
completed. This is slightly less efficient for very small I/O
payloads, but significantly more efficient as data payload size and
inline threshold increase past a kilobyte.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever c8b920bb49 xprtrdma: Basic support for Remote Invalidation
Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
as part of a Receive completion. Local invalidation can be skipped
for that rkey.

Use an out-of-band signaling mechanism to indicate to the server
that the client is prepared to receive RDMA Send With Invalidate.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 87cfb9a0c8 xprtrdma: Client-side support for rpcrdma_connect_private
Send an RDMA-CM private message on connect, and look for one during
a connection-established event.

Both sides can communicate their various implementation limits.
Implementations that don't support this sideband protocol ignore it.

Once the client knows the server's inline threshold maxima, it can
adjust the use of Reply chunks, and eliminate most use of Position
Zero Read chunks. Moderately-sized I/O can be done using a pure
inline RDMA Send instead of RDMA operations that require memory
registration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 6ea8e71150 xprtrdma: Move recv_wr to struct rpcrdma_rep
Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 90aab60296 xprtrdma: Move send_wr to struct rpcrdma_req
Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever b157380af1 xprtrdma: Simplify rpcrdma_ep_post_recv()
Clean up.

Since commit fc66448549 ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 13650c23f1 xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf
Clean up. The "ia" argument is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:38 -04:00
Chuck Lever 54cbd6b0c6 xprtrdma: Delay DMA mapping Send and Receive buffers
Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 99ef4db329 xprtrdma: Replace DMA_BIDIRECTIONAL
The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 08cf2efd54 xprtrdma: Use smaller buffers for RPC-over-RDMA headers
Commit 949317464b ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 9c40c49f14 xprtrdma: Initialize separate RPC call and reply buffers
RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 5a6d1db455 SUNRPC: Add a transport-specific private field in rpc_rqst
Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 68778945e4 SUNRPC: Separate buffer pointers for RPC Call and Reply messages
For xprtrdma, the RPC Call and Reply buffers are involved in real
I/O operations.

To start with, the DMA direction of the I/O for a Call is opposite
that of a Reply.

In the current arrangement, the Reply buffer address is on a
four-byte alignment just past the call buffer. Would be friendlier
on some platforms if that was at a DMA cache alignment instead.

Because the current arrangement allocates a single memory region
which contains both buffers, the RPC Reply buffer often contains a
page boundary in it when the Call buffer is large enough (which is
frequent).

It would be a little nicer for setting up DMA operations (and
possible registration of the Reply buffer) if the two buffers were
separated, well-aligned, and contained as few page boundaries as
possible.

Now, I could just pad out the single memory region used for the pair
of buffers. But frequently that would mean a lot of unused space to
ensure the Reply buffer did not have a page boundary.

Add a separate pointer to rpc_rqst that points right to the RPC
Reply buffer. This makes no difference to xprtsock, but it will help
xprtrdma in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 3435c74aed SUNRPC: Generalize the RPC buffer release API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever 5fe6eaa1f9 SUNRPC: Generalize the RPC buffer allocation API
xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Transports that want to allocate separate Call and Reply buffers
will ignore the "size" argument anyway.  Don't bother passing it.

The buf_alloc method can't return two pointers. Instead, make the
method's return value an error code, and set the rq_buffer pointer
in the method itself.

This gives call_allocate an opportunity to terminate an RPC instead
of looping forever when a permanent problem occurs. If a request is
just bogus, or the transport is in a state where it can't allocate
resources for any request, there needs to be a way to kill the RPC
right there and not loop.

This immediately fixes a rare problem in the backchannel send path,
which loops if the server happens to send a CB request whose
call+reply size is larger than a page (which it shouldn't do yet).

One more issue: looks like xprt_inject_disconnect was incorrectly
placed in the failure path in call_allocate. It needs to be in the
success path, as it is for other call-sites.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever b9c5bc03be SUNRPC: Refactor rpc_xdr_buf_init()
Clean up: there is some XDR initialization logic that is common
to the forward channel and backchannel. Move it to an XDR header
so it can be shared.

rpc_rqst::rq_buffer points to a buffer containing big-endian data.
Update its annotation as part of the clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Chuck Lever eb342e9a38 xprtrdma: Eliminate INLINE_THRESHOLD macros
Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:37 -04:00
Andy Adamson fda0ab4117 SUNRPC: rpc_clnt_add_xprt setup function for NFS layer
Use a setup function to call into the NFS layer to test an rpc_xprt
for session trunking so as to not leak the rpc_xprt_switch into
the nfs layer.

Search for the address in the rpc_xprt_switch first so as not to
put an unnecessary EXCHANGE_ID on the wire.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson 39e5d2df95 SUNRPC search xprt switch for sockaddr
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson dd69171769 SUNRPC rpc_clnt_xprt_switch_add_xprt
Give the NFS layer access to the rpc_xprt_switch_add_xprt function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson 3b58a8a904 SUNRPC rpc_clnt_xprt_switch_put
Give the NFS layer access to the xprt_switch_put function

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Andy Adamson 7705f6abbb SUNRPC remove rpc_task_release_client from rpc_task_set_client
rpc_task_set_client is only called from rpc_run_task after
rpc_new_task and rpc_task_release_client is not needed as the
task is new.

When called from rpc_new_task, rpc_task_set_client also removed the
assigned rpc_xprt which is not desired.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Trond Myklebust d002526886 SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create()
Clean up.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:36 -04:00
Amitoj Kaur Chawla 2813b626e3 sunrpc: Remove unnecessary variable
The variable `err` is not used anywhere and just returns the
predefined value `0` at the end of the function. Hence, remove the
variable and return 0 explicitly.

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19 13:08:35 -04:00
Ilan Tayari b588479358 xfrm: Fix memory leak of aead algorithm name
commit 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
introduced aead. The function attach_aead kmemdup()s the algorithm
name during xfrm_state_construct().
However this memory is never freed.
Implementation has since been slightly modified in
commit ee5c23176f ("xfrm: Clone states properly on migration")
without resolving this leak.
This patch adds a kfree() call for the aead algorithm name.

Fixes: 1a6509d991 ("[IPSEC]: Add support for combined mode algorithms")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-19 12:08:58 +02:00
David S. Miller e867e87ae8 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV90ZzvSw1s6N8H32AQIBsg//StI2J/tyGGTvvVXoeWPAGeDZp8907Qbc
 O6671V3feft0HvenQ/uFfC9PdpRTYHTlQhWidk9/wTx5nvg/5b7HGHSD8ghPnCRj
 Jyu4XbpXM6vw650qSBpRAg5TLIAkSsrdjaANzFSYw70sX6eTLdKJF3YFyoDRD2Cc
 T2Y0XIUAcACl5/A/KIHRaKjTtmzhoc5Vih9lOWQRZ/dw2u0A6GdwN0qoFfHqXwvI
 t1S4MtSkXrO8D9uWqmFhTpRzSAzO57L7bccVJbr+OzdNAm/Bkicjrxyzo6YepDAg
 VRdYaNtKCR9vPRFb9vZFlZy3vyUHb/22mrgM6/V5GG5qd/NFSnouFZ/WGxLnv1Yt
 5X+F0qJXXgxSmaGt0f6cV7d9OO24HU/YWGl7wUCHsMZ4bWDySBUWtk0X4USpCsiD
 +AUxYJTEUNKZcCsW8XmC+cfnkRLNPycX603/pCH7aiPdQ5qY+ue89ICxu3PrJrf/
 S8PuQPIoEWmEdTQ9QweG9FpIuQc2LO9fmzOJdiZXHYbGAKFOfNNz5Z64tSVRvPVE
 WbxW0HYSNCJmBt15OI3hyONmbgACgmuD+e3EFWzCATz94FwMogLBFNqPcw1YpbwX
 48RdcP62gx7FMZ4MdqfWllviZKjTlTVmacCRghwGFBOuZL3ifu1gfhsTbrAxqXCs
 CSlnXgZV7gI=
 =/2cJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160917-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Tracepoint addition and improvement

Here is a set of patches that add some more tracepoints and improve a couple
of existing ones.  New additions include:

 (1) Connection refcount tracking.

 (2) Client connection state machine tracking.

 (3) Tx and Rx packet lifecycle.

 (4) ACK reception and transmission.

 (5) recvmsg processing.

Updates include:

 (1) Print the symbolic packet name in the Rx packet tracepoint.

 (2) Additional call refcount trace events.

 (3) Improvements to sk_buff tracking with AF_RXRPC.

In addition:

 (1) Config option to inject packet loss during both transmission and
     reception.

 (2) Removal of some printks.

This series needs to be applied on top of the previously posted fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:52:21 -04:00
David S. Miller 5b0c6fc8ef RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV90ZnPSw1s6N8H32AQKguA//Q7lvTa3n3DFvMQcIPsyJZ6VniUqksTA1
 wmQrw4GHXRUgM8UWz7G9Y5aqxUp2q6y6Vm9BeHkQ2bYZjSrOx5Dc/AImWhBn5au+
 h+HZYcEs4mFM5AoVT7GK8o/nNODDjNt2qwcH/Nf8+SM7Xf52zYHelrSteLZZ1YWO
 S1m4pa5YUBa/ICD3+K52lWq9SCG0VMmy41UHDXU6uSakfN62rn9ZYCcNeathlJGS
 2D0cG3GzYYHiBZ1CkmdPQgGQhTM3wzI+0OvbNnidFlF78zVDlxX8C9zgWs/VlTCg
 02ok4+ftzDojgXH9W+DziEiyCNq14GTDbrdSai5WA+vHVGagr6OoSwDWgPHkXBvW
 pYQh7jqRoBOKN2fsVkU0t19hPc2CCVGYMh49A6AFv8lgS+nWoDXOlmZ0snh8deQg
 Z0HO5mx+V+4yJplBlwH6ncvbRB9ywpsvIuLriZXC/aJg6aY4a8nrU35d1+6xUaM7
 RBMud0uj+7oU+sC9N7CuM8m8HpBOg6+qAsbsfATSwadMRcMdS4LSoXcBg0WvljIH
 JmtL924yEnMDw1yPkmuDBcQ9K6DuxeOYZg4A2756tBtGulxuVjntmI1MVAQlsbqH
 CnNPWpxIDoLRQsHVcYWS5O1F16drGobzFhmj7Hf/6HmGa28x7nQhDafzfFj/3Dos
 MAdM2pdO2x8=
 =MjVT
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160917-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes & miscellany

Here are some more AF_RXRPC fix patches with a couple of miscellaneous
changes also.  Fixes include:

 (1) Make RxRPC IPv6 support conditional on IPv6 being available.

 (2) Move the condition check in rxrpc_locate_data() into the caller and
     check the error return.

 (3) Fix the detection of the last received packet in recvmsg.

 (4) Account calls that need acceptance and clean up any unaccepted ones if
     the socket gets closed.

 (5) Fix the cleanup of client connections.

 (6) Fix the soft-ACK parsing and the retransmission of packets based on
     those ACKs.

 (7) Suppress transmission of an ACK when there's no pending ACK to
     transmit because another thread stole it.

And some miscellany:

 (8) Whitespace removal.

 (9) Switch-value consistency in rxrpc_send_call_packet().

(10) Fix the basic transmission packet size to allow for spur-of-the-moment
     jumbo DATA packet production.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:51:21 -04:00
Florian Westphal 48da34b7a7 sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal ed760cb8aa sched: replace __skb_dequeue with __qdisc_dequeue_head
After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal ec32336879 sched: remove qdisc arg from __qdisc_dequeue_head
Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal 97d0678f91 sched: don't use skb queue helpers
A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Florian Westphal 1486587b2f pie: use qdisc_dequeue_head wrapper
Doesn't change generated code.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
Christophe Jaillet e8bc8f9a67 sctp: Remove some redundant code
In commit 311b21774f ("sctp: simplify sk_receive_queue locking"), a call
to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
hidden in 'sctp_skb_list_tail()'

Now, the code around it looks redundant. The '_init()' part of
'skb_queue_splice_tail_init()' should already do the same.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:34:01 -04:00
Mahesh Bandewar e8bffe0cf9 net: Add _nf_(un)register_hooks symbols
Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Mahesh Bandewar d409b84768 ipv6: Export p6_route_input_lookup symbol
Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Nogah Frankel 69ae6ad2ff net: core: Add offload stats to if_stats_msg
Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:33:42 -04:00
David S. Miller c13ed534b8 This time we have various things - all across the board:
* MU-MIMO sniffer support in mac80211
  * a create_singlethread_workqueue() cleanup
  * interface dump filtering that was documented but not implemented
  * support for the new radiotap timestamp field
  * send delBA in two unexpected conditions (as required by the spec)
  * connect keys cleanups - allow only WEP with index 0-3
  * per-station aggregation limit to work around broken APs
  * debugfs improvement for the integrated codel algorithm
 and various other small improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2+umAAoJEGt7eEactAAdIMkP/jMmpbxkzD64L7nTkO4APGva
 r6RmMM1SmgVD/CtVkjlBLuvo5YOTWv/vWvy6KoUESOINAx/e6T3T7bmmCOXzbsOL
 e5/YYcS1AOqgn5SdhgIj1E5cpdYIhlUGRlNJ0qEjeLLrh4/TLUNbCcuPhOYybUMz
 fUrdPKgDeWb7x9EHLENhPsVtCXWwKnkDIS4qclPZCWgRj46XM4pNB4OlvCUzGY6k
 bOqGJfrtjYjgKFDmPFqfYA4JDA56980qqO41+eEKXeMvDKNs+pSiNco130Q+uU3E
 o7tk9DMnAnCy2GihpV1ZYVkLr6O+7o9xVuenj3NRlhyd1mn2gXxLcO4AkHcrZBkf
 Ei+2L+KgnWELyqiSOaGTJKlugsgS4DDoNnFEIVjSweQ9DIoBA/Gj/6+4uZeHXJ3M
 bEjtHnCLi5CuI067uBoevwXFoMi1poWra2KnZKOZzFS5OL3xHv4//x/Wmnn2/5Jz
 ffEwVyRmTY76sLWfnwXUDClrFWAYQrpNyTryc+k3cpYKzhnseiqt+z43cBuISm00
 uh5B9PpPB8RhtUnXrL/SHRyf8YEluaidTsI2lc1LvwXOc0+Zbp73mTCgP+rzLs9p
 K2qVRiozpIXanW6hKmmaDwjKlcAKKLP0xN2v90MqwQt4YdLIKlXnll1AH2BawzuP
 OWB3n8D0I6y0PWH+Yo8o
 =s1MY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time we have various things - all across the board:
 * MU-MIMO sniffer support in mac80211
 * a create_singlethread_workqueue() cleanup
 * interface dump filtering that was documented but not implemented
 * support for the new radiotap timestamp field
 * send delBA in two unexpected conditions (as required by the spec)
 * connect keys cleanups - allow only WEP with index 0-3
 * per-station aggregation limit to work around broken APs
 * debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:29:08 -04:00
David S. Miller 7ac327318e Two more fixes:
* reject aggregation sessions for TSID/TID 8-16 that we
    can never use anyway and which could confuse drivers
  * check return value of skb_linearize()
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2+l7AAoJEGt7eEactAAdedMP/0k225qWRR+19lavkZ3/lSqb
 YiQdYNwKCZf9WR+5FiQKFcTV4XMCfKiWX2eJJWqFI5r+V4mfTJAZDehZpddCvJou
 KW3dZYe25+DaFApdhQiJXTxKSeJAbRB7EWT4i0P54Ned0YUeFwwOjvxuH+fq2VY0
 tScX5JogPuGFVzXhlqUVZTjbHlBkcILGv7QjOXXL9yf4d2x57uXyabOB14l351RF
 4hqXm7u+JQedbGjhYHDMKnyuwXQW4mbhELBdyMh7heLPqufWf4Xakrwh8WYSKuBb
 H5dICeq+/JPKnwOStmwtz1FQVIlU08PMLnWzQbByrkDvQPCIkMHLu6h6oNierqAH
 z4vyy8w05KzFNG3R5egygopXbncZvSOCcOPkDQxbTYmh2tK97OMUjXbUfPEUXZs2
 WbSuECm+RVSNKtLlUip7TrM3iYI7xmCN/qPB4Wuq+z/JiSOrnP+G047iUbhdDpnE
 fwLmRWWdEGkjnv3TFdzzxLVAJPCQWyuLfxaUBlHlYRNqBY8ufTR3FHUKj4dVwqJK
 EqnSiDUqbZE6qJw516MEp+CqZGb8hzeMCeAu+docWMa3CmnNlglZtYgNiou45y9Z
 9/8dAO93/qAzsaw9A/XprnIVPcu1vsFSicrhjOdZsnYoVJDGoW6rjzz6HpCk9QxB
 7FjxZ5LilJ5Ws4ltlwr7
 =3dek
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Two more fixes:
 * reject aggregation sessions for TSID/TID 8-16 that we
   can never use anyway and which could confuse drivers
 * check return value of skb_linearize()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:26:49 -04:00
Eric Dumazet 695b4ec0f0 pkt_sched: fq: use proper locking in fq_dump_stats()
When fq is used on 32bit kernels, we need to lock the qdisc before
copying 64bit fields.

Otherwise "tc -s qdisc ..." might report bogus values.

Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:15:08 -04:00
Thadeu Lima de Souza Cascardo db74a3335e openvswitch: use percpu flow stats
Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Cc: pravin shelar <pshelar@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Thadeu Lima de Souza Cascardo 40773966cc openvswitch: fix flow stats accounting when node 0 is not possible
On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:14:01 -04:00
Xin Long 41001faf95 sctp: not return ENOMEM err back in sctp_packet_transmit
As David and Marcelo's suggestion, ENOMEM err shouldn't return back to
user in transmit path. Instead, sctp's retransmit would take care of
the chunks that fail to send because of ENOMEM.

This patch is only to do some release job when alloc_skb fails, not to
return ENOMEM back any more.

Besides, it also cleans up sctp_packet_transmit's err path, and fixes
some issues in err path:

 - It didn't free the head skb in nomem: path.
 - No need to check nskb in no_route: path.
 - It should goto err: path if alloc_skb fails for head.
 - Not all the NOMEMs should free nskb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long 83dbc3d4a3 sctp: make sctp_outq_flush/tail/uncork return void
sctp_outq_flush return value is meaningless now, this patch is
to make sctp_outq_flush return void, as well as sctp_outq_fail
and sctp_outq_uncork.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:33 -04:00
Xin Long 645194409b sctp: save transmit error to sk_err in sctp_outq_flush
Every time when sctp calls sctp_outq_flush, it sends out the chunks of
control queue, retransmit queue and data queue. Even if some trunks are
failed to transmit, it still has to flush all the transports, as it's
the only chance to clean that transmit_list.

So the latest transmit error here should be returned back. This transmit
error is an internal error of sctp stack.

I checked all the places where it uses the transmit error (the return
value of sctp_outq_flush), most of them are actually just save it to
sk_err.

Except for sctp_assoc/endpoint_bh_rcv, they will drop the chunk if
it's failed to send a REPLY, which is actually incorrect, as we can't
be sure the error that sctp_outq_flush returns is from sending that
REPLY.

So it's meaningless for sctp_outq_flush to return error back.

This patch is to save transmit error to sk_err in sctp_outq_flush, the
new error can update the old value. Eventually, sctp_wait_for_* would
check for it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long b61c654f9b sctp: free msg->chunks when sctp_primitive_SEND return err
Last patch "sctp: do not return the transmit err back to sctp_sendmsg"
made sctp_primitive_SEND return err only when asoc state is unavailable.
In this case, chunks are not enqueued, they have no chance to be freed if
we don't take care of them later.

This Patch is actually to revert commit 1cd4d5c432 ("sctp: remove the
unused sctp_datamsg_free()"), commit 69b5777f2e ("sctp: hold the chunks
only after the chunk is enqueued in outq") and commit 8b570dc9f7 ("sctp:
only drop the reference on the datamsg after sending a msg"), to use
sctp_datamsg_free to free the chunks of current msg.

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long 66388f2c08 sctp: do not return the transmit err back to sctp_sendmsg
Once a chunk is enqueued successfully, sctp queues can take care of it.
Even if it is failed to transmit (like because of nomem), it should be
put into retransmit queue.

If sctp report this error to users, it confuses them, they may resend
that msg, but actually in kernel sctp stack is in charge of retransmit
it already.

Besides, this error probably is not from the failure of transmitting
current msg, but transmitting or retransmitting another msg's chunks,
as sctp_outq_flush just tries to send out all transports' chunks.

This patch is to make sctp_cmd_send_msg return avoid, and not return the
transmit err back to sctp_sendmsg

Fixes: 8b570dc9f7 ("sctp: only drop the reference on the datamsg after sending a msg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Xin Long 2c89791eeb sctp: remove the unnecessary state check in sctp_outq_tail
Data Chunks are only sent by sctp_primitive_SEND, in which sctp checks
the asoc's state through statetable before calling sctp_outq_tail. So
there's no need to check the asoc's state again in sctp_outq_tail.

Besides, sctp_do_sm is protected by lock_sock, even if sending msg is
interrupted by timer events, the event's processes still need to acquire
lock_sock first. It means no others CMDs can be enqueue into side effect
list before CMD_SEND_MSG to change asoc->state, so it's safe to remove it.

This patch is to remove redundant asoc->state check from sctp_outq_tail.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-18 22:02:32 -04:00
Alexei Starovoitov 8d79266bc4 ip6_tunnel: add collect_md mode to IPv6 tunnels
Similar to gre, vxlan, geneve tunnels allow IPIP6 and IP6IP6 tunnels
to operate in 'collect metadata' mode.
Unlike ipv4 code here it's possible to reuse ip6_tnl_xmit() function
for both collect_md and traditional tunnels.
bpf_skb_[gs]et_tunnel_key() helpers and ovs (in the future) are the users.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Alexei Starovoitov cfc7381b30 ip_tunnel: add collect_md mode to IPIP tunnel
Similar to gre, vxlan, geneve tunnels allow IPIP tunnels to
operate in 'collect metadata' mode.
bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away.
ovs can use it as well in the future (once appropriate ovs-vport
abstractions and user apis are added).
Note that just like in other tunnels we cannot cache the dst,
since tunnel_info metadata can be different for every packet.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:13:07 -04:00
Julia Lawall eb94737d71 l2tp: constify net_device_ops structures
Check for net_device_ops structures that are only stored in the netdev_ops
field of a net_device structure.  This field is declared const, so
net_device_ops structures that have this property can be declared as const
also.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct net_device_ops i@p = { ... };

@ok@
identifier r.i;
struct net_device e;
position p;
@@
e.netdev_ops = &i@p;

@bad@
position p != {r.p,ok.p};
identifier r.i;
struct net_device_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct net_device_ops i = { ... };
// </smpl>

The result of size on this file before the change is:
   text	      data     bss     dec         hex	  filename
   3401        931      44    4376        1118	net/l2tp/l2tp_eth.o

and after the change it is:
   text	     data        bss	    dec	    hex	filename
   3993       347         44       4384    1120	net/l2tp/l2tp_eth.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:07:23 -04:00
Alan Cox 5ff904d55d llc: switch type to bool as the timeout is only tested versus 0
(As asked by Dave in Februrary)

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
Eric Dumazet 3613b3dbd1 tcp: prepare skbs for better sack shifting
With large BDP TCP flows and lossy networks, it is very important
to keep a low number of skbs in the write queue.

RACK and SACK processing can perform a linear scan of it.

We should avoid putting any payload in skb->head, so that SACK
shifting can be done if needed.

With this patch, we allow to pack ~0.5 MB per skb instead of
the 64KB initially cooked at tcp_sendmsg() time.

This gives a reduction of number of skbs in write queue by eight.
tcp_rack_detect_loss() likes this.

We still allow payload in skb->head for first skb put in the queue,
to not impact RPC workloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 10:05:05 -04:00
phil.turnbull@oracle.com 8ab86c00e3 irda: Free skb on irda_accept error path.
skb is not freed if newsk is NULL. Rework the error path so free_skb is
unconditionally called on function exit.

Fixes: c3ea9fa274 ("[IrDA] af_irda: IRDA_ASSERT cleanups")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:31 -04:00
Eric Dumazet ffb4d6c850 tcp: fix overflow in __tcp_retransmit_skb()
If a TCP socket gets a large write queue, an overflow can happen
in a test in __tcp_retransmit_skb() preventing all retransmits.

The flow then stalls and resets after timeouts.

Tested:

sysctl -w net.core.wmem_max=1000000000
netperf -H dest -- -s 1000000000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 09:59:30 -04:00
David Howells 8a681c3605 rxrpc: Add config to inject packet loss
Add a configuration option to inject packet loss by discarding
approximately every 8th packet received and approximately every 8th DATA
packet transmitted.

Note that no locking is used, but it shouldn't really matter.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells 71f3ca408f rxrpc: Improve skb tracing
Improve sk_buff tracing within AF_RXRPC by the following means:

 (1) Use an enum to note the event type rather than plain integers and use
     an array of event names rather than a big multi ?: list.

 (2) Distinguish Rx from Tx packets and account them separately.  This
     requires the call phase to be tracked so that we know what we might
     find in rxtx_buffer[].

 (3) Add a parameter to rxrpc_{new,see,get,free}_skb() to indicate the
     event type.

 (4) A pair of 'rotate' events are added to indicate packets that are about
     to be rotated out of the Rx and Tx windows.

 (5) A pair of 'lost' events are added, along with rxrpc_lose_skb() for
     packet loss injection recording.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells ba39f3a0ed rxrpc: Remove printks from rxrpc_recvmsg_data() to fix uninit var
Remove _enter/_debug/_leave calls from rxrpc_recvmsg_data() of which one
uses an uninitialised variable.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:04 +01:00
David Howells 849979051c rxrpc: Add a tracepoint to follow what recvmsg does
Add a tracepoint to follow what recvmsg does within AF_RXRPC.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells 58dc63c998 rxrpc: Add a tracepoint to follow packets in the Rx buffer
Add a tracepoint to follow the life of packets that get added to a call's
receive buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells f3639df2d9 rxrpc: Add a tracepoint to log ACK transmission
Add a tracepoint to log information about ACK transmission.

Signed-off-by: David Howels <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells ec71eb9ada rxrpc: Add a tracepoint to log received ACK packets
Add a tracepoint to log information from received ACK packets.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells a124fe3ee5 rxrpc: Add a tracepoint to follow the life of a packet in the Tx buffer
Add a tracepoint to follow the insertion of a packet into the transmit
buffer, its transmission and its rotation out of the buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells 363deeab6d rxrpc: Add connection tracepoint and client conn state tracepoint
Add a pair of tracepoints, one to track rxrpc_connection struct ref
counting and the other to track the client connection cache state.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells a84a46d730 rxrpc: Add some additional call tracing
Add additional call tracepoint points for noting call-connected,
call-released and connection-failed events.

Also fix one tracepoint that was using an integer instead of the
corresponding enum value as the point type.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells a3868bfc8d rxrpc: Print the packet type name in the Rx packet trace
Print a symbolic packet type name for each valid received packet in the
trace output, not just a number.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 11:24:03 +01:00
David Howells 182f505624 rxrpc: Fix the basic transmit DATA packet content size at 1412 bytes
Fix the basic transmit DATA packet content size at 1412 bytes so that they
can be arbitrarily assembled into jumbo packets.

In the future, I'm thinking of moving to keeping a jumbo packet header at
the beginning of each packet in the Tx queue and creating the packet header
on the spot when kernel_sendmsg() is invoked.  That way, jumbo packets can
be assembled on the spur of the moment for (re-)transmission.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:54:32 +01:00
David Howells 2311e327cd rxrpc: Be consistent about switch value in rxrpc_send_call_packet()
rxrpc_send_call_packet() should use type in both its switch-statements
rather than using pkt->whdr.type.  This might give the compiler an easier
job of uninitialised variable checking.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:54:21 +01:00
David Howells 27d0fc431c rxrpc: Don't transmit an ACK if there's no reason set
Don't transmit an ACK if call->ackr_reason in unset.  There's the
possibility of a race between recvmsg() sending an ACK and the background
processing thread trying to send the same one.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:55 +01:00
David Howells dfa7d92040 rxrpc: Fix retransmission algorithm
Make the retransmission algorithm use for-loops instead of do-loops and
move the counter increments into the for-statement increment slots.

Though the do-loops are slighly more efficient since there will be at least
one pass through the each loop, the counter increments are harder to get
right as the continue-statements skip them.

Without this, if there are any positive acks within the loop, the do-loop
will cycle forever because the counter increment is never done.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells d01dc4c3c1 rxrpc: Fix the parsing of soft-ACKs
The soft-ACK parser doesn't increment the pointer into the soft-ACK list,
resulting in the first ACK/NACK value being applied to all the relevant
packets in the Tx queue.  This has the potential to miss retransmissions
and cause excessive retransmissions.

Fix this by incrementing the pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells 78883793f8 rxrpc: Fix unexposed client conn release
If the last call on a client connection is release after the connection has
had a bunch of calls allocated but before any DATA packets are sent (so
that it's not yet marked RXRPC_CONN_EXPOSED), an assertion will happen in
rxrpc_disconnect_client_call().

	af_rxrpc: Assertion failed - 1(0x1) >= 2(0x2) is false
	------------[ cut here ]------------
	kernel BUG at ../net/rxrpc/conn_client.c:753!

This is because it's expecting the conn to have been exposed and to have 2
or more refs - but this isn't necessarily the case.

Simply remove the assertion.  This allows the conn to be moved into the
inactive state and deleted if it isn't resurrected before the final put is
called.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells 357f5ef646 rxrpc: Call rxrpc_release_call() on error in rxrpc_new_client_call()
Call rxrpc_release_call() on getting an error in rxrpc_new_client_call()
rather than trying to do the cleanup ourselves.  This isn't a problem,
provided we set RXRPC_CALL_HAS_USERID only if we actually add the call to
the calls tree as cleanup code fragments that would otherwise cause
problems are conditional.

Without this, we miss some of the cleanup.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:21 +01:00
David Howells 66d58af7f4 rxrpc: Fix the putting of client connections
In rxrpc_put_one_client_conn(), if a connection has RXRPC_CONN_COUNTED set
on it, then it's accounted for in rxrpc_nr_client_conns and may be on
various lists - and this is cleaned up correctly.

However, if the connection doesn't have RXRPC_CONN_COUNTED set on it, then
the put routine returns rather than just skipping the extra bit of cleanup.

Fix this by making the extra bit of clean up conditional instead and always
killing off the connection.

This manifests itself as connections with a zero usage count hanging around
in /proc/net/rxrpc_conns because the connection allocated, but discarded,
due to a race with another process that set up a parallel connection, which
was then shared instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:53:20 +01:00
David Howells 0360da6db7 rxrpc: Purge the to_be_accepted queue on socket release
Purge the queue of to_be_accepted calls on socket release.  Note that
purging sock_calls doesn't release the ref owned by to_be_accepted.

Probably the sock_calls list is redundant given a purges of the recvmsg_q,
the to_be_accepted queue and the calls tree.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells e6f3afb3fc rxrpc: Record calls that need to be accepted
Record calls that need to be accepted using sk_acceptq_added() otherwise
the backlog counter goes negative because sk_acceptq_removed() is called.
This causes the preallocator to malfunction.

Calls that are preaccepted by AFS within the kernel aren't affected by
this.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells 816c9fce12 rxrpc: Fix handling of the last packet in rxrpc_recvmsg_data()
The code for determining the last packet in rxrpc_recvmsg_data() has been
using the RXRPC_CALL_RX_LAST flag to determine if the rx_top pointer points
to the last packet or not.  This isn't a good idea, however, as the input
code may be running simultaneously on another CPU and that sets the flag
*before* updating the top pointer.

Fix this by the following means:

 (1) Restrict the use of RXRPC_CALL_RX_LAST to the input routines only.
     There's otherwise a synchronisation problem between detecting the flag
     and checking tx_top.  This could probably be dealt with by appropriate
     application of memory barriers, but there's a simpler way.

 (2) Set RXRPC_CALL_RX_LAST after setting rx_top.

 (3) Make rxrpc_rotate_rx_window() consult the flags header field of the
     DATA packet it's about to discard to see if that was the last packet.
     Use this as the basis for ending the Rx phase.  This shouldn't be a
     problem because the recvmsg side of things is guaranteed to see the
     packets in order.

 (4) Make rxrpc_recvmsg_data() return 1 to indicate the end of the data if:

     (a) the packet it has just processed is marked as RXRPC_LAST_PACKET

     (b) the call's Rx phase has been ended.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:51:54 +01:00
David Howells 2e2ea51dec rxrpc: Check the return value of rxrpc_locate_data()
Check the return value of rxrpc_locate_data() in rxrpc_recvmsg_data().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:49 +01:00
David Howells 4b22457c06 rxrpc: Move the check of rx_pkt_offset from rxrpc_locate_data() to caller
Move the check of rx_pkt_offset from rxrpc_locate_data() to the caller,
rxrpc_recvmsg_data(), so that it's more clear what's going on there.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:48 +01:00
David Howells fabf920180 rxrpc: Remove some whitespace.
Remove a tab that's on a line that should otherwise be blank.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-17 10:50:15 +01:00
David Howells d19127473a rxrpc: Make IPv6 support conditional on CONFIG_IPV6
Add CONFIG_AF_RXRPC_IPV6 and make the IPv6 support code conditional on it.
This is then made conditional on CONFIG_IPV6.

Without this, the following can be seen:

   net/built-in.o: In function `rxrpc_init_peer':
>> peer_object.c:(.text+0x18c3c8): undefined reference to `ip6_route_output_flags'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17 03:58:45 -04:00
Linus Torvalds 87ee1280ff Fix a memory corruption bug that I introduced in 4.7.
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX3GKYAAoJECebzXlCjuG+7jEQAIgMFl0feSlYrWqPjHZuUXcu
 uNdvnxKgWWP+JJp5pEh+uFLUsXqamfMqcEj4B0OgcoZcsYMDVsAEafeSWkdNVQeO
 Wr+j+IBcC1oECGA5MbZWzHBzw1CMq5/l411pvU+2xeZl8iG83MCl9su8dgf4ctAe
 hc7asU5fiQ+nazhfBaKpVzMAQO+G76E5J90B/Y+VbjYvtQoITUaT3R/PqBIKoG6j
 ezlElPHKddD2Yd8JnqPfnQ521I4yQL2AhrtJzQayCAnDlBXJ0B41IchcFRxPJPBX
 teNybHjHX9XQMGyE13EM4MO43AM9Bfcn7hAUPppag4tnvmS34fkIbBsWwWJjTmhR
 sjfLoBbP11KgGUjg7vRbNNsQ1zHhDyvss5ChUQAqfkDDH7CpnmOPXUqKKSGIfVvw
 fCpBMqxh5xwSzX3m2Wm2wVsUYH0rSyTKmcomdrZw0o577L2HsoaqWa2hcD/v7fE2
 MtPnYLszJXJA3k1ZLoJ6L4sFphznOwzwrj8oLpan5YGsJnonyHcequZw9Z4YoMYG
 YPztd/kwctIezKhHyRqmr8lghTJ5vjLK49FfJEjrDAb8vfXq3Gwb7ZkwtZBH9RSw
 CV2vmJ1ny9PRRCKw0UCYAfUv5ucS2JiRbk31qjoIMYpDWJQghrZHnltRxgKg5UDW
 ZxTGNLwH/WXRVHJzTaeE
 =gr4D
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd bugfix from Bruce Fields:
 "Fix a memory corruption bug that I introduced in 4.7"

* tag 'nfsd-4.8-2' of git://linux-nfs.org/~bfields/linux:
  svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
2016-09-16 17:00:26 -07:00
Luca Coelho fbd05e4a6e cfg80211: add helper to find an IE that matches a byte-array
There are a few places where an IE that matches not only the EID, but
also other bytes inside the element, needs to be found.  To simplify
that and reduce the amount of similar code, implement a new helper
function to match the EID and an extra array of bytes.

Additionally, simplify cfg80211_find_vendor_ie() by using the new
match function.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-16 14:49:52 +02:00
Emmanuel Grumbach c68df2e7be mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE
In 46fa38e84b ("mac80211: allow software PS-Poll/U-APSD with
AP_LINK_PS"), Johannes allowed to use mac80211's code for handling
stations that go to PS or send PS-Poll / uAPSD trigger frames for
devices that enable RSS.

This means that mac80211 doesn't look at frames anymore but rather
relies on a notification that will come from the device when a PS
transition occurs or when a PS-Poll / trigger frame is detected by
the device.

iwlwifi will need this capability but still needs mac80211 to take
care of the TIM IE. Today, if a driver sets AP_LINK_PS, mac80211
will not update the TIM IE. Change mac80211 to check existence of
the set_tim driver callback rather than using AP_LINK_PS to decide
if the driver handles the TIM IE internally or not.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[reword commit message a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-16 14:49:23 +02:00
John Crispin cafdc45c94 net-next: dsa: add Qualcomm tag RX/TX handler
Add support for the 2-bytes Qualcomm tag that gigabit switches such as
the QCA8337/N might insert when receiving packets, or that we need
to insert while targeting specific switch ports. The tag is inserted
directly behind the ethernet header.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:31:51 -04:00
Mark Tomlinson d6f64d725b net: VRF: Pass original iif to ip_route_input()
The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb->dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:24:07 -04:00
David S. Miller 6777ae8083 Here are two batman-adv bugfix patches:
- Fix reference counting for last_bonding_candidate, by Sven Eckelmann
 
  - Fix head room reservation for ELP packets, by Linus Luessing
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdBQJX2UCwFhxzd0BzaW1vbnd1bmRlcmxpY2guZGUACgkQoSvjmEKS
 nqHp+w/+MGG87sjUfObJKrSTmsIDgWHl2VSQwKBivbkvIuBm9gxIhFblaBQEJd5A
 88Z5LOGWpnxh2rfFIPurTLxkYAZyARpIxWOLVtmPE3TdfMi4savTsvkOywd0ZHqE
 kXZH1QHE/Y3CQBf9FaM5cgTiwAXjZ+KWr5kRg3WNmH0oQUepndUBb5AAbjpD2G3f
 Gt4TsfunplXCmA/5uJMISOjKiub8usUhXVXBHVpxR+ZItdoIolwN197wDzXT8pWK
 FJe+Flqrvxi3n0kNFknUzCDdt09TFLms4m0AEu+8f4P6t1mR7v+YHNM4IlBx8BX0
 6Kwiz009h9+5JFZvOSTCy6tkrodn5cAk9LNNsam5uPTWXeY76gFOzegdHIcIaBxq
 MLEqnTUgONqOakZs+4NopUz0HUvtJXDGYJcy7SDLYIYEwhaHP7seyuPkjA+xap2p
 uPZR3dTYESdeNGs/uPTEGVh1q8w6xXqXe9IRzP1KvGAOsg5IXCboLvJHpVMc25aT
 4CT9qz5H94sVqdR6d5pBb+0CLoxnhbl+4IjKnHwMBzVM/0LVDJZRmx9qgiNHJ+0f
 PGQeQEbAfrihORRn2jCEq0YLCHKhjvLw9O7GVj353wmiVkvvuR+MQEv+bcDgu0DE
 mH9maOpcOahud6S+x+m9of3f3ytwWR4UO2Qk5t1RdIuVRSD7sXM=
 =mHPX
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20160914' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are two batman-adv bugfix patches:

 - Fix reference counting for last_bonding_candidate, by Sven Eckelmann

 - Fix head room reservation for ELP packets, by Linus Luessing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:17:33 -04:00
Eric Dumazet 76f0dcbb5a tcp: fix a stale ooo_last_skb after a replace
When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.

To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.

This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.

Example:

We receive segments 2001:3001,  4001:5001

Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.

packetdrill test demonstrating the bug

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4

+0.01 < . 1001:2001(1000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001>

+0.01 < . 1001:3001(2000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001>

Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 04:09:49 -04:00
David S. Miller 364eac0c8b RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV9h/5PSw1s6N8H32AQLJCQ//RFbu0SNSoJnnZbOTwkxBaGYnGg4KbNVt
 iR4zumQfFssyYr7WcH1S6kuPzM/dJfjkRYqollyUGCEfnWyDwyfnjM9Na9PQoZ9F
 k7xnbim8N65njHLdGF6QMhenmoRXSBVCN2E0uPTbBXurFHJ8ZgQQs+DhogalvGUl
 2TL/aMdpqRoo1Vg0/APVOKeLGqgHEhrXxelTZB/74IXyYT+rzjfzu+ZfwxUAijsM
 d+FBSwY+D8RYSV4LXQzMNNFCwNORbG2Rse2nEqd7bVqdVywWsuhbgeESjx1Y3+ge
 /mofVyxrpoblT9qsScbISbIQEe6cLxRiQgQHEudennRI2/3EbpNSijhNFWVon2Em
 NAa7r+tfOPtVx5JTL9NyvwtrXPfAgDi7Stpml3Yhhr/CjRHYK9kfKysowMXL5vOz
 NHD0fUozNLecpGCmdxG+alwf5BJ5q9DRPP7bI7KE/4FfVsYe8bO6pQ9G+myeP/A4
 h8DuvK4xSJUEpEa5dpDLA1wSC2XH5PgYdXIr2DFBaFjllIdf1cGNKKIYRF2eIX/I
 obVD7c72oZV1kIK3URTLJpE+CdA4KgTFuL3YqIxquA2Iedb1t2uOrQcp4WwaWf7V
 REY9KbBn1F0+yJfO3Fjckerzle+MlAmrAHQpkZUduo5JzdRm3DY++YzswXQT5Fpr
 8S3T30nwY0k=
 =T1wh
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160913-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Support IPv6

Here is a set of patches that add IPv6 support.  They need to be applied on
top of the just-posted miscellaneous fix patches.  They are:

 (1) Make autobinding of an unconnected socket work when sendmsg() is
     called to initiate a client call.

 (2) Don't specify the protocol when creating the client socket, but rather
     take the default instead.

 (3) Use rxrpc_extract_addr_from_skb() in a couple of places that were
     doing the same thing manually.  This allows the IPv6 address
     extraction to be done in fewer places.

 (4) Add IPv6 support.  With this, calls can be made to IPv6 servers from
     userspace AF_RXRPC programs; AFS, however, can't use IPv6 yet as the
     RPC calls need to be upgradeable.
====================

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:57:19 -04:00
David S. Miller 39caa8bf6d RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV9h5A/Sw1s6N8H32AQJOPA//UI0606GZV2zjGqvWYbwquxjhWbbiVfEx
 CB5BeiQjKs8MxrJeHT/+bh6Z1Y6YorkyrVCc7kI1RQ+yiN0hw49bhFfF9Kr46DBF
 gYI2VdiKjIFEgC9fTenLkhMDQC7Hhf9O50hzk9QcC4y7w1Lhytah97d9w+Df0ECy
 a2QLMe2Ad9K5qR08ih3yTH7+G9K1m4/iqIrON2Hd9Opb+oFJgOiixvUVPr9f/6Xd
 /2YeAPDy/2A1MQ2nNE+oSW4C5uD+mJICqjjSw9YyhYl31lIfwBZ7+DE9hjR1qCXj
 UzMJLKrutXQQ1U7/Fbbke6UU5yKVm1djQB1qTF8t1hCHp/q88E7T06UUU9oBDqe0
 98CjPofEXBcqn9hjrXIvJgxCEISTPHx9ikaq0i5yF/6pSHZ9G8gLUfrqbMwipkfk
 mXItd6HAHXhX7cS5u76v7I4c9u5olexX5cJ91/ibtOdsupiJTMLwCx4twR6knEcS
 /6SSqjklFL4f6HjuNlNJ8m2dB98DII+Ym0qo/ZQy4KUm/+0yzrkpGHvt32CR4wng
 qjtDN+KgxNss1duu4zkHgQe22u3iSRToxwydWTIQYY6tx4e08X1eSIFRL5ddYpEC
 bjnOtmniAyDP5YF1jRwFDLS3YzT9Uvrf0TVAOvU7/FjPh3KCGa8fn38xIbEsX6eI
 1uadG1bf9wg=
 =vHfH
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160913-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Miscellaneous fixes

Here's a set of miscellaneous fix patches.  There are a couple of points of
note:

 (1) There is one non-fix patch that adjusts the call ref tracking
     tracepoint to make kernel API-held refs on calls more obvious.  This
     is a prerequisite for the patch that fixes prealloc refcounting.

 (2) The final patch alters how jumbo packets that partially exceed the
     receive window are handled.  Previously, space was being left in the
     Rx buffer for them, but this significantly hurts performance as the Rx
     window can't be increased to match the OpenAFS Tx window size.

     Instead, the excess subpackets are discarded and an EXCEEDS_WINDOW ACK
     is generated for the first.  To avoid the problem of someone trying to
     run the kernel out of space by feeding the kernel a series of
     overlapping maximal jumbo packets, we stop allowing jumbo packets on a
     call if we encounter more than three jumbo packets with duplicate or
     excessive subpackets.
====================

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:52:20 -04:00
David S. Miller 3e454fd5db A few more fixes:
* better mesh path fixing, from Thomas
  * fix TIM IE recalculation after sending frames
    to a sleeping station, from Felix
  * fix sequence number assignment while sending
    frames to a sleeping station, also from Felix
  * validate number of probe response CSA counter
    offsets, fixing a copy/paste bug (from myself)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJX2FrGAAoJEGt7eEactAAd/ZUP/ix4CokfC/WYxAYoOI6R2X+X
 UVvkFgG8hsZMzm2PZ4bg14jx9cQdnisgzxJnDgwoYvAnjBR0mbPl50LXoJ+JTlVr
 JnoIeMEKXqtxP099B4ssSCpW8YEr6Ex8RFweqN5vLuyIhlW63To1pOMvj9+zvUDB
 3P7CmX2D7cWYys8ciDEaNhHziJRAdd9AruWf3emS1IhScDw05NlDQlff6FZpG/gW
 JMVGjKs+v5DdT3SH677/S6KfpwkPEpIoWzgYH32OW7nXxfuYZCoMWSdeoCfgNe+i
 WgZ154RPBD/YcXsCdNtw36uped7qlrS55Fnlyi1bKMx7vdKLJHABh+FB7qxFpO2F
 2gh3G/KBcuGFfOihPXp11m3LlUn+ml1Ri3F8ffwvoO2ZoZGDplrH6oX2+Q9Hlm7U
 zvhuVzLriq0xHGZhv4EXqMt+70Zk0GvwzBviSjYXEwUJ+h/FkCyrVyUCpfmko9lf
 8QMhI+rm0b5BtOvTkGI4ghEbpz80x1UnnUEGiMdzFhFIunSy43EzJ/RbpiAiaOCI
 pou/moi9+0yMAuJQkMEK2Yg+DdluTsSNuMHB7NKTx8o/wzwgFWFJ6jHlQO72SM43
 k0Ku66LyJfovmv03dPaTxDSISaoxsn+1+SlsUPAF0usLyLka9A0WBjOE8edfDI0h
 hAamoQ0QHd4OlHtgKib6
 =xzhM
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few more fixes:
 * better mesh path fixing, from Thomas
 * fix TIM IE recalculation after sending frames
   to a sleeping station, from Felix
 * fix sequence number assignment while sending
   frames to a sleeping station, also from Felix
 * validate number of probe response CSA counter
   offsets, fixing a copy/paste bug (from myself)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-16 01:34:08 -04:00
Lance Richardson 2679d04041 openvswitch: avoid deferred execution of recirc actions
The ovs kernel data path currently defers the execution of all
recirc actions until stack utilization is at a minimum.
This is too limiting for some packet forwarding scenarios due to
the small size of the deferred action FIFO (10 entries). For
example, broadcast traffic sent out more than 10 ports with
recirculation results in packet drops when the deferred action
FIFO becomes full, as reported here:

     http://openvswitch.org/pipermail/dev/2016-March/067672.html

Since the current recursion depth is available (it is already tracked
by the exec_actions_level pcpu variable), we can use it to determine
whether to execute recirculation actions immediately (safe when
recursion depth is low) or defer execution until more stack space is
available.

With this change, the deferred action fifo size becomes a non-issue
for currently failing scenarios because it is no longer used when
there are three or fewer recursions through ovs_execute_actions().

Suggested-by: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:35:52 -04:00
Or Gerlitz a53d850a79 net/sched: cls_flower: Remove an unused field from the filter key structure
Commit c3f8324188 "net: Add full IPv6 addresses to flow_keys" added an
unused instance of struct flow_dissector_key_addrs into struct fl_flow_key,
remove it.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:27:23 -04:00
Or Gerlitz aa72d70837 net/sched: cls_flower: Support masking for matching on tcp/udp ports
Add the definitions for src/dst udp/tcp port masks and use
them when setting && dumping the relevant keys.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 20:27:23 -04:00
Jamal Hadi Salim 86da71b573 net_sched: Introduce skbmod action
This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:33:47 -04:00
Daniel Borkmann f53d8c7b18 bpf: use skb_at_tc_ingress helper in tcf_bpf
We have a small skb_at_tc_ingress() helper for testing for ingress, so
make use of it. cls_bpf already uses it and so should act_bpf.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:29:47 -04:00
Daniel Borkmann 04b3f8de4b bpf: drop unnecessary test in cls_bpf_classify and tcf_bpf
The skb_mac_header_was_set() test in cls_bpf's and act_bpf's fast-path is
actually unnecessary and can be removed altogether. This was added by
commit a166151cbe ("bpf: fix bpf helpers to use skb->mac_header relative
offsets"), which was later on improved by 3431205e03 ("bpf: make programs
see skb->data == L2 for ingress and egress"). We're always guaranteed to
have valid mac header at the time we invoke cls_bpf_classify() or tcf_bpf().

Reason is that since 6d1ccff627 ("net: reset mac header in dev_start_xmit()")
we do skb_reset_mac_header() in __dev_queue_xmit() before we could call
into sch_handle_egress() or any subsequent enqueue. sch_handle_ingress()
always sees a valid mac header as well (things like skb_reset_mac_len()
would badly fail otherwise). Thus, drop the unnecessary test in classifier
and action case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:29:47 -04:00
Hadar Hen Zion 07c0f09e23 net/sched: act_tunnel_key: Remove rcu_read_lock protection
Remove rcu_read_lock protection from tunnel_key_dump and use
rtnl_dereference, dump operation is protected by  rtnl lock.

Also, remove rcu_read_lock from tunnel_key_release and use
rcu_dereference_protected.

Both operations are running exclusively and a writer couldn't modify
t->params while those functions are executed.

Fixes: 54d94fd89d90 ('net/sched: Introduce act_tunnel_key')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-15 19:18:18 -04:00
Johannes Berg ec53c832ee cfg80211: remove unnecessary pointer-of
For an array, there's no need to use &array, so just use the
plain wiphy->addresses[i].addr here to silence smatch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:20 +02:00
Rajkumar Manoharan e8a24cd4b8 mac80211: allow driver to handle packet-loss mechanism
Based on consecutive msdu failures, mac80211 triggers CQM packet-loss
mechanism. Drivers like ath10k that have its own connection monitoring
algorithm, offloaded to firmware for triggering station kickout. In case
of station kickout, driver will report low ack status by mac80211 API
(ieee80211_report_low_ack).

This flag will enable the driver to completely rely on firmware events
for station kickout and bypass mac80211 packet loss mechanism.

Signed-off-by: Rajkumar Manoharan <rmanohar@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:20 +02:00
Johannes Berg c7e9dbcf09 mac80211: remove sta_remove_debugfs driver callback
No drivers implement this, relying either on the recursive
directory removal to remove their debugfs, or not having any
to start with. Remove the dead driver callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:19 +02:00
Johannes Berg 8826fef95b mac80211: remove pointless chanctx NULL check
If chanctx is derived as container_of() from a non-NULL pointer,
it can't ever be NULL. Since we checked conf before, that's true
here, so remove the useless NULL check.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:19 +02:00
Johannes Berg 5140974dca mac80211: remove unused assignment
The next line overwrites this assignment, so remove it; there's
no real value in using it for the next assignment either.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:18 +02:00
Johannes Berg 53b18980fd nl80211: always check nla_put* return values
A few instances were found where we didn't check them, add the
missing checks even though they'll probably never trigger as
the message should be large enough here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:17 +02:00
Johannes Berg 76e1fb4b55 nl80211: always check nla_nest_start() return value
If the message got full during nla_nest_start(), it can return
NULL. None of the cases here seem like that can really happen,
but check the return value nonetheless.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:17 +02:00
Johannes Berg 58bd7f1158 mac80211: fix scan completed tracing
Passing the 'info' pointer where a 'info->aborted' is expected will
always lead to tracing to erroneously record that the scan was aborted,
fix that by passing the correct info->aborted. The remaining data will
be collected in cfg80211, so I haven't duplicated it here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:16 +02:00
Johannes Berg 93db1d9e6c mac80211: fix possible out-of-bounds access
In the unlikely situation that the supplicant has negotiated
admission for the background AC (which it has no reason to as
it's not supposed to be requiring admission control to start
with, and we'd ignore such a requirement anyway), the loop
here may terminate with non_acm_ac == 4, which leads to an
array overrun.

Check this explicitly just for completeness.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:46:16 +02:00
Johannes Berg f1c1f17ac5 cfg80211: allow connect keys only with default (TX) key
There's no point in allowing connect keys when one of them
isn't also configured as the TX key, it would just confuse
drivers and probably cause them to pick something for TX.
Disallow this confusing and erroneous configuration.

As wpa_supplicant will always send NL80211_ATTR_KEYS, even
when there are no keys inside, allow that and treat it as
though the attribute isn't present at all.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 16:45:41 +02:00
Johannes Berg 85d5313ed7 mac80211: reject TSPEC TIDs (TSIDs) for aggregation
Since mac80211 doesn't currently support TSIDs 8-15 which can
only be used after QoS TSPEC negotiation (and not even after
WMM negotiation), reject attempts to set up aggregation
sessions for them, which might confuse drivers. In mac80211
we do correctly handle that, but the TSIDs should never get
used anyway, and drivers might not be able to handle it.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-15 10:08:52 +02:00
Johannes Berg 0b97a484e5 mac80211: check skb_linearize() return value
The A-MSDU TX code (within TXQs) didn't always check the return value
of skb_linearize() properly, resulting in potentially passing a frag-
list SKB down to the driver even when it said it can't handle it. Fix
that.

Fixes: 6e0456b545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-14 12:08:33 +02:00
David Howells 75b54cb57c rxrpc: Add IPv6 support
Add IPv6 support to AF_RXRPC.  With this, AF_RXRPC sockets can be created:

	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET6);

instead of:

	service = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);

The AFS filesystem doesn't support IPv6 at the moment, though, since that
requires upgrades to some of the RPC calls.

Note that a good portion of this patch is replacing "%pI4:%u" in print
statements with "%pISpc" which is able to handle both protocols and print
the port.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells 1c2bc7b948 rxrpc: Use rxrpc_extract_addr_from_skb() rather than doing this manually
There are two places that want to transmit a packet in response to one just
received and manually pick the address to reply to out of the sk_buff.
Make them use rxrpc_extract_addr_from_skb() instead so that IPv6 is handled
automatically.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells aaa31cbc66 rxrpc: Don't specify protocol to when creating transport socket
Pass 0 as the protocol argument when creating the transport socket rather
than IPPROTO_UDP.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells cd5892c756 rxrpc: Create an address for sendmsg() to bind unbound socket with
Create an address for sendmsg() to bind unbound socket with rather than
using a completely blank address otherwise the transport socket creation
will fail because it will try to use address family 0.

We use the address family specified in the protocol argument when the
AF_RXRPC socket was created and SOCK_DGRAM as the default.  For anything
else, bind() must be used.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 23:09:13 +01:00
David Howells 75e4212639 rxrpc: Correctly initialise, limit and transmit call->rx_winsize
call->rx_winsize should be initialised to the sysctl setting and the sysctl
setting should be limited to the maximum we want to permit.  Further, we
need to place this in the ACK info instead of the sysctl setting.

Furthermore, discard the idea of accepting the subpackets of a jumbo packet
that lie beyond the receive window when the first packet of the jumbo is
within the window.  Just discard the excess subpackets instead.  This
allows the receive window to be opened up right to the buffer size less one
for the dead slot.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:45 +01:00
David Howells 3432a757b1 rxrpc: Fix prealloc refcounting
The preallocated call buffer holds a ref on the calls within that buffer.
The ref was being released in the wrong place - it worked okay for incoming
calls to the AFS cache manager service, but doesn't work right for incoming
calls to a userspace service.

Instead of releasing an extra ref service calls in rxrpc_release_call(),
the ref needs to be released during the acceptance/rejectance process.  To
this end:

 (1) The prealloc ref is now normally released during
     rxrpc_new_incoming_call().

 (2) For preallocated kernel API calls, the kernel API's ref needs to be
     released when the call is discarded on socket close.

 (3) We shouldn't take a second ref in rxrpc_accept_call().

 (4) rxrpc_recvmsg_new_call() needs to get a ref of its own when it adds
     the call to the to_be_accepted socket queue.

In doing (4) above, we would prefer not to put the call's refcount down to
0 as that entails doing cleanup in softirq context, but it's unlikely as
there are several refs held elsewhere, at least one of which must be put by
someone in process context calling rxrpc_release_call().  However, it's not
a problem if we do have to do that.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:37 +01:00
David Howells cbd00891de rxrpc: Adjust the call ref tracepoint to show kernel API refs
Adjust the call ref tracepoint to show references held on a call by the
kernel API separately as much as possible and add an additional trace to at
the allocation point from the preallocation buffer for an incoming call.

Note that this doesn't show the allocation of a client call for the kernel
separately at the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:30 +01:00
David Howells 01fd074224 rxrpc: Allow tx_winsize to grow in response to an ACK
Allow tx_winsize to grow when the ACK info packet shows a larger receive
window at the other end rather than only permitting it to shrink.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:38:24 +01:00
David Howells 89a80ed4c0 rxrpc: Use skb->len not skb->data_len
skb->len should be used rather than skb->data_len when referring to the
amount of data in a packet.  This will only cause a malfunction in the
following cases:

 (1) We receive a jumbo packet (validation and splitting both are wrong).

 (2) We see if there's extra ACK info in an ACK packet (we think it's not
     there and just ignore it).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:22 +01:00
David Howells b25de36053 rxrpc: Add missing unlock in rxrpc_call_accept()
Add a missing unlock in rxrpc_call_accept() in the path taken if there's no
call to wake up.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:22 +01:00
David Howells 33b603fda8 rxrpc: Requeue call for recvmsg if more data
rxrpc_recvmsg() needs to make sure that the call it has just been
processing gets requeued for further attention if the buffer has been
filled and there's more data to be consumed.  The softirq producer only
queues the call and wakes the socket if it fills the first slot in the
window, so userspace might end up sleeping forever otherwise, despite there
being data available.

This is not a problem provided the userspace buffer is big enough or it
empties the buffer completely before more data comes in.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells 91c2c7b656 rxrpc: The IDLE ACK packet should use rxrpc_idle_ack_delay
The IDLE ACK packet should use the rxrpc_idle_ack_delay setting when the
timer is set for it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells bc4abfcf51 rxrpc: Add missing wakeup on Tx window rotation
We need to wake up the sender when Tx window rotation due to an incoming
ACK makes space in the buffer otherwise the sender is liable to just hang
endlessly.

This problem isn't noticeable if the Tx phase transfers no more than will
fit in a single window or the Tx window rotates fast enough that it doesn't
get full.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
David Howells 08a39685a7 rxrpc: Make sure we initialise the peer hash key
Peer records created for incoming connections weren't getting their hash
key set.  This meant that incoming calls wouldn't see more than one DATA
packet - which is not a problem for AFS CM calls with small request data
blobs.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-13 22:36:21 +01:00
Johannes Berg 89b706fb28 cfg80211: reduce connect key caching struct size
After the previous patches, connect keys can only (correctly)
be used for storing static WEP keys. Therefore, remove all the
data for dealing with key index 4/5 and reduce the size of the
key material to the maximum for WEP keys.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:54 +02:00
Johannes Berg e9c8f8d3a4 cfg80211: validate key index better
Don't accept it if a key_idx < 0 snuck through, reject WEP keys with
key index 4 and 5 (which are used for IGTKs) and don't allow IGTKs
with key indices other than 4 and 5. This makes the key data match
expectations better.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:53 +02:00
Johannes Berg 9381e267b6 cfg80211: wext: only allow WEP keys to be configured before connected
When not connected, anything but WEP keys shouldn't be allowed to be
configured for later - only static WEP keys make sense at this point.
Change wext to reject anything else just like nl80211 does.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:52 +02:00
Johannes Berg 386b1f2738 nl80211: only allow WEP keys during connect command
This was already documented that way in nl80211.h, but the
parsing code still accepted other key types. Change it to
really only accept WEP keys as documented.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:52 +02:00
Johannes Berg 42ee231cd1 nl80211: fix connect keys range check
Only key index 0-3 should be accepted, 4/5 are for IGTKs and
cannot be used as connect keys. Fix the range checking to not
allow such erroneous configurations.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:51 +02:00
Johannes Berg b6b5555bc8 cfg80211: disallow shared key authentication with key index 4
Key index 4 can only be used for an IGTK, so the range checks
for shared key authentication should treat 4 as an error, fix
that in the code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:20:51 +02:00
Johannes Berg ad5987b47e nl80211: validate number of probe response CSA counters
Due to an apparent copy/paste bug, the number of counters for the
beacon configuration were checked twice, instead of checking the
number of probe response counters. Fix this to check the number of
probe response counters before parsing those.

Cc: stable@vger.kernel.org
Fixes: 9a774c78e2 ("cfg80211: Support multiple CSA counters")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 20:19:27 +02:00
Xin Long 715f5552b1 sctp: hold the transport before using it in sctp_hash_cmp
Since commit 4f00878126 ("sctp: apply rhashtable api to send/recv
path"), sctp uses transport rhashtable with .obj_cmpfn sctp_hash_cmp,
in which it compares the members of the transport with the rhashtable
args to check if it's the right transport.

But sctp uses the transport without holding it in sctp_hash_cmp, it can
cause a use-after-free panic. As after it gets transport from hashtable,
another CPU may close the sk and free the asoc. In sctp_association_free,
it frees all the transports, meanwhile, the assoc's refcnt may be reduced
to 0, assoc can be destroyed by sctp_association_destroy.

So after that, transport->assoc is actually an unavailable memory address
in sctp_hash_cmp. Although sctp_hash_cmp is under rcu_read_lock, it still
can not avoid this, as assoc is not freed by RCU.

This patch is to hold the transport before checking it's members with
sctp_transport_hold, in which it checks the refcnt first, holds it if
it's not 0.

Fixes: 4f00878126 ("sctp: apply rhashtable api to send/recv path")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:44:58 -04:00
Wei Yongjun c20cb81193 tipc: fix possible memory leak in tipc_udp_enable()
'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ba5aa84a2d ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:28:32 -04:00
Vivien Didelot 308433155a net: bridge: add helper to call /sbin/bridge-stp
If /sbin/bridge-stp is available on the system, bridge tries to execute
it instead of the kernel implementation when starting/stopping STP.

If anything goes wrong with /sbin/bridge-stp, bridge silently falls back
to kernel STP, making hard to debug userspace STP.

This patch adds a br_stp_call_user helper to start/stop userspace STP
and debug errors from the program: abnormal exit status is stored in the
lower byte and normal exit status is stored in higher byte.

Below is a simple example on a kernel with dynamic debug enabled:

    # ln -s /bin/false /sbin/bridge-stp
    # brctl stp br0 on
    br0: failed to start userspace STP (256)
    # dmesg
    br0: /sbin/bridge-stp exited with code 1
    br0: failed to start userspace STP (256)
    br0: using kernel STP

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:21:31 -04:00
David S. Miller 67b9f0b737 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Endianess fix for the new nf_tables netlink trace infrastructure,
   NFTA_TRACE_POLICY endianess was not correct, patch from Liping Zhang.

2) Fix broken re-route after userspace queueing in nf_tables route
   chain. This patch is large but it is simple since it is just getting
   this code in sync with iptable_mangle. Also from Liping.

3) NAT mangling via ctnetlink lies to userspace when nf_nat_setup_info()
   fails to setup the NAT conntrack extension. This problem has been
   there since the beginning, but it can now show up after rhashtable
   conversion.

4) Fix possible NULL pointer dereference due to failures in allocating
   the synproxy and seqadj conntrack extensions, from Gao feng.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:17:24 -04:00
Johannes Berg 4854f175c3 mac80211: remove useless open_count check
__ieee80211_suspend() checks early on if there's anything
to do by checking open_count, so there's no need to check
again later in the function. Remove the useless check.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 15:39:29 +02:00
Gao Feng 4440a2ab3b netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions
When memory is exhausted, nfct_seqadj_ext_add may fail to add the
synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't
check if get valid seqadj pointer by the nfct_seqadj.

Now drop the packet directly when fail to add seqadj extension to
avoid dereference NULL pointer in nf_ct_seqadj_init from
init_conntrack().

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-13 10:50:56 +02:00
Laura Garcia Liebana 14e2dee099 netfilter: nft_hash: fix hash overflow validation
The overflow validation in the init() function establishes that the
maximum value that the hash could reach is less than U32_MAX, which is
likely to be true.

The fix detects the overflow when the maximum hash value is less than
the offset itself.

Fixes: 70ca767ea1 ("netfilter: nft_hash: Add hash offset value")
Reported-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-13 10:49:23 +02:00
Johannes Berg 11d62caf93 mac80211: simplify TDLS RA lookup
smatch pointed out that the second check of "tdls_auth" was
pointless since if it was true, we returned from the function
already. We can further simplify the code by moving the first
check (if it's a TDLS peer at all) into the outer if, to only
handle that inside. This simplifies the control flow here.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 08:28:22 +02:00
Toke Høiland-Jørgensen 8d51dbb8c7 mac80211: Re-structure aqm debugfs output and keep CoDel stats per txq
Currently the 'aqm' stats in mac80211 only keeps overlimit drop stats,
not CoDel stats. This moves the CoDel stats into the txqi structure to
keep them per txq in order to show them in debugfs.

In addition, the aqm debugfs output is restructured by splitting it up
into three files: One global per phy, one per netdev and one per
station, in the appropriate directories. The files are all called aqm,
and are only created if the driver supports the wake_tx_queue op (rather
than emitting an error on open as previously).

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-13 08:20:16 +02:00
David S. Miller b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Linus Torvalds 2c937eb4dd NFS client bugfixes for 4.8
Highlights include:
 
 Stable patches:
 - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct state
   accounting
 - Fix the CREATE_SESSION slot number
 
 Bugfixes:
 - sunrpc: fix a UDP memory accounting regression
 - NFS: Fix an error reporting regression in nfs_file_write()
 - pNFS: Fix further layout stateid issues
 - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
 - RPC/rdma: Fix receive buffer accounting
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJX1wEwAAoJEGcL54qWCgDysPMP/iEgzv6Peky9DVYG35btxZXC
 QQxZDfvOa3Xxe9cH0JwfyisaDHw2gO5RQqFFCCxA/x0dZsf2s3Nrjt6C9yH8q7qF
 i8c1OQ8oEBMgM+BsByCQniUubSaAvs2jVVpAs7G+eOYPSqxFKzsHJwDqqRp4aZrW
 YDohIumsHFoKl1GYCx9jv44wtmQQJjgIJ0Uq8SJvMkSzzRaGgVIeCbfpRgtqVD3g
 mU8k3XV0C+fnLgtwtlG1dkqbnuNSp1gT72f8joId+SJjtnGgjxqi0eIn48vY5k4N
 SJ5+4N6Uko87k9uQ2zn1UTR2Jrltn7mtMI7RHJVuiLnbZjAsf0lfOIF3sgItAwhS
 G0F/EHzMbt3+vs4P9EsGJgTcViVplgJeXw0hQIqXbJN0IwsXG0/UYGuPUFxtMOHQ
 +ko8BYJaNWcQCVdkFc5rVyt/tM6rKDahLlA3sIn3bCGssL67CYgkfNsBIoOEmjp9
 u4XTYwJYD2hXMpskc8W623voQ2/VDbbWB6bphmZH9EeOvlzRB5TW5OvEB0VE805+
 WYZal32LNnaUE4rpUtr78rYEvzPqn7tb9+OglP/tYa1QB3A0nwC9f74CDQ6s08oR
 K00fVXu9yffkBty8Cm0e4HpUcjT+95BMVdJUJU3lhbUbu+eq74L/32OSjuGmdRWf
 c4S6sHfgCeX6uJPCb2rD
 =j4kB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
     state accounting
   - Fix the CREATE_SESSION slot number

  Bugfixes:
   - sunrpc: fix a UDP memory accounting regression
   - NFS: Fix an error reporting regression in nfs_file_write()
   - pNFS: Fix further layout stateid issues
   - RPC/rdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer
     exhaustion...")
   - RPC/rdma: Fix receive buffer accounting"

* tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Fix the CREATE_SESSION slot number accounting
  xprtrdma: Fix receive buffer accounting
  xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
  pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
  pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
  pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
  pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
  NFS: Fix error reporting in nfs_file_write()
  sunrpc: fix UDP memory accounting
2016-09-12 14:13:45 -07:00
Chuck Lever bf2c4b6f9b svcauth_gss: Revert 64c59a3726 ("Remove unnecessary allocation")
rsc_lookup steals the passed-in memory to avoid doing an allocation of
its own, so we can't just pass in a pointer to memory that someone else
is using.

If we really want to avoid allocation there then maybe we should
preallocate somwhere, or reference count these handles.

For now we should revert.

On occasion I see this on my server:

kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
kernel: invalid opcode: 0000 [#1] SMP
kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: events do_cache_clean [sunrpc]
kernel: task: ffff8808541d8000 task.stack: ffff880854344000
kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
kernel: Stack:
kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
kernel: Call Trace:
kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP <ffff880854347d70>
kernel: ---[ end trace 3fdec044969def26 ]---

It seems to be most common after a server reboot where a client has been
using a Kerberos mount, and reconnects to continue its workload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-12 16:57:16 -04:00
Pablo Neira Ayuso ecfcdfec7e netfilter: nf_nat: handle NF_DROP from nfnetlink_parse_nat_setup()
nf_nat_setup_info() returns NF_* verdicts, so convert them to error
codes that is what ctnelink expects. This has passed overlook without
having any impact since this nf_nat_setup_info() has always returned
NF_ACCEPT so far. Since 870190a9ec ("netfilter: nat: convert nat bysrc
hash to rhashtable"), this is problem.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 20:32:57 +02:00
Liping Zhang 2e917d602a netfilter: nft_numgen: fix race between num generate and store it
After we generate a new number, we still use the priv->counter and
store it to the dreg. This is not correct, another cpu may already
change it to a new number. So we must use the generated number, not
the priv->counter itself.

Fixes: 91dbc6be0a ("netfilter: nf_tables: add number generator expression")
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 20:00:23 +02:00
Florian Westphal 8e8118f893 netfilter: conntrack: remove packet hotpath stats
These counters sit in hot path and do show up in perf, this is especially
true for 'found' and 'searched' which get incremented for every packet
processed.

Information like

searched=212030105
new=623431
found=333613
delete=623327

does not seem too helpful nowadays:

- on busy systems found and searched will overflow every few hours
(these are 32bit integers), other more busy ones every few days.

- for debugging there are better methods, such as iptables' trace target,
the conntrack log sysctls.  Nowadays we also have perf tool.

This removes packet path stat counters except those that
are expected to be 0 (or close to 0) on a normal system, e.g.
'insert_failed' (race happened) or 'invalid' (proto tracker rejects).

The insert stat is retained for the ctnetlink case.
The found stat is retained for the tuple-is-taken check when NAT has to
determine if it needs to pick a different source address.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:59:39 +02:00
Gao Feng 23d07508d2 netfilter: Add the missed return value check of nft_register_chain_type
There are some codes of netfilter module which did not check the return
value of nft_register_chain_type. Add the checks now.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:45 +02:00
Gao Feng 4e6577de71 netfilter: Add the missed return value check of register_netdevice_notifier
There are some codes of netfilter module which did not check the return
value of register_netdevice_notifier. Add the checks now.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:43 +02:00
Pablo Neira cf71c03edf netfilter: nf_conntrack: simplify __nf_ct_try_assign_helper() return logic
Instead of several goto's just to return the result, simply return it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:34 +02:00
Pablo Neira Ayuso 71212c9b04 netfilter: nf_tables: don't drop IPv6 packets that cannot parse transport
This is overly conservative and not flexible at all, so better let them
go through and let the filtering policy decide what to do with them. We
use skb_header_pointer() all over the place so we would just fail to
match when trying to access fields from malformed traffic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:32 +02:00
Pablo Neira Ayuso 10151d7b03 netfilter: nf_tables_bridge: use nft_set_pktinfo_ipv{4, 6}_validate
Consolidate pktinfo setup and validation by using the new generic
functions so we converge to the netdev family codebase.

We only need a linear IPv4 and IPv6 header from the reject expression,
so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate()
to net/bridge/netfilter/nft_reject_bridge.c.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:15 +02:00
Pablo Neira Ayuso ddc8b6027a netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()
These functions are extracted from the netdev family, they initialize
the pktinfo structure and validate that the IPv4 and IPv6 headers are
well-formed given that these functions are called from a path where
layer 3 sanitization did not happen yet.

These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h
so they can be reused by a follow up patch to use them from the bridge
family too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:09 +02:00
Pablo Neira Ayuso beac5afa2d netfilter: nf_tables: ensure proper initialization of nft_pktinfo fields
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.

This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.

The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:51:57 +02:00
Pablo Neira Ayuso dbd2be0646 netfilter: nft_dynset: allow to invert match criteria
The dynset expression matches if we can fit a new entry into the set.
If there is no room for it, then it breaks the rule evaluation.

This patch introduces the inversion flag so you can add rules to
explicitly drop packets that don't fit into the set. For example:

 # nft filter input flow table xyz size 4 { ip saddr timeout 120s counter } overflow drop

This is useful to provide a replacement for connlimit.

For the rule above, every new entry uses the IPv4 address as key in the
set, this entry gets a timeout of 120 seconds that gets refresh on every
packet seen. If we get new flow and our set already contains 4 entries
already, then this packet is dropped.

You can already express this in positive logic, assuming default policy
to drop:

 # nft filter input flow table xyz size 4 { ip saddr timeout 10s counter } accept

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:49:50 +02:00
Laura Garcia Liebana 70ca767ea1 netfilter: nft_hash: Add hash offset value
Add support to pass through an offset to the hash value. With this
feature, the sysadmin is able to generate a hash with a given
offset value.

Example:

	meta mark set jhash ip saddr mod 2 seed 0xabcd offset 100

This option generates marks according to the source address from 100 to
101.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
2016-09-12 18:37:12 +02:00
Linus Torvalds da499f8f53 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly small sets of driver fixes scattered all over the place.

   1) Mediatek driver fixes from Sean Wang.  Forward port not written
      correctly during TX map, missed handling of EPROBE_DEFER, and
      mistaken use of put_page() instead of skb_free_frag().

   2) Fix socket double-free in KCM code, from WANG Cong.

   3) QED driver fixes from Sudarsana Reddy Kalluru, including a fix for
      using the dcbx buffers before initializing them.

   4) Mellanox Switch driver fixes from Jiri Pirko, including a fix for
      double fib removals and an error handling fix in
      mlxsw_sp_module_init().

   5) Fix kernel panic when enabling LLDP in i40e driver, from Dave
      Ertman.

   6) Fix padding of TSO packets in thunderx driver, from Sunil Goutham.

   7) TCP's rcv_wup not initialized properly when using fastopen, from
      Neal Cardwell.

   8) Don't use uninitialized flow keys in flow dissector, from Gao
      Feng.

   9) Use after free in l2tp module unload, from Sabrina Dubroca.

  10) Fix interrupt registry ordering issues in smsc911x driver, from
      Jeremy Linton.

  11) Fix crashes in bonding having to do with enslaving and rx_handler,
      from Mahesh Bandewar.

  12) AF_UNIX deadlock fixes from Linus.

  13) In mlx5 driver, don't read skb->xmit_mode after it might have been
      freed from the TX reclaim path.  From Tariq Toukan.

  14) Fix a bug from 2015 in TCP Yeah where the congestion window does
      not increase, from Artem Germanov.

  15) Don't pad frames on receive in NFP driver, from Jakub Kicinski.

  16) Fix chunk fragmenting in SCTP wrt. GSO, from Marcelo Ricardo
      Leitner.

  17) Fix deletion of VRF routes, from Mark Tomlinson.

  18) Fix device refcount leak when DAD fails in ipv6, from Wei Yongjun"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (101 commits)
  net/mlx4_en: Fix panic on xmit while port is down
  net/mlx4_en: Fixes for DCBX
  net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_state()
  net/mlx4_en: Fix the return value of mlx4_en_dcbnl_set_all()
  net: ethernet: renesas: sh_eth: add POST registers for rz
  drivers: net: phy: mdio-xgene: Add hardware dependency
  dwc_eth_qos: do not register semi-initialized device
  sctp: identify chunks that need to be fragmented at IP level
  mlxsw: spectrum: Set port type before setting its address
  mlxsw: spectrum_router: Fix error path in mlxsw_sp_router_init
  nfp: don't pad frames on receive
  nfp: drop support for old firmware ABIs
  nfp: remove linux/version.h includes
  tcp: cwnd does not increase in TCP YeAH
  net/mlx5e: Fix parsing of vlan packets when updating lro header
  net/mlx5e: Fix global PFC counters replication
  net/mlx5e: Prevent casting overflow
  net/mlx5e: Move an_disable_cap bit to a new position
  net/mlx5e: Fix xmit_more counter race issue
  tcp: fastopen: avoid negative sk_forward_alloc
  ...
2016-09-12 07:56:06 -07:00
Pedersen, Thomas 5df20f2141 mac80211: make mpath path fixing more robust
A fixed mpath was not quite being treated as such:

1) if a PERR frame was received, a fixed mpath was
   deactivated.

2) queued path discovery for fixed mpath was potentially
   being considered, changing mpath state.

3) other mpath flags were potentially being inherited when
   fixing the mpath. Just assign PATH_FIXED and SN_VALID.

This solves several issues when fixing a mesh path in one
direction. The reverse direction mpath should probably
also be fixed, or root announcements at least be enabled.

Signed-off-by: Thomas Pedersen <twp@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 12:27:14 +02:00
Felix Fietkau df6ef5d8a8 mac80211: fix sequence number assignment for PS response frames
When using intermediate queues, sequence number allocation is deferred
until dequeue. This doesn't work for PS response frames, which bypass
those queues.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:56:49 +02:00
Felix Fietkau 83843c80dc mac80211: fix tim recalculation after PS response
Handle the case where the mac80211 intermediate queues are empty and the
driver has buffered frames

Fixes: ba8c3d6f16 ("mac80211: add an intermediate software queue implementation")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:54:42 +02:00
Johannes Berg 53f249747d mac80211: send delBA on unexpected BlockAck Request
If we don't have a BA session, send delBA, as requested by the
IEEE 802.11 spec. Apply the same limit of sending such a delBA
only once as in the previous patch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:46:31 +02:00
Johannes Berg bfe40fa395 mac80211: send delBA on unexpected BlockAck data frames
When we receive data frames with ACK policy BlockAck, send
delBA as requested by the 802.11 spec. Since this would be
happening for every frame inside an A-MPDU if it's really
received outside a session, limit it to a single attempt.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:46:21 +02:00
Johannes Berg 99ee7cae3b mac80211: add support for radiotap timestamp field
Use the existing device timestamp from the RX status information
to add support for the new radiotap timestamp field. Currently
only 32-bit counters are supported, but we also add the radiotap
mactime where applicable. This new field allows more flexibility
in where the timestamp is taken etc. The non-timestamp data in
the field is taken from a new field in the hw struct.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:45:45 +02:00
Aviya Erenfeld 42bd20d998 mac80211: add support for MU-MIMO air sniffer
add support to MU-MIMO air sniffer according groupID:
in monitor mode, use a given MU-MIMO groupID to monitor stations
that belongs to that group using MU-MIMO.

add support for following a station according to its MAC address
using VHT MU-MIMO sniffer:
the monitors wait until they get an action MU-MIMO notification
frame, then parses it in order to find the groupID that corresponds
to the given MAC address and monitors packets destined to that
groupID using VHT MU-MIMO.

Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:44:52 +02:00
Maxim Altshul 480dd46b9d mac80211: RX BA support for sta max_rx_aggregation_subframes
The ability to change the max_rx_aggregation frames is useful
in cases of IOP.

There exist some devices (latest mobile phones and some AP's)
that tend to not respect a BA sessions maximum size (in Kbps).
These devices won't respect the AMPDU size that was negotiated during
association (even though they do respect the maximal number of packets).

This violation is characterized by a valid number of packets in
a single AMPDU. Even so, the total size will exceed the size negotiated
during association.

Eventually, this will cause some undefined behavior, which in turn
causes the hw to drop packets, causing the throughput to plummet.

This patch will make the subframe limitation to be held by each station,
instead of being held only by hw.

Signed-off-by: Maxim Altshul <maxim.altshul@ti.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:36:21 +02:00
Bhaktipriya Shridhar e481901384 cfg80211: Remove deprecated create_singlethread_workqueue
The workqueue "cfg80211_wq" is involved in cleanup, scan and event related
works. It queues multiple work items &rdev->event_work,
&rdev->dfs_update_channels_wk,
&wiphy_to_rdev(request->wiphy)->scan_done_wk,
&wiphy_to_rdev(wiphy)->sched_scan_results_wk, which require strict
execution ordering.
Hence, an ordered dedicated workqueue has been used.

Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure
forward progress under memory pressure.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:24:48 +02:00
Aviya Erenfeld d82121845d mac80211: refactor monitor representation in sdata
Insert the u32 monitor flags variable in a new structure
that represents a monitor interface.
This will allow to add more configuration variables to
that structure which will happen in an upcoming change.

Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:24:47 +02:00
Denis Kenzior b7fb44daca nl80211: Allow GET_INTERFACE dumps to be filtered
This patch allows GET_INTERFACE dumps to be filtered based on
NL80211_ATTR_WIPHY or NL80211_ATTR_WDEV.  The documentation for
GET_INTERFACE mentions that this is possible:
"Request an interface's configuration; either a dump request on
a %NL80211_ATTR_WIPHY or ..."

However, this behavior has not been implemented until now.

Johannes: rewrite most of the patch:
 * use nl80211_dump_wiphy_parse() to also allow passing an interface
   to be able to dump its siblings
 * fix locking (must hold rtnl around using nl80211_fam.attrbuf)
 * make init self-contained instead of relying on other cb->args

Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-12 11:24:46 +02:00
David Ahern 8a966fc016 net: ipv6: Remove l3mdev_get_saddr6
No longer needed

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:53 -07:00
David Ahern d66f6c0a8f net: ipv4: Remove l3mdev_get_saddr
No longer needed

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:53 -07:00
David Ahern e0d56fdd73 net: l3mdev: remove redundant calls
A previous patch added l3mdev flow update making these hooks
redundant. Remove them.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:52 -07:00
David Ahern 4c1feac58e net: vrf: Flip IPv6 output path from FIB lookup hook to out hook
Flip the IPv6 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv6 output processing to
send the packet to the VRF driver on xmit.

Link scope addresses (linklocal and multicast) need special handling:
specifically the oif the flow struct can not be changed because we
want the lookup tied to the enslaved interface. ie., the source address
and the returned route MUST point to the interface scope passed in.
Convert the existing vrf_get_rt6_dst to handle only link scope addresses.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:52 -07:00
David Ahern ebfc102c56 net: vrf: Flip IPv4 output path from FIB lookup hook to out hook
Flip the IPv4 output path to use the l3mdev tx out hook. The VRF dst
is not returned on the first FIB lookup. Instead, the dst on the
skb is switched at the beginning of the IPv4 output processing to
send the packet to the VRF driver on xmit.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:52 -07:00
David Ahern 5f02ce24c2 net: l3mdev: Allow the l3mdev to be a loopback
Allow an L3 master device to act as the loopback for that L3 domain.
For IPv4 the device can also have the address 127.0.0.1.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:52 -07:00
David Ahern a8e3e1a9f0 net: l3mdev: Add hook to output path
This patch adds the infrastructure to the output path to pass an skb
to an l3mdev device if it has a hook registered. This is the Tx parallel
to l3mdev_ip{6}_rcv in the receive path and is the basis for removing
the existing hook that returns the vrf dst on the fib lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:52 -07:00
David Ahern 9ee0034b8f net: flow: Add l3mdev flow update
Add l3mdev hook to set FLOWI_FLAG_SKIP_NH_OIF flag and update oif/iif
in flow struct if its oif or iif points to a device enslaved to an L3
Master device. Only 1 needs to be converted to match the l3mdev FIB
rule. This moves the flow adjustment for l3mdev to a single point
catching all lookups. It is redundant for existing hooks (those are
removed in later patches) but is needed for missed lookups such as
PMTU updates.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 23:12:51 -07:00
Eric Dumazet 2594a2a928 tcp: better use ooo_last_skb in tcp_data_queue_ofo()
Willem noticed that we could avoid an rbtree lookup if the
the attempt to coalesce incoming skb to the last skb failed
for some reason.

Since most ooo additions are at the tail, this is definitely
worth adding a test and fast path.

Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:43:41 -07:00
Thadeu Lima de Souza Cascardo ed227099da openvswitch: use alias for genetlink family names
When userspace tries to create datapaths and the module is not loaded,
it will simply fail. With this patch, the module will be automatically
loaded.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:42:46 -07:00
Javier Martinez Canillas 65b323e2ff xfrm: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:11 -07:00
Javier Martinez Canillas aebf5de07a sctp: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:11 -07:00
Javier Martinez Canillas 0013de38a8 net: sched: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:11 -07:00
Javier Martinez Canillas 9dd79945b0 l2tp: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:11 -07:00
Javier Martinez Canillas 6ca40d4e84 ipv4: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:10 -07:00
Javier Martinez Canillas 181402a5c7 net: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:10 -07:00
Javier Martinez Canillas 9a81c34ace lec: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:10 -07:00
Javier Martinez Canillas a73ec314a0 appletalk: use IS_ENABLED() instead of checking for built-in or module
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.

Using the macro makes the code more readable by helping abstract away some
of the Kconfig built-in and module enable details.

Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 21:19:10 -07:00
Amir Vadai d0f6dd8a91 net/sched: Introduce act_tunnel_key
This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.

The action will release the metadata created by the tunnel device
(decap), or set the metadata with the specified values for encap
operation.

For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:

$ tc filter add dev net0 protocol ip parent ffff: \
    flower \
      ip_proto 1 \
      dst_ip 11.11.11.2 \
    action tunnel_key set \
      src_ip 11.11.0.1 \
      dst_ip 11.11.0.2 \
      id 11 \
    action mirred egress redirect dev vxlan0

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 20:53:56 -07:00
Amir Vadai bc3103f1ed net/sched: cls_flower: Classify packet in ip tunnels
Introduce classifying by metadata extracted by the tunnel device.
Outer header fields - source/dest ip and tunnel id, are extracted from
the metadata when classifying.

For example, the following will add a filter on the ingress Qdisc of shared
vxlan device named 'vxlan0'. To forward packets with outer src ip
11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be
forwarded to tap device 'vnet0' (after metadata is released):

$ tc filter add dev vxlan0 protocol ip parent ffff: \
    flower \
      enc_src_ip 11.11.0.2 \
      enc_dst_ip 11.11.0.1 \
      enc_key_id 11 \
      dst_ip 11.11.11.1 \
    action tunnel_key release \
    action mirred egress redirect dev vnet0

The action tunnel_key, will be introduced in the next patch in this
series.

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 20:53:55 -07:00
Amir Vadai d817f432c2 net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id()
Add utility functions to convert a 32 bits key into a 64 bits tunnel and
vice versa.
These functions will be used instead of cloning code in GRE and VXLAN,
and in tc act_iptunnel which will be introduced in a following patch in
this patchset.

Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10 20:53:55 -07:00
Daniel Borkmann f3694e0012 bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.

This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.

BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.

I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann 374fb54eea bpf: add own ctx rewriter on ifindex for clsact progs
When fetching ifindex, we don't need to test dev for being NULL since
we're always guaranteed to have a valid dev for clsact programs. Thus,
avoid this test in fast path.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann f035a51536 bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros
Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
which otherwise often result in overly long bytes_to_bpf_size(sizeof())
and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
check in convert_bpf_extensions(), but we should rather make that generic
as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
users to detect any rewriter size issues at compile time. Note, there are
currently none, but we want to assert that it stays this way.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Daniel Borkmann 6088b5823b bpf: minor cleanups in helpers
Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:03 -07:00
Eric Dumazet bf8d85d4f9 ip_tunnel: do not clear l4 hashes
If skb has a valid l4 hash, there is no point clearing hash and force
a further flow dissection when a tunnel encapsulation is added.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:33:11 -07:00
David S. Miller fa5f4aaf6e RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV9FCuvSw1s6N8H32AQKo5w/8CySGsorFk67/QiQGdBt+URd8cxR2NuvF
 i3P7Kbo30ycJO7Q4Uc4DvO3kTqWiNMbXWVgGLfA64HDFojjuuXfQdFwf98FZ2WtQ
 OxQUV5fzSPFwlDktd5nWm5qTCdv7+lIvBCVsEPuX2pSkc7HesiYMsZt2ilOac9Ho
 Meon2/S1oq3hctZv2DTiaI+Ae8YBMar7GSUfylRGa2TkXCgG8eYcjGyGigLJ2F03
 e+/8w6+jtrW5hASCJPI9re+qiYgmnYa7UVpwrVjM1dVOYYZfmU02Jq6HgW9bSd24
 MYk6neksMGVpQbVmAbj5/MmxUg98q8UpY9ygt2IWP4UvGNDYBGCiSbfyQoTnoWUP
 02k3E6HnFfs8SPbxuNmA4uB2BHL2y87+G8u1g0IUZkT8i3zFwLd01UBwJqB23tYE
 EIRAad1xWwGaSJGyFgsmry1RJsitSUAG9w/68Ni1IMQxsHsIROTz6TNBki1tMcOh
 AAsbj4iJ0rJ2Ca/Xbk9kAdPzEr85ZA3Za5BwA9ZDwZjmt2X1RrzuK9gIaKB8hsWS
 zVjRjpvSOaTyx97rtEVfkT310GMGYC5r9ba+kE4ukGeHWKRVkMk5tkADZw9RFKdf
 ubXN/zyfv4YABHHUIfQn5UgHHmxl4GpN0CD+cY7hPtmB9J2wvsadckqrzBOFIQL+
 dg7jZAb+fjc=
 =GfEj
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160908' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rewrite data and ack handling

This patch set constitutes the main portion of the AF_RXRPC rewrite.  It
consists of five fix/helper patches:

 (1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values.

 (2) Update some protocol definitions slightly.

 (3) Use of an hlist for RCU purposes.

 (4) Removal of per-call sk_buff accounting (not really needed when skbs
     aren't being queued on the main queue).

 (5) Addition of a tracepoint to log incoming packets in the data_ready
     callback and to log the end of the data_ready callback.

And then there are two patches that form the main part:

 (6) Preallocation of resources for incoming calls so that in patch (7) the
     data_ready handler can be made to fully instantiate an incoming call
     and make it live.  This extends through into AFS so that AFS can
     preallocate its own incoming call resources.

     The preallocation size is capped at the listen() backlog setting - and
     that is capped at a sysctl limit which can be set between 4 and 32.

     The preallocation is (re)charged either by accepting/rejecting pending
     calls or, in the case of AFS, manually.  If insufficient preallocation
     resources exist, a BUSY packet will be transmitted.

     The advantage of using this preallocation is that once a call is set
     up in the data_ready handler, DATA packets can be queued on it
     immediately rather than the DATA packets being queued for a background
     work item to do all the allocation and then try and sort out the DATA
     packets whilst other DATA packets may still be coming in and going
     either to the background thread or the new call.

 (7) Rewrite the handling of DATA, ACK and ABORT packets.

     In the receive phase, DATA packets are now held in per-call circular
     buffers with deduplication, out of sequence detection and suchlike
     being done in data_ready.  Since there is only one producer and only
     once consumer, no locks need be used on the receive queue.

     Received ACK and ABORT packets are now parsed and discarded in
     data_ready to recycle resources as fast as possible.

     sk_buffs are no longer pulled, trimmed or cloned, but rather the
     offset and size of the content is tracked.  This particularly affects
     jumbo DATA packets which need insertion into the receive buffer in
     multiple places.  Annotations are kept to track which bit is which.

     Packets are no longer queued on the socket receive queue; rather,
     calls are queued.  Dummy packets to convey events therefore no longer
     need to be invented and metadata packets can be discarded as soon as
     parsed rather then being pushed onto the socket receive queue to
     indicate terminal events.

     The preallocation facility added in (6) is now used to set up incoming
     calls with very little locking required and no calls to the allocator
     in data_ready.

     Decryption and verification is now handled in recvmsg() rather than in
     a background thread.  This allows for the future possibility of
     decrypting directly into the user buffer.

     With this patch, the code is a lot simpler and most of the mass of
     call event and state wangling code in call_event.c is gone.

With this, the majority of the AF_RXRPC rewrite is complete.  However,
there are still things to be done, including:

 (*) Limit the number of active service calls to prevent an attacker from
     filling up a server's memory.

 (*) Limit the number of calls on the rebuff-with-BUSY queue.

 (*) Transmit delayed/deferred ACKs from recvmsg() if possible, rather than
     punting to the background thread.  Ideally, the background thread
     shouldn't run at all, but data_ready can't call kernel_sendmsg() and
     we can't rely on recvmsg() attending to the call in a timely fashion.

 (*) Prevent the call at the front of the socket queue from hogging
     recvmsg()'s attention if there's a sufficiently continuous supply of
     data.

 (*) Distribute ICMP errors by connection rather than by call.  Possibly
     parse the ICMP packet to try and pin down the exact connection and
     call.

 (*) Encrypt/decrypt directly between user buffers and socket buffers where
     possible.

 (*) IPv6.

 (*) Service ID upgrade.  This is a facility whereby a special flag bit is
     set in the DATA packet header when making a call that tells the server
     that it is allowed to change the service ID to an upgraded one and
     reply with an equivalent call from the upgraded service.

     This is used, for example, to override certain AFS calls so that IPv6
     addresses can be returned.

 (*) Allow userspace to preallocate call user IDs for incoming calls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:24:21 -07:00
Marcelo Ricardo Leitner 7303a14750 sctp: identify chunks that need to be fragmented at IP level
Previously, without GSO, it was easy to identify it: if the chunk didn't
fit and there was no data chunk in the packet yet, we could fragment at
IP level. So if there was an auth chunk and we were bundling a big data
chunk, it would fragment regardless of the size of the auth chunk. This
also works for the context of PMTU reductions.

But with GSO, we cannot distinguish such PMTU events anymore, as the
packet is allowed to exceed PMTU.

So we need another check: to ensure that the chunk that we are adding,
actually fits the current PMTU. If it doesn't, trigger a flush and let
it be fragmented at IP level in the next round.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:18:33 -07:00
Colin Ian King 05f1b12f71 net: x25: remove null checks on arrays calling_ae and called_ae
dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and
cannot be null and hence are redundant checks. Remove these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 18:13:30 -07:00
stephen hemminger b8b867e132 rtnetlink: remove unused ifla_stats_policy
This structure is defined but never used. Flagged with W=1

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 16:52:43 -07:00
Guillaume Nault 73483c1289 ipv6: report NLM_F_CREATE and NLM_F_EXCL flags in RTM_NEWROUTE events
Since commit 37a1d3611c ("ipv6: include NLM_F_REPLACE in route
replace notifications"), RTM_NEWROUTE notifications have their
NLM_F_REPLACE flag set if the new route replaced a preexisting one.
However, other flags aren't set.

This patch reports the missing NLM_F_CREATE and NLM_F_EXCL flag bits.

NLM_F_APPEND is not reported, because in ipv6 a NLM_F_CREATE request
is interpreted as an append request (contrary to ipv4, "prepend" is not
supported, so if NLM_F_EXCL is not set then NLM_F_APPEND is implicit).

As a result, the possible flag combination can now be reported
(iproute2's terminology into parentheses):

  * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
    ("add").
  * NLM_F_CREATE: route did already exist, new route added after
    preexisting ones ("append").
  * NLM_F_REPLACE: route did already exist, new route replaced the
    first preexisting one ("change").

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 16:50:23 -07:00
Guillaume Nault b93e1fa710 ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events
fib_table_insert() inconsistently fills the nlmsg_flags field in its
notification messages.

Since commit b8f5583135 ("[RTNETLINK]: Fix sending netlink message
when replace route."), the netlink message has its nlmsg_flags set to
NLM_F_REPLACE if the route replaced a preexisting one.

Then commit a2bb6d7d6f ("ipv4: include NLM_F_APPEND flag in append
route notifications") started setting nlmsg_flags to NLM_F_APPEND if
the route matched a preexisting one but was appended.

In other cases (exclusive creation or prepend), nlmsg_flags is 0.

This patch sets ->nlmsg_flags in all situations, preserving the
semantic of the NLM_F_* bits:

  * NLM_F_CREATE: a new fib entry has been created for this route.
  * NLM_F_EXCL: no other fib entry existed for this route.
  * NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
  * NLM_F_APPEND: the new fib entry was added after other entries for
    the same route.

As a result, the possible flag combination can now be reported
(iproute2's terminology into parentheses):

  * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
    ("add").
  * NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
    added after preexisting ones ("append").
  * NLM_F_CREATE: route did already exist, new route added before
    preexisting ones ("prepend").
  * NLM_F_REPLACE: route did already exist, new route replaced the
    first preexisting one ("change").

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 16:50:23 -07:00
Liping Zhang fe01111d23 netfilter: nft_queue: check the validation of queues_total and queuenum
Although the validation of queues_total and queuenum is checked in nft
utility, but user can add nft rules via nfnetlink, so it is necessary
to check the validation at the nft_queue expr init routine too.

Tested by run ./nft-test.py any/queue.t:
  any/queue.t: 6 unit tests, 0 error, 0 warning

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-09 15:54:48 +02:00
thomas.zeitlhofer+lkml@ze-it.at 1fb81e09d4 vti: use right inner_mode for inbound inter address family policy checks
In case of inter address family tunneling (IPv6 over vti4 or IPv4 over
vti6), the inbound policy checks in vti_rcv_cb() and vti6_rcv_cb() are
using the wrong address family. As a result, all inbound inter address
family traffic is dropped.

Use the xfrm_ip2inner_mode() helper, as done in xfrm_input() (i.e., also
increment LINUX_MIB_XFRMINSTATEMODEERROR in case of error), to select the
inner_mode that contains the right address family for the inbound policy
checks.

Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-09 09:02:08 +02:00
Mathias Krause 2f30ea5090 xfrm_user: propagate sec ctx allocation errors
When we fail to attach the security context in xfrm_state_construct()
we'll return 0 as error value which, in turn, will wrongly claim success
to userland when, in fact, we won't be adding / updating the XFRM state.

This is a regression introduced by commit fd21150a0f ("[XFRM] netlink:
Inline attach_encap_tmpl(), attach_sec_ctx(), and attach_one_addr()").

Fix it by propagating the error returned by security_xfrm_state_alloc()
in this case.

Fixes: fd21150a0f ("[XFRM] netlink: Inline attach_encap_tmpl()...")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-09-09 09:02:08 +02:00
Eric Dumazet e895cdce68 ipv4: accept u8 in IP_TOS ancillary data
In commit f02db315b8 ("ipv4: IP_TOS and IP_TTL can be specified as
ancillary data") Francesco added IP_TOS values specified as integer.

However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
in a single byte, when IP_RECVTOS is set on the socket.

It can be very useful to reflect all ancillary options as given by the
kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
EINVAL after Francesco patch.

So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
server can simply reuse same ancillary block without having to mangle
it.

Jesper can then augment
https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
to add TOS reflection ;)

Fixes: f02db315b8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Francesco Fusco <ffusco@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:45:57 -07:00
Yaogong Wang 9f5afeae51 tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:25:58 -07:00
Artem Germanov db7196a0d0 tcp: cwnd does not increase in TCP YeAH
Commit 76174004a0
(tcp: do not slow start when cwnd equals ssthresh )
introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual
ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP
connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth
~100KB/s.
   That is caused by stalled cwnd when cwnd equals ssthresh. This patch
fixes it by proper increasing cwnd in this case.

Signed-off-by: Artem Germanov <agermanov@anchorfree.com>
Acked-by: Dmitry Adamushko <d.adamushko@anchorfree.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:16:12 -07:00
Eric Garver 018c1dda5f openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes
Add support for 802.1ad including the ability to push and pop double
tagged vlans. Add support for 802.1ad to netlink parsing and flow
conversion. Uses double nested encap attributes to represent double
tagged vlan. Inner TPID encoded along with ctci in nested attributes.

This is based on Thomas F Herbert's original v20 patch. I made some
small clean ups and bug fixes.

Signed-off-by: Thomas F Herbert <thomasfherbert@gmail.com>
Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:10:28 -07:00
Lorenzo Colitti d545caca82 net: inet: diag: expose the socket mark to privileged processes.
This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.

Commit a52e95abf7 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes.  It is useful for
tools like ss which display information about sockets.

Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 16:13:09 -07:00
Eric Dumazet 76061f631c tcp: fastopen: avoid negative sk_forward_alloc
When DATA and/or FIN are carried in a SYN/ACK message or SYN message,
we append an skb in socket receive queue, but we forget to call
sk_forced_mem_schedule().

Effect is that the socket has a negative sk->sk_forward_alloc as long as
the message is not read by the application.

Josh Hunt fixed a similar issue in commit d22e153718 ("tcp: fix tcp
fin memory accounting")

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 16:08:10 -07:00
David S. Miller 40e3012e6e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
ipsec 2016-09-08

1) Fix a crash when xfrm_dump_sa returns an error.
   From Vegard Nossum.

2) Remove some incorrect WARN() on normal error handling.
   From Vegard Nossum.

3) Ignore socket policies when rebuilding hash tables,
   socket policies are not inserted into the hash tables.
   From Tobias Brunner.

4) Initialize and check tunnel pointers properly before
   we use it. From Alexey Kodanev.

5) Fix l3mdev oif setting on xfrm dst lookups.
   From David Ahern.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 13:12:37 -07:00
David S. Miller 575f9c43e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
ipsec-next 2016-09-08

1) Constify the xfrm_replay structures. From Julia Lawall

2) Protect xfrm state hash tables with rcu, lookups
   can be done now without acquiring xfrm_state_lock.
   From Florian Westphal.

3) Protect xfrm policy hash tables with rcu, lookups
   can be done now without acquiring xfrm_policy_lock.
   From Florian Westphal.

4) We don't need to have a garbage collector list per
   namespace anymore, so use a global one instead.
   From Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 13:09:41 -07:00
David Howells 248f219cb8 rxrpc: Rewrite the data and ack handling code
Rewrite the data and ack handling code such that:

 (1) Parsing of received ACK and ABORT packets and the distribution and the
     filing of DATA packets happens entirely within the data_ready context
     called from the UDP socket.  This allows us to process and discard ACK
     and ABORT packets much more quickly (they're no longer stashed on a
     queue for a background thread to process).

 (2) We avoid calling skb_clone(), pskb_pull() and pskb_trim().  We instead
     keep track of the offset and length of the content of each packet in
     the sk_buff metadata.  This means we don't do any allocation in the
     receive path.

 (3) Jumbo DATA packet parsing is now done in data_ready context.  Rather
     than cloning the packet once for each subpacket and pulling/trimming
     it, we file the packet multiple times with an annotation for each
     indicating which subpacket is there.  From that we can directly
     calculate the offset and length.

 (4) A call's receive queue can be accessed without taking locks (memory
     barriers do have to be used, though).

 (5) Incoming calls are set up from preallocated resources and immediately
     made live.  They can than have packets queued upon them and ACKs
     generated.  If insufficient resources exist, DATA packet #1 is given a
     BUSY reply and other DATA packets are discarded).

 (6) sk_buffs no longer take a ref on their parent call.

To make this work, the following changes are made:

 (1) Each call's receive buffer is now a circular buffer of sk_buff
     pointers (rxtx_buffer) rather than a number of sk_buff_heads spread
     between the call and the socket.  This permits each sk_buff to be in
     the buffer multiple times.  The receive buffer is reused for the
     transmit buffer.

 (2) A circular buffer of annotations (rxtx_annotations) is kept parallel
     to the data buffer.  Transmission phase annotations indicate whether a
     buffered packet has been ACK'd or not and whether it needs
     retransmission.

     Receive phase annotations indicate whether a slot holds a whole packet
     or a jumbo subpacket and, if the latter, which subpacket.  They also
     note whether the packet has been decrypted in place.

 (3) DATA packet window tracking is much simplified.  Each phase has just
     two numbers representing the window (rx_hard_ack/rx_top and
     tx_hard_ack/tx_top).

     The hard_ack number is the sequence number before base of the window,
     representing the last packet the other side says it has consumed.
     hard_ack starts from 0 and the first packet is sequence number 1.

     The top number is the sequence number of the highest-numbered packet
     residing in the buffer.  Packets between hard_ack+1 and top are
     soft-ACK'd to indicate they've been received, but not yet consumed.

     Four macros, before(), before_eq(), after() and after_eq() are added
     to compare sequence numbers within the window.  This allows for the
     top of the window to wrap when the hard-ack sequence number gets close
     to the limit.

     Two flags, RXRPC_CALL_RX_LAST and RXRPC_CALL_TX_LAST, are added also
     to indicate when rx_top and tx_top point at the packets with the
     LAST_PACKET bit set, indicating the end of the phase.

 (4) Calls are queued on the socket 'receive queue' rather than packets.
     This means that we don't need have to invent dummy packets to queue to
     indicate abnormal/terminal states and we don't have to keep metadata
     packets (such as ABORTs) around

 (5) The offset and length of a (sub)packet's content are now passed to
     the verify_packet security op.  This is currently expected to decrypt
     the packet in place and validate it.

     However, there's now nowhere to store the revised offset and length of
     the actual data within the decrypted blob (there may be a header and
     padding to skip) because an sk_buff may represent multiple packets, so
     a locate_data security op is added to retrieve these details from the
     sk_buff content when needed.

 (6) recvmsg() now has to handle jumbo subpackets, where each subpacket is
     individually secured and needs to be individually decrypted.  The code
     to do this is broken out into rxrpc_recvmsg_data() and shared with the
     kernel API.  It now iterates over the call's receive buffer rather
     than walking the socket receive queue.

Additional changes:

 (1) The timers are condensed to a single timer that is set for the soonest
     of three timeouts (delayed ACK generation, DATA retransmission and
     call lifespan).

 (2) Transmission of ACK and ABORT packets is effected immediately from
     process-context socket ops/kernel API calls that cause them instead of
     them being punted off to a background work item.  The data_ready
     handler still has to defer to the background, though.

 (3) A shutdown op is added to the AF_RXRPC socket so that the AFS
     filesystem can shut down the socket and flush its own work items
     before closing the socket to deal with any in-progress service calls.

Future additional changes that will need to be considered:

 (1) Make sure that a call doesn't hog the front of the queue by receiving
     data from the network as fast as userspace is consuming it to the
     exclusion of other calls.

 (2) Transmit delayed ACKs from within recvmsg() when we've consumed
     sufficiently more packets to avoid the background work item needing to
     run.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells 00e907127e rxrpc: Preallocate peers, conns and calls for incoming service requests
Make it possible for the data_ready handler called from the UDP transport
socket to completely instantiate an rxrpc_call structure and make it
immediately live by preallocating all the memory it might need.  The idea
is to cut out the background thread usage as much as possible.

[Note that the preallocated structs are not actually used in this patch -
 that will be done in a future patch.]

If insufficient resources are available in the preallocation buffers, it
will be possible to discard the DATA packet in the data_ready handler or
schedule a BUSY packet without the need to schedule an attempt at
allocation in a background thread.

To this end:

 (1) Preallocate rxrpc_peer, rxrpc_connection and rxrpc_call structs to a
     maximum number each of the listen backlog size.  The backlog size is
     limited to a maxmimum of 32.  Only this many of each can be in the
     preallocation buffer.

 (2) For userspace sockets, the preallocation is charged initially by
     listen() and will be recharged by accepting or rejecting pending
     new incoming calls.

 (3) For kernel services {,re,dis}charging of the preallocation buffers is
     handled manually.  Two notifier callbacks have to be provided before
     kernel_listen() is invoked:

     (a) An indication that a new call has been instantiated.  This can be
     	 used to trigger background recharging.

     (b) An indication that a call is being discarded.  This is used when
     	 the socket is being released.

     A function, rxrpc_kernel_charge_accept() is called by the kernel
     service to preallocate a single call.  It should be passed the user ID
     to be used for that call and a callback to associate the rxrpc call
     with the kernel service's side of the ID.

 (4) Discard the preallocation when the socket is closed.

 (5) Temporarily bump the refcount on the call allocated in
     rxrpc_incoming_call() so that rxrpc_release_call() can ditch the
     preallocation ref on service calls unconditionally.  This will no
     longer be necessary once the preallocation is used.

Note that this does not yet control the number of active service calls on a
client - that will come in a later patch.

A future development would be to provide a setsockopt() call that allows a
userspace server to manually charge the preallocation buffer.  This would
allow user call IDs to be provided in advance and the awkward manual accept
stage to be bypassed.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells 49e19ec7d3 rxrpc: Add tracepoints to record received packets and end of data_ready
Add two tracepoints:

 (1) Record the RxRPC protocol header of packets retrieved from the UDP
     socket by the data_ready handler.

 (2) Record the outcome of the data_ready handler.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells 2ab27215ea rxrpc: Remove skb_count from struct rxrpc_call
Remove the sk_buff count from the rxrpc_call struct as it's less useful
once we stop queueing sk_buffs.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:12 +01:00
David Howells de8d6c7401 rxrpc: Convert rxrpc_local::services to an hlist
Convert the rxrpc_local::services list to an hlist so that it can be
accessed under RCU conditions more readily.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:11 +01:00
David Howells cf13258fd4 rxrpc: Fix ASSERTCMP and ASSERTIFCMP to handle signed values
Fix ASSERTCMP and ASSERTIFCMP to be able to handle signed values by casting
both parameters to the type of the first before comparing.  Without this,
both values are cast to unsigned long, which means that checks for values
less than zero don't work.

The downside of this is that the state enum values in struct rxrpc_call and
struct rxrpc_connection can't be bitfields as __typeof__ can't handle them.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-08 11:10:11 +01:00
subashab@codeaurora.org 0f76d25644 net: xfrm: Change u32 sysctl entries to use proc_douintvec
proc_dointvec limits the values to INT_MAX in u32 sysctl entries.
proc_douintvec allows to write upto UINT_MAX.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-07 23:17:53 -07:00
Lorenzo Colitti f95bf34622 net: diag: make udp_diag_destroy work for mapped addresses.
udp_diag_destroy does look up the IPv4 UDP hashtable for mapped
addresses, but it gets the IPv4 address to look up from the
beginning of the IPv6 address instead of the end.

Tested: https://android-review.googlesource.com/269874
Fixes: 5d77dca828 ("net: diag: support SOCK_DESTROY for UDP sockets")
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-07 17:31:30 -07:00
Andrey Vagin 733ade23de netlink: don't forget to release a rhashtable_iter structure
This bug was detected by kmemleak:
unreferenced object 0xffff8804269cc3c0 (size 64):
  comm "criu", pid 1042, jiffies 4294907360 (age 13.713s)
  hex dump (first 32 bytes):
    a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00  .2.,............
    00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
  backtrace:
    [<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280
    [<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag]

v2: don't remove a reference on a rhashtable_iter structure to
    release it from netlink_diag_dump_done

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-07 17:29:38 -07:00
David Howells 5a42976d4f rxrpc: Add tracepoint for working out where aborts happen
Add a tracepoint for working out where local aborts happen.  Each
tracepoint call is labelled with a 3-letter code so that they can be
distinguished - and the DATA sequence number is added too where available.

rxrpc_kernel_abort_call() also takes a 3-letter code so that AFS can
indicate the circumstances when it aborts a call.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 16:34:40 +01:00
David Howells e8d6bbb05a rxrpc: Fix returns of call completion helpers
rxrpc_set_call_completion() returns bool, not int, so the ret variable
should match this.

rxrpc_call_completed() and __rxrpc_call_completed() should return the value
of rxrpc_set_call_completion().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 16:34:30 +01:00
David Howells 8d94aa381d rxrpc: Calls shouldn't hold socket refs
rxrpc calls shouldn't hold refs on the sock struct.  This was done so that
the socket wouldn't go away whilst the call was in progress, such that the
call could reach the socket's queues.

However, we can mark the socket as requiring an RCU release and rely on the
RCU read lock.

To make this work, we do:

 (1) rxrpc_release_call() removes the call's call user ID.  This is now
     only called from socket operations and not from the call processor:

	rxrpc_accept_call() / rxrpc_kernel_accept_call()
	rxrpc_reject_call() / rxrpc_kernel_reject_call()
	rxrpc_kernel_end_call()
	rxrpc_release_calls_on_socket()
	rxrpc_recvmsg()

     Though it is also called in the cleanup path of
     rxrpc_accept_incoming_call() before we assign a user ID.

 (2) Pass the socket pointer into rxrpc_release_call() rather than getting
     it from the call so that we can get rid of uninitialised calls.

 (3) Fix call processor queueing to pass a ref to the work queue and to
     release that ref at the end of the processor function (or to pass it
     back to the work queue if we have to requeue).

 (4) Skip out of the call processor function asap if the call is complete
     and don't requeue it if the call is complete.

 (5) Clean up the call immediately that the refcount reaches 0 rather than
     trying to defer it.  Actual deallocation is deferred to RCU, however.

 (6) Don't hold socket refs for allocated calls.

 (7) Use the RCU read lock when queueing a message on a socket and treat
     the call's socket pointer according to RCU rules and check it for
     NULL.

     We also need to use the RCU read lock when viewing a call through
     procfs.

 (8) Transmit the final ACK/ABORT to a client call in rxrpc_release_call()
     if this hasn't been done yet so that we can then disconnect the call.
     Once the call is disconnected, it won't have any access to the
     connection struct and the UDP socket for the call work processor to be
     able to send the ACK.  Terminal retransmission will be handled by the
     connection processor.

 (9) Release all calls immediately on the closing of a socket rather than
     trying to defer this.  Incomplete calls will be aborted.

The call refcount model is much simplified.  Refs are held on the call by:

 (1) A socket's user ID tree.

 (2) A socket's incoming call secureq and acceptq.

 (3) A kernel service that has a call in progress.

 (4) A queued call work processor.  We have to take care to put any call
     that we failed to queue.

 (5) sk_buffs on a socket's receive queue.  A future patch will get rid of
     this.

Whilst we're at it, we can do:

 (1) Get rid of the RXRPC_CALL_EV_RELEASE event.  Release is now done
     entirely from the socket routines and never from the call's processor.

 (2) Get rid of the RXRPC_CALL_DEAD state.  Calls now end in the
     RXRPC_CALL_COMPLETE state.

 (3) Get rid of the rxrpc_call::destroyer work item.  Calls are now torn
     down when their refcount reaches 0 and then handed over to RCU for
     final cleanup.

 (4) Get rid of the rxrpc_call::deadspan timer.  Calls are cleaned up
     immediately they're finished with and don't hang around.
     Post-completion retransmission is handled by the connection processor
     once the call is disconnected.

 (5) Get rid of the dead call expiry setting as there's no longer a timer
     to set.

 (6) rxrpc_destroy_all_calls() can just check that the call list is empty.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:33:20 +01:00
David Howells 6543ac5235 rxrpc: Use rxrpc_is_service_call() rather than rxrpc_conn_is_service()
Use rxrpc_is_service_call() rather than rxrpc_conn_is_service() if the call
is available just in case call->conn is NULL.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells 8b7fac50ab rxrpc: Pass the connection pointer to rxrpc_post_packet_to_call()
Pass the connection pointer to rxrpc_post_packet_to_call() as the call
might get disconnected whilst we're looking at it, but the connection
pointer determined by rxrpc_data_read() is guaranteed by RCU for the
duration of the call.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells 278ac0cdd5 rxrpc: Cache the security index in the rxrpc_call struct
Cache the security index in the rxrpc_call struct so that we can get at it
even when the call has been disconnected and the connection pointer
cleared.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells f4fdb3525b rxrpc: Use call->peer rather than call->conn->params.peer
Use call->peer rather than call->conn->params.peer to avoid the possibility
of call->conn being NULL and, whilst we're at it, check it for NULL before we
access it.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells fff72429c2 rxrpc: Improve the call tracking tracepoint
Improve the call tracking tracepoint by showing more differentiation
between some of the put and get events, including:

  (1) Getting and putting refs for the socket call user ID tree.

  (2) Getting and putting refs for queueing and failing to queue the call
      processor work item.

Note that these aren't necessarily used in this patch, but will be taken
advantage of in future patches.

An enum is added for the event subtype numbers rather than coding them
directly as decimal numbers and a table of 3-letter strings is provided
rather than a sequence of ?: operators.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 15:30:22 +01:00
David Howells e796cb4192 rxrpc: Delete unused rxrpc_kernel_free_skb()
Delete rxrpc_kernel_free_skb() as it's unused.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 14:43:43 +01:00
David Howells 71a17de307 rxrpc: Whitespace cleanup
Remove some whitespace.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-07 14:43:39 +01:00
Marco Angaroni 1bcabc81ee netfilter: nf_ct_sip: allow tab character in SIP headers
Current parsing methods for SIP headers do not allow the presence of
tab characters between header name and header value. As a result Call-ID
SIP headers like the following are discarded by IPVS SIP persistence
engine:

"Call-ID\t: mycallid@abcde"
"Call-ID:\tmycallid@abcde"

In above examples Call-IDs are represented as strings in C language.
Obviously in real message we have byte "09" before/after colon (":").

Proposed fix is in nf_conntrack_sip module.
Function sip_skip_whitespace() should skip tabs in addition to spaces,
since in SIP grammar whitespace (WSP) corresponds to space or tab.

Below is an extract of relevant SIP ABNF syntax.

Call-ID  =  ( "Call-ID" / "i" ) HCOLON callid
callid   =  word [ "@" word ]

HCOLON  =  *( SP / HTAB ) ":" SWS
SWS     =  [LWS] ; sep whitespace
LWS     =  [*WSP CRLF] 1*WSP ; linear whitespace
WSP     =  SP / HTAB
word    =  1*(alphanum / "-" / "." / "!" / "%" / "*" /
           "_" / "+" / "`" / "'" / "~" /
           "(" / ")" / "<" / ">" /
           ":" / "\" / DQUOTE /
           "/" / "[" / "]" / "?" /
           "{" / "}" )

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 13:53:43 +02:00
Pablo Neira Ayuso 22609b43b1 netfilter: nft_quota: introduce nft_overquota()
This is patch renames the existing function to nft_overquota() and make
it return a boolean that tells us if we have exceeded our byte quota.
Just a cleanup.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 11:02:06 +02:00
Pablo Neira Ayuso db6d857b81 netfilter: nft_quota: fix overquota logic
Use xor to decide to break further rule evaluation or not, since the
existing logic doesn't achieve the expected inversion.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 11:00:56 +02:00
Laura Garcia Liebana 0d9932b287 netfilter: nft_numgen: rename until attribute by modulus
The _until_ attribute is renamed to _modulus_ as the behaviour is similar to
other expresions with number limits (ex. nft_hash).

Renaming is possible because there isn't a kernel release yet with these
changes.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:55:46 +02:00
Gao Feng ddb075b0cd netfilter: ftp: Remove the useless code
There are some debug code which are commented out in find_pattern by #if 0.
Now remove them.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:38:00 +02:00
Gao Feng 723eb299de netfilter: ftp: Remove the useless dlen==0 condition check in find_pattern
The caller function "help" has already make sure the datalen could not be zero
before invoke find_pattern as a parameter by the following codes

        if (dataoff >= skb->len) {
                pr_debug("ftp: dataoff(%u) >= skblen(%u)\n", dataoff,
                         skb->len);
                return NF_ACCEPT;
        }
        datalen = skb->len - dataoff;

And the latter codes "ends_in_nl = (fb_ptr[datalen - 1] == '\n');" use datalen
directly without checking if it is zero.

So it is unneccessary to check it in find_pattern too.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:37:59 +02:00
Marco Angaroni f0608ceaa7 netfilter: nf_ct_sip: correct allowed characters in Call-ID SIP header
Current parsing methods for SIP header Call-ID do not check correctly all
characters allowed by RFC 3261. In particular "," character is allowed
instead of "'" character. As a result Call-ID headers like the following
are discarded by IPVS SIP persistence engine.

Call-ID: -.!%*_+`'~()<>:\"/[]?{}

Above example is composed using all non-alphanumeric characters listed
in RFC 3261 for Call-ID header syntax.

Proposed fix is in nf_conntrack_sip module; function iswordc() checks this
range: (c >= '(' && c <= '/') which includes these characters: ()*+,-./
They are all allowed except ",". Instead "'" is not included in the list.

Below is an extract of relevant SIP ABNF syntax.

Call-ID  =  ( "Call-ID" / "i" ) HCOLON callid
callid   =  word [ "@" word ]

HCOLON  =  *( SP / HTAB ) ":" SWS
SWS     =  [LWS] ; sep whitespace
LWS     =  [*WSP CRLF] 1*WSP ; linear whitespace
WSP     =  SP / HTAB
word    =  1*(alphanum / "-" / "." / "!" / "%" / "*" /
           "_" / "+" / "`" / "'" / "~" /
           "(" / ")" / "<" / ">" /
           ":" / "\" / DQUOTE /
           "/" / "[" / "]" / "?" /
           "{" / "}" )

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:37:58 +02:00
Marco Angaroni 68cb9fe47e netfilter: nf_ct_sip: correct parsing of continuation lines in SIP headers
Current parsing methods for SIP headers do not properly manage
continuation lines: in case of Call-ID header the first character of
Call-ID header value is truncated. As a result IPVS SIP persistence
engine hashes over a call-id that is not exactly the one present in
the originale message.

Example: "Call-ID: \r\n abcdeABCDE1234"
results in extracted call-id equal to "bcdeABCDE1234".

In above example Call-ID is represented as a string in C language.
Obviously in real message the first bytes after colon (":") are
"20 0d 0a 20".

Proposed fix is in nf_conntrack_sip module.
Since sip_follow_continuation() function walks past the leading
spaces or tabs of the continuation line, sip_skip_whitespace()
should simply return the ouput of sip_follow_continuation().
Otherwise another iteration of the for loop is done and dptr
is incremented by one pointing to the second character of the
first word in the header.

Below is an extract of relevant SIP ABNF syntax.

Call-ID  =  ( "Call-ID" / "i" ) HCOLON callid
callid   =  word [ "@" word ]

HCOLON  =  *( SP / HTAB ) ":" SWS
SWS     =  [LWS] ; sep whitespace
LWS     =  [*WSP CRLF] 1*WSP ; linear whitespace
WSP     =  SP / HTAB
word    =  1*(alphanum / "-" / "." / "!" / "%" / "*" /
           "_" / "+" / "`" / "'" / "~" /
           "(" / ")" / "<" / ">" /
           ":" / "\" / DQUOTE /
           "/" / "[" / "]" / "?" /
           "{" / "}" )

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:37:57 +02:00
Gao Feng c579a9e7d5 netfilter: gre: Use consistent GRE and PTTP header structure instead of the ones defined by netfilter
There are two existing strutures which defines the GRE and PPTP header.
So use these two structures instead of the ones defined by netfilter to
keep consitent with other codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:36:52 +02:00
Gao Feng ecc6569f35 netfilter: gre: Use consistent GRE_* macros instead of ones defined by netfilter.
There are already some GRE_* macros in kernel, so it is unnecessary
to define these macros. And remove some useless macros

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-07 10:36:48 +02:00
Wei Yongjun 751eb6b604 ipv6: addrconf: fix dev refcont leak when DAD failed
In general, when DAD detected IPv6 duplicate address, ifp->state
will be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a
delayed work, the call tree should be like this:

ndisc_recv_ns
  -> addrconf_dad_failure        <- missing ifp put
     -> addrconf_mod_dad_work
       -> schedule addrconf_dad_work()
         -> addrconf_dad_stop()  <- missing ifp hold before call it

addrconf_dad_failure() called with ifp refcont holding but not put.
addrconf_dad_work() call addrconf_dad_stop() without extra holding
refcount. This will not cause any issue normally.

But the race between addrconf_dad_failure() and addrconf_dad_work()
may cause ifp refcount leak and netdevice can not be unregister,
dmesg show the following messages:

IPv6: eth0: IPv6 duplicate address fe80::XX:XXXX:XXXX:XX detected!
...
unregister_netdevice: waiting for eth0 to become free. Usage count = 1

Cc: stable@vger.kernel.org
Fixes: c15b1ccadb ("ipv6: move DAD and addrconf_verify processing
to workqueue")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 14:17:49 -07:00
Mark Tomlinson 5a56a0b3a4 net: Don't delete routes in different VRFs
When deleting an IP address from an interface, there is a clean-up of
routes which refer to this local address. However, there was no check to
see that the VRF matched. This meant that deletion wasn't confined to
the VRF it should have been.

To solve this, a new field has been added to fib_info to hold a table
id. When removing fib entries corresponding to a local ip address, this
table id is also used in the comparison.

The table id is populated when the fib_info is created. This was already
done in some places, but not in ip_rt_ioctl(). This has now been fixed.

Fixes: 021dd3b8a1 ("net: Add routes to the table associated with the device")
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 13:56:13 -07:00
David S. Miller c7ee5672f3 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV8yHP/Sw1s6N8H32AQL7cQ//S3MP4x690tV49UocezdAV8PuV3HpM5kt
 a+43v8RqjTHaZQxOwQaG71OpAM2gH1z7QPSq6jsLy0PEmaPiP/MZOAWqUudBAyqo
 eVsEMEVlw0f7PogyKqr06uPoVo21IdLNX9E89CqaGqFuDfsKlj1bKQHH4/248Arr
 i4ztzid/Mj98fHvzqMdr631c06GvLozU/5X6xE7hzkGkqVmRtjIB6qETGqerwx/p
 GJlcTZVFw2EviS6/Ft/t26xgVsOg1ogzXjWLUufnnJ1GpRaucqMfHwYD0WqQV3sB
 Bu6WRx2JgXRPBi5m7gWymkgT0pUNRiDFuWN6qdlJbHJgKGuVojF6tnh2go2bj9Cq
 q/GLbi8Y810v64293i1vdz6yyM1PzDG648+6z8vbTpsLI7cHDq5csPYMHRIh34IM
 FQmSZblKIuALD8BXqW1lXrqVKU0WEFVI9WjcRk9OSIanqPrQyP8xOAJVp03uIGe/
 uxkheJBy7XvghwKWZNZ1y0A0g6NxY0gNMhVsmW77VbNfOsKOpEcpSCvMwjgMOvDi
 npynisM8tf2cx973SZYXjvvl2LSMHjiVQ/tf0iTMYmBjtU7ft5i8p3w290yAD+JQ
 JKqpKq3TXVty5MMEXl5bStIr/HkgPk7e6v1sH9WIREu9m9gROANGe/9Mr4VvweHk
 jEuFHy2EZNE=
 =3Don
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160904-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Split output code from sendmsg code

Here's a set of small patches that split the packet transmission code from
the sendmsg code and simply rearrange the new file to make it more
logically laid out ready for being rewritten.  An enum is also moved out of
the header file to there as it's only used there.  This needs to be applied
on top of the just-posted fixes patch set.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 13:53:29 -07:00
David S. Miller 0122c6d5fa RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV8yHKPSw1s6N8H32AQKbWA/+MEjzVzIOpp4edvAj5DzdO5pls3NvhUAg
 2sP15E9nOL9JHrOD5snCVSRd9ROEGRE0/S9WXUjeb2VKz7C2pmTixo+MjiJRMzAR
 TZZlE2Ydx0h+A7WiywKoTY4g7VICL+UC+4XcheMzTLNS2mzqKb2GGOm3BwvGaFeT
 RKRVIlwIHziShaIs7K7ZkfKxGaDIL/9x344uPfFHaDKb33aOBnTuY6HlFf5Yu2qm
 pzh+R7cBYJMvhEqd71ESYPSbSGnjBN5zRzjZDSvXeI/30k9Ee6mFastv4fdG3Mrk
 0WOLxml9yLTlJnfXeuN0T9B0C/ur4oD4hKEDREXVTxcTEXRq/VxSJ3cYxeM3DlAz
 U795lcveiYajRv7F73jcfNuaEENQg5HsZuaFs+CJgVxQpsqs9IOpEl9CNrVsvjmf
 9crgamUj34ehZ5lgsV/Qbm7OFk16dmQ59ImGClQFgsIU6hEWCP0VgwKXWTwP7YZD
 Ucp1zWp/XTAtbrRNdkye/Z0WE5QOtoUWzftPrf95TP7LFewp0DGZ+miBV/atSHbZ
 bADXR0SOtax8u9bfs1HYTadwHk1LJYuVXNYn/KN06c9q1WKuKH/qg0JAjv/KUnJm
 Nnx0TFh4kUg999+3CtxqmupiqkD5SCckCN2ggifgDCmQhEwr1CYrGKWpZvNMq+S6
 nVNTUr7PYSI=
 =RRYk
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160904-1' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Small fixes

Here's a set of small fix patches:

 (1) Fix some uninitialised variables.

 (2) Set the client call state before making it live by attaching it to the
     conn struct.

 (3) Randomise the epoch and starting client conn ID values, and don't
     change the epoch when the client conn ID rolls round.

 (4) Replace deprecated create_singlethread_workqueue() calls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 13:46:26 -07:00
Chuck Lever 05c974669e xprtrdma: Fix receive buffer accounting
An RPC can terminate before its reply arrives, if a credential
problem or a soft timeout occurs. After this happens, xprtrdma
reports it is out of Receive buffers.

A Receive buffer is posted before each RPC is sent, and returned to
the buffer pool when a reply is received. If no reply is received
for an RPC, that Receive buffer remains posted. But xprtrdma tries
to post another when the next RPC is sent.

If this happens a few dozen times, there are no receive buffers left
to be posted at send time. I don't see a way for a transport
connection to recover at that point, and it will spit warnings and
unnecessarily delay RPCs on occasion for its remaining lifetime.

Commit 1e465fd4ff ("xprtrdma: Replace send and receive arrays")
removed a little bit of logic to detect this case and not provide
a Receive buffer so no more buffers are posted, and then transport
operation continues correctly. We didn't understand what that logic
did, and it wasn't commented, so it was removed as part of the
overhaul to support backchannel requests.

Restore it, but be wary of the need to keep extra Receives posted
to deal with backchannel requests.

Fixes: 1e465fd4ff ("xprtrdma: Replace send and receive arrays")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-06 15:59:35 -04:00
Chuck Lever 78d506e1b7 xprtrdma: Revert 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion...")
Receive buffer exhaustion, if it were to actually occur, would be
catastrophic. However, when there are no reply buffers to post, that
means all of them have already been posted and are waiting for
incoming replies. By design, there can never be more RPCs in flight
than there are available receive buffers.

A receive buffer can be left posted after an RPC exits without a
received reply; say, due to a credential problem or a soft timeout.
This does not result in fewer posted receive buffers than there are
pending RPCs, and there is already logic in xprtrdma to deal
appropriately with this case.

It also looks like the "+ 2" that was removed was accidentally
accommodating the number of extra receive buffers needed for
receiving backchannel requests. That will need to be addressed by
another patch.

Fixes: 3d4cf35bd4 ("xprtrdma: Reply buffer exhaustion can be...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-06 15:59:35 -04:00
Dave Jones 03c2778a93 ipv6: release dst in ping_v6_sendmsg
Neither the failure or success paths of ping_v6_sendmsg release
the dst it acquires.  This leads to a flood of warnings from
"net/core/dst.c:288 dst_release" on older kernels that
don't have 8bf4ada2e2 backported.

That patch optimistically hoped this had been fixed post 3.10, but
it seems at least one case wasn't, where I've seen this triggered
a lot from machines doing unprivileged icmp sockets.

Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 12:54:17 -07:00
David S. Miller 60175ccdf4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree.  Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.

More specifically, they are:

1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
   transport area in dccp, sctp, tcp, udp and udplite protocol
   conntrackers, from Gao Feng.

2) Missing whitespace on error message in physdev match, from Hangbin Liu.

3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
   from Florian.

6) Rename nf_tables set implementation to nft_set_{name}.c

7) Introduce the hash expression to allow arbitrary hashing of selector
   concatenations, from Laura Garcia Liebana.

8) Remove ip_conntrack sysctl backward compatibility code, this code has
   been around for long time already, and we have two interfaces to do
   this already: nf_conntrack sysctl and ctnetlink.

9) Use nf_conntrack_get_ht() helper function whenever possible, instead
   of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

10) Add quota expression for nf_tables.

11) Add number generator expression for nf_tables, this supports
    incremental and random generators that can be combined with maps,
    very useful for load balancing purpose, again from Laura Garcia Liebana.

12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
    configuration, this is used by a follow up patch to perform better chain
    update validation.

14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
    nft_set_hash implementation to honor the NLM_F_EXCL flag.

15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
    patch from Florian Westphal.

16) Don't use the DYING bit to know if the conntrack event has been already
    delivered, instead a state variable to track event re-delivery
    states, also from Florian.

17) Remove the per-conntrack timer, use the workqueue approach that was
    discussed during the NFWS, from Florian Westphal.

18) Use the netlink conntrack table dump path to kill stale entries,
    again from Florian.

19) Add a garbage collector to get rid of stale conntracks, from
    Florian.

20) Reschedule garbage collector if eviction rate is high.

21) Get rid of the __nf_ct_kill_acct() helper.

22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

23) Make nf_log_set() interface assertive on unsupported families.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 12:45:26 -07:00
Liping Zhang d1a6cba576 netfilter: nft_chain_route: re-route before skb is queued to userspace
Imagine such situation, user add the following nft rules, and queue
the packets to userspace for further check:
  # ip rule add fwmark 0x0/0x1 lookup eth0
  # ip rule add fwmark 0x1/0x1 lookup eth1
  # nft add table filter
  # nft add chain filter output {type route hook output priority 0 \;}
  # nft add rule filter output mark set 0x1
  # nft add rule filter output queue num 0

But after we reinject the skbuff, the packet will be sent via the
wrong route, i.e. in this case, the packet will be routed via eth0
table, not eth1 table. Because we skip to do re-route when verdict
is NF_QUEUE, even if the mark was changed.

Acctually, we should not touch sk_buff if verdict is NF_DROP or
NF_STOLEN, and when re-route fails, return NF_DROP with error code.
This is consistent with the mangle table in iptables.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-06 18:02:37 +02:00
Liping Zhang 5210d393ef netfilter: nf_tables_trace: fix endiness when dump chain policy
NFTA_TRACE_POLICY attribute is big endian, but we forget to call
htonl to convert it. Fortunately, this attribute is parsed as big
endian in libnftnl.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-05 19:28:23 +02:00
David Howells 3dc20f090d rxrpc Move enum rxrpc_command to sendmsg.c
Move enum rxrpc_command to sendmsg.c as it's now only used in that file.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
David Howells df423a4af1 rxrpc: Rearrange net/rxrpc/sendmsg.c
Rearrange net/rxrpc/sendmsg.c to be in a more logical order.  This makes it
easier to follow and eliminates forward declarations.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
David Howells 0b58b8a18b rxrpc: Split sendmsg from packet transmission code
Split the sendmsg code from the packet transmission code (mostly to be
found in output.c).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
David Howells 090f85deb6 rxrpc: Don't change the epoch
It seems the local epoch should only be changed on boot, so remove the code
that changes it for client connections.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
David Howells 5f2d9c4438 rxrpc: Randomise epoch and starting client conn ID values
Create a random epoch value rather than a time-based one on startup and set
the top bit to indicate that this is the case.

Also create a random starting client connection ID value.  This will be
incremented from here as new client connections are created.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 21:41:39 +01:00
Linus Torvalds 6e1ce3c345 af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock'
Right now we use the 'readlock' both for protecting some of the af_unix
IO path and for making the bind be single-threaded.

The two are independent, but using the same lock makes for a nasty
deadlock due to ordering with regards to filesystem locking.  The bind
locking would want to nest outside the VSF pathname locking, but the IO
locking wants to nest inside some of those same locks.

We tried to fix this earlier with commit c845acb324 ("af_unix: Fix
splice-bind deadlock") which moved the readlock inside the vfs locks,
but that caused problems with overlayfs that will then call back into
filesystem routines that take the lock in the wrong order anyway.

Splitting the locks means that we can go back to having the bind lock be
the outermost lock, and we don't have any deadlocks with lock ordering.

Acked-by: Rainer Weikusat <rweikusat@cyberadapt.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 13:29:29 -07:00
Linus Torvalds 38f7bd94a9 Revert "af_unix: Fix splice-bind deadlock"
This reverts commit c845acb324.

It turns out that it just replaces one deadlock with another one: we can
still get the wrong lock ordering with the readlock due to overlayfs
calling back into the filesystem layer and still taking the vfs locks
after the readlock.

The proper solution ends up being to just split the readlock into two
pieces: the bind lock (taken *outside* the vfs locks) and the IO lock
(taken *inside* the filesystem locks).  The two locks are independent
anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 13:29:29 -07:00
Mahesh Bandewar 24b27fc4cd bonding: Fix bonding crash
Following few steps will crash kernel -

  (a) Create bonding master
      > modprobe bonding miimon=50
  (b) Create macvlan bridge on eth2
      > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \
	   type macvlan
  (c) Now try adding eth2 into the bond
      > echo +eth2 > /sys/class/net/bond0/bonding/slaves
      <crash>

Bonding does lots of things before checking if the device enslaved is
busy or not.

In this case when the notifier call-chain sends notifications, the
bond_netdev_event() assumes that the rx_handler /rx_handler_data is
registered while the bond_enslave() hasn't progressed far enough to
register rx_handler for the new slave.

This patch adds a rx_handler check that can be performed right at the
beginning of the enslave code to avoid getting into this situation.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:41:12 -07:00
WANG Cong bc51dddf98 netns: avoid disabling irq for netns id
We never read or change netns id in hardirq context,
the only place we read netns id in softirq context
is in vxlan_xmit(). So, it should be enough to just
disable BH.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:39:59 -07:00
WANG Cong 38f507f1ba vxlan: call peernet2id() in fdb notification
netns id should be already allocated each time we change
netns, that is, in dev_change_net_namespace() (more precisely
in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:39:58 -07:00
Joe Stringer 76644232e6 openvswitch: Free tmpl with tmpl_free.
When an error occurs during conntrack template creation as part of
actions validation, we need to free the template. Previously we've been
using nf_ct_put() to do this, but nf_ct_tmpl_free() is more appropriate.

Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-04 11:38:10 -07:00
David Howells af338a9ea6 rxrpc: The client call state must be changed before attachment to conn
We must set the client call state to RXRPC_CALL_CLIENT_SEND_REQUEST before
attaching the call to the connection struct, not after, as it's liable to
receive errors and conn aborts as soon as the assignment is made - and
these will cause its state to be changed outside of the initiating thread's
control.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-04 13:10:10 +01:00
Paolo Abeni a41bd25ae6 sunrpc: fix UDP memory accounting
The commit f9b2ee714c ("SUNRPC: Move UDP receive data path
into a workqueue context"), as a side effect, moved the
skb_free_datagram() call outside the scope of the related socket
lock, but UDP sockets require such lock to be held for proper
memory accounting.
Fix it by replacing skb_free_datagram() with
skb_free_datagram_locked().

Fixes: f9b2ee714c ("SUNRPC: Move UDP receive data path into a workqueue context")
Reported-and-tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-09-03 10:00:49 -04:00
Jon Paul Maloy e0a05ebe26 tipc: send broadcast nack directly upon sequence gap detection
Because of the risk of an excessive number of NACK messages and
retransissions, receivers have until now abstained from sending
broadcast NACKS directly upon detection of a packet sequence number
gap. We have instead relied on such gaps being detected by link
protocol STATE message exchange, something that by necessity delays
such detection and subsequent retransmissions.

With the introduction of unicast NACK transmission and rate control
of retransmissions we can now remove this limitation. We now allow
receiving nodes to send NACKS immediately, while coordinating the
permission to do so among the nodes in order to avoid NACK storms.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:25 -07:00
Jon Paul Maloy 7c4a54b963 tipc: rate limit broadcast retransmissions
As cluster sizes grow, so does the amount of identical or overlapping
broadcast NACKs generated by the packet receivers. This often leads to
'NACK crunches' resulting in huge numbers of redundant retransmissions
of the same packet ranges.

In this commit, we introduce rate control of broadcast retransmissions,
so that a retransmitted range cannot be retransmitted again until after
at least 10 ms. This reduces the frequency of duplicate, redundant
retransmissions by an order of magnitude, while having a significant
positive impact on overall throughput and scalability.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
Jon Paul Maloy 02d11ca200 tipc: transfer broadcast nacks in link state messages
When we send broadcasts in clusters of more 70-80 nodes, we sometimes
see the broadcast link resetting because of an excessive number of
retransmissions. This is caused by a combination of two factors:

1) A 'NACK crunch", where loss of broadcast packets is discovered
   and NACK'ed by several nodes simultaneously, leading to multiple
   redundant broadcast retransmissions.

2) The fact that the NACKS as such also are sent as broadcast, leading
   to excessive load and packet loss on the transmitting switch/bridge.

This commit deals with the latter problem, by moving sending of
broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
bits in word 8 of the said message for this purpose, and introduce a
new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
backwards compatible.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
David Howells 00b5407e42 rxrpc: Fix uninitialised variable warning
Fix the following uninitialised variable warning:

../net/rxrpc/call_event.c: In function 'rxrpc_process_call':
../net/rxrpc/call_event.c:879:58: warning: 'error' may be used uninitialized in this function [-Wmaybe-uninitialized]
    _debug("post net error %d", error);
                                                          ^

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-02 22:39:44 +01:00
Arnd Bergmann 30787a4170 rxrpc: fix undefined behavior in rxrpc_mark_call_released
gcc -Wmaybe-initialized correctly points out a newly introduced bug
through which we can end up calling rxrpc_queue_call() for a dead
connection:

net/rxrpc/call_object.c: In function 'rxrpc_mark_call_released':
net/rxrpc/call_object.c:600:5: error: 'sched' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This sets the 'sched' variable to zero to restore the previous
behavior.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: f5c17aaeb2 ("rxrpc: Calls should only have one terminal state")
Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-02 22:39:44 +01:00
Sabrina Dubroca 2f86953e74 l2tp: fix use-after-free during module unload
Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq
 -> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU ->
l2tp_tunnel_destruct).

By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish
destroying the socket, the private data reserved via the net_generic
mechanism has already been freed, but l2tp_tunnel_destruct() actually
uses this data.

Make sure tunnel deletion for the netns has completed before returning
from l2tp_exit_net() by first flushing the tunnel removal workqueue, and
then waiting for RCU callbacks to complete.

Fixes: 167eb17e0b ("l2tp: create tunnel sockets in the right namespace")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 11:44:44 -07:00
Eli Cooper ab34380162 ipv6: Don't unset flowi6_proto in ipxip6_tnl_xmit()
Commit 8eb30be035 ("ipv6: Create ip6_tnl_xmit") unsets
flowi6_proto in ip4ip6_tnl_xmit() and ip6ip6_tnl_xmit().
Since xfrm_selector_match() relies on this info, IPv6 packets
sent by an ip6tunnel cannot be properly selected by their
protocols after removing it. This patch puts flowi6_proto back.

Cc: stable@vger.kernel.org
Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 23:41:24 -07:00
Nikolay Aleksandrov b6cb5ac833 net: bridge: add per-port multicast flood flag
Add a per-port flag to control the unknown multicast flood, similar to the
unknown unicast flood flag and break a few long lines in the netlink flag
exports.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:48:33 -07:00
Nikolay Aleksandrov 8addd5e7d3 net: bridge: change unicast boolean to exact pkt_type
Remove the unicast flag and introduce an exact pkt_type. That would help us
for the upcoming per-port multicast flood flag and also slightly reduce the
tests in the input fast path.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:48:33 -07:00
Gao Feng 635c223cfa rps: flow_dissector: Fix uninitialized flow_keys used in __skb_get_hash possibly
The original codes depend on that the function parameters are evaluated from
left to right. But the parameter's evaluation order is not defined in C
standard actually.

When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys,
hashrnd) with some compilers or environment, the keys passed to
flow_keys_have_l4 is not initialized.

Fixes: 6db61d79c1 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash")

Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:45:03 -07:00
Roopa Prabhu d297653dd6 rtnetlink: fdb dump: optimize by saving last interface markers
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real    1m11.791s
user    0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real    0m2.017s
user    0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:56:15 -07:00
David Howells d001648ec7 rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
     rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
     will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
     because we don't have to go through all the packets to get to the one
     cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
     buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
     contexts - usually that of the userspace process that issued a syscall
     - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

     int rxrpc_kernel_recv_data(
		struct socket *sock, struct rxrpc_call *call,
		void *buffer, size_t bufsize, size_t *_offset,
		bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

	rxrpc_kernel_is_data_last()
    	rxrpc_kernel_get_abort_code()
    	rxrpc_kernel_get_error_number()
    	rxrpc_kernel_free_skb()
    	rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:43:27 -07:00
Neal Cardwell 28b346cbc0 tcp: fastopen: fix rcv_wup initialization for TFO server on SYN/data
Yuchung noticed that on the first TFO server data packet sent after
the (TFO) handshake, the server echoed the TCP timestamp value in the
SYN/data instead of the timestamp value in the final ACK of the
handshake. This problem did not happen on regular opens.

The tcp_replace_ts_recent() logic that decides whether to remember an
incoming TS value needs tp->rcv_wup to hold the latest receive
sequence number that we have ACKed (latest tp->rcv_nxt we have
ACKed). This commit fixes this issue by ensuring that a TFO server
properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends
a SYN/ACK for the SYN/data.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:40:15 -07:00
Nikolay Aleksandrov 85a3d4a935 net: bridge: don't increment tx_dropped in br_do_proxy_arp
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the
skb isn't changed/dropped and processing continues so we shouldn't
increment tx_dropped.

CC: Kyeyoon Park <kyeyoonp@codeaurora.org>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:35:30 -07:00
Nicolas Dichtel 29c994e361 netconf: add a notif when settings are created
All changes are notified, but the initial state was missing.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 15:18:08 -07:00
Nicolas Dichtel d26c638c16 ipv6: add missing netconf notif when 'all' is updated
The 'default' value was not advertised.

Fixes: f3a1bfb11c ("rtnl/ipv6: use netconf msg to advertise forwarding status")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 15:18:08 -07:00
stephen hemminger f5bb341e1d l2tp: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:01 -07:00
stephen hemminger 4f70c96ffd tcp: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:01 -07:00
stephen hemminger 6501f34ff7 ila: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:01 -07:00
stephen hemminger 3f18ff2b42 fou: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger 3ee5256da0 netns: make nla_policy const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger deeb91f59d batman: make netlink attributes const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger 85bae4bd8a drop_monitor: make genl_multicast_group const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger 12d8de6d95 net: make genetlink ctrl ops const
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 14:09:00 -07:00
stephen hemminger ce927bf174 mpls: get rid of trivial returns
return at end of function is useless.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 10:13:15 -07:00
Parthasarathy Bhuvaragan d2f394dc48 tipc: fix random link resets while adding a second bearer
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.

When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.

When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).

Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.

In this commit, we adjust the size of name distributor message so that
they can be tunnelled.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 10:12:26 -07:00
WANG Cong c0338aff22 kcm: fix a socket double free
Dmitry reported a double free on kcm socket, which could
be easily reproduced by:

	#include <unistd.h>
	#include <sys/syscall.h>

	int main()
	{
	  int fd = syscall(SYS_socket, 0x29ul, 0x5ul, 0x0ul, 0, 0, 0);
	  syscall(SYS_ioctl, fd, 0x89e2ul, 0x20a98000ul, 0, 0, 0);
	  return 0;
	}

This is because on the error path, after we install
the new socket file, we call sock_release() to clean
up the socket, which leaves the fd pointing to a freed
socket. Fix this by calling sys_close() on that fd
directly.

Fixes: ab7ac4eb98 ("kcm: Kernel Connection Multiplexor module")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 21:00:19 -07:00
Vivien Didelot 8df3025520 net: dsa: add MDB support
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 14:15:42 -07:00
Davide Caratti 9264251ee2 bridge: re-introduce 'fix parsing of MLDv2 reports'
commit bc8c20acae ("bridge: multicast: treat igmpv3 report with
INCLUDE and no sources as a leave") seems to have accidentally reverted
commit 47cc84ce0c ("bridge: fix parsing of MLDv2 reports"). This
commit brings back a change to br_ip6_multicast_mld2_report() where
parsing of MLDv2 reports stops when the first group is successfully
added to the MDB cache.

Fixes: bc8c20acae ("bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 09:29:58 -07:00
David Ahern 48d2ab609b net: mpls: Fixups for GSO
As reported by Lennert the MPLS GSO code is failing to properly segment
large packets. There are a couple of problems:

1. the inner protocol is not set so the gso segment functions for inner
   protocol layers are not getting run, and

2  MPLS labels for packets that use the "native" (non-OVS) MPLS code
   are not properly accounted for in mpls_gso_segment.

The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment
to call the gso segment functions for the higher layer protocols. That
means skb_mac_gso_segment is called twice -- once with the network
protocol set to MPLS and again with the network protocol set to the
inner protocol.

This patch sets the inner skb protocol addressing item 1 above and sets
the network_header and inner_network_header to mark where the MPLS labels
start and end. The MPLS code in OVS is also updated to set the two
network markers.

>From there the MPLS GSO code uses the difference between the network
header and the inner network header to know the size of the MPLS header
that was pushed. It then pulls the MPLS header, resets the mac_len and
protocol for the inner protocol and then calls skb_mac_gso_segment
to segment the skb.

Afterward the inner protocol segmentation is done the skb protocol
is set to mpls for each segment and the network and mac headers
restored.

Reported-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
Roopa Prabhu 14972cbd34 net: lwtunnel: Handle fragmentation
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
Eric Dumazet 41852497a9 net: batch calls to flush_all_backlogs()
After commit 145dd5f9c8 ("net: flush the softnet backlog in process
context"), we can easily batch calls to flush_all_backlogs() for all
devices processed in rollback_registered_many()

Tested:

Before patch, on an idle host.

modprobe dummy numdummies=10000
perf stat -e context-switches -a rmmod dummy

 Performance counter stats for 'system wide':

         1,211,798      context-switches

       1.302137465 seconds time elapsed

After patch:

perf stat -e context-switches -a rmmod dummy

 Performance counter stats for 'system wide':

           225,523      context-switches

       0.721623566 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:17:20 -07:00
David S. Miller 2df5d103a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Allow nf_tables reject expression from input, forward and output hooks,
   since only there the routing information is available, otherwise we crash.

2) Fix unsafe list iteration when flushing timeout and accouting objects.

3) Fix refcount leak on timeout policy parsing failure.

4) Unlink timeout object for unconfirmed conntracks too

5) Missing validation of pkttype mangling from bridge family.

6) Fix refcount leak on ebtables on second lookup for the specific
   bridge match extension, this patch from Sabrina Dubroca.

7) Remove unnecessary ip_hdr() in nf_tables_netdev family.

Patches from 1-5 and 7 from Liping Zhang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:02:09 -07:00
David S. Miller 15543692a0 Three little fixes:
* revert a recent wext patch, which Ben Hutchings noticed was
    wrong, and it turns out not to be necessary for any driver
 
  * fix an infinite loop that can occur under certain conditions
    in mac80211's TDLS code (depending on regulatory information)
 
  * add a cfg80211_get_station() static inline when cfg80211 isn't
    built, to allow other modules to not have to depend on it for it
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJXxSR4AAoJEGt7eEactAAdl3EP/AiUwqrYqbLnnFy6C7obFS3p
 eBBMxQAZbT+q+fFlZvqRrt5tPdkYriPLhm/0sAzuapnyS+Q6seNJ/vPoo91uC1jU
 ZI/j97v9NwUtRLfNCq+0Jwvs7ma0U1VEcPV9wDdV5JgnKk0Z1CUIcsErYr1+v0YQ
 EpRwxczhzJNTULW36UP7RvVQpxwIGldPhxSZ0t1uHWaYTFliaTlnJUAk0ql44Lmm
 WLvoMSjFgX99P11ToCe81MPEzF2IXILvxPwtNZmn5tldEN2xknKEoEmmbN65fYDf
 OIJIJ3s1CijQvnkgXtU0RWWCMnyOoJjsLckgSDdy0euhbS5xRIfxBN2n+kqaI9WV
 a/aIvWNNhvAy2vNdWUJk0FrVBnDjlTtG1afIEAgJyP7uxTQqepQfyaRENLtH+kKe
 lWbOITUZztyagGIn8Bv1pDrrqwO+fSjiEsVEVAQMMmNKBpUWf8urhDQmabCLYGDB
 Nxh2e3wjv5ZQ+55uJIGRDCcPIrddh86FVtQBqTID+86r4a1RwPaWfhzFZYVj84fg
 504UzwtYlw1ITUhGbdMwribLVkwtBMVuEvpPrh6avwzS8wAH4upkhp4GGl/tfd/Z
 De0LqpxCKbDiI+VmmDo8FD4nx4wu4nTYIaecLjNoUSXhbjbPyI6V8/hIXqKiQUA3
 ObkKlGicZJmhMa4zna0q
 =dFzv
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2016-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Three little fixes:
 * revert a recent wext patch, which Ben Hutchings noticed was
   wrong, and it turns out not to be necessary for any driver

 * fix an infinite loop that can occur under certain conditions
   in mac80211's TDLS code (depending on regulatory information)

 * add a cfg80211_get_station() static inline when cfg80211 isn't
   built, to allow other modules to not have to depend on it for it
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 21:34:48 -07:00
Linus Torvalds 0cf21c6609 NFS client bugfixes for 4.8
Highlights include:
 
 Stable patches:
 - Fix a refcount leak in nfs_callback_up_net
 - Fix an Oopsable condition when the flexfile pNFS driver connection to
   the DS fails
 - Fix an Oopsable condition in NFSv4.1 server callback races
 - Ensure pNFS clients stop doing I/O to the DS if their lease has expired,
   as required by the NFSv4.1 protocol
 
 Bugfixes:
 - Fix potential looping in the NFSv4.x migration code
 - Patch series to close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
 - Silence WARN_ON when NFSv4.1 over RDMA is in use
 - Fix a LAYOUTCOMMIT race in the pNFS/blocks client
 - Fix pNFS timeout issues when the DS fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJXxbnyAAoJEGcL54qWCgDykWoP/jqgBBR/cSaOtx+5m39wlf0P
 pTdQkgcpWnhBS90tKZtC6zfJ2DFVt8sUNVn9+mVzT4Q7TgEcAmENQ//s0igxHLbl
 bkXPvULydvD05Db8m1xmq2snj72tWbpg3CaA7nfx6yiP63k237QxhyNZVkmEQDur
 ynU8dPzmxRaSTQdVgatdS0zqx8sF47OFnXVxkV0ssBKORGsWj3yKDcs293NZNFAM
 Ztkih5oW1mm+BtWUQVNrjRnfZFG+PxAxWv090JM6wABDRbDHwSaKmwmI0kWRKXoH
 DHrj4i/Wzws65Fg5AyVPSRkF8YvHSVsLnw/FlwKKZFsrWjU6WtLdLSzgzwQ47x98
 tQk/YGgNyiiD1cAcw+l0d3Ct1SO4AptNuisdJK0cn3iCdsbh6Y0eW6yRRtQY6jQI
 8qOyMTT8fp9ooEQK+nMNOhJVVlsG0hbvWAt/uiiBdPhjAfVB0UFRuua/vNKUO7yv
 hJkDY9i7EkMXKACf5BCpBuvYdU7rwqp43K9x34029A5vFTKOhJZS4hnAIocDd/WF
 Hw7yqHdpkvI5RgFbBV5tmfZPyS65k8AzzTtT1QHKlH0qEtN2iMaXsXM9EzK5bKfW
 85Cc6yzRk7NzDZKmZFs/T8zCYdzet48sCY7wVyOQjL0aIkIDNNcZhex+C1GuD1dp
 Ld0H5f9eZdwv/OAqJ8tm
 =U+XK
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a refcount leak in nfs_callback_up_net
   - Fix an Oopsable condition when the flexfile pNFS driver connection
     to the DS fails
   - Fix an Oopsable condition in NFSv4.1 server callback races
   - Ensure pNFS clients stop doing I/O to the DS if their lease has
     expired, as required by the NFSv4.1 protocol

  Bugfixes:
   - Fix potential looping in the NFSv4.x migration code
   - Patch series to close callback races for OPEN, LAYOUTGET and
     LAYOUTRETURN
   - Silence WARN_ON when NFSv4.1 over RDMA is in use
   - Fix a LAYOUTCOMMIT race in the pNFS/blocks client
   - Fix pNFS timeout issues when the DS fails"

* tag 'nfs-for-4.8-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.x: Fix a refcount leak in nfs_callback_up_net
  NFS4: Avoid migration loops
  pNFS/flexfiles: Fix an Oopsable condition when connection to the DS fails
  NFSv4.1: Remove obsolete and incorrrect assignment in nfs4_callback_sequence
  NFSv4.1: Close callback races for OPEN, LAYOUTGET and LAYOUTRETURN
  NFSv4.1: Defer bumping the slot sequence number until we free the slot
  NFSv4.1: Delay callback processing when there are referring triples
  NFSv4.1: Fix Oopsable condition in server callback races
  SUNRPC: Silence WARN_ON when NFSv4.1 over RDMA is in use
  pnfs/blocklayout: update last_write_offset atomically with extents
  pNFS: The client must not do I/O to the DS if it's lease has expired
  pNFS: Handle NFS4ERR_OLD_STATEID correctly in LAYOUTSTAT calls
  pNFS/flexfiles: Set reasonable default retrans values for the data channel
  NFS: Allow the mount option retrans=0
  pNFS/flexfiles: Fix layoutstat periodic reporting
2016-08-30 11:14:02 -07:00
David Howells 4de48af663 rxrpc: Pass struct socket * to more rxrpc kernel interface functions
Pass struct socket * to more rxrpc kernel interface functions.  They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

	rxrpc_kernel_is_data_last()
	rxrpc_kernel_get_abort_code()
	rxrpc_kernel_get_error_number()
	rxrpc_kernel_free_skb()
	rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells ea82aaec98 rxrpc: Use call->peer rather than going to the connection
Use call->peer rather than call->conn->params.peer as call->conn may become
NULL.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells 8324f0bcfb rxrpc: Provide a way for AFS to ask for the peer address of a call
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:

   void rxrpc_kernel_get_peer(struct rxrpc_call *call,
			      struct sockaddr_rxrpc *_srx);

In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.

Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells e34d4234b0 rxrpc: Trace rxrpc_call usage
Add a trace event for debuging rxrpc_call struct usage.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:02:36 +01:00
David Howells f5c17aaeb2 rxrpc: Calls should only have one terminal state
Condense the terminal states of a call state machine to a single state,
plus a separate completion type value.  The value is then set, along with
error and abort code values, only when the call is transitioned to the
completion state.

Helpers are provided to simplify this.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 15:58:31 +01:00
David Howells ccbd3dbe85 rxrpc: Fix a potential NULL-pointer deref in rxrpc_abort_calls
The call pointer in a channel on a connection will be NULL if there's no
active call on that channel.  rxrpc_abort_calls() needs to check for this
before trying to take the call's state_lock.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 15:56:12 +01:00
Gao Feng 779994fa36 netfilter: log: Check param to avoid overflow in nf_log_set
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:52:32 +02:00
Gao Feng 3cb27991aa netfilter: log_arp: Use ARPHRD_ETHER instead of literal '1'
There is one macro ARPHRD_ETHER which defines the ethernet proto for ARP,
so we could use it instead of the literal number '1'.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:51:08 +02:00
Florian Westphal ad66713f5a netfilter: remove __nf_ct_kill_acct helper
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:10 +02:00
Florian Westphal c023c0e4a0 netfilter: conntrack: resched gc again if eviction rate is high
If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal b87a2f9199 netfilter: conntrack: add gc worker to remove timed-out entries
Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.

v3: don't call cancel_delayed_work_sync twice (again, Eric).

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal 2344d64ec7 netfilter: evict stale entries on netlink dumps
When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.

We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later.  Considering that resizes are very rare it doesn't
seem worth doing it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal f330a7fdbe netfilter: conntrack: get rid of conntrack timer
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal 616b14b469 netfilter: don't rely on DYING bit to detect when destroy event was sent
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:08 +02:00
Florian Westphal 95a8d19f28 netfilter: restart search if moved to other chain
In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: ea781f197d ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:08 +02:00
Liping Zhang c73c248490 netfilter: nf_tables_netdev: remove redundant ip_hdr assignment
We have already use skb_header_pointer to get the ip header pointer,
so there's no need to use ip_hdr again. Moreover, in NETDEV INGRESS
hook, ip header maybe not linear, so use ip_hdr is not appropriate,
remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:41:04 +02:00
Arik Nemtsov 554d072e7b mac80211: TDLS: don't require beaconing for AP BW
Stop downgrading TDLS chandef when reaching the AP BW. The AP provides
the necessary regulatory protection in this case.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=153961, which
reported an infinite loop here.

Reported-by: Kamil Toman <kamil.toman@gmail.com>
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-08-30 08:03:41 +02:00
David S. Miller 6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Arnd Bergmann 0b498a5277 net_sched: fix use of uninitialized ethertype variable in cls_flower
The addition of VLAN support caused a possible use of uninitialized
data if we encounter a zero TCA_FLOWER_KEY_ETH_TYPE key, as pointed
out by "gcc -Wmaybe-uninitialized":

net/sched/cls_flower.c: In function 'fl_change':
net/sched/cls_flower.c:366:22: error: 'ethertype' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This changes the code to only set the ethertype field if it
was nonzero, as before the patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9399ae9a6c ("net_sched: flower: Add vlan support")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:30:23 -04:00
Eric Dumazet c9c3321257 tcp: add tcp_add_backlog()
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
  Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
  might have pulled all the frame content, but skb->truesize could
  not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:20:24 -04:00
Tom Herbert 96a5908347 kcm: Remove TCP specific references from kcm and strparser
kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Tom Herbert 3203558589 tcp: Set read_sock and peek_len proto_ops
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Richard Alpe 832629ca5c tipc: add UDP remoteip dump to netlink API
When using replicast a UDP bearer can have an arbitrary amount of
remote ip addresses associated with it. This means we cannot simply
add all remote ip addresses to an existing bearer data message as it
might fill the message, leaving us with a truncated message that we
can't safely resume. To handle this we introduce the new netlink
command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be
called when the bearer data message has the
TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than
one remote ip (replicast).

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe fdb3accc2c tipc: add the ability to get UDP options via netlink
Add UDP bearer options to netlink bearer get message. This is used by
the tipc user space tool to display UDP options.

The UDP bearer information is passed using either a sockaddr_in or
sockaddr_in6 structs. This means the user space receiver should
intermediately store the retrieved data in a large enough struct
(sockaddr_strage) before casting to the proper IP version type.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe c9b64d492b tipc: add replicast peer discovery
Automatically learn UDP remote IP addresses of communicating peers by
looking at the source IP address of incoming TIPC link configuration
messages (neighbor discovery).

This makes configuration slightly easier and removes the problematic
scenario where a node receives directly addressed neighbor discovery
messages sent using replicast which the node cannot "reply" to using
mutlicast, leaving the link FSM in a limbo state.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe ef20cd4dd1 tipc: introduce UDP replicast
This patch introduces UDP replicast. A concept where we emulate
multicast by sending multiple unicast messages to configured peers.

The purpose of replicast is mainly to be able to use TIPC in cloud
environments where IP multicast is disabled. Using replicas to unicast
multicast messages is costly as we have to copy each skb and send the
copies individually.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe 1ca73e3fa1 tipc: refactor multicast ip check
Add a function to check if a tipc UDP media address is a multicast
address or not. This is a purely cosmetic change.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:40 -07:00
Richard Alpe ce984da36e tipc: split UDP send function
Split the UDP send function into two. One callback that prepares the
skb and one transmit function that sends the skb. This will come in
handy in later patches, when we introduce UDP replicast.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:40 -07:00