Commit 79b16aadea
("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb()")
introduce 'sk' but we already have one inner 'sk'.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitaly Kuznetsov says:
====================
hv_netvsc: linearize SKBs bigger than MAX_PAGE_BUFFER_COUNT-2 pages
This patch series fixes the same issue which was fixed in Xen with commit
97a6d1bb2b ("xen-netfront: Fix handling packets on
compound pages with skb_linearize").
It is relatively easy to create a packet which is small in size but occupies
more than 30 (MAX_PAGE_BUFFER_COUNT-2) pages. Here is a kernel-mode reproducer
which tries sending a packet with only 34 bytes of payload (but on 34 pages)
and fails:
static int __init sendfb_init(void)
{
struct socket *sock;
int i, ret;
struct sockaddr_in in4_addr = { 0 };
struct page *pages[17];
unsigned long flags;
ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
if (ret) {
pr_err("failed to create socket: %d!\n", ret);
return ret;
}
in4_addr.sin_family = AF_INET;
/* www.google.com, 74.125.133.99 */
in4_addr.sin_addr.s_addr = cpu_to_be32(0x4a7d8563);
in4_addr.sin_port = cpu_to_be16(80);
ret = sock->ops->connect(sock, (struct sockaddr *)&in4_addr, sizeof(in4_addr), 0);
if (ret) {
pr_err("failed to connect: %d!\n", ret);
return ret;
}
/* We can send up to 17 frags */
flags = MSG_MORE;
for (i = 0; i < 17; i++) {
if (i == 16)
flags = MSG_EOR;
pages[i] = alloc_pages(GFP_KERNEL | __GFP_COMP, 1);
if (!pages[i]) {
pr_err("out of memory!");
goto free_pages;
}
sock->ops->sendpage(sock, pages[i], PAGE_SIZE -1, 2, flags);
}
free_pages:
for (; i > 0; i--)
__free_pages(pages[i - 1], 1);
printk("sendfb_init: test done\n");
return -1;
}
module_init(sendfb_init);
MODULE_LICENSE("GPL");
A try to load such module results in multiple
'kernel: hv_netvsc vmbus_15 eth0: Packet too big: 100' messages as all retries
fail as well. It should also be possible to trigger the issue from userspace, I
expect e.g. NFS under heavy load to get stuck sometimes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In netvsc_start_xmit() we can handle packets which are scattered around not
more than MAX_PAGE_BUFFER_COUNT-2 pages. It is, however, easy to create a
packet which is not big in size but occupies more pages (e.g. if it uses frags
on compound pages boundaries). When we drop such packet it cases sender to try
resending it but in most cases it will try resending the same packet which will
also get dropped, this will cause the particular connection to stick. To solve
the issue we can try linearizing skb.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
... which validly uses dev_kfree_skb_any() instead of dev_kfree_skb().
Setting ret to -EFAULT and -ENOMEM have no real meaning here (we need to set
it to anything but -EAGAIN) as we drop the packet and return NETDEV_TX_OK
anyway.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah says:
====================
sfc: Nic specific sriov functions, netdev_ops and sriov_configure
First two patches among the series of patches to support SRIOV on EF10.
First patch declares nic specific sriov functions in nic specific headers,
creates only one instance of the netdev_ops, removes sriov functionality
from Falcon code.
Second patch adds support for sriov_configure.
The Virtual Functions can be enabled but they do not bind to the SFC
driver just yet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the use of sriov_configure on EF10
to enable Virtual Functions while the driver is loaded.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By putting all the efx_{siena,ef10}_sriov_* declarations in
{siena,ef10}_sriov.h, ensure they cannot be called from nic-generic code.
Also fixes up an instance of this, where mcdi.c was calling
efx_siena_sriov_flr.
The single instance of netdev_ops should call general high level
functions that can then call something adapter specific in efx_nic_type.
We should only do adapter specialisation via efx_nic_type.
Removal of sriov functionality from the Falcon code means that tests
are needed for the presence of some callbacks.
Signed-off-by: Shradha Shah <sshah@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck says:
====================
Replace wmb()/rmb() with dma_wmb()/dma_rmb() where appropriate
This is a start of a side project cleaning up the drivers that can make use
of the dma_wmb and dma_rmb calls. The general idea is to start removing
the unnecessary wmb/rmb calls from a number of drivers and to make use of
the lighter weight dma_wmb/dma_rmb calls as this should allow for an
overall improvement in performance as each barrier can cost a significant
number of cycles and on architectures such as x86 this is unnecessary.
These changes are what I would consider low hanging fruit. The likelihood
of the changes introducing an error should be low since the use of the
barriers in these cases are fairly obvious.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This change replaces calls to rmb with dma_rmb in the case where we want to
order all follow-on descriptor reads after the check for the descriptor
status bit.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change updates several spots where a wmb was being used to instead use
a dma_wmb to flush out writes before updating the control portion of the
descriptor.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch goes through and replaces wmb/rmb with dma_wmb/dma_rmb in cases
where the barrier is being used to order writes or reads to just memory and
doesn't involve any programmed I/O.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bond_3ad_bind_slave() calls ad_initialize_port() and then immediately
assigns correct values making some of that initialization unnecessary.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch breaks the rich assignments into it's own statements
and removes some duplicate code where admin-key, & oper-key are
updated.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 79b16aadea ("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket parameter might legally be NULL, thus sock_net is sometimes
causing a NULL pointer dereference. Using net_device pointer in dst_entry
is more reliable.
Fixes: b6a7719aed ("ipv4: hash net ptr into fragmentation bucket selection")
Reported-by: Rick Jones <rick.jones2@hp.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an userdata set extension and allow the user to attach arbitrary
data to set elements. This is intended to hold TLV encoded data like
comments or DNS annotations that have no meaning to the kernel.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new "dynset" expression for dynamic set updates.
A new set op ->update() is added which, for non existant elements,
invokes an initialization callback and inserts the new element.
For both new or existing elements the extenstion pointer is returned
to the caller to optionally perform timer updates or other actions.
Element removal is not supported so far, however that seems to be a
rather exotic need and can be added later on.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently a set binding is assumed to be related to a lookup and, in
case of maps, a data load.
In order to use bindings for set updates, the loop detection checks
must be restricted to map operations only. Add a flags member to the
binding struct to hold the set "action" flags such as NFT_SET_MAP,
and perform loop detection based on these.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use atomic operations for the element count to avoid races with async
updates.
To properly handle the transactional semantics during netlink updates,
deleted but not yet committed elements are accounted for seperately and
are treated as being already removed. This means for the duration of
a netlink transaction, the limit might be exceeded by the amount of
elements deleted. Set implementations must be prepared to handle this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The NFT_SET_TIMEOUT flag is ignore in nft_select_set_ops, which may
lead to selection of a set implementation that doesn't actually
support timeouts.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_bridge_info->mask is used for several things, for example to
remember if skb->pkt_type was set to OTHER_HOST.
For a bridge, OTHER_HOST is expected case. For ip forward its a non-starter
though -- routing expects PACKET_HOST.
Bridge netfilter thus changes OTHER_HOST to PACKET_HOST before hook
invocation and then un-does it after hook traversal.
This information is irrelevant outside of br_netfilter.
After this change, ->mask now only contains flags that need to be
known outside of br_netfilter in fast-path.
Future patch changes mask into a 2bit state field in sk_buff, so that
we can remove skb->nf_bridge pointer for good and consider all remaining
places that access nf_bridge info content a not-so fastpath.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
->mask is a bit info field that mixes various use cases.
In particular, we have flags that are mutually exlusive, and flags that
are only used within br_netfilter while others need to be exposed to
other parts of the kernel.
Remove BRNF_8021Q/PPPoE flags. They're mutually exclusive and only
needed within br_netfilter context.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Don't access skb->nf_bridge directly, this pointer will be removed soon.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
right now we store this in the nf_bridge_info struct, accessible
via skb->nf_bridge. This patch prepares removal of this pointer from skb:
Instead of using skb->nf_bridge->x, we use helpers to obtain the in/out
device (or ifindexes).
Followup patches to netfilter will then allow nf_bridge_info to be
obtained by a call into the br_netfilter core, rather than keeping a
pointer to it in sk_buff.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
br_netfilter maintains an extra state, nf_bridge_info, which is attached
to skb via skb->nf_bridge pointer.
Amongst other things we use skb->nf_bridge->data to store the original
mac header for every processed skb.
This is required for ip refragmentation when using conntrack
on top of bridge, because ip_fragment doesn't copy it from original skb.
However there is no need anymore to do this unconditionally.
Move this to the one place where its needed -- when br_netfilter calls
ip_fragment().
Also switch to percpu storage for this so we can handle fragmenting
without accessing nf_bridge meta data.
Only user left is neigh resolution when DNAT is detected, to hold
the original source mac address (neigh resolution builds new mac header
using bridge mac), so rename ->data and reduce its size to whats needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently in xt_socket, we take advantage of early demuxed sockets
since commit 00028aa370 ("netfilter: xt_socket: use IP early demux")
in order to avoid a second socket lookup in the fast path, but we
only make partial use of this:
We still unnecessarily parse headers, extract proto, {s,d}addr and
{s,d}ports from the skb data, accessing possible conntrack information,
etc even though we were not even calling into the socket lookup via
xt_socket_get_sock_{v4,v6}() due to skb->sk hit, meaning those cycles
can be spared.
After this patch, we only proceed the slower, manual lookup path
when we have a skb->sk miss, thus time to match verdict for early
demuxed sockets will improve further, which might be i.e. interesting
for use cases such as mentioned in 681f130f39 ("netfilter: xt_socket:
add XT_SOCKET_NOWILDCARD flag").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The change to only export WEXT symbols when required could break
the build if CONFIG_CFG80211_WEXT was explicitly disabled while
a driver like orinoco selected it.
Fix this by hiding the symbol when it's required so it can't be
disabled in that case.
Fixes: 2afe38d15c ("cfg80211-wext: export symbols only when needed")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the two places changed, we now use netvsc_xmit_completion() which properly
frees hv_netvsc_packet in or not in skb headroom.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sum of RNDIS msg and PPI struct sizes is used in multiple places, so we define
a macro for them.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies. This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts. If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.
The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.
The change has passed all existing Fast Open tests with both
old and new options at Google.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes byte backlog accounting for the first of two chained netem instances.
Bytes backlog reported now corresponds to the number of queued packets.
When two netem instances are chained, for instance to apply rate and queue
limitation followed by packet delay, the number of backlogged bytes reported
by the first netem instance is wrong. It reports the sum of bytes in the queues
of the first and second netem. The first netem reports the correct number of
backlogged packets but not bytes. This is shown in the example below.
Consider a chain of two netem schedulers created using the following commands:
$ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
$ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms
Start an iperf session to send packets out on the specified interface and
monitor the backlog using tc:
$ tc -s qdisc show dev veth2
Output using unpatched netem:
qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
backlog 172694b 73p requeues 0
qdisc netem 2: parent 1: limit 1000 delay 50.0ms
Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
backlog 63588b 42p requeues 0
The interface used to produce this output has an MTU of 1500. The output for
backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
the total number of sent bytes and packets. By dividing the number of sent
bytes by the number of sent packets, we get an average packet size of ~=1504.
If we divide the number of backlogged bytes by packets, we get ~=2365. This is
due to the first netem incorrectly counting the 63588b which are in netem 2's
queue as being in its own queue. To verify this is the case, we subtract them
from the reported value and divide by the number of packets as follows:
172694 - 63588 = 109106 bytes actualled backlogged in netem 1
109106 / 73 packets ~= 1494 bytes (which matches our MTU)
The root cause is that the byte accounting is not done at the
same time with packet accounting. The solution is to update the backlog value
every time the packet queue is updated.
Signed-off-by: Joseph D Beshay <joseph.beshay@utdallas.edu>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Read Local Out Of Band Extended Data mgmt command is specified to
return the SSP values when given a BR/EDR address type as input
parameter. The returned values may include either the 192-bit variants
of C and R, or their 256-bit variants, or both, depending on the status
of Secure Connections and Secure Connections Only modes. If SSP is not
enabled the command will only return the Class of Device value (like it
has done so far).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Also move 'group' description to match the order of the net_device structure.
Fixes: 7a66bbc96c ("net: remove iflink field from struct net_device")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel says:
====================
netns: enhance netlink interface for nsid
The first patch is a small cleanup. The second patch implements notifications
for netns id events. And the last one allows to dump existing netns id from
userland.
iproute2 patches are available, I can send them on demand.
v2: drop the first patch (the fix is now in net-next)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Which this patch, it's possible to dump the list of ids allocated for peer
netns.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, netns ids that are created and deleted are advertised into the
group RTNLGRP_NSID.
Because callers of rtnl_net_notifyid() already know the id of the peer, there is
no need to call __peernet2id() in rtnl_net_fill().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to initialize err, it will be overridden by the value of nlmsg_parse().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since Bluetooth 4.1 there are two additional values for SSP OOB data,
namely C-256 and R-256. This patch updates the EIR definitions to take
into account both the 192 and 256 bit variants of C and R.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Prevent UDP tunnels from operating on garbage socket
So this should do the rest of the work such that when we encapsulate
into a UDP tunnel, the output path works on the UDP tunnel's socket
rather than skb->sk.
Part of this work is based upon changes done by Jiri Pirko some time
ago.
Basically the first step is to pass the socket through the nf_hook
okfn(), and then next we do the same for the UDP tunnel xmit routines.
Signed-off-by: David S. Miller <davem@davemloft.net>
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.
Based upon a patch by Jiri Pirko.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
On the output paths in particular, we have to sometimes deal with two
socket contexts. First, and usually skb->sk, is the local socket that
generated the frame.
And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.
We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.
The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device. We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is currently always set to NULL, but nf_queue is adjusted to be
prepared for it being set to a real socket by taking and releasing a
reference to that socket when necessary.
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we can consolidate where we setup new nf_hook_state objects,
to make sure the entire thing is initialized.
The only other place an nf_hook_object is instantiated is nf_queue,
wherein a structure copy is used.
Signed-off-by: David S. Miller <davem@davemloft.net>
Return a negative error code on failure.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>