Commit Graph

47464 Commits

Author SHA1 Message Date
David Ahern 1dfa76390b net: ipv4: add check for l3slave for index returned in IP_PKTINFO
Similar to the loopback device, for packets sent through a VRF device
the index returned in ipi_ifindex needs to be the saved index in
rt_iif.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 20:05:13 -07:00
David Ahern 9438c871b2 net: ipv4: remove unnecessary check on orig_oif
rt_iif is going to be set to either 0 or orig_oif. If orig_oif
is 0 it amounts to the same end result so remove the check.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 20:03:03 -07:00
Jason Wang 7c4974786f net: export some generic xdp helpers
This patch tries to export some generic xdp helpers to drivers. This
can let driver to do XDP for a specific skb. This is useful for the
case when the packet is hard to be processed at page level directly
(e.g jumbo/GSO frame).

With this patch, there's no need for driver to forbid the XDP set when
configuration is not suitable. Instead, it can defer the XDP for
packets that is hard to be processed directly after skb is created.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 19:56:07 -07:00
Jakub Sitnicki d0225784be rtnelink: Move link dump consistency check out of the loop
Calls to rtnl_dump_ifinfo() are protected by RTNL lock. So are the
{list,unlist}_netdevice() calls where we bump the net->dev_base_seq
number.

For this reason net->dev_base_seq can't change under out feet while
we're looping over links in rtnl_dump_ifinfo(). So move the check for
net->dev_base_seq change (since the last time we were called) out of the
loop.

This way we avoid giving a wrong impression that there are concurrent
updates to the link list going on while we're iterating over them.

Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 19:43:57 -07:00
Joe Perches 0ed80da518 openvswitch: Remove unnecessary newlines from OVS_NLERR uses
OVS_NLERR already adds a newline so these just add blank
lines to the logging.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:52:47 -07:00
Tonghao Zhang 939912216f net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx.
Because we remove the UFO support, we will also remove the
CHECKSUM_UNNECESSARY check in skb_needs_check().

Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:45:52 -07:00
David Ahern 839da4d989 net: ipv4: set orig_oif based on fib result for local traffic
Attempts to connect to a local address with a socket bound
to a device with the local address hangs if there is no listener:

  $ ip addr sh dev eth1
  3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.4/24 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 2001:db8:1::4/120 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe1c:37/64 scope link
       valid_lft forever preferred_lft forever

  $ vrf-test -I eth1 -r 10.100.1.4
  <hangs when there is no server>

(don't let the command name fool you; vrf-test works without vrfs.)

The problem is that the original intended device, eth1 in this case, is
lost when the tcp reset is sent, so the socket lookup does not find a
match for the reset and the connect attempt hangs. Fix by adjusting
orig_oif for local traffic to the device from the fib lookup result.

With this patch you get the more user friendly:
  $ vrf-test -I eth1 -r 10.100.1.4
  connect failed: 111: Connection refused

orig_oif is saved to the newly created rtable as rt_iif and when set
it is used as the dif for socket lookups. It is set based on flowi4_oif
passed in to ip_route_output_key_hash_rcu and will be set to either
the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which
is the case in the example above) or a netdev index depending on the
lookup path. In each case, resetting orig_oif to the device in the fib
result for the RTN_LOCAL case allows the actual device to be preserved
as the skb tx and rx is done over the loopback or VRF device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:40:39 -07:00
Jiri Pirko 7b06e8aed2 net: sched: remove cops->tcf_cl_offload
cops->tcf_cl_offload is no longer needed, as the drivers check what they
can and cannot offload using the classid identify helpers. So remove this.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 13:47:01 -07:00
Jiri Pirko a2e8da9378 net: sched: use newly added classid identity helpers
Instead of checking handle, which does not have the inner class
information and drivers wrongly assume clsact->egress as ingress, use
the newly introduced classid identification helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 13:47:01 -07:00
Xin Long 327c0dab8d sctp: fix some indents in sm_make_chunk.c
There are some bad indents of functions' defination in sm_make_chunk.c.
They have been there since beginning, it was probably caused by that
the typedef sctp_chunk_t was replaced with struct sctp_chunk.

So it's the best time to fix them in this patchset, it's also to fix
some bad indents in other functions' defination in sm_make_chunk.c.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long 172a1599ba sctp: remove the typedef sctp_disposition_t
This patch is to remove the typedef sctp_disposition_t, and
replace with enum sctp_disposition in the places where it's
using this typedef.

It's also to fix the indent for many functions' defination.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long 8ee821aea3 sctp: remove the typedef sctp_sm_table_entry_t
This patch is to remove the typedef sctp_sm_table_entry_t, and
replace with struct sctp_sm_table_entry in the places where it's
using this typedef.

It is also to fix some indents.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long e08af95df1 sctp: remove the typedef sctp_verb_t
This patch is to remove the typedef sctp_verb_t, and
replace with enum sctp_verb in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long c488b7704e sctp: remove the typedef sctp_arg_t
This patch is to remove the typedef sctp_arg_t, and
replace with union sctp_arg in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long a85bbeb221 sctp: remove the typedef sctp_cmd_seq_t
This patch is to remove the typedef sctp_cmd_seq_t, and
replace with struct sctp_cmd_seq in the places where it's
using this typedef.

Note that it doesn't fix many indents although it should,
as sctp_disposition_t's removal would mess them up again.
So better to fix them when removing sctp_disposition_t in
the later patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long e2c3108ab2 sctp: remove the typedef sctp_cmd_t
This patch is to remove the typedef sctp_cmd_t, and
replace with enum sctp_cmd in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long b7ef2618a0 sctp: remove the typedef sctp_socket_type_t
This patch is to remove the typedef sctp_socket_type_t, and
replace with enum sctp_socket_type in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:44 -07:00
Xin Long d38ef5ae35 sctp: remove the typedef sctp_dbg_objcnt_entry_t
This patch is to remove the typedef sctp_dbg_objcnt_entry_t, and
replace with struct sctp_dbg_objcnt_entry in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:43 -07:00
Xin Long a05437ac5d sctp: remove the typedef sctp_cmsgs_t
This patch is to remove the typedef sctp_cmsgs_t, and
replace with struct sctp_cmsgs in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:43 -07:00
Xin Long edf903f83e sctp: remove the typedef sctp_sender_hb_info_t
This patch is to remove the typedef sctp_sender_hb_info_t, and
replace with struct sctp_sender_hb_info in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 10:02:43 -07:00
David S. Miller 3b2b69efec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.

Minor context conflict in bcmsysport statistics bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 12:11:16 -07:00
Willem de Bruijn c27927e372 packet: fix tp_reserve race in packet_set_ring
Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.

This bug was discovered by syzkaller.

Fixes: 8913336a7e ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:52:12 -07:00
Willem de Bruijn 85f1bd9a7b udp: consistently apply ufo or fragmentation
When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.

Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.

Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.

A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.

Found by syzkaller.

Fixes: e89e9cf539 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:52:12 -07:00
Florian Westphal 8caa38b56c rtnetlink: fallback to UNSPEC if current family has no doit callback
We need to use PF_UNSPEC in case the requested family has no doit
callback, otherwise this now fails with EOPNOTSUPP instead of running the
unspec doit callback, as before.

Fixes: 6853dd4881 ("rtnetlink: protect handler table with rcu")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:50:22 -07:00
Florian Westphal d38a65125f rtnetlink: init handler refcounts to 1
If using CONFIG_REFCOUNT_FULL=y we get following splat:
 refcount_t: increment on 0; use-after-free.
WARNING: CPU: 0 PID: 304 at lib/refcount.c:152 refcount_inc+0x47/0x50
Call Trace:
 rtnetlink_rcv_msg+0x191/0x260
 ...

This warning is harmless (0 is "no callback running", not "memory
was freed").

Use '1' as the new 'no handler is running' base instead of 0 to avoid
this.

Fixes: 019a316992 ("rtnetlink: add reference counting to prevent module unload while dump is in progress")
Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:50:22 -07:00
Florian Westphal 8515ae3843 rtnetlink: switch rtnl_link_get_slave_info_data_size to rcu
David Ahern reports following splat:
 RTNL: assertion failed at net/core/dev.c (5717)
 netdev_master_upper_dev_get+0x5f/0x70
 if_nlmsg_size+0x158/0x240
 rtnl_calcit.isra.26+0xa3/0xf0

rtnl_link_get_slave_info_data_size currently assumes RTNL protection, but
there appears to be no hard requirement for this, so use rcu instead.

At the time of this writing, there are three 'get_slave_size' callbacks
(now invoked under rcu): bond_get_slave_size, vrf_get_slave_size and
br_port_get_slave_size, all return constant only (i.e. they don't sleep).

Fixes: 6853dd4881 ("rtnetlink: protect handler table with rcu")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:50:22 -07:00
Florian Westphal 5c2bb9b6e2 rtnetlink: do not use RTM_GETLINK directly
Userspace sends RTM_GETLINK type, but the kernel substracts
RTM_BASE from this, i.e. 'type' doesn't contain RTM_GETLINK
anymore but instead RTM_GETLINK - RTM_BASE.

This caused the calcit callback to not be invoked when it
should have been (and vice versa).

While at it, also fix a off-by one when checking family index. vs
handler array size.

Fixes: e1fa6d216d ("rtnetlink: call rtnl_calcit directly")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:50:22 -07:00
Florian Westphal 377cb24884 rtnetlink: use rcu_dereference_raw to silence rcu splat
Ido reports a rcu splat in __rtnl_register.
The splat is correct; as rtnl_register doesn't grab any logs
and doesn't use rcu locks either.  It has always been like this.
handler families are not registered in parallel so there are no
races wrt. the kmalloc ordering.

The only reason to use rcu_dereference in the first place was to
avoid sparse from complaining about this.

Thus this switches to _raw() to not have rcu checks here.

The alternative is to add rtnl locking to register/unregister,
however, I don't see a compelling reason to do so as this has been
lockless for the past twenty years or so.

Fixes: 6853dd4881 ("rtnetlink: protect handler table with rcu")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:50:21 -07:00
John Crispin 2d57164564 net: core: fix compile error inside flow_dissector due to new dsa callback
The following error was introduced by
commit 43e665287f ("net-next: dsa: fix flow dissection")
due to a missing #if guard

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:448:18: error: 'struct net_device' has no member named 'dsa_ptr'
ops = skb->dev->dsa_ptr->tag_ops;
                ^
make[3]: *** [net/core/flow_dissector.o] Error 1

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:16:05 -07:00
John Crispin 43e665287f net-next: dsa: fix flow dissection
RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:51:47 -07:00
John Crispin 2dd592b274 net-next: tag_mtk: add flow_dissect callback to the ops struct
The MT7530 inserts the 4 magic header in between the 802.3 address and
protocol field. The patch implements the callback that can be called by
the flow dissector to figure out the real protocol and offset of the
network header. With this patch applied we can properly parse the packet
and thus make hashing function properly.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:51:47 -07:00
John Crispin 68277a2c9d net-next: dsa: move struct dsa_device_ops to the global header file
We need to access this struct from within the flow_dissector to fix
dissection for packets coming in on DSA devices.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:51:47 -07:00
Xin Long 96d9703050 net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
Commit 55917a21d0 ("netfilter: x_tables: add context to know if
extension runs from nft_compat") introduced a member nft_compat to
xt_tgchk_param structure.

But it didn't set it's value for ipt_init_target. With unexpected
value in par.nft_compat, it may return unexpected result in some
target's checkentry.

This patch is to set all it's fields as 0 and only initialize the
non-zero fields in ipt_init_target.

v1->v2:
  As Wang Cong's suggestion, fix it by setting all it's fields as
  0 and only initializing the non-zero fields.

Fixes: 55917a21d0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:46:44 -07:00
Nikolay Borisov 1714020e42 igmp: Fix regression caused by igmp sysctl namespace code.
Commit dcd87999d4 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization into inet_init_net, that way they will always have
sane values.

Fixes: dcd87999d4 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz386@gmail.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:46:44 -07:00
Bhumika Goyal 800bb47e71 net: atm: make atmdev_ops const
Make these const as they are only stored in the ops field of a atm_dev
structure, which is const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:43:50 -07:00
David Ahern 6eb7939371 net: ipv6: lower ndisc notifier priority below addrconf
ndisc_notify is used to send unsolicited neighbor advertisements
(e.g., on a link up). Currently, the ndisc notifier is run before the
addrconf notifer which means NA's are not sent for link-local addresses
which are added by the addrconf notifier.

Fix by lowering the priority of the ndisc notifier. Setting the priority
to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:40:04 -07:00
Jon Paul Maloy ed43594aed tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:38:06 -07:00
Florian Westphal 165b911725 net: call newid/getid without rtnl mutex held
Both functions take nsid_lock and don't rely on rtnl lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal 62256f98f2 rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED
Allow callers to tell rtnetlink core that its doit callback
should be invoked without holding rtnl mutex.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal 6853dd4881 rtnetlink: protect handler table with rcu
Note that netlink dumps still acquire rtnl mutex via the netlink
dump infrastructure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal 0cc09020ae rtnetlink: small rtnl lock pushdown
instead of rtnl lock/unload at the top level, push it down
to the called function.

This is just an intermediate step, next commit switches protection
of the rtnl_link ops table to rcu, in which case (for dumps) the
rtnl lock is acquired only by the netlink dumper infrastructure
(current lock/unlock/dump/lock/unlock rtnl sequence becomes
 rcu lock/rcu unlock/dump).

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal 019a316992 rtnetlink: add reference counting to prevent module unload while dump is in progress
I don't see what prevents rmmod (unregister_all is called) while a dump
is active.

Even if we'd add rtnl lock/unlock pair to unregister_all (as done here),
thats not enough either as rtnl_lock is released right before the dump
process starts.

So this adds a refcount:
 * acquire rtnl mutex
 * bump refcount
 * release mutex
 * start the dump

... and make unregister_all remove the callbacks (no new dumps possible)
and then wait until refcount is 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal b97bac64a5 rtnetlink: make rtnl_register accept a flags parameter
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Florian Westphal e1fa6d216d rtnetlink: call rtnl_calcit directly
There is only a single place in the kernel that regisers the "calcit"
callback (to determine min allocation for dumps).

This is in rtnetlink.c for PF_UNSPEC RTM_GETLINK.
The function that checks for calcit presence at run time will first check
the requested family (which will always fail for !PF_UNSPEC as no callsite
registers this), then falls back to checking PF_UNSPEC.

Therefore we can just check if type is RTM_GETLINK and then do a direct
call.  Because of fallback to PF_UNSPEC all RTM_GETLINK types used this
regardless of family.

This has the advantage that we don't need to allocate space for
the function pointer for all the other families.

A followup patch will drop the calcit function pointer from the
rtnl_link callback structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:57:38 -07:00
Daniel Borkmann 92b31a9af7 bpf: add BPF_J{LT,LE,SLT,SLE} instructions
Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.

This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.

Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).

The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)

  [1] https://github.com/borkmann/llvm/tree/bpf-insns

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:53:56 -07:00
Willem de Bruijn ccaffff182 sock: fix zerocopy panic in mem accounting
Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
that has initialized its field uarg->mmp.

Before this patch, a vhost-net with experimental_zcopytx can crash in

  mm_unaccount_pinned_pages
  sock_zerocopy_put
  skb_zcopy_clear
  skb_release_data

Only sock_zerocopy_alloc initializes this field. Move the unaccount
call from generic sock_zerocopy_put to its specific callback
sock_zerocopy_callback.

Fixes: a91dbff551 ("sock: ulimit on MSG_ZEROCOPY pages")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:49:17 -07:00
David S. Miller 3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Linus Torvalds 4530cca198 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "The pull requests are getting smaller, that's progress I suppose :-)

   1) Fix infinite loop in CIPSO option parsing, from Yujuan Qi.

   2) Fix remote checksum handling in VXLAN and GUE tunneling drivers,
      from Koichiro Den.

   3) Missing u64_stats_init() calls in several drivers, from Florian
      Fainelli.

   4) TCP can set the congestion window to an invalid ssthresh value
      after congestion window reductions, from Yuchung Cheng.

   5) Fix BPF jit branch generation on s390, from Daniel Borkmann.

   6) Correct MIPS ebpf JIT merge, from David Daney.

   7) Correct byte order test in BPF test_verifier.c, from Daniel
      Borkmann.

   8) Fix various crashes and leaks in ASIX driver, from Dean Jenkins.

   9) Handle SCTP checksums properly in mlx4 driver, from Davide
      Caratti.

  10) We can potentially enter tcp_connect() with a cached route
      already, due to fastopen, so we have to explicitly invalidate it.

  11) skb_warn_bad_offload() can bark in legitimate situations, fix from
      Willem de Bruijn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  net: avoid skb_warn_bad_offload false positives on UFO
  qmi_wwan: fix NULL deref on disconnect
  ppp: fix xmit recursion detection on ppp channels
  rds: Reintroduce statistics counting
  tcp: fastopen: tcp_connect() must refresh the route
  net: sched: set xt_tgchk_param par.net properly in ipt_init_target
  net: dsa: mediatek: add adjust link support for user ports
  net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets
  qed: Fix a memory allocation failure test in 'qed_mcp_cmd_init()'
  hysdn: fix to a race condition in put_log_buffer
  s390/qeth: fix L3 next-hop in xmit qeth hdr
  asix: Fix small memory leak in ax88772_unbind()
  asix: Ensure asix_rx_fixup_info members are all reset
  asix: Add rx->ax_skb = NULL after usbnet_skb_return()
  bpf: fix selftest/bpf/test_pkt_md_access on s390x
  netvsc: fix race on sub channel creation
  bpf: fix byte order test in test_verifier
  xgene: Always get clk source, but ignore if it's missing for SGMII ports
  MIPS: Add missing file for eBPF JIT.
  bpf, s390: fix build for libbpf and selftest suite
  ...
2017-08-09 10:14:04 -07:00
Vincent Bernat feca7d8c13 net: ipv6: avoid overhead when no custom FIB rules are installed
If the user hasn't installed any custom rules, don't go through the
whole FIB rules layer. This is pretty similar to f4530fa574 (ipv4:
Avoid overhead when no custom FIB rules are installed).

Using a micro-benchmark module [1], timing ip6_route_output() with
get_cycles(), with 40,000 routes in the main routing table, before this
patch:

    min=606 max=12911 count=627 average=1959 95th=4903 90th=3747 50th=1602 mad=821
    table=254 avgdepth=21.8 maxdepth=39
    value │                         ┊                            count
      600 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                                         199
      880 │▒▒▒░░░░░░░░░░░░░░░░                                      43
     1160 │▒▒▒░░░░░░░░░░░░░░░░░░░░                                  48
     1440 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░                               43
     1720 │▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░                          59
     2000 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                      50
     2280 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                    26
     2560 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                  31
     2840 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░               28
     3120 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              17
     3400 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             17
     3680 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             8
     3960 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           11
     4240 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░            6
     4520 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           6
     4800 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           9

After:

    min=544 max=11687 count=627 average=1776 95th=4546 90th=3585 50th=1227 mad=565
    table=254 avgdepth=21.8 maxdepth=39
    value │                         ┊                            count
      540 │▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒                                        201
      800 │▒▒▒▒▒░░░░░░░░░░░░░░░░                                    63
     1060 │▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░                               68
     1320 │▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░                            39
     1580 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                         32
     1840 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                       32
     2100 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                    34
     2360 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░                 33
     2620 │▒▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░               26
     2880 │▒░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              22
     3140 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░              9
     3400 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             8
     3660 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░             9
     3920 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░            8
     4180 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           8
     4440 │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░           8

At the frequency of the host during the bench (~ 3.7 GHz), this is
about a 100 ns difference on the median value.

A next step would be to collapse local and main tables, as in
0ddcf43d5d (ipv4: FIB Local/MAIN table collapse).

[1]: https://github.com/vincentbernat/network-lab/blob/master/lab-routes-ipv6/kbench_mod.c

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 21:40:08 -07:00
Willem de Bruijn 8d63bee643 net: avoid skb_warn_bad_offload false positives on UFO
skb_warn_bad_offload triggers a warning when an skb enters the GSO
stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
checksum offload set.

Commit b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
observed that SKB_GSO_DODGY producers can trigger the check and
that passing those packets through the GSO handlers will fix it
up. But, the software UFO handler will set ip_summed to
CHECKSUM_NONE.

When __skb_gso_segment is called from the receive path, this
triggers the warning again.

Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
Tx these two are equivalent. On Rx, this better matches the
skb state (checksum computed), as CHECKSUM_NONE here means no
checksum computed.

See also this thread for context:
http://patchwork.ozlabs.org/patch/799015/

Fixes: b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 21:39:01 -07:00