Commit Graph

5908 Commits

Author SHA1 Message Date
Stefano Brivio 5f2fb802ee ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
Fixes: 2f987a76a9 ("net: ipv6: keep sk status consistent after datagram connect failure")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-20 12:43:43 -04:00
Pablo Neira Ayuso 20710b3b81 netfilter: ctnetlink: synproxy support
This patch exposes synproxy information per-conntrack. Moreover, send
sequence adjustment events once server sends us the SYN,ACK packet, so
we can synchronize the sequence adjustment too for packets going as
reply from the server, as part of the synproxy logic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-20 14:39:31 +01:00
Stefano Brivio f8a554b4aa vti6: Fix dev->max_mtu setting
We shouldn't allow a tunnel to have IP_MAX_MTU as MTU, because
another IPv6 header is going on top of our packets. Without this
patch, we might end up building packets bigger than IP_MAX_MTU.

Fixes: b96f9afee4 ("ipv4/6: use core net MTU range checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio 7a67e69a33 vti6: Keep set MTU on link creation or change, validate it
In vti6_link_config(), if MTU is already given on link creation
or change, validate and use it instead of recomputing it. To do
that, we need to propagate the knowledge that MTU was set by
userspace all the way down to vti6_link_config().

To keep this simple, vti6_dev_init() sets the new 'keep_mtu'
argument of vti6_link_config() to true: on initialization, we
don't have convenient access to netlink attributes there, but we
will anyway check whether dev->mtu is set in vti6_link_config().
If it's non-zero, it was set to the value of the IFLA_MTU
attribute during creation. Otherwise, determine a reasonable
value.

Fixes: ed1efb2aef ("ipv6: Add support for IPsec virtual tunnel interfaces")
Fixes: 53c81e95df ("ip6_vti: adjust vti mtu according to mtu of lower device")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio c6741fbed6 vti6: Properly adjust vti6 MTU from MTU of lower device
If a lower device is found, we don't need to subtract
LL_MAX_HEADER to calculate our MTU: just use its MTU, the link
layer headers are already taken into account by it.

If the lower device is not found, start from ETH_DATA_LEN
instead, and only in this case subtract a worst-case
LL_MAX_HEADER.

We then need to subtract our additional IPv6 header from the
calculation.

While at it, note that vti6 doesn't have a hardware header, so
it doesn't need to set dev->hard_header_len. And as
vti6_link_config() now always sets the MTU, there's no need to
set a default value in vti6_dev_setup().

This makes the behaviour consistent with IPv4 vti, after
commit a32452366b ("vti4: Don't count header length twice."),
which was accidentally reverted by merge commit f895f0cfbb
("Merge branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec").

While commit 53c81e95df ("ip6_vti: adjust vti mtu according to
mtu of lower device") improved on the original situation, this
was still not ideal. As reported in that commit message itself,
if we start from an underlying veth MTU of 9000, we end up with
an MTU of 8832, that is, 9000 - LL_MAX_HEADER - sizeof(ipv6hdr).
This should simply be 8880, or 9000 - sizeof(ipv6hdr) instead:
we found the lower device (veth) and we know we don't have any
additional link layer header, so there's no need to subtract an
hypothetical worst-case number.

Fixes: 53c81e95df ("ip6_vti: adjust vti mtu according to mtu of lower device")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Tonghao Zhang 1e80295158 udp: Move the udp sysctl to namespace.
This patch moves the udp_rmem_min, udp_wmem_min
to namespace and init the udp_l3mdev_accept explicitly.

The udp_rmem_min/udp_wmem_min affect udp rx/tx queue,
with this patch namespaces can set them differently.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16 12:03:30 -04:00
David Ahern 1893ff2027 net/ipv6: Add l3mdev check to ipv6_chk_addr_and_flags
Lookup the L3 master device for the passed in device. Only consider
addresses on netdev's with the same master device. If the device is
not enslaved or is NULL, then the l3mdev is NULL which means only
devices not enslaved (ie, in the default domain) are considered.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16 11:28:38 -04:00
David Ahern 232378e8db net/ipv6: Change address check to always take a device argument
ipv6_chk_addr_and_flags determines if an address is a local address and
optionally if it is an address on a specific device. For example, it is
called by ip6_route_info_create to determine if a given gateway address
is a local address. The address check currently does not consider L3
domains and as a result does not allow a route to be added in one VRF
if the nexthop points to an address in a second VRF. e.g.,

    $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
    Error: Invalid gateway address.

where 2001:db8:102::23 is an address on an interface in vrf r1.

ipv6_chk_addr_and_flags needs to allow callers to always pass in a device
with a separate argument to not limit the address to the specific device.
The device is used used to determine the L3 domain of interest.

To that end add an argument to skip the device check and update callers
to always pass a device where possible and use the new argument to mean
any address in the domain.

Update a handful of users of ipv6_chk_addr with a NULL dev argument. This
patch handles the change to these callers without adding the domain check.

ip6_validate_gw needs to handle 2 cases - one where the device is given
as part of the nexthop spec and the other where the device is resolved.
There is at least 1 VRF case where deferring the check to only after
the route lookup has resolved the device fails with an unintuitive error
"RTNETLINK answers: No route to host" as opposed to the preferred
"Error: Gateway can not be a local address." The 'no route to host'
error is because of the fallback to a full lookup. The check is done
twice to avoid this error.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16 11:28:38 -04:00
David Ahern 9fbb704c33 net/ipv6: Refactor gateway validation on route add
Move gateway validation code from ip6_route_info_create into
ip6_validate_gw. Code move plus adjustments to handle the potential
reset of dev and idev and to make checkpatch happy.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16 11:28:38 -04:00
David S. Miller d2ddf628e9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-03-13

1) Refuse to insert 32 bit userspace socket policies on 64
   bit systems like we do it for standard policies. We don't
   have a compat layer, so inserting socket policies from
   32 bit userspace will lead to a broken configuration.

2) Make the policy hold queue work without the flowcache.
   Dummy bundles are not chached anymore, so we need to
   generate a new one on each lookup as long as the SAs
   are not yet in place.

3) Fix the validation of the esn replay attribute. The
   The sanity check in verify_replay() is bypassed if
   the XFRM_STATE_ESN flag is not set. Fix this by doing
   the sanity check uncoditionally.
   From Florian Westphal.

4) After most of the dst_entry garbage collection code
   is removed, we may leak xfrm_dst entries as they are
   neither cached nor tracked somewhere. Fix this by
   reusing the 'uncached_list' to track xfrm_dst entries
   too. From Xin Long.

5) Fix a rcu_read_lock/rcu_read_unlock imbalance in
   xfrm_get_tos() From Xin Long.

6) Fix an infinite loop in xfrm_get_dst_nexthop. On
   transport mode we fetch the child dst_entry after
   we continue, so this pointer is never updated.
   Fix this by fetching it before we continue.

7) Fix ESN sequence number gap after IPsec GSO packets.
    We accidentally increment the sequence number counter
    on the xfrm_state by one packet too much in the ESN
    case. Fix this by setting the sequence number to the
    correct value.

8) Reset the ethernet protocol after decapsulation only if a
   mac header was set. Otherwise it breaks configurations
   with TUN devices. From Yossi Kuperman.

9) Fix __this_cpu_read() usage in preemptible code. Use
   this_cpu_read() instead in ipcomp_alloc_tfms().
   From Greg Hackmann.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13 10:38:07 -04:00
Paolo Abeni 2f987a76a9 net: ipv6: keep sk status consistent after datagram connect failure
On unsuccesful ip6_datagram_connect(), if the failure is caused by
ip6_datagram_dst_update(), the sk peer information are cleared, but
the sk->sk_state is preserved.

If the socket was already in an established status, the overall sk
status is inconsistent and fouls later checks in datagram code.

Fix this saving the old peer information and restoring them in
case of failure. This also aligns ipv6 datagram connect() behavior
with ipv4.

v1 -> v2:
 - added missing Fixes tag

Fixes: 85cb73ff9b ("net: ipv6: reset daddr and dport in sk if connect() fails")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 15:10:54 -04:00
David S. Miller bbfa047a25 ipv6: Use ip6_multipath_hash_policy() in rt6_multipath_hash().
Make use of the new helper.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 11:09:33 -04:00
William Tu e41c7c68ea ip6erspan: make sure enough headroom at xmit.
The patch adds skb_cow_header() to ensure enough headroom
at ip6erspan_tunnel_xmit before pushing the erspan header
to the skb.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:57 -05:00
William Tu d6aa71197f ip6erspan: improve error handling for erspan version number.
When users fill in incorrect erspan version number through
the struct erspan_metadata uapi, current code skips pushing
the erspan header but continue pushing the gre header, which
is incorrect.  The patch fixes it by returning error.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:56 -05:00
William Tu 3b04caab81 ip6gre: add erspan v2 to tunnel lookup
The patch adds the erspan v2 proto in ip6gre_tunnel_lookup
so the erspan v2 tunnel can be found correctly.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:56 -05:00
Eric Dumazet 79134e6ce2 net: do not create fallback tunnels for non-default namespaces
fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
ip6tnl0, ip6gre0) are automatically created when the corresponding
module is loaded.

These tunnels are also automatically created when a new network
namespace is created, at a great cost.

In many cases, netns are used for isolation purposes, and these
extra network devices are a waste of resources. We are using
thousands of netns per host, and hit the netns creation/delete
bottleneck a lot. (Many thanks to Kirill for recent work on this)

Add a new sysctl so that we can opt-out from this automatic creation.

Note that these tunnels are still created for the initial namespace,
to be the least intrusive for typical setups.

Tested:
lpk43:~# cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
done
wait

lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh

real	0m37.521s
user	0m0.886s
sys	7m7.084s
lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
lpk43:~# time ./add_del_unshare.sh

real	0m4.761s
user	0m0.851s
sys	1m8.343s
lpk43:~#

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:23:11 -05:00
Lorenzo Bianconi 9f62c15f28 ipv6: fix access to non-linear packet in ndisc_fill_redirect_hdr_option()
Fix the following slab-out-of-bounds kasan report in
ndisc_fill_redirect_hdr_option when the incoming ipv6 packet is not
linear and the accessed data are not in the linear data region of orig_skb.

[ 1503.122508] ==================================================================
[ 1503.122832] BUG: KASAN: slab-out-of-bounds in ndisc_send_redirect+0x94e/0x990
[ 1503.123036] Read of size 1184 at addr ffff8800298ab6b0 by task netperf/1932

[ 1503.123220] CPU: 0 PID: 1932 Comm: netperf Not tainted 4.16.0-rc2+ #124
[ 1503.123347] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 1503.123527] Call Trace:
[ 1503.123579]  <IRQ>
[ 1503.123638]  print_address_description+0x6e/0x280
[ 1503.123849]  kasan_report+0x233/0x350
[ 1503.123946]  memcpy+0x1f/0x50
[ 1503.124037]  ndisc_send_redirect+0x94e/0x990
[ 1503.125150]  ip6_forward+0x1242/0x13b0
[...]
[ 1503.153890] Allocated by task 1932:
[ 1503.153982]  kasan_kmalloc+0x9f/0xd0
[ 1503.154074]  __kmalloc_track_caller+0xb5/0x160
[ 1503.154198]  __kmalloc_reserve.isra.41+0x24/0x70
[ 1503.154324]  __alloc_skb+0x130/0x3e0
[ 1503.154415]  sctp_packet_transmit+0x21a/0x1810
[ 1503.154533]  sctp_outq_flush+0xc14/0x1db0
[ 1503.154624]  sctp_do_sm+0x34e/0x2740
[ 1503.154715]  sctp_primitive_SEND+0x57/0x70
[ 1503.154807]  sctp_sendmsg+0xaa6/0x1b10
[ 1503.154897]  sock_sendmsg+0x68/0x80
[ 1503.154987]  ___sys_sendmsg+0x431/0x4b0
[ 1503.155078]  __sys_sendmsg+0xa4/0x130
[ 1503.155168]  do_syscall_64+0x171/0x3f0
[ 1503.155259]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 1503.155436] Freed by task 1932:
[ 1503.155527]  __kasan_slab_free+0x134/0x180
[ 1503.155618]  kfree+0xbc/0x180
[ 1503.155709]  skb_release_data+0x27f/0x2c0
[ 1503.155800]  consume_skb+0x94/0xe0
[ 1503.155889]  sctp_chunk_put+0x1aa/0x1f0
[ 1503.155979]  sctp_inq_pop+0x2f8/0x6e0
[ 1503.156070]  sctp_assoc_bh_rcv+0x6a/0x230
[ 1503.156164]  sctp_inq_push+0x117/0x150
[ 1503.156255]  sctp_backlog_rcv+0xdf/0x4a0
[ 1503.156346]  __release_sock+0x142/0x250
[ 1503.156436]  release_sock+0x80/0x180
[ 1503.156526]  sctp_sendmsg+0xbb0/0x1b10
[ 1503.156617]  sock_sendmsg+0x68/0x80
[ 1503.156708]  ___sys_sendmsg+0x431/0x4b0
[ 1503.156799]  __sys_sendmsg+0xa4/0x130
[ 1503.156889]  do_syscall_64+0x171/0x3f0
[ 1503.156980]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 1503.157158] The buggy address belongs to the object at ffff8800298ab600
                which belongs to the cache kmalloc-1024 of size 1024
[ 1503.157444] The buggy address is located 176 bytes inside of
                1024-byte region [ffff8800298ab600, ffff8800298aba00)
[ 1503.157702] The buggy address belongs to the page:
[ 1503.157820] page:ffffea0000a62a00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
[ 1503.158053] flags: 0x4000000000008100(slab|head)
[ 1503.158171] raw: 4000000000008100 0000000000000000 0000000000000000 00000001800e000e
[ 1503.158350] raw: dead000000000100 dead000000000200 ffff880036002600 0000000000000000
[ 1503.158523] page dumped because: kasan: bad access detected

[ 1503.158698] Memory state around the buggy address:
[ 1503.158816]  ffff8800298ab900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.158988]  ffff8800298ab980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.159165] >ffff8800298aba00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1503.159338]                    ^
[ 1503.159436]  ffff8800298aba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159610]  ffff8800298abb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159785] ==================================================================
[ 1503.159964] Disabling lock debugging due to kernel taint

The test scenario to trigger the issue consists of 4 devices:
- H0: data sender, connected to LAN0
- H1: data receiver, connected to LAN1
- GW0 and GW1: routers between LAN0 and LAN1. Both of them have an
  ethernet connection on LAN0 and LAN1
On H{0,1} set GW0 as default gateway while on GW0 set GW1 as next hop for
data from LAN0 to LAN1.
Moreover create an ip6ip6 tunnel between H0 and H1 and send 3 concurrent
data streams (TCP/UDP/SCTP) from H0 to H1 through ip6ip6 tunnel (send
buffer size is set to 16K). While data streams are active flush the route
cache on HA multiple times.
I have not been able to identify a given commit that introduced the issue
since, using the reproducer described above, the kasan report has been
triggered from 4.14 and I have not gone back further.

Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:16:19 -05:00
Kirill Tkhai 1fd2c55705 net: Convet ipv6_net_ops
These pernet_operations are similar to ipv4_net_ops.
They are safe to be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-08 12:36:45 -05:00
Kirill Tkhai 997266a4a0 net: Convert ip6 tables pernet_operations
The pernet_operations:

    ip6table_filter_net_ops
    ip6table_mangle_net_ops
    ip6table_nat_net_ops
    ip6table_raw_net_ops
    ip6table_security_net_ops

have exit methods, which call ip6t_unregister_table().
ip6table_filter_net_ops has init method registering
filter table.

Since there must not be in-flight ipv6 packets at the time
of pernet_operations execution and since pernet_operations
don't send ipv6 packets each other, these pernet_operations
are safe to be async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-08 12:36:43 -05:00
Eric Dumazet a366e300ae ip6mr: remove synchronize_rcu() in favor of SOCK_RCU_FREE
Kirill found that recently added synchronize_rcu() call in
ip6mr_sk_done()
was slowing down netns dismantle and posted a patch to use it only if
the socket
was found.

I instead suggested to get rid of this call, and use instead
SOCK_RCU_FREE

We might later change IPv4 side to use the same technique and unify
both stacks. IPv4 does not use synchronize_rcu() but has a call_rcu()
that could be replaced by SOCK_RCU_FREE.

Tested:
 time for i in {1..1000}; do unshare -n /bin/false;done

 Before : real 7m18.911s
 After : real 10.187s

Fixes: 8571ab479a ("ip6mr: Make mroute_sk rcu-based")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Yuval Mintz <yuvalm@mellanox.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 18:13:41 -05:00
Stefano Brivio e9fa1495d7 ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes
Currently, administrative MTU changes on a given netdevice are
not reflected on route exceptions for MTU-less routes, with a
set PMTU value, for that device:

 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref medium
 # ping6 -c 1 -q -s10000 2001:db8::b > /dev/null
 # ip netns exec a ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium
 # ip link set dev vti_a mtu 3000
 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium
 # ip link set dev vti_a mtu 9000
 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium

The first issue is that since commit fb56be83e4 ("net-ipv6: on
device mtu change do not add mtu to mtu-less routes") we don't
call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(),
which handles administrative MTU changes, if the regular route
is MTU-less.

However, PMTU exceptions should be always updated, as long as
RTAX_MTU is not locked. Keep the check for MTU-less main route,
as introduced by that commit, but, for exceptions,
call rt6_exceptions_update_pmtu() regardless of that check.

Once that is fixed, one problem remains: MTU changes are not
reflected if the new MTU is higher than the previous one,
because rt6_exceptions_update_pmtu() doesn't allow that. We
should instead allow PMTU increase if the old PMTU matches the
local MTU, as that implies that the old MTU was the lowest in the
path, and PMTU discovery might lead to different results.

The existing check in rt6_mtu_change_route() correctly took that
case into account (for regular routes only), so factor it out
and re-use it also in rt6_exceptions_update_pmtu().

While at it, fix comments style and grammar, and try to be a bit
more descriptive.

Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: fb56be83e4 ("net-ipv6: on device mtu change do not add mtu to mtu-less routes")
Fixes: f5bbe7ee79 ("ipv6: prepare rt6_mtu_change() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:17:50 -05:00
Gustavo A. R. Silva 9a21ac9432 ipv6: ndisc: use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 12:19:47 -05:00
Yossi Kuperman 87cdf3148b xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto
Artem Savkov reported that commit 5efec5c655 leads to a packet loss under
IPSec configuration. It appears that his setup consists of a TUN device,
which does not have a MAC header.

Make sure MAC header exists.

Note: TUN device sets a MAC header pointer, although it does not have one.

Fixes: 5efec5c655 ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version")
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Tested-by: Artem Savkov <artem.savkov@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-07 10:54:29 +01:00
David S. Miller 0f3e9c97eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All of the conflicts were cases of overlapping changes.

In net/core/devlink.c, we have to make care that the
resouce size_params have become a struct member rather
than a pointer to such an object.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 01:20:46 -05:00
Florian Westphal 0d7df906a0 netfilter: x_tables: ensure last rule in base chain matches underflow/policy
Harmless from kernel point of view, but again iptables assumes that
this is true when decoding ruleset coming from kernel.

If a (syzkaller generated) ruleset doesn't have the underflow/policy
stored as the last rule in the base chain, then iptables will abort()
because it doesn't find the chain policy.

libiptc assumes that the policy is the last rule in the basechain, which
is only true for iptables-generated rulesets.

Unfortunately this needs code duplication -- the functions need the
struct layout of the rule head, but that is different for
ip/ip6/arptables.

NB: pr_warn could be pr_debug but in case this break rulesets somehow its
useful to know why blob was rejected.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:44 +01:00
Florian Westphal 9782a11efc netfilter: compat: prepare xt_compat_init_offsets to return errors
should have no impact, function still always returns 0.
This patch is only to ease review.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:43 +01:00
Florian Westphal c84ca954ac netfilter: x_tables: add counters allocation wrapper
allows to have size checks in a single spot.
This is supposed to reduce oom situations when fuzz-testing xtables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:43 +01:00
Florian Westphal 1b293e30f7 netfilter: x_tables: move hook entry checks into core
Allow followup patch to change on location instead of three.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:43 +01:00
Florian Westphal 07a9da51b4 netfilter: x_tables: check standard verdicts in core
Userspace must provide a valid verdict to the standard target.

The verdict can be either a jump (signed int > 0), or a return code.

Allowed return codes are either RETURN (pop from stack), NF_ACCEPT, DROP
and QUEUE (latter is allowed for legacy reasons).

Jump offsets (verdict > 0) are checked in more detail later on when
loop-detection is performed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:43 +01:00
Xin Long f31e5f1a89 netfilter: unlock xt_table earlier in __do_replace
Now it's doing cleanup_entry for oldinfo under the xt_table lock,
but it's not really necessary. After the replacement job is done
in xt_replace_table, oldinfo is not used elsewhere any more, and
it can be freed without xt_table lock safely.

The important thing is that rtnl_lock is called in some xt_target
destroy, which means rtnl_lock, a big lock is used in xt_table
lock, a smaller one. It usually could be the reason why a dead
lock may happen.

Besides, all xt_target/match checkentry is called out of xt_table
lock. It's better also to move all cleanup_entry calling out of
xt_table lock, just as do_replace_finish does for ebtables.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05 23:15:43 +01:00
Kirill Tkhai c60a246cd3 net: Convert arp_tables_net_ops and ip6_tables_net_ops
These pernet_operations call xt_proto_init() and xt_proto_fini(),
which just register and unregister /proc entries.
They are safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05 10:48:27 -05:00
Kirill Tkhai 3822034569 net: Convert log pernet_operations
These pernet_operations use nf_log_set() and nf_log_unset()
in their methods:

	nf_log_bridge_net_ops
	nf_log_arp_net_ops
	nf_log_ipv4_net_ops
	nf_log_ipv6_net_ops
	nf_log_netdev_net_ops

Nobody can send such a packet to a net before it's became
registered, nobody can send a packet after all netdevices
are unregistered. So, these pernet_operations are able
to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05 10:48:27 -05:00
William Tu 77a5196a80 gre: add sequence number for collect md mode.
Currently GRE sequence number can only be used in native
tunnel mode.  This patch adds sequence number support for
gre collect metadata mode.  RFC2890 defines GRE sequence
number to be specific to the traffic flow identified by the
key.  However, this patch does not implement per-key seqno.
The sequence number is shared in the same tunnel device.
That is, different tunnel keys using the same collect_md
tunnel share single sequence number.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 18:35:02 -05:00
Daniel Axtens 80f5974d15 net: xfrm: use skb_gso_validate_network_len() to check gso sizes
Replace skb_gso_network_seglen() with
skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS
case.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens 779b7931b2 net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
David Ahern b4bac172e9 net/ipv6: Add support for path selection using hash of 5-tuple
Some operators prefer IPv6 path selection to use a standard 5-tuple
hash rather than just an L3 hash with the flow the label. To that end
add support to IPv6 for multipath hash policy similar to bf4e0a3db9
("net: ipv4: add support for ECMP hash policy choice"). The default
is still L3 which covers source and destination addresses along with
flow label and IPv6 protocol.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 13:04:23 -05:00
David Ahern b75cc8f90f net/ipv6: Pass skb to route lookup
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 13:04:22 -05:00
David Ahern 9a2a537acc net/ipv6: Make rt6_multipath_hash similar to fib_multipath_hash
Make rt6_multipath_hash more of a direct parallel to fib_multipath_hash
and reduce stack and overhead in the process: get_hash_from_flowi6 is
just a wrapper around __get_hash_from_flowi6 with another stack
allocation for flow_keys. Move setting the addresses, protocol and
label into rt6_multipath_hash and allow it to make the call to
flow_hash_from_keys.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 13:04:21 -05:00
David Ahern 6f74b6c259 net: Align ip_multipath_l3_keys and ip6_multipath_l3_keys
Symmetry is good and allows easy comparison that ipv4 and ipv6 are
doing the same thing. To that end, change ip_multipath_l3_keys to
set addresses at the end after the icmp compares, and move the
initialization of ipv6 flow keys to rt6_multipath_hash.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 13:04:21 -05:00
David S. Miller 4a0c7191c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Put back reference on CLUSTERIP configuration structure from the
   error path, patch from Florian Westphal.

2) Put reference on CLUSTERIP configuration instead of freeing it,
   another cpu may still be walking over it, also from Florian.

3) Refetch pointer to IPv6 header from nf_nat_ipv6_manip_pkt() given
   packet manipulation may reallocation the skbuff header, from Florian.

4) Missing match size sanity checks in ebt_among, from Florian.

5) Convert BUG_ON to WARN_ON in ebtables, from Florian.

6) Sanity check userspace offsets from ebtables kernel, from Florian.

7) Missing checksum replace call in flowtable IPv4 DNAT, from Felix
   Fietkau.

8) Bump the right stats on checksum error from bridge netfilter,
   from Taehee Yoo.

9) Unset interface flag in IPv6 fib lookups otherwise we get
   misleading routing lookup results, from Florian.

10) Missing sk_to_full_sk() in ip6_route_me_harder() from Eric Dumazet.

11) Don't allow devices to be part of multiple flowtables at the same
    time, this may break setups.

12) Missing netlink attribute validation in flowtable deletion.

13) Wrong array index in nf_unregister_net_hook() call from error path
    in flowtable addition path.

14) Fix FTP IPVS helper when NAT mangling is in place, patch from
    Julian Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-02 20:32:15 -05:00
Sabrina Dubroca f1c02cfb7b ipv6: allow userspace to add IFA_F_OPTIMISTIC addresses
According to RFC 4429 (section 3.1), adding new IPv6 addresses as
optimistic addresses is acceptable, as long as the implementation
follows some rules:

   * Optimistic DAD SHOULD only be used when the implementation is aware
        that the address is based on a most likely unique interface
        identifier (such as in [RFC2464]), generated randomly [RFC3041],
        or by a well-distributed hash function [RFC3972] or assigned by
        Dynamic Host Configuration Protocol for IPv6 (DHCPv6) [RFC3315].
        Optimistic DAD SHOULD NOT be used for manually entered
        addresses.

Thus, it seems reasonable to allow userspace to set the optimistic flag
when adding new addresses.

We must not let userspace set NODAD + OPTIMISTIC, since if the kernel is
not performing DAD we would never clear the optimistic flag. We must
also ignore userspace's request to add OPTIMISTIC flag to addresses that
have already completed DAD (addresses that don't have the TENTATIVE
flag, or that have the DADFAILED flag).

Then we also need to clear the OPTIMISTIC flag on permanent addresses
when DAD fails. Otherwise, IFA_F_OPTIMISTIC addresses added by userspace
can still be used after DAD has failed, because in
ipv6_chk_addr_and_flags(), IFA_F_OPTIMISTIC overrides IFA_F_TENTATIVE.

Setting IFA_F_OPTIMISTIC from userspace is conditional on
CONFIG_IPV6_OPTIMISTIC_DAD and the optimistic_dad sysctl.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:43:06 -05:00
Yuval Mintz 7b0db85737 ipmr, ip6mr: Unite dumproute flows
The various MFC entries are being held in the same kind of mr_tables
for both ipmr and ip6mr, and their traversal logic is identical.
Also, with the exception of the addresses [and other small tidbits]
the major bulk of the nla setting is identical.

Unite as much of the dumping as possible between the two.
Notice this requires creating an mr_table iterator for each, as the
for-each preprocessor macro can't be used by the common logic.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 889cd83cbe ip6mr: Remove MFC_NOTIFY and refactor flags
MFC_NOTIFY exists in ip6mr, probably as some legacy code
[was already removed for ipmr in commit
06bd6c0370 ("net: ipmr: remove unused MFC_NOTIFY flag and make the flags enum").
Remove it from ip6mr as well, and move the enum into a common file;
Notice MFC_OFFLOAD is currently only used by ipmr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 3feda6b46f ipmr, ip6mr: Unite vif seq functions
Same as previously done with the mfc seq, the logic for the vif seq is
refactored to be shared between ipmr and ip6mr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz c8d6196803 ipmr, ip6mr: Unite mfc seq logic
With the exception of the final dump, ipmr and ip6mr have the exact same
seq logic for traversing a given mr_table. Refactor that code and make
it common.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 845c9a7ae7 ipmr, ip6mr: Unite logic for searching in MFC cache
ipmr and ip6mr utilize the exact same methods for searching the
hashed resolved connections, difference being only in the construction
of the hash comparison key.

In order to unite the flow, introduce an mr_table operation set that
would contain the protocol specific information required for common
flows, in this case - the hash parameters and a comparison key
representing a (*,*) route.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 494fff5637 ipmr, ip6mr: Make mfc_cache a common structure
mfc_cache and mfc6_cache are almost identical - the main difference is
in the origin/group addresses and comparison-key. Make a common
structure encapsulating most of the multicast routing logic  - mr_mfc
and convert both ipmr and ip6mr into using it.

For easy conversion [casting, in this case] mr_mfc has to be the first
field inside every multicast routing abstraction utilizing it.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 0bbbf0e7d0 ipmr, ip6mr: Unite creation of new mr_table
Now that both ipmr and ip6mr are using the same mr_table structure,
we can have a common function to allocate & initialize a new instance.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz b70432f731 mroute*: Make mr_table a common struct
Following previous changes to ip6mr, mr_table and mr6_table are
basically the same [up to mr6_table having additional '6' suffixes to
its variable names].
Move the common structure definition into a common header; This
requires renaming all references in ip6mr to variables that had the
distinct suffix.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 87c418bf13 ip6mr: Align hash implementation to ipmr
Since commit 8fb472c09b ("ipmr: improve hash scalability") ipmr has
been using rhashtable as a basis for its mfc routes, but ip6mr is
currently still using the old private MFC hash implementation.

Align ip6mr to the current ipmr implementation.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 8571ab479a ip6mr: Make mroute_sk rcu-based
In ipmr the mr_table socket is handled under RCU. Introduce the same
for ip6mr.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Yuval Mintz 6853f21f76 ipmr,ipmr6: Define a uniform vif_device
The two implementations have almost identical structures - vif_device and
mif_device. As a step toward uniforming the mr_tables, eliminate the
mif_device and relocate the vif_device definition into a new common
header file.

Also, introduce a common initializing function for setting most of the
vif_device fields in a new common source file. This requires modifying
the ipv{4,6] Kconfig and ipv4 makefile as we're introducing a new common
config option - CONFIG_IP_MROUTE_COMMON.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-01 13:13:23 -05:00
Roopa Prabhu 5e5d6fed37 ipv6: route: dissect flow in input path if fib rules need it
Dissect flow in fwd path if fib rules require it. Controlled by
a flag to avoid penatly for the common case. Flag is set when fib
rules with sport, dport and proto match that require flow dissect
are installed. Also passes the dissected hash keys to the multipath
hash function when applicable to avoid dissecting the flow again.
icmp packets will continue to use inner header for hash
calculations.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 22:44:44 -05:00
Roopa Prabhu bb0ad1987e ipv6: fib6_rules: support for match on sport, dport and ip proto
support to match on src port, dst port and ip protocol.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 22:44:43 -05:00
Stephen Hemminger 82695b30ff inet: whitespace cleanup
Ran simple script to find/remove trailing whitespace and blank lines
at EOF because that kind of stuff git whines about and editors leave
behind.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28 11:43:28 -05:00
Petr Machata d1b2a6c4be net: GRE: Add is_gretap_dev, is_ip6gretap_dev
Determining whether a device is a GRE device is easily done by
inspecting struct net_device.type. However, for the tap variants, the
type is just ARPHRD_ETHER.

Therefore introduce two predicate functions that use netdev_ops to tell
the tap devices.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:46:26 -05:00
Xin Long 2b3957c34b sit: fix IFLA_MTU ignored on NEWLINK
Commit 128bb975dc ("ip6_gre: init dev->mtu and dev->hard_header_len
correctly") fixed IFLA_MTU ignored on NEWLINK for ip6_gre. The same
mtu fix is also needed for sit.

Note that dev->hard_header_len setting for sit works fine, no need to
fix it. sit is actually ipv4 tunnel, it can't call ip6_tnl_change_mtu
to set mtu.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:36:28 -05:00
Xin Long a6aa804462 ip6_tunnel: fix IFLA_MTU ignored on NEWLINK
Commit 128bb975dc ("ip6_gre: init dev->mtu and dev->hard_header_len
correctly") fixed IFLA_MTU ignored on NEWLINK for ip6_gre. The same
mtu fix is also needed for ip6_tunnel.

Note that dev->hard_header_len setting for ip6_tunnel works fine,
no need to fix it.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 14:36:28 -05:00
Kirill Tkhai 9532ce17f7 net: Convert defrag6_net_ops
These pernet_operations only unregister nf hooks.
So, they are able to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:39 -05:00
Kirill Tkhai afd7b3eb13 net: Convert ila_net_ops
These pernet_operations register and unregister nf hooks.
Also they populate and depopulate ila_net_id-pointed hash
table. The table is changed by hooks during skb processing
and via netlink request. It looks impossible for another
net pernet_operations to force the table reading or writing,
so, they are able to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:39 -05:00
Kirill Tkhai 989d9812b7 net: Convert sit_net_ops
These pernet_operations are similar to ip6_tnl_net_ops. Exit method
unregisters all net sit devices, and it looks like another
pernet_operations are not interested in foreign net sit list.
Init method registers netdevice. So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai 5ecc29550a net: Convert vti6_net_ops
These pernet_operations are similar to ip6_tnl_net_ops. Exit method
unregisters all net vti6 tunnels, and it looks like another
pernet_operations are not interested in foreign net vti6 list.
Init method registers netdevice. So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai 66997ba083 net: Convert ip6_tnl_net_ops
These pernet_operations are similar to ip6gre_net_ops. Exit method
unregisters all net ip6_tnl tunnels, and it looks like another
pernet_operations are not interested in foreign net tunnels list.
So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai 5c155c5024 net: Convert ip6gre_net_ops
These pernet_operations are similar to bond_net_ops. Exit method
unregisters all net ip6gre devices, and it looks like another
pernet_operations are not interested in foreign net ip6gre list
or net_generic()->tunnels_wc. Init method registers net device.
So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai 02df428ca2 net: Convert simple pernet_operations
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:

vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Shannon Nelson 5211fcfb81 esp: check the NETIF_F_HW_ESP_TX_CSUM bit before segmenting
If I understand correctly, we should not be asking for a
checksum offload on an ipsec packet if the netdev isn't
advertising NETIF_F_HW_ESP_TX_CSUM.  In that case, we should
clear the NETIF_F_CSUM_MASK bits.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-02-27 10:46:01 +01:00
Eric Dumazet 7d98386d55 netfilter: use skb_to_full_sk in ip6_route_me_harder
For some reason, Florian forgot to apply to ip6_route_me_harder
the fix that went in commit 29e09229d9 ("netfilter: use
skb_to_full_sk in ip_route_me_harder")

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener") 
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-25 20:51:13 +01:00
Florian Westphal 47b7e7f828 netfilter: don't set F_IFACE on ipv6 fib lookups
"fib" starts to behave strangely when an ipv6 default route is
added - the FIB lookup returns a route using 'oif' in this case.

This behaviour was inherited from ip6tables rpfilter so change
this as well.

Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1221
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-25 20:15:53 +01:00
Florian Westphal b078556aec netfilter: ipv6: fix use-after-free Write in nf_nat_ipv6_manip_pkt
l4proto->manip_pkt() can cause reallocation of skb head so pointer
to the ipv6 header must be reloaded.

Reported-and-tested-by: <syzbot+10005f4292fc9cc89de7@syzkaller.appspotmail.com>
Fixes: 58a317f106 ("netfilter: ipv6: add IPv6 NAT support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-25 20:03:54 +01:00
David S. Miller f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Arnd Bergmann ca79bec237 ipv6 sit: work around bogus gcc-8 -Wrestrict warning
gcc-8 has a new warning that detects overlapping input and output arguments
in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(),
which is actually correct:

net/ipv6/sit.c: In function 'sit_init_net':
net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]

The problem here is that the logic detecting the memcpy() arguments finds them
to be the same, but the conditional that tests for the input and output of
ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant.

We know that netdev_priv(t->dev) is the same as t for a tunnel device,
and comparing "dev" directly here lets the compiler figure out as well
that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so
it no longer warns.

This code is old, so Cc stable to make sure that we don't get the warning
for older kernels built with new gcc.

Cc: Martin Sebor <msebor@gmail.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 10:53:26 -05:00
David S. Miller 943a0d4a9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains large batch with Netfilter fixes for
your net tree, mostly due to syzbot report fixups and pr_err()
ratelimiting, more specifically, they are:

1) Get rid of superfluous unnecessary check in x_tables before vmalloc(),
   we don't hit BUG there anymore, patch from Michal Hock, suggested by
   Andrew Morton.

2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang.

3) Drop socket lock that results in circular locking dependency, patch
   from Paolo Abeni.

4) Drop packet if case of malformed blob that makes backpointer jump
   in x_tables, from Florian Westphal.

5) Fix refcount leak due to race in ipt_CLUSTERIP in
   clusterip_config_find_get(), from Cong Wang.

6) Several patches to ratelimit pr_err() for x_tables since this can be
   a problem where CAP_NET_ADMIN semantics can protect us in untrusted
   namespace, from Florian Westphal.

7) Missing .gitignore update for new autogenerated asn1 state machine
   for the SNMP NAT helper, from Zhu Lingshan.

8) Missing timer initialization in xt_LED, from Paolo Abeni.

9) Do not allow negative port range in NAT, also from Paolo.

10) Lock imbalance in the xt_hashlimit rate match mode, patch from
    Eric Dumazet.

11) Initialize workqueue before timer in the idletimer match,
    from Eric Dumazet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:49:55 -05:00
David S. Miller f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Kirill Tkhai 4d6b80762b net: Convert ip_tables_net_ops, udplite6_net_ops and xt_net_ops
ip_tables_net_ops and udplite6_net_ops create and destroy /proc entries.
xt_net_ops does nothing.

So, we are able to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:12 -05:00
Kirill Tkhai 5fc094f5b8 net: Convert ip6_frags_ops
Exit methods calls inet_frags_exit_net() with global ip6_frags
as argument. So, after we make the pernet_operations async,
a pair of exit methods may be called to iterate this hash table.
Since there is inet_frag_worker(), which already may work
in parallel with inet_frags_exit_net(), and it can make the same
cleanup, that inet_frags_exit_net() does, it's safe. So we may
mark these pernet_operations as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai d16784d9fb net: Convert fib6_net_ops, ipv6_addr_label_ops and ip6_segments_ops
These pernet_operations register and unregister tables
and lists for packets forwarding. All of the entities
are per-net. Init methods makes simple initializations,
and since net is not visible for foreigners at the time
it is working, it can't race with anything. Exit method
is executed when there are only local devices, and there
mustn't be packets in-flight. Also, it looks like no one
pernet_operations want to send ipv6 packets to foreign
net. The same reasons are for ipv6_addr_label_ops and
ip6_segments_ops. So, we are able to mark all them as
async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai b489141369 net: Convert xfrm6_net_ops
These pernet_operations create sysctl tables and
initialize net::xfrm.xfrm6_dst_ops used for routing.
It doesn't look like another pernet_operations send
ipv6 packets to foreign net namespaces, so it should
be safe to mark the pernet_operations as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai a7852a76f4 net: Convert ip6_flowlabel_net_ops
These pernet_operations create and destroy /proc entries.
ip6_fl_purge() makes almost the same actions as timer
ip6_fl_gc_timer does, and as it can be executed in parallel
with ip6_fl_purge(), two parallel ip6_fl_purge() may be
executed. So, we can mark it async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai ac34cb6c0c net: Convert ping_v6_net_ops
These pernet_operations only register and unregister /proc
entries, so it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai 58708caef5 net: Convert ipv6_sysctl_net_ops
These pernet_operations create and destroy sysctl tables.
They are not touched by another net pernet_operations.
So, it's possible to execute them in parallel with others.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:11 -05:00
Kirill Tkhai fef65a2c6c net: Convert tcpv6_net_ops
These pernet_operations create and destroy net::ipv6.tcp_sk
socket, which is used in tcp_v6_send_response() only. It looks
like foreign pernet_operations don't want to set ipv6 connection
inside destroyed net, so this socket may be created in destroyed
in parallel with anything else. inet_twsk_purge() is also safe
for that, as described in patch for tcp_sk_ops. So, it's possible
to mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai 7b7dd180b8 net: Convert fib6_rules_net_ops
These pernet_operations register and unregister
net::ipv6.fib6_rules_ops, which are used for
routing. It looks like there are no pernet_operations,
which send ipv6 packages to another net, so we
are able to mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai 85ca51b2a2 net: Convert ipv6_inetpeer_ops
net->ipv6.peers is dereferenced in three places via inet_getpeer_v6(),
and it's used to handle skb. All the users of inet_getpeer_v6() do not
look like be able to be called from foreign net pernet_operations, so
we may mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai 509114112d net: Convert raw6_net_ops, udplite6_net_ops, ipv6_proc_ops, if6_proc_net_ops and ip6_route_net_late_ops
These pernet_operations create and destroy /proc entries
and safely may be converted and safely may be mark as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai 1a2e93329d net: Convert icmpv6_sk_ops, ndisc_net_ops and igmp6_net_ops
These pernet_operations create and destroy net::ipv6.icmp_sk
socket, used to send ICMP or error reply.

Nobody can dereference the socket to handle a packet before
net is initialized, as there is no routing; nobody can do
that in parallel with exit, as all of devices are moved
to init_net or destroyed and there are no packets it-flight.
So, it's possible to mark these pernet_operations as async.

The same for ndisc_net_ops and for igmp6_net_ops. The last
one also creates and destroys /proc entries.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai b01a59a488 net: Convert ip6mr_net_ops
These pernet_operations create and destroy /proc entries,
populate and depopulate net::rules_ops and multiroute table.
All the structures are pernet, and they are not touched
by foreign net pernet_operations. So, it's possible to mark
them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Kirill Tkhai 753d525a08 net: Convert inet6_net_ops
init method initializes sysctl defaults, allocates
percpu arrays and creates /proc entries.
exit method reverts the above.

There are no pernet_operations, which are interested
in the above entities of foreign net namespace, so
inet6_net_ops are able to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:09 -05:00
Alexey Kodanev 15f35d49c9 udplite: fix partial checksum initialization
Since UDP-Lite is always using checksum, the following path is
triggered when calculating pseudo header for it:

  udp4_csum_init() or udp6_csum_init()
    skb_checksum_init_zero_check()
      __skb_checksum_validate_complete()

The problem can appear if skb->len is less than CHECKSUM_BREAK. In
this particular case __skb_checksum_validate_complete() also invokes
__skb_checksum_complete(skb). If UDP-Lite is using partial checksum
that covers only part of a packet, the function will return bad
checksum and the packet will be dropped.

It can be fixed if we skip skb_checksum_init_zero_check() and only
set the required pseudo header checksum for UDP-Lite with partial
checksum before udp4_csum_init()/udp6_csum_init() functions return.

Fixes: ed70fcfcee ("net: Call skb_checksum_init in IPv4")
Fixes: e4f45b7f40 ("net: Call skb_checksum_init in IPv6")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-16 15:57:42 -05:00
Xin Long 510c321b55 xfrm: reuse uncached_list to track xdsts
In early time, when freeing a xdst, it would be inserted into
dst_garbage.list first. Then if it's refcnt was still held
somewhere, later it would be put into dst_busy_list in
dst_gc_task().

When one dev was being unregistered, the dev of these dsts in
dst_busy_list would be set with loopback_dev and put this dev.
So that this dev's removal wouldn't get blocked, and avoid the
kmsg warning:

  kernel:unregister_netdevice: waiting for veth0 to become \
  free. Usage count = 2

However after Commit 52df157f17 ("xfrm: take refcnt of dst
when creating struct xfrm_dst bundle"), the xdst will not be
freed with dst gc, and this warning happens.

To fix it, we need to find these xdsts that are still held by
others when removing the dev, and free xdst's dev and set it
with loopback_dev.

But unfortunately after flow_cache for xfrm was deleted, no
list tracks them anymore. So we need to save these xdsts
somewhere to release the xdst's dev later.

To make this easier, this patch is to reuse uncached_list to
track xdsts, so that the dev refcnt can be released in the
event NETDEV_UNREGISTER process of fib_netdev_notifier.

Thanks to Florian, we could move forward this fix quickly.

Fixes: 52df157f17 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
Reported-by: Jianlin Shi <jishi@redhat.com>
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-02-16 07:03:33 +01:00
Florian Westphal b26066447b netfilter: x_tables: use pr ratelimiting in all remaining spots
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-14 21:05:38 +01:00
Florian Westphal cc48baefdf netfilter: x_tables: rate-limit table mismatch warnings
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-14 21:05:36 +01:00
Florian Westphal 57ebd808a9 netfilter: add back stackpointer size checks
The rationale for removing the check is only correct for rulesets
generated by ip(6)tables.

In iptables, a jump can only occur to a user-defined chain, i.e.
because we size the stack based on number of user-defined chains we
cannot exceed stack size.

However, the underlying binary format has no such restriction,
and the validation step only ensures that the jump target is a
valid rule start point.

IOW, its possible to build a rule blob that has no user-defined
chains but does contain a jump.

If this happens, no jump stack gets allocated and crash occurs
because no jumpstack was allocated.

Fixes: 7814b6ec6d ("netfilter: xtables: don't save/restore jumpstack offset")
Reported-by: syzbot+e783f671527912cd9403@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-14 20:47:41 +01:00
Paolo Abeni 01ea306f2a netfilter: drop outermost socket lock in getsockopt()
The Syzbot reported a possible deadlock in the netfilter area caused by
rtnl lock, xt lock and socket lock being acquired with a different order
on different code paths, leading to the following backtrace:
Reviewed-by: Xin Long <lucien.xin@gmail.com>

======================================================
WARNING: possible circular locking dependency detected
4.15.0+ #301 Not tainted
------------------------------------------------------
syzkaller233489/4179 is trying to acquire lock:
  (rtnl_mutex){+.+.}, at: [<0000000048e996fd>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

but task is already holding lock:
  (&xt[i].mutex){+.+.}, at: [<00000000328553a2>]
xt_find_table_lock+0x3e/0x3e0 net/netfilter/x_tables.c:1041

which lock already depends on the new lock.
===

Since commit 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock
only in the required scope"), we already acquire the socket lock in
the innermost scope, where needed. In such commit I forgot to remove
the outer-most socket lock from the getsockopt() path, this commit
addresses the issues dropping it now.

v1 -> v2: fix bad subj, added relavant 'fixes' tag

Fixes: 22265a5c3c ("netfilter: xt_TEE: resolve oif using netdevice notifiers")
Fixes: 202f59afd4 ("netfilter: ipt_CLUSTERIP: do not hold dev")
Fixes: 3f34cfae1230 ("netfilter: on sockopt() acquire sock lock only in the required scope")
Reported-by: syzbot+ddde1c7b7ff7442d7f2d@syzkaller.appspotmail.com
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-02-14 20:44:42 +01:00
David Ahern 9942895b5e net: Move ipv4 set_lwt_redirect helper to lwtunnel
IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as
needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change
IPv6 to also use it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:43:32 -05:00
Eric Dumazet e0f9759f53 tcp: try to keep packet if SYN_RCV race is lost
배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.

This involves RPS, and fragmentation.

배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.

This is caused by a CPU losing the race, and simply not caring enough.

Since this seems to occur frequently, we can do better and perform
a second lookup.

Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.

Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.

Many thanks to 배석진 for a very detailed analysis.

Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:21:45 -05:00
Kirill Tkhai 0bc9be6718 net: Convert addrconf_ops
These pernet_operations (un)register sysctl, which
are not touched by anybody else.

So, it's safe to make them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13 10:36:09 -05:00
Denys Vlasenko 9b2c45d479 net: make getname() functions return length rather than use int* parameter
Changes since v1:
Added changes in these files:
    drivers/infiniband/hw/usnic/usnic_transport.c
    drivers/staging/lustre/lnet/lnet/lib-socket.c
    drivers/target/iscsi/iscsi_target_login.c
    drivers/vhost/net.c
    fs/dlm/lowcomms.c
    fs/ocfs2/cluster/tcp.c
    security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

    text    data     bss      dec     hex filename
30108430 2633624  873672 33615726 200ef6e vmlinux.before.o
30108109 2633612  873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: David S. Miller <davem@davemloft.net>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-12 14:15:04 -05:00
Song Liu 5c487bb9ad tcp: tracepoint: only call trace_tcp_send_reset with full socket
tracepoint tcp_send_reset requires a full socket to work. However, it
may be called when in TCP_TIME_WAIT:

        case TCP_TW_RST:
                tcp_v6_send_reset(sk, skb);
                inet_twsk_deschedule_put(inet_twsk(sk));
                goto discard_it;

To avoid this problem, this patch checks the socket with sk_fullsock()
before calling trace_tcp_send_reset().

Fixes: c24b14c46b ("tcp: add tracepoint trace_tcp_send_reset")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07 22:00:42 -05:00
David Ahern 44750f8483 net/ipv6: onlink nexthop checks should default to main table
Because of differences in how ipv4 and ipv6 handle fib lookups,
verification of nexthops with onlink flag need to default to the main
table rather than the local table used by IPv4. As it stands an
address within a connected route on device 1 can be used with
onlink on device 2. Updating the table properly rejects the route
due to the egress device mismatch.

Update the extack message as well to show it could be a device
mismatch for the nexthop spec.

Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07 21:52:42 -05:00
David Ahern 58e354c01b net/ipv6: Handle reject routes with onlink flag
Verification of nexthops with onlink flag need to handle unreachable
routes. The lookup is only intended to validate the gateway address
is not a local address and if the gateway resolves the egress device
must match the given device. Hence, hitting any default reject route
is ok.

Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07 21:52:06 -05:00