Commit Graph

39482 Commits

Author SHA1 Message Date
Alexander Couzens 06a15f51cf l2tp: protect tunnel->del_work by ref_count
There is a small chance that tunnel_free() is called before tunnel->del_work scheduled
resulting in a zero pointer dereference.

Signed-off-by: Alexander Couzens <lynxis@fe80.eu>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:39:10 -07:00
Bendik Rønning Opstad d2e1339f40 tcp: Fix CWV being too strict on thin streams
Application limited streams such as thin streams, that transmit small
amounts of payload in relatively few packets per RTT, can be prevented
from growing the CWND when in congestion avoidance. This leads to
increased sojourn times for data segments in streams that often transmit
time-dependent data.

Currently, a connection is considered CWND limited only after having
successfully transmitted at least one packet with new data, while at the
same time failing to transmit some unsent data from the output queue
because the CWND is full. Applications that produce small amounts of
data may be left in a state where it is never considered to be CWND
limited, because all unsent data is successfully transmitted each time
an incoming ACK opens up for more data to be transmitted in the send
window.

Fix by always testing whether the CWND is fully used after successful
packet transmissions, such that a connection is considered CWND limited
whenever the CWND has been filled. This is the correct behavior as
specified in RFC2861 (section 3.1).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Cc: Mads Johannessen <madsjoh@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:36:30 -07:00
David Ahern 17fb0b2b90 net: Remove redundant oif checks in rt6_device_match
The oif has already been checked that it is non-zero; the 2 additional
checks on oif within that if (oif) {...} block are redundant.

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:30:24 -07:00
Eric Dumazet 7c85af8810 tcp: avoid reorders for TFO passive connections
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like

[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>

packets [2] and [3] can be generated at almost the same time.

If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.

S will have to retransmit the answer.

Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.

It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.

This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.

This removes the reorder source at the host level.

We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:11:19 -07:00
Karl Heiss 635682a144 sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake.  Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.

Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.

 BUG: soft lockup - CPU#9 stuck for 67s! [swapper:0]
 ...
 RIP: 0010:[<ffffffff8152d48e>]  [<ffffffff8152d48e>] _spin_lock+0x1e/0x30
 RSP: 0018:ffff880028323b20  EFLAGS: 00000206
 RAX: 0000000000000002 RBX: ffff880028323b20 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff880028323be0 RDI: ffff8804632c4b48
 RBP: ffffffff8100bb93 R08: 0000000000000000 R09: 0000000000000000
 R10: ffff880610662280 R11: 0000000000000100 R12: ffff880028323aa0
 R13: ffff8804383c3880 R14: ffff880028323a90 R15: ffffffff81534225
 FS:  0000000000000000(0000) GS:ffff880028320000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
 CR2: 00000000006df528 CR3: 0000000001a85000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process swapper (pid: 0, threadinfo ffff880616b70000, task ffff880616b6cab0)
 Stack:
 ffff880028323c40 ffffffffa01c2582 ffff880614cfb020 0000000000000000
 <d> 0100000000000000 00000014383a6c44 ffff8804383c3880 ffff880614e93c00
 <d> ffff880614e93c00 0000000000000000 ffff8804632c4b00 ffff8804383c38b8
 Call Trace:
 <IRQ>
 [<ffffffffa01c2582>] ? sctp_rcv+0x492/0xa10 [sctp]
 [<ffffffff8148c559>] ? nf_iterate+0x69/0xb0
 [<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8148c716>] ? nf_hook_slow+0x76/0x120
 [<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8149757d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff81497808>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff81496ccd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff81497255>] ? ip_rcv+0x275/0x350
 [<ffffffff8145cfeb>] ? __netif_receive_skb+0x4ab/0x750
 ...

With lockdep debugging:

 =====================================
 [ BUG: bad unlock balance detected! ]
 -------------------------------------
 CslRx/12087 is trying to release lock (slock-AF_INET) at:
 [<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
 but there are no more locks to release!

 other info that might help us debug this:
 2 locks held by CslRx/12087:
 #0:  (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
 #1:  (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]

Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.

Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 21:03:40 -07:00
Karl Heiss f05940e618 sctp: Whitespace fix
Fix indentation in sctp_generate_heartbeat_event.

Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 21:03:40 -07:00
Ian Wilson 34c2d9fb04 bridge: Allow forward delay to be cfgd when STP enabled
Allow bridge forward delay to be configured when Spanning Tree is enabled.

Signed-off-by: Ian Wilson <iwilson@brocade.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-27 19:09:38 -07:00
Jiri Benc b1be00a6c3 vxlan: support both IPv4 and IPv6 sockets in a single vxlan device
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 22:40:55 -07:00
David S. Miller 4963ed48f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/arp.c

The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 16:08:27 -07:00
Linus Torvalds 518a7cb698 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) When we run a tap on netlink sockets, we have to copy mmap'd SKBs
    instead of cloning them.  From Daniel Borkmann.

 2) When converting classical BPF into eBPF, fix the setting of the
    source reg to BPF_REG_X.  From Tycho Andersen.

 3) Fix igmpv3/mldv2 report parsing in the bridge multicast code, from
    Linus Lussing.

 4) Fix dst refcounting for ipv6 tunnels, from Martin KaFai Lau.

 5) Set NLM_F_REPLACE flag properly when replacing ipv6 routes, from
    Roopa Prabhu.

 6) Add some new cxgb4 PCI device IDs, from Hariprasad Shenai.

 7) Fix headroom tests and SKB leaks in ipv6 fragmentation code, from
    Florian Westphal.

 8) Check DMA mapping errors in bna driver, from Ivan Vecera.

 9) Several 8139cp bug fixes (dev_kfree_skb_any in interrupt context,
    misclearing of interrupt status in TX timeout handler, etc.) from
    David Woodhouse.

10) In tipc, reset SKB header pointer after skb_linearize(), from Erik
    Hugne.

11) Fix autobind races et al. in netlink code, from Herbert Xu with
    help from Tejun Heo and others.

12) Missing SET_NETDEV_DEV in sunvnet driver, from Sowmini Varadhan.

13) Fix various races in timewait timer and reqsk_queue_hadh_req, from
    Eric Dumazet.

14) Fix array overruns in mac80211, from Johannes Berg and Dan
    Carpenter.

15) Fix data race in rhashtable_rehash_one(), from Dmitriy Vyukov.

16) Fix race between poll_one_napi and napi_disable, from Neil Horman.

17) Fix byte order in geneve tunnel port config, from John W Linville.

18) Fix handling of ARP replies over lightweight tunnels, from Jiri
    Benc.

19) We can loop when fib rule dumps cross multiple SKBs, fix from Wilson
    Kok and Roopa Prabhu.

20) Several reference count handling bug fixes in the PHY/MDIO layer
    from Russel King.

21) Fix lockdep splat in ppp_dev_uninit(), from Guillaume Nault.

22) Fix crash in icmp_route_lookup(), from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net: Fix panic in icmp_route_lookup
  net: update docbook comment for __mdiobus_register()
  ppp: fix lockdep splat in ppp_dev_uninit()
  net: via/Kconfig: GENERIC_PCI_IOMAP required if PCI not selected
  phy: marvell: add link partner advertised modes
  net: fix net_device refcounting
  phy: add phy_device_remove()
  phy: fixed-phy: properly validate phy in fixed_phy_update_state()
  net: fix phy refcounting in a bunch of drivers
  of_mdio: fix MDIO phy device refcounting
  phy: add proper phy struct device refcounting
  phy: fix mdiobus module safety
  net: dsa: fix of_mdio_find_bus() device refcount leak
  phy: fix of_mdio_find_bus() device refcount leak
  ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
  ip6_gre: Reduce log level in ip6gre_err() to debug
  fib_rules: fix fib rule dumps across multiple skbs
  bnx2x: byte swap rss_key to comply to Toeplitz specs
  net: revert "net_sched: move tp->root allocation into fw_init()"
  lwtunnel: remove source and destination UDP port config option
  ...
2015-09-26 06:01:33 -04:00
David Ahern bdb06cbf77 net: Fix panic in icmp_route_lookup
Andrey reported a panic:

[ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
[ 7249.865559] IP: [<c16afeca>] icmp_route_lookup+0xaa/0x320
[ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
[ 7249.865637] Oops: 0000 [#1]
...
[ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-999-generic #201509220155
[ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014  08/02/2006
[ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
[ 7249.866949] EIP: 0060:[<c16afeca>] EFLAGS: 00210246 CPU: 0
[ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
[ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
[ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
[ 7249.867077]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
[ 7249.867141] Stack:
[ 7249.867165]  320310ee 00000000 00000042 320310ee 00000000 c1aeca00
f3920240 f0c69180
[ 7249.867268]  f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
e659266c f483ba54
[ 7249.867361]  8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
c16b0e00 00000064
[ 7249.867448] Call Trace:
[ 7249.867494]  [<f855058b>] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
[ 7249.867534]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867576]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867615]  [<c16b0e00>] ? icmp_send+0xa0/0x380
[ 7249.867648]  [<c16b102f>] icmp_send+0x2cf/0x380
[ 7249.867681]  [<f89c8126>] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
[ 7249.867714]  [<f89cd0da>] reject_tg+0x7a/0x9f [ipt_REJECT]
[ 7249.867746]  [<f88c29a7>] ipt_do_table+0x317/0x70c [ip_tables]
[ 7249.867780]  [<f895e0a6>] ? __nf_conntrack_find_get+0x166/0x3b0
[nf_conntrack]
[ 7249.867838]  [<f895eea8>] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
[ 7249.867889]  [<f84c0035>] iptable_filter_hook+0x35/0x80 [iptable_filter]
[ 7249.867933]  [<c16776a1>] nf_iterate+0x71/0x80
[ 7249.867970]  [<c1677715>] nf_hook_slow+0x65/0xc0
[ 7249.868002]  [<c1681811>] __ip_local_out_sk+0xc1/0xd0
[ 7249.868034]  [<c1680f30>] ? ip_forward_options+0x1a0/0x1a0
[ 7249.868066]  [<c1681836>] ip_local_out_sk+0x16/0x30
[ 7249.868097]  [<c1684054>] ip_send_skb+0x14/0x80
[ 7249.868129]  [<c16840f4>] ip_push_pending_frames+0x34/0x40
[ 7249.868163]  [<c16844a2>] ip_send_unicast_reply+0x282/0x310
[ 7249.868196]  [<c16a0863>] tcp_v4_send_reset+0x1b3/0x380
[ 7249.868227]  [<c16a1b63>] tcp_v4_rcv+0x323/0x990
[ 7249.868257]  [<c16776a1>] ? nf_iterate+0x71/0x80
[ 7249.868289]  [<c167dc2b>] ip_local_deliver_finish+0x8b/0x230
[ 7249.868322]  [<c167df4c>] ip_local_deliver+0x4c/0xa0
[ 7249.868353]  [<c167dba0>] ? ip_rcv_finish+0x390/0x390
[ 7249.868384]  [<c167d88c>] ip_rcv_finish+0x7c/0x390
[ 7249.868415]  [<c167e280>] ip_rcv+0x2e0/0x420
...

Prior to the VRF change the oif was not set in the flow struct, so the
VRF support should really have only added the vrf_master_ifindex lookup.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Cc: Andrey Melnikov <temnota.am@gmail.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 21:44:02 -07:00
Eric Dumazet 1b70e977ce inet: constify inet_rtx_syn_ack() sock argument
SYNACK packets are sent on behalf on unlocked listeners
or fastopen sockets. Mark socket as const to catch future changes
that might break the assumption.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet ea3bea3a1d tcp/dccp: constify rtx_synack() and friends
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet 802885fc04 dccp: constify dccp_make_response() socket argument
Like tcp_make_synack() the only time we might change the socket is
when calling sock_wmalloc(), which is using atomic operation to
update sk->sk_wmem_alloc

Also use MAX_DCCP_HEADER as both IPv4/IPv6 use this value for max_header.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet 0f935dbedc tcp: constify tcp_v{4|6}_send_synack() socket argument
This documents fact that listener lock might not be held
at the time SYNACK are sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet 1c1e9d2b67 ipv6: constify ip6_xmit() sock argument
This is to document that socket lock might not be held at this point.

skb_set_owner_w() and ipv6_local_error() are using proper atomic ops
or spinlocks, so we promote the socket to non const when calling them.

netfilter hooks should never assume socket lock is held,
we also promote the socket to non const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet 5d062de7f8 tcp: constify tcp_make_synack() socket argument
listener socket is not locked when tcp_make_synack() is called.

We better make sure no field is written.

There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet 6ac705b180 tcp: remove tcp_ecn_make_synack() socket argument
SYNACK packets might be sent without holding socket lock.

For DCTCP/ECN sake, we should call INET_ECN_xmit() while
socket lock is owned, and only when we init/change congestion control.

This also fixies a bug if congestion module is changed from
dctcp to another one on a listener : we now clear ECN bits
properly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet 37bfbdda0b tcp: remove tcp_synack_options() socket argument
We do not use the socket in this function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet cfe673b0ae ip: constify ip_build_and_send_pkt() socket argument
This function is used to build and send SYNACK packets,
possibly on behalf of unlocked listener socket.

Make sure we did not miss a write by making this socket const.

We no longer can use ip_select_ident() and have to either
set iph->id to 0 or directly call __ip_select_ident()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet b83e3deb97 tcp: md5: constify tcp_md5_do_lookup() socket argument
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet 30d50c61df ipv6: constify inet6_csk_route_req() socket argument
socket is not modified, make it const so that callers can
do the same if they need.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet 3aef934f4d ipv6: constify ip6_dst_lookup_{flow|tail}() sock arguments
ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch
socket, lets add a const qualifier.

This will permit the same change in inet6_csk_route_req()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet e5895bc600 inet: constify inet_csk_route_req() socket argument
This is used by TCP listener core, and listener socket shall
not be modified by inet_csk_route_req().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet 6f9c961546 inet: constify ip_route_output_flow() socket argument
Very soon, TCP stack might call inet_csk_route_req(), which
calls inet_csk_route_req() with an unlocked listener socket,
so we need to make sure ip_route_output_flow() is not trying to
change any field from its socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet b1964b5fce tcp: constify tcp_openreq_init_rwin()
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Eric Dumazet b40cf18ef7 tcp: constify listener socket in tcp_v[46]_init_req()
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Michal Kubeček 6ea29da1d0 net: remove unused argument of __netdev_find_adj()
The __netdev_find_adj() helper does not use its first argument, only the
device to find and list to walk through.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:35:15 -07:00
stephen hemminger 163c2e252f l2tp: auto load IP modules
When creating a IP encapsulated tunnel the necessary l2tp module
should be loaded. It already works for UDP encapsulation, it just
doesn't work for direct IP encap.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:27:22 -07:00
stephen hemminger f1f39f9110 l2tp: auto load type modules
It should not be necessary to do explicit module loading when
configuring L2TP. Modules should be loaded as needed instead
(as is done already with netlink and other tunnel types).

This patch adds a new module alias type and code to load
the sub module on demand.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:27:22 -07:00
Florian Fainelli f37db85d0c net: dsa: Set a "dsa" device_type
Provide a device_type information for slave network devices created by
DSA, this is useful for user-space application to easily locate/search
for devices of a specific kind.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:25:16 -07:00
Linus Torvalds 101688f534 NFS client bugfixes for Linux 4.3
Highlights include:
 
 Stable patches:
 - fix v4.2 SEEK on files over 2 gigs
 - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
 - Fix recovery of recalled read delegations
 
 Bugfixes:
 - Fix a case where NFSv4 fails to send CLOSE after a server reboot
 - Fix sunrpc to wait for connections to complete before retrying
 - Fix sunrpc races between transport connect/disconnect and shutdown
 - Fix an infinite loop when layoutget fail with BAD_STATEID
 - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
 - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
 - Fix layoutreturn/close ordering issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWBWKdAAoJEGcL54qWCgDy2rUP/iIWUQSpUPfKKw7xquQUQe4j
 ci4nFxpJ/zhKj1u7x3wrkxZAXcEooYo+ZJ7ayzROKcfQL/sUSWGbSLdr3mqrQynv
 b0SDmnJK9V+CdBQrA+Jp5UGQxcumpMxsAfqVznT0qkf/wDp44DCVgDz5Aj8cRbWU
 6xPfMgVLEnXiId9IgKqg3sJ2NmvMZXuI9sHM6hp6OzRmQDjTcx+LgRz7tnQHgaEk
 zGz8R6eDm3OA0wfApqZwJ6JY793HsDdy30W9L0Yi2PVGXfzwoEB8AqgLVwSDIY1B
 5hG5zn3tg9PSz9vhJ7M2h4AgFHdB3w3XGdJUafwqZEeqEIagw1iFCWlMyo/lE2dG
 G7oob9Jiiwxjc3RDWn2wGaafymrrWZwl2nYzC4O3UvJ3hVJ0mEl1iJagK1m8LzfN
 fmnP7tTyPuoOXkzDogZ0YI3FrngO6430PoR2hUPkS1yce/a+IV0HQEmXbSDSwN80
 1d9zyC9TnPj6rFjZeaGxGK17BpkC0oIQCPq4OSJB4396wzAwMqoJjJVVWWeAK4UC
 PxzoXqAAaBFguSsDbuBMcXgiuUw/7DIZ/pdzsWSiCFgocgF5ZdJdieCNtGk0nbLM
 37R7HCauF93JDrkpUMKPnLXScb2IbEh31pFtKzptJYKwMxEiScXXiP3NE9hfX65i
 2zLkl2aBvd154RvVKNbp
 =GdeV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - fix v4.2 SEEK on files over 2 gigs
   - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
   - Fix recovery of recalled read delegations

  Bugfixes:
   - Fix a case where NFSv4 fails to send CLOSE after a server reboot
   - Fix sunrpc to wait for connections to complete before retrying
   - Fix sunrpc races between transport connect/disconnect and shutdown
   - Fix an infinite loop when layoutget fail with BAD_STATEID
   - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
   - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
   - Fix layoutreturn/close ordering issues"

* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS41: make close wait for layoutreturn
  NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
  NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
  NFSv4: Recovery of recalled read delegations is broken
  NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
  NFS: Do cleanup before resetting pageio read/write to mds
  SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
  SUNRPC: Lock the transport layer on shutdown
  nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
  SUNRPC: Ensure that we wait for connections to complete before retrying
  SUNRPC: drop null test before destroy functions
  nfs: fix v4.2 SEEK on files over 2 gigs
  SUNRPC: Fix races between socket connection and destroy code
  nfs: fix pg_test page count calculation
  Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-25 11:33:52 -07:00
Chuck Lever bb6c96d728 xprtrdma: Replace global lkey with lkey local to PD
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 10:46:51 -04:00
Russell King 9861f72074 net: fix net_device refcounting
of_find_net_device_by_node() uses class_find_device() internally to
lookup the corresponding network device.  class_find_device() returns
a reference to the embedded struct device, with its refcount
incremented.

Add a comment to the definition in net/core/net-sysfs.c indicating the
need to drop this refcount, and fix the DSA code to drop this refcount
when the OF-generated platform data is cleaned up and freed.  Also
arrange for the ref to be dropped when handling errors.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:53 -07:00
Russell King e496ae690b net: dsa: fix of_mdio_find_bus() device refcount leak
Current users of of_mdio_find_bus() leak a struct device refcount, as
they fail to clean up the reference obtained inside class_find_device().

Fix the DSA code to properly refcount the returned MDIO bus by:
1. taking a reference on the struct device whenever we assign it to
   pd->chip[x].host_dev.
2. dropping the reference when we overwrite the existing reference.
3. dropping the reference when we free the data structure.
4. dropping the initial reference we obtained after setting up the
   platform data structure, or on failure.

In step 2 above, where we obtain a new MDIO bus, there is no need to
take a reference on it as we would only have to drop it immediately
after assignment again, iow:

	put_device(cd->host_dev);	/* drop original assignment ref */
	cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
	put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:52 -07:00
Jiri Pirko f623ab7f51 switchdev: reduce transaction phase enum down to a boolean
Now, since we have only 2 values for transaction phase, just use bool.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko 79a62eb22a dsa: use prepare/commit switchdev transaction helpers
The enum is going to disappear, use the helpers instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko 9f6467cf22 switchdev: remove "ABORT" transaction phase
No longer used by drivers, as transaction queue with item destructors
takes care of abort phase internally in switchdev code. So kill it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko f8db83486e switchdev: move transaction phase enum under transaction structure
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko 7ea6eb3f56 switchdev: introduce transaction item queue for attr_set and obj_add
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Abort item destruction
is handled as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko 69f5df491e switchdev: rename "trans" to "trans_ph".
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Pablo Neira Ayuso c3456026ad Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.4

please consider these bug fixes and extensive clean-ups of IPVS
from Eric Biederman for v4.4.

His excellent description of the changes, which is part of an even larger
set of clean-up work, is as follows:

  I am gradually working my way through the netfilter stack passing struct
  down into the netfilter hooks and from the netfilter hooks and from there
  down into the functions that actually care.  This removes the need for
  netfilter functions to guess how to figure out how to compute which
  network namespace they are in and instead provides a simple and reliable
  method to do so.

  The cleanups stand on their own but this is part of a larger effort to
  have routes with an output device that is not in the current network
  namespace.

  The IPVS code has been a bit more of a challenge than most.  Just passing
  struct net through to where it is needed did not feel clean to me.  The
  practical issue is that the ipvs code in most places actually wants
  struct netns_ipvs and not struct net.

  So as part of this process I have turned the relationship between struct
  net and the structs netns_ipvs, ip_vs_conn_param, ip_vs_conn, and
  ip_vs_service inside out.  I have modified the ipvs functions to take a
  struct netns_ipvs not a struct net.  The net is code with fewer
  conversions from one type of structure to another.  I did wind up adding
  a struct netns_ipvs parameter to quite a few functions that did not have
  it before so I could pass the structure down from the netfilter hooks to
  where it is actually needed to avoid guessing.

  I have broken up the work in a bunch of small patches so there is at
  least a chance and reviewing that each step I took is correct.  The
  series compiles at each step so bisecting it should not be a problem if
  something weird comes up.

  The first two changes in this series are actually bug fixes.  The first
  is a compile fix for a bug in sctp that came in, in the last round of
  ipvs changes merged into nf-next.  The second fixes an older bug where in
  pathological circumstances the wrong network namespace could be used when
  a proc file is written to.

  The rest of the patchset is a bunch of boring changes getting pushing
  struct netns_ipvs (and by extension ipvs->net) where it needs to be.
  Either by replacing struct net pointers or adding new struct netns_ipvs
  pointers.  With a handful of other minor cleanups (like removing
  skb_net).

I have decided include the bug fixes in this pull request. Patch one
relates to a bug that was added to nf-next recently and is thus not
applicable to nf . Patch two could arguably be promoted to a fix for v4.3
and stable though it does not appear to be severe enough to warrant that
course of action; let me know if you would like me to reconsider.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-25 01:38:58 +02:00
Matt Bennett 17a10c9215 ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
Currently error log messages in ip6_tnl_err are printed at 'warn'
level. This is different to other tunnel types which don't print
any messages. These log messages don't provide any information that
couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 16:08:37 -07:00
David S. Miller deccbe80be Just two small fixes:
* VHT MCS mask array overrun, reported by Dan Carpenter
  * reset CQM history to always get a notification, from Sara Sharon
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJWAWENAAoJEDBSmw7B7bqr7agQAI/IVkuKlzu24XFg6PbaUNcL
 cD+jZplmg3OeeUf/Gtv7KVLAbRlo1X+qIiuquewzGhqAOGSm3KmypQmKQy7Y1gMO
 Nq4uJwdkJpo2aQtQH97jXwWfk40d+BqSXNbnpobvzrMWaMXHO+DT18Icc3iJqO9l
 cQOUTNplDMTpXgjHXjf6296MVP30lbUnoI6BPJGb3LS3BvfgX8KYR4XGlRCQ5R1F
 4gZE0IokY+PQmbZvBzXD3dBGuwnztezX8g25IytgFs+UDyeg8xn/OI+Q3e4HMpqD
 4SeNpD3AD1jxEedpr9trCaLCm8NQdnfU4LDE0y5qaTC2cWPTw1qZs4alNlbD+L3L
 odjiP1aU2X9rWTf78/WcJwn+aufBNGz20M9at5IWkUCi88taSGWzZAviRhhGpuzd
 4c2v6DLcaAqtTTGhhN6CUE4f9+JzeICgDjqSmdR90caMAtOPFjS2ZpbMstKVxO1V
 fjJ4Ys/VRsySrSIji/ryPNUSigeNhEJm9jImObcKhsEv6aqdIbYzdDWekbSVDtvB
 nJQJqlI1RlNn+HO+VXvPDIBuSpOXddVO6142yLQtte4ReL9NbZFU0lx7U5N6JoIk
 QfXbC9eD/LxdRONxLvakblWWbm+FflCaTiEi9dL+zQeAAY3Zgyfi2vo4bHdwi3XL
 GMNNo05q5WSFGkfktb8v
 =4Ui5
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two small fixes:
 * VHT MCS mask array overrun, reported by Dan Carpenter
 * reset CQM history to always get a notification, from Sara Sharon
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:36:20 -07:00
Matt Bennett a46496ce38 ip6_gre: Reduce log level in ip6gre_err() to debug
Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:28:43 -07:00
Wilson Kok 41fc014332 fib_rules: fix fib rule dumps across multiple skbs
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:21:54 -07:00
Eric Dumazet d8ed625044 tcp: factorize sk_txhash init
Neal suggested to move sk_txhash init into tcp_create_openreq_child(),
called both from IPv4 and IPv6.

This opportunity was missed in commit 58d607d3e5 ("tcp: provide
skb->hash to synack packets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:52:30 -07:00
WANG Cong d8aecb1011 net: revert "net_sched: move tp->root allocation into fw_init()"
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:33:30 -07:00
Jiri Benc b194f30c61 lwtunnel: remove source and destination UDP port config option
The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.

As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.

Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).

If in the future we rework ARP handling to allow port specification, the
attributes can be added back.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:37 -07:00
Jiri Benc 63d008a4e9 ipv4: send arp replies to the correct tunnel
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.

We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.

The only thing to do is to "reverse" the ip_tunnel_info.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:36 -07:00
Jiri Benc 38cf595b19 ipv6: remove unused neigh parameter from ndisc functions
Since commit 12fd84f438 ("ipv6: Remove unused neigh argument for
icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na
and ndisc_send_ns is unused.

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:26:08 -07:00
Jiri Benc 92c14d9b5e genetlink: simplify genl_notify
The genl_notify function has too many arguments for no real reason - all
callers use genl_info to get them anyway. Just pass the genl_info down to
genl_notify.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:25:23 -07:00
Herbert Xu da314c9923 netlink: Replace rhash_portid with bound
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09 ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:07:08 -07:00
Eric W. Biederman 57781c1cee ipvs: Pass ipvs into ip_vs_gather_frags
This will be needed later when the network namespace guessing is
removed from ip_defrag.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 9cfdd75b7c ipvs: Remove skb_sknet
This function adds no real value and it obscures what the code is doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 7d1f88eca0 ipvs: Pass ipvs not net to ip_vs_protocol_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 69f390934b ipvs: Remove net argument from ip_vs_tcp_conn_listen
The argument is unnecessary and in practice confusing,
and has caused the callers to do all manner of silly things.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman a43d1a6b97 ipvs: Pass ipvs through ip_vs_route_me_harder into sysctl_snat_reroute
This removes the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 7b5f689a2c ipvs: Pass ipvs into ip_vs_out_icmp and ip_vs_out_icmp_v6
This removes the need to compute ipvs with the hack "net_ipvs(skb_net(skb))"

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 6f2bcea991 ipvs: Pass ipvs into ip_vs_in_icmp and ip_vs_in_icmp_v6
With ipvs passed into ip_vs_in_icmp and ip_vs_in_icmp_v6
they no longer need to call the hack that is skb_net.

Additionally ipvs_in_icmp no longer needs to call dev_net(skb->dev)
and can use the ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 6e385bb3ef ipvs: Pass ipvs into ip_vs_in
Derive ipvs from state->net in the callers of ip_vs_in and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman 1b75097dd7 ipvs: Pass ipvs into ip_vs_out
Derive ipvs from state->net in the callers of ip_vs_out and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 2300f0451e ipvs: Pass ipvs not net into sysctl_nat_icmp_send
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 51efbcbbb2 ipvs: Simplify ipvs and net access in ip_vs_leave
Stop using the hack skb_net(skb) to compute the network namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 5703294874 ipvs: Wrap sysctl_cache_bypass and remove ifdefs in ip_vs_leave
With sysctl_cache_bypass now a compile time constant the compiler can
figue out that it can elimiate all of the code that depends on
sysctl_cache_bypass being true.

Also remove the duplicate computation of net previously necessitated
by #ifdef CONFIG_SYSCTL

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 7c08d78e6f ipvs: Better derivation of ipvs in ip_vs_in_stats and ip_vs_out_stats
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman 20868a40d0 ipvs: Pass ipvs into ensure_mtu_is adequate
This allows two different ways for computing/guessing net to be
removed from ensure_mtu_is_adequate.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman f5745f8ae6 ipvs: Pass ipvs into __ip_vs_get_out_rt_v6
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman ecfe87b884 ipvs: Pass ipvs into __ip_vs_get_out_rt
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 361c3f5293 ipvs: Better derivation of ipvs in ip_vs_tunnel_xmit
Don't use "net_ipvs(skb_net(skb))" as skb_net is a bad hack.  Instead
use cp->ipvs and ipvs->net for the net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman d8f44c335a ipvs: Pass ipvs into .conn_schedule and ip_vs_try_to_schedule
This moves the hack "net_ipvs(skb_net(skb))" up one level where it
will be easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 2f3edc6a5b ipvs: Pass ipvs not net into ip_vs_conn_net_init and ip_vs_conn_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman d889717aaf ipvs: Pass ipvs not net into ip_vs_conn_net_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 754b81a357 ipvs: Pass ipvs not net to ip_vs_conn_hashkey
Use the address of struct netns_ipvs in the hash not the address of
struct net.  Both addresses are equally valid candidates and by using
the address of struct netns_ipvs there becomes no need deal with
struct net in this part of the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman 0cf705c8c2 ipvs: Pass ipvs into conn_out_get
Move the hack of relying on "net_ipvs(skb_net(skb))" to derive the
ipvs up a layer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman ab16197642 ipvs: Pass ipvs into .conn_in_get and ip_vs_conn_in_get_proto
Stop relying on "net_ipvs(skb_net(skb))" to derive the ipvs as
skb_net is a hack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman f5099dd4d9 ipvs: Pass ipvs into ip_vs_conn_fill_param_proto
Move the ugly hack net_ipvs(skb_net(skb)) up a layer in the call stack
so it is easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 1281a9c2d1 ipvs: Pass ipvs not net into init_netns and exit_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman c70bd6800a ipvs: Pass ipvs not net into [un]register_ip_vs_proto_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman b5dd212cc1 ipvs: Pass ipvs not net into ip_vs_app_net_init and ip_vs_app_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 09858708e6 ipvs: Pass ipvs not net into ip_vs_app_inc_release
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 9f8128a56e ipvs: Pass ipvs not net to register_ip_vs_app and unregister_ip_vs_app
Also move the tests for net_ipvs being NULL into __ip_vs_ftp_init
and __ip_vs_ftp_exit.  The only places where they possibly make
sense.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 3250dc9c52 ipvs: Pass ipvs not net to register_ip_vs_app_inc
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman a080ce38a0 ipvs: Pass ipvs not net into ip_vs_app_inc_new
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman 19648918fb ipvs: Pass ipvs not net into register_app and unregister_app
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman a4dd0360c6 ipvs: Pass ipvs not net to ip_vs_estimator_net_init and ip_vs_estimator_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 70a131a2c8 ipvs: Pass ipvs not net to estimation_timer
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 3d99376689 ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 8b8237a581 ipvs: Pass ipvs not net to ip_vs_control_net_(init|cleanup)_sysctl
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 423b55954d ipvs: Pass ipvs not net to ip_vs_random_drop_entry
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman 0f34d54bf4 ipvs: Pass ipvs not net to ip_vs_start_estimator aned ip_vs_stop_estimator
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman cacd1e60f1 ipvs: Pass ipvs not net to ip_vs_genl_set_config
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman ebea1f7c0b ipvs: Pass ipvs not net to ip_vs_sync_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 802cb43703 ipvs: Pass ipvs not net to ip_vs_sync_net_init
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 1fc12004d2 ipvs: Pass ipvs not net to ip_vs_proc_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 4f30665bac ipvs: Pass ipvs not net to ip_vs_proc_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman b61a8c1a40 ipvs: Pass ipvs not net to ip_vs_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 72e9481e28 ipvs: Pass ipvs not net to ip_vs_sync_conn_v0
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 7d537f3ab7 ipvs: Pass ipvs not net to ip_vs_process_message
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman 37b68e6ded ipvs: Store ipvs not net in struct ip_vs_sync_thread_data
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of tinfo->net to access tinfo->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman fd124e2f8b ipvs: Pass ipvs not net to make_receive_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 68c76b6aa0 ipvs: Pass ipvs not net to make_send_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman b3cf3cbfb5 ipvs: Pass ipvs not net to stop_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 6ac121d710 ipvs: Pass ipvs not net to start_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman df04ffb766 ipvs: Pass ipvs not net to ip_vs_genl_del_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman d8443c5f2b ipvs: Pass ipvs not net to ip_vs_genl_new_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 34c2f5146c ipvs: Pass ipvs not net to ip_vs_genl_find_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman 613fb830b7 ipvs: Pass ipvs not net to ip_vs_genl_parse_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman af5403419d ipvs: Pass ipvs not net to __ip_vs_get_timeouts
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman 08fff4c357 ipvs: Pass ipvs not net to __ip_vs_get_dest_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman b2876b7773 ipvs: Pass ipvs not net to __ip_vs_get_service_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman f1faa1e749 ipvs: Pass ipvs not net to ip_vs_set_timeout
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 18d6ade63c ipvs: Pass ipvs not net to ip_vs_proto_data_get
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman a47b430080 ipvs: Cache ipvs in ip_vs_in_icmp and ip_vs_in_icmp_v6
Storte the value of net_ipvs in a variable named ipvs so that when
there are more users struct netns_ipvs in ip_vs_in_cmp and
ip_vs_in_icmp_v6 they won't need to compute the value again.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman c60856c687 ipvs: Pass ipvs not net to ip_vs_zero_all
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 56d2169b77 ipvs: Pass ipvs not net to ip_vs_service_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman ef7c599d91 ipvs: Pass ipvs not net to ip_vs_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 5060bd8307 ipvs: Pass ipvs not net to ip_vs_add_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman cd58278bd4 ipvs: Cache ipvs in ip_vs_genl_set_cmd
Compute ipvs early in ip_vs_genl_set_cmd and use the cached value to
access ipvs->sync_state.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman 8e743f1b45 ipvs: Pass ipvs not net to ip_vs_dest_trash_expire
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 79ac82e0aa ipvs: Pass ipvs not net to __ip_vs_del_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 6c0e14f507 ipvs: Pass ipvs not net to ip_vs_trash_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman dc2add6f2e ipvs: Pass ipvs not net to ip_vs_find_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 48aed1b029 ipvs: Pass ipvs not net to ip_vs_has_real_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 0a4fd6ce92 ipvs: Pass ipvs not net to ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman bb2e2a8c95 ipvs: Pass ipvs not net to __ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman ba61f39034 ipvs: Pass ipvs not net to ip_vs_svc_hashkey
Use the address of ipvs not the address of net when computing the
hash value.  This removes an unncessary dependency on struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman 1ed8b94780 ipvs: Pass ipvs not net to __ip_vs_svc_fwm_find
ipvs is what the code actually wants to use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman f6510b245e ipvs: Pass ipvs not net to ip_vs_svc_fwm_hashkey
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 3109d2f2d1 ipvs: Store ipvs not net in struct ip_vs_service
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

In functions where we are searching for an svc and filtering by net
filter by ipvs instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 19913dec1b ipvs: Pass ipvs not net to ip_vs_fill_conn
ipvs is what is actually desired so change the parameter and the modify
the callers to pass struct netns_ipvs.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman e64e2b460c ipvs: Store ipvs not net in struct ip_vs_conn_param
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

When lookup up struct ip_vs_conn in a hash table replace comparisons
of cp->net with comparisons of cp->ipvs which is possible
now that ipvs is present in ip_vs_conn_param.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 58dbc6f260 ipvs: Store ipvs not net in struct ip_vs_conn
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of conn->net to access conn->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman d484fc3812 ipvs: Use state->net in the ipvs forward functions
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 717e917ddf ipvs: Don't use current in proc_do_defense_mode
Instead store ipvs in extra2 so that proc_do_defense_mode can easily
find the ipvs that it's value is associated with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman 1daea8ed16 ipvs: Hoist computation of ipvs earlier in sctp_conn_schedule
The addition of sysctl_sloppy_sctp in sctp_conn_schedule resulted
in a use of ipvs before it was computed.  Hoist the computation of
ipvs earlier to avoid this problem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:32 +09:00
Siva Mannem dcd45e0649 bridge: don't age externally added FDB entries
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Premkumar Jonnala <pjonnala@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
Scott Feldman a79e88d9fb bridge: define some min/max/default ageing time constants
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
David Woodhouse d3869efe7a Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:33:55 -07:00
Neil Horman 2d8bff1269 netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0				CPU1
ndo_tx_timeout			napi_poll_dev
 napi_disable			 poll_one_napi
  test_and_set_bit (ret 0)
				  test_bit (ret 1)
   reset adapter		   napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware  (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do.  The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable.  The disabled napi state is actually identical to the scheduled
state for a given napi instance.  The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access.  In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy.  netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit.  That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
	Remove a trailing whtiespace
	Resubmit with proper subject prefix

V3)
	Clean up spacing nits

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:32:50 -07:00
Daniel Borkmann 5cf8ca0e47 cls_bpf: further limit exec opcodes subset
Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.

Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann ef146fa40c cls_bpf: make binding to classid optional
The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.

Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).

An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann bf007d1c75 cls_bpf: also dump TCA_BPF_FLAGS
In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.

Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:01 -07:00
Daniel Borkmann 927ab1d764 sched, bpf: let stack handle !IFF_UP devs on bpf_clone_redirect
Similarly as already the case in bpf_redirect()/skb_do_redirect()
pair, let the stack deal with devs that are !IFF_UP.

dev_forward_skb() as well as dev_queue_xmit() will free the skb
and increment drop counter internally in such cases, so we can
spare the condition in bpf_clone_redirect().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:25:51 -07:00
Eric Dumazet 675ee231d9 tcp: add proper TS val into RST packets
RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:24:07 -07:00
Tom Herbert 644d0e6569 ipv6 Use get_hash_from_flowi6 for rt6 hash
In rt6_info_hash_nhsfn replace the custom hashing over flowi6 that is
using xor with a call to common function get_hash_from_flowi6.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:21:09 -07:00
Neil Armstrong fbd03513bf net: dsa: Fix Marvell Egress Trailer check
The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:37:03 -07:00
Jesse Gross ae5f2fb1d5 openvswitch: Zero flows on allocation.
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:33:41 -07:00
Andrea Arcangeli ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
David S. Miller 99cb99aa05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:

1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
   proc knob to enable/disable it. By default is off to retain the old
   behaviour. Patchset from Alex Gartrell.

I'm also including what Alex originally said for the record:

"The configuration of ipvs at Facebook is relatively straightforward.  All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie.  For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).

The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance.  With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery.  Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.

To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp.  If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."

2) Add another proc entry to ignore tunneled packets to avoid routing loops
   from IPVS, also from Alex.

3) Fifteen patches from Eric Biederman to:

* Stop passing nf_hook_ops as parameter to the hook and use the state hook
  object instead all around the netfilter code, so only the private data
  pointer is passed to the registered hook function.

* Now that we've got state->net, propagate the netns pointer to netfilter hook
  clients to avoid its computation over and over again. A good example of how
  this has been simplified is the former TEE target (now nf_dup infrastructure)
  since it has killed the ugly pick_net() function.

There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 13:11:43 -07:00
Sara Sharon babc305e21 mac80211: reset CQM history upon reconfiguration
The current behavior of notifying CQM events is inconsistent:
Upon first configuration there is a cqm event with the current
status according to threshold configured, regardless of signal
stability.
When there is reconfiguration no event is sent unless there is
a significant change to the signal level according to the new
configuration.

Since the current reconfiguration behavior might cause missing
CQM events in case the current signal did not change but is on
the other side of the new threshold, fix that by resetting the
stored signal level upon reconfiguration.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:22:50 +02:00
Helmut Schaa 4dc792b8f0 mac80211: Split sending tx'ed frames to monitor interfaces into its own function
This allows ieee80211_tx_monitor to be used directly for sending 802.11 frames
to all monitor interfaces.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:28 +02:00
Rafał Miłecki d55d0d598e nl80211: put current TX power in interface info
Many drivers implement reading current TX power (using either cfg80211
or ieee80211 op) but userspace can't get it using nl80211. Right now the
only way to access it is to call some wext ioctl.
Let's put TX power in interface info reply (callback is wdev specific)
just like we do with current channel.
To be consistent (e.g. NL80211_CMD_SET_WIPHY) let's use mBm as na unit.

Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:27 +02:00
Johannes Berg 338c17ae31 mac80211: use DECLARE_EWMA for ave_beacon_signal
It doesn't seem problematic to change the weight for the average
beacon signal from 3 to 4, so use DECLARE_EWMA. This also makes
the code easier to maintain since bugs like the one fixed in the
previous patch can't happen as easily.

With a fix from Avraham Stern to invert the sign since EMWA uses
unsigned values only.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:26 +02:00
Johannes Berg 8ec6d97871 mac80211: fix driver RSSI event calculations
The ifmgd->ave_beacon_signal value cannot be taken as is for
comparisons, it must be divided by since it's represented
like that for better accuracy of the EWMA calculations. This
would lead to invalid driver RSSI events. Fix the used value.

Fixes: 615f7b9bb1 ("mac80211: add driver RSSI threshold events")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:26 +02:00
Johannes Berg 8e0d7fe07c mac80211: remove last_beacon/ave_beacon debugfs files
These file aren't really useful:
 - if per beacon data is required then you need to use
   radiotap or similar anyway, debugfs won't help much
 - average beacon signal is reported in station info in
   nl80211 and can be looked up with iw

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:25 +02:00
Bob Copeland c85fb53c4f mac80211: implement VHT support for mesh
Implement the basics required for supporting very high throughput
with mesh: include VHT information elements in beacons, probe
responses, and peering action frames, and check for compatible VHT
configurations when peering.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:25 +02:00
Chun-Yeow Yeoh f020ae40b0 mac80211: zero center freq segment 2 in VHT oper IE
Clear the Channel Center Frequency Segment 2 in VHT operation
IEs to avoid sending non-zero values if the SKB wasn't zeroed
before adding the VHT operation IE.

Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
[change commit message a bit - not necessarily just mesh related]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:24 +02:00
Emmanuel Grumbach 99e7ca44bb mac80211: allow the driver to advertise A-MSDU within A-MPDU Rx support
Drivers may be interested in receiving A-MSDU within A-MDPU.
Not all the devices may be able to do so, make it configurable.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:24 +02:00
Johannes Berg 46cad4b7a1 mac80211: remove direct probe step before authentication
The direct probe step before authentication was done mostly for
two reasons:
 1) the BSS data could be stale
 2) the beacon might not have included all IEs

The concern (1) doesn't really seem to be relevant any more as
we time out BSS information after about 30 seconds, and in fact
the original patch only did the direct probe if the data was
older than the BSS timeout to begin with. This condition got
(likely inadvertedly) removed later though.

Analysing this in more detail shows that since we mostly use
data from the association response, the only real reason for
needing the probe response was that the code validates the WMM
parameters, and those are optional in beacons. As the previous
patches removed that behaviour, we can now remove the direct
probe step entirely.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:23 +02:00
Emmanuel Grumbach e3abc8ff0f mac80211: allow to transmit A-MSDU within A-MPDU
Advertise the capability to send A-MSDU within A-MPDU
in the AddBA request sent by mac80211. Let the driver
know about the peer's capabilities.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:23 +02:00
Andrei Otcheretianski 1b09b5568e mac80211: introduce per vif frame registration API
Currently the cfg80211's frame registration api receives wdev, however
mac80211 assumes per device filter configuration and ignores wdev.
Per device filtering is too wasteful, especially for multi-channel
devices.
Introduce new per vif frame registration API and use it for probe
request registrations in ieee80211_mgmt_frame_register()
Also call directly to ieee80211_configure_filter instead of using a work
since it is now allowed to sleep in ieee80211_mgmt_frame_register.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:22 +02:00
Johannes Berg 7bdbe400d1 nl80211: support vendor dumpit commands
In order to transfer many items in vendor commands, support the
dumpit netlink method for them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:22 +02:00
Arik Nemtsov dd55ab59b6 mac80211: TDLS: check reg with IR-relax on chandef upgrade
When checking if a TDLS chandef can be upgraded, IR-relaxation can be
taken into account to allow more channels.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:21 +02:00
Arik Nemtsov 82c0cc90d6 mac80211: debugfs: add file to disallow TDLS wider-bw
Sometimes we are interested in testing TDLS performance in a specific
width setting. Add the ability to disable the wider-band feature, thereby
allowing the TDLS channel width to be controlled by the BSS width.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:21 +02:00
Andrei Otcheretianski fc58c47ef1 mac80211: process skb_queue while scanning in HW
Queued frames aren't processed during scan, which results in an inability
to complete the BA session establishment until the scan ends. Since we
can't tx frames until the BA agreement setup is complete, it might result
in a very large latency during scan.
Fix this by allowing to process queued skbs while scanning in HW. This
should be ok since the devices which support hw scan should be able
to handle tx/rx while scanning.
During SW scan, mac80211 drops any txed frames besides probes and NDPs,
so it is still needed to delay processing of the queued frames till the
SW scan is done.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:20 +02:00
Johannes Berg 8de1c63ba1 wireless: make __freq_reg_info static
As pointed out by sparse, this symbol should be static, make it so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:20 +02:00
Johannes Berg 2df1b131b5 mac80211: fix VHT MCS mask array overrun
The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
Split the loops to handle this correctly.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:19:48 +02:00
Eric Dumazet 29c6852602 inet: fix races in reqsk_queue_hash_req()
Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Eric Dumazet ed2e923945 tcp/dccp: fix timewait races in timer handling
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Yuchung Cheng f9b9958229 tcp: send loss probe after 1s if no RTT available
This patch makes TLP to use 1 sec timer by default when RTT is
not available due to SYN/ACK retransmission or SYN cookies.

Prior to this change, the lack of RTT prevents TLP so the first
data packets sent can only be recovered by fast recovery or RTO.
If the fast recovery fails to trigger the RTO is 3 second when
SYN/ACK is retransmitted. With this patch we can trigger fast
recovery in 1sec instead.

Note that we need to check Fast Open more properly. A Fast Open
connection could be (accepted then) closed before it receives
the final ACK of 3WHS so the state is FIN_WAIT_1. Without the
new check, TLP will retransmit FIN instead of SYN/ACK.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Yuchung Cheng 0f1c28ae74 tcp: usec resolution SYN/ACK RTT
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.

This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.

For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)

For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.

If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().

One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Ursula Braun 91e60eb60b s390/iucv: do not use arrays as argument
The iucv code uses arrays as arguments. Even though this does not
really cause a problem, it could be misleading, since the compiler
turns array arguments into just a pointer argument. To be more
precise this patch changes the array arguments into pointers.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:03:04 -07:00
David S. Miller 5dcd246107 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-09-18

Here's the first bluetooth-next pull request for the 4.4 kernel:

 - ieee802154 cleanups & fixes
 - debugfs support for the at86rf230 driver
 - Support for quirky (seemingly counterfeit) CSR Bluetooth controllers
 - Power management and device config improvements for Intel controllers
 - Fix for devices with incorrect advertising data length
 - Fix for closing HCI user channel socket

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:00:44 -07:00
Herbert Xu 1f770c0a09 netlink: Fix autobind race condition that leads to zero port ID
The commit c0bb07df7d ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:55:31 -07:00
Nicolas Dichtel bc22a0e2ea iptunnel: make rx/tx bytes counters consistent
This was already done a long time ago in
commit 64194c31a0 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).

Before the patch the gre header was included on tx.

After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/gre 10.16.0.249 peer 10.16.0.121
    RX: bytes  packets  errors  dropped overrun mcast
    84         1        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    84         1        0       0       0       0

Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:36:22 -07:00
David S. Miller ac81374493 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patch contains Netfilter fixes for your net tree, they are:

1) nf_log_unregister() should only set to NULL the logger that is being
   unregistered, instead of everything else. Patch from Florian Westphal.

2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
   This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
   Also from Florian.

3) Use existing match/target extensions in the internal nft_compat extension
   lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).

4) Wait for rcu grace period before leaving nf_log_unregister().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:32:20 -07:00
Erik Hugne 4e3ae00100 tipc: reinitialize pointer after skb linearize
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Tamás Végh <tamas.vegh@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:31:20 -07:00
Ksenija Stanojevic 22a3f9a204 rxrpc: Replace get_seconds with ktime_get_seconds
Replace time_t type and get_seconds function which are not y2038 safe
on 32-bit systems. Function ktime_get_seconds use monotonic instead of
real time and therefore will not cause overflow.

Signed-off-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 21:53:56 -07:00
Nikola Forró 0315e38270 net: Fix behaviour of unreachable, blackhole and prohibit routes
Man page of ip-route(8) says following about route types:

  unreachable - these destinations are unreachable.  Packets are dis‐
  carded and the ICMP message host unreachable is generated.  The local
  senders get an EHOSTUNREACH error.

  blackhole - these destinations are unreachable.  Packets are dis‐
  carded silently.  The local senders get an EINVAL error.

  prohibit - these destinations are unreachable.  Packets are discarded
  and the ICMP message communication administratively prohibited is
  generated.  The local senders get an EACCES error.

In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.

In both address families all three route types now behave consistently
with documentation.

Signed-off-by: Nikola Forró <nforro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 21:45:08 -07:00
Trond Myklebust 4b0ab51db3 SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
Under all conditions, it should be quite sufficient just to mark
the socket as disconnected. It will then be closed by the
transport shutdown or reconnect code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:38:35 -04:00
Trond Myklebust 79234c3db6 SUNRPC: Lock the transport layer on shutdown
Avoid all races with the connect/disconnect handlers by taking the
transport lock.

Reported-by:"Suzuki K. Poulose" <suzuki.poulose@arm.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-19 16:37:43 -04:00
Eric W. Biederman 0a031ac5c0 netfilter: Use nf_ct_net instead of dev_net(out) in nf_nat_masquerade_ipv6
Use nf_ct_net(ct) instead of guessing that the netdevice out can
reliably report the network namespace the conntrack operation is
happening in.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:28 +02:00
Eric W. Biederman c7af6483b9 netfilter: Pass net into nf_xfrm_me_harder
Instead of calling dev_net on a likley looking network device
pass state->net into nf_xfrm_me_harder.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:22 +02:00
Eric W. Biederman 06198b34a3 netfilter: Pass priv instead of nf_hook_ops to netfilter hooks
Only pass the void *priv parameter out of the nf_hook_ops.  That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:16 +02:00
Eric W. Biederman 176971b338 ipvs: Read hooknum from state rather than ops->hooknum
This should be more cache efficient as state is more likely to be in
core, and the netfilter core will stop passing in ops soon.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:10 +02:00
Eric W. Biederman a31f1adc09 netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple
As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 22:00:04 +02:00
Eric W. Biederman a4ffe319ae act_connmark: Remember the struct net instead of guessing it.
Stop guessing the struct net instead of remember it.  Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:59:31 +02:00
Eric W. Biederman 206e8c0075 netfilter: Pass net to nf_dup_ipv4 and nf_dup_ipv6
This allows them to stop guessing the network namespace with pick_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:59:11 +02:00
Eric W. Biederman 88182a0e0c netfilter: nf_tables: Use pkt->net instead of computing net from the passed net_devices
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:49 +02:00
Eric W. Biederman 686c9b5080 netfilter: x_tables: Use par->net instead of computing from the passed net devices
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:25 +02:00
Eric W. Biederman 156c196f60 netfilter: x_tables: Pass struct net in xt_action_param
As xt_action_param lives on the stack this does not bloat any
persistent data structures.

This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:14 +02:00
Eric W. Biederman 6aa187f21c netfilter: nf_tables: kill nft_pktinfo.ops
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum

This simplifies the code, makes it more readable, and likely reduces
cache line misses.  Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:58:01 +02:00
Eric W. Biederman 082a758f04 inet netfilter: Prefer state->hook to ops->hooknum
The values of nf_hook_state.hook and nf_hook_ops.hooknum must be the
same by definition.

We are more likely to access the fields in nf_hook_state over the
fields in nf_hook_ops so with a little luck this results in
fewer cache line misses, and slightly more consistent code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:51 +02:00
Eric W. Biederman 6cb8ff3f1a inet netfilter: Remove hook from ip6t_do_table, arp_do_table, ipt_do_table
The values of ops->hooknum and state->hook are guaraneted to be equal
making the hook argument to ip6t_do_table, arp_do_table, and
ipt_do_table is unnecessary. Remove the unnecessary hook argument.

In the callers use state->hook instead of ops->hooknum for clarity and
to reduce the number of cachelines the callers touch.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:43 +02:00
Eric W. Biederman 97b59c3a91 netfilter: ebtables: Simplify the arguments to ebt_do_table
Nearly everything thing of interest to ebt_do_table is already present
in nf_hook_state.  Simplify ebt_do_table by just passing in the skb,
nf_hook_state, and the table.  This make the code easier to read and
maintenance easier.

To support this create an nf_hook_state on the stack in ebt_broute
(the only caller without a nf_hook_state already available).  This new
nf_hook_state adds no new computations to ebt_broute, but does use a
few more bytes of stack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:57:35 +02:00
Pablo Neira Ayuso 36aea585a1 Merge tag 'ipvs-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
IPVS Updates for v4.4

please consider these IPVS Updates for v4.4.

The updates include the following from Alex Gartrell:
* Scheduling of ICMP
* Sysctl to ignore tunneled packets; and hence some packet-looping scenarios
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-18 21:05:03 +02:00
Szymon Janc 6818375e97 Bluetooth: Fix reporting incorrect EIR in device found mgmt event
Some remote devices (ie Gigaset G-Tag) misbehave with ADV data length.
This can lead to incorrect EIR format in device found event when
ADV_DATA and SCAN_RSP are merged (terminator field before SCAN_RSP
part).

Fix this by inspecting ADV_DATA and correct its length if terminator
is found.

> HCI Event: LE Meta Event (0x3e) plen 42              [hci0] 32.172182
      LE Advertising Report (0x02)
        Num reports: 1
        Event type: Connectable undirected - ADV_IND (0x00)
        Address type: Public (0x00)
        Address: 7C:2F:80:94:97:5A (Gigaset Communications GmbH)
        Data length: 30
        Flags: 0x06
          LE General Discoverable Mode
          BR/EDR Not Supported
        Company: Gigaset Communications GmbH (384)
          Data: 021512348094975abbc5
        16-bit Service UUIDs (partial): 1 entry
          Battery Service (0x180f)
        RSSI: -65 dBm (0xbf)
> HCI Event: LE Meta Event (0x3e) plen 27              [hci0] 32.172191
      LE Advertising Report (0x02)
        Num reports: 1
        Event type: Scan response - SCAN_RSP (0x04)
        Address type: Public (0x00)
        Address: 7C:2F:80:94:97:5A (Gigaset Communications GmbH)
        Data length: 15
        Name (complete): Gigaset G-tag
        RSSI: -59 dBm (0xc5)

Note "Data length: 30" in ADV_DATA which results in 9 extra zero bytes
after Battery Service UUID. Terminator field present in the middle of
EIR in Device Found event resulted in userspace stop parsing EIR and
skipping device name.

@ Device Found: 7C:2F:80:94:97:5A (1) rssi -59 flags 0x0000
      02 01 06 0d ff 80 01 02 15 12 34 80 94 97 5a bb  ..........4...Z.
      c5 03 02 0f 18 00 00 00 00 00 00 00 00 00 0e 09  ................
      47 69 67 61 73 65 74 20 47 2d 74 61 67           Gigaset G-tag

With this fix EIR with merged ADV_DATA and SCAN_RSP in device found
event is properly formatted:

@ Device Found: 7C:2F:80:94:97:5A (1) rssi -59 flags 0x0000
      02 01 06 0d ff 80 01 02 15 12 34 80 94 97 5a bb  ..........4...Z.
      c5 03 02 0f 18 0e 09 47 69 67 61 73 65 74 20 47  .......Gigaset G
      2d 74 61 67                                      -tag

Signed-off-by: Szymon Janc <ext.szymon.janc@tieto.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-18 09:53:20 +02:00
Szymon Janc e781b7f7fc Bluetooth: Add BT_ERR_RATELIMITED
This patch adds ratelimited version of the BT_ERR macro.

Signed-off-by: Szymon Janc <ext.szymon.janc@tieto.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-18 09:53:19 +02:00
Eric Dumazet 47bbbb30b4 sch_dsmark: improve memory locality
Memory placement in sch_dsmark is silly : Better place mask/value
in the same cache line.

Also, we can embed small arrays in the first cache line and
remove a potential cache miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:37:19 -07:00
Eric Dumazet c2e7204d18 tcp_cubic: do not set epoch_start in the future
Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.

Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.

Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.

This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.

Fixes: 30927520db ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Jana Iyengar <jri@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:35:07 -07:00
David S. Miller 1dbb2413cb Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2015-09-17

Here's one important patch for the 4.3-rc series that fixes an issue
with Bluetooth LE encryption failing because of a too early check for
the SMP context.

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:25:51 -07:00
Sasha Levin 34f5b00664 atm: deal with setting entry before mkip was called
If we didn't call ATMARP_MKIP before ATMARP_ENCAP the VCC descriptor is
non-existant and we'll end up dereferencing a NULL ptr:

[1033173.491930] kasan: GPF could be caused by NULL-ptr deref or user memory accessirq event stamp: 123386
[1033173.493678] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[1033173.493689] Modules linked in:
[1033173.493697] CPU: 9 PID: 23815 Comm: trinity-c64 Not tainted 4.2.0-next-20150911-sasha-00043-g353d875-dirty #2545
[1033173.493706] task: ffff8800630c4000 ti: ffff880063110000 task.ti: ffff880063110000
[1033173.493823] RIP: clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[1033173.493826] RSP: 0018:ffff880063117a88  EFLAGS: 00010203
[1033173.493828] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 000000000000000c
[1033173.493830] RDX: 0000000000000002 RSI: ffffffffb3f10720 RDI: 0000000000000014
[1033173.493832] RBP: ffff880063117b80 R08: ffff88047574d9a4 R09: 0000000000000000
[1033173.493834] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000c622f53
[1033173.493836] R13: ffff8800cb905500 R14: ffff8808d6da2000 R15: 00000000fffffdfd
[1033173.493840] FS:  00007fa56b92d700(0000) GS:ffff880478000000(0000) knlGS:0000000000000000
[1033173.493843] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1033173.493845] CR2: 0000000000000000 CR3: 00000000630e8000 CR4: 00000000000006a0
[1033173.493855] Stack:
[1033173.493862]  ffffffffb0b60444 000000000000eaea 0000000041b58ab3 ffffffffb3c3ce32
[1033173.493867]  ffffffffb0b6f3e0 ffffffffb0b60444 ffffffffb5ea2e50 1ffff1000c622f5e
[1033173.493873]  ffff8800630c4cd8 00000000000ee09a ffffffffb3ec4888 ffffffffb5ea2de8
[1033173.493874] Call Trace:
[1033173.494108] do_vcc_ioctl (net/atm/ioctl.c:170)
[1033173.494113] vcc_ioctl (net/atm/ioctl.c:189)
[1033173.494116] svc_ioctl (net/atm/svc.c:605)
[1033173.494200] sock_do_ioctl (net/socket.c:874)
[1033173.494204] sock_ioctl (net/socket.c:958)
[1033173.494244] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607)
[1033173.494290] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613)
[1033173.494295] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[1033173.494362] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 50 09 00 00 49 8b 9e 60 06 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 14 48 89 fa 48 c1 ea 03 <0f> b6 04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 14 09 00
All code

========
   0:   fa                      cli
   1:   48 c1 ea 03             shr    $0x3,%rdx
   5:   80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)
   9:   0f 85 50 09 00 00       jne    0x95f
   f:   49 8b 9e 60 06 00 00    mov    0x660(%r14),%rbx
  16:   48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  1d:   fc ff df
  20:   48 8d 7b 14             lea    0x14(%rbx),%rdi
  24:   48 89 fa                mov    %rdi,%rdx
  27:   48 c1 ea 03             shr    $0x3,%rdx
  2b:*  0f b6 04 02             movzbl (%rdx,%rax,1),%eax               <-- trapping instruction
  2f:   48 89 fa                mov    %rdi,%rdx
  32:   83 e2 07                and    $0x7,%edx
  35:   38 d0                   cmp    %dl,%al
  37:   7f 08                   jg     0x41
  39:   84 c0                   test   %al,%al
  3b:   0f 85 14 09 00 00       jne    0x955

Code starting with the faulting instruction
===========================================
   0:   0f b6 04 02             movzbl (%rdx,%rax,1),%eax
   4:   48 89 fa                mov    %rdi,%rdx
   7:   83 e2 07                and    $0x7,%edx
   a:   38 d0                   cmp    %dl,%al
   c:   7f 08                   jg     0x16
   e:   84 c0                   test   %al,%al
  10:   0f 85 14 09 00 00       jne    0x92a
[1033173.494366] RIP clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[1033173.494368]  RSP <ffff880063117a88>

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:13:32 -07:00
Florian Westphal 1d325d217c ipv6: ip6_fragment: fix headroom tests and skb leak
David Woodhouse reports skb_under_panic when we try to push ethernet
header to fragmented ipv6 skbs:

 skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[..]
ip6_finish_output2+0x196/0x4da

David further debugged this:
  [..] offending fragments were arriving here with skb_headroom(skb)==10.
  Which is reasonable, being the Solos ADSL card's header of 8 bytes
  followed by 2 bytes of PPP frame type.

The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
in ip6_forward will only see reassembled skb.

Therefore, headroom is overestimated by 8 bytes (we pulled fragment
header) and we don't check the skbs in the frag_list either.

We can't do these checks in netfilter defrag since outdev isn't known yet.

Furthermore, existing tests in ip6_fragment did not consider the fragment
or ipv6 header size when checking headroom of the fraglist skbs.

While at it, also fix a skb leak on memory allocation -- ip6_fragment
must consume the skb.

I tested this e1000 driver hacked to not allocate additional headroom
(we end up in slowpath, since LL_RESERVED_SPACE is 16).

If 2 bytes of headroom are allocated, fastpath is taken (14 byte
ethernet header was pulled, so 16 byte headroom available in all
fragments).

Reported-by: David Woodhouse <dwmw2@infradead.org>
Diagnosed-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:36:15 -07:00
David Ahern bde6f9ded1 net: Initialize table in fib result
Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., from Richard:

[    0.877040] BUG: unable to handle kernel NULL pointer dereference at 0000000000000056
[    0.877597] IP: [<ffffffff8155b5e2>] ip_route_input_noref+0x1a2/0xb00
[    0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0
[    0.877597] Oops: 0000 [#1] SMP
[    0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio
[    0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1
[    0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.877597] task: ffff88003fab0bc0 ti: ffff88003faa8000 task.ti: ffff88003faa8000
[    0.877597] RIP: 0010:[<ffffffff8155b5e2>]  [<ffffffff8155b5e2>] ip_route_input_noref+0x1a2/0xb00
[    0.877597] RSP: 0018:ffff88003ed03ba0  EFLAGS: 00010202
[    0.877597] RAX: 0000000000000046 RBX: 00000000ffffff8f RCX: 0000000000000020
[    0.877597] RDX: ffff88003fab50b8 RSI: 0000000000000200 RDI: ffffffff8152b4b8
[    0.877597] RBP: ffff88003ed03c50 R08: 0000000000000000 R09: 0000000000000000
[    0.877597] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003fab6f00
[    0.877597] R13: ffff88003fab5000 R14: 0000000000000000 R15: ffffffff81cb5600
[    0.877597] FS:  00007f6de5751700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
[    0.877597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.877597] CR2: 0000000000000056 CR3: 000000003fa6d000 CR4: 00000000000006e0
[    0.877597] Stack:
[    0.877597]  0000000000000000 0000000000000046 ffff88003fffa600 ffff88003ed03be0
[    0.877597]  ffff88003f9e2c00 697da8c0017da8c0 ffff880000000000 000000000007fd00
[    0.877597]  0000000000000000 0000000000000046 0000000000000000 0000000400000000
[    0.877597] Call Trace:
[    0.877597]  <IRQ>
[    0.877597]  [<ffffffff812bfa1f>] ? cpumask_next_and+0x2f/0x40
[    0.877597]  [<ffffffff8158e13c>] arp_process+0x39c/0x690
[    0.877597]  [<ffffffff8158e57e>] arp_rcv+0x13e/0x170
[    0.877597]  [<ffffffff8151feec>] __netif_receive_skb_core+0x60c/0xa00
[    0.877597]  [<ffffffff81515795>] ? __build_skb+0x25/0x100
[    0.877597]  [<ffffffff81515795>] ? __build_skb+0x25/0x100
[    0.877597]  [<ffffffff81521ff6>] __netif_receive_skb+0x16/0x70
[    0.877597]  [<ffffffff81522078>] netif_receive_skb_internal+0x28/0x90
[    0.877597]  [<ffffffff8152288f>] napi_gro_receive+0x7f/0xd0
[    0.877597]  [<ffffffffa0017906>] virtnet_receive+0x256/0x910 [virtio_net]
[    0.877597]  [<ffffffffa0017fd8>] virtnet_poll+0x18/0x80 [virtio_net]
[    0.877597]  [<ffffffff815234cd>] net_rx_action+0x1dd/0x2f0
[    0.877597]  [<ffffffff81053228>] __do_softirq+0x98/0x260
[    0.877597]  [<ffffffff8164969c>] do_softirq_own_stack+0x1c/0x30

The root cause is use of res.table uninitialized.

Thanks to Nikolay for noticing the uninitialized use amongst the maze of
gotos.

As Nikolay pointed out the second initialization is not required to fix
the oops, but rather to fix a related problem where a valid lookup should
be invalidated before creating the rth entry.

Fixes: b7503e0cdb ("net: Add FIB table id to rtable")
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Reported-by: Richard Alpe <richard.alpe@ericsson.com>
Reported-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:34:08 -07:00
Alexei Starovoitov 27b29f6305 bpf: add bpf_redirect() helper
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:09:07 -07:00
Daniel Borkmann 045efa82ff cls_bpf: introduce integrated actions
Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb->protocol) {
  case htons(ETH_P_ARP):
    skb->tc_classid = 1;
    break;
  case htons(ETH_P_IP):
    skb->tc_classid = 2;
    break;
  case htons(ETH_P_IPV6):
    skb->tc_classid = 3;
    break;
  default:
    return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:09:06 -07:00
Junwei Zhang f6c53334d6 net: only check perm protocol when register proto
The permanent protocol nodes are at the head of the list,
So only need check all these nodes.

No matter the new node is permanent or not,
insert the new node after the last permanent protocol node,

If the new node conflicts with existing permanent node,
return error.

Signed-off-by: Martin Zhang <martinbj2008@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:02:59 -07:00
Eric Dumazet 58d607d3e5 tcp: provide skb->hash to synack packets
In commit b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf
on xmit"), Tom provided a l4 hash to most outgoing TCP packets.

We'd like to provide one as well for SYNACK packets, so that all packets
of a given flow share same txhash, to later enable bonding driver to
also use skb->hash to perform slave selection.

Note that a SYNACK retransmit shuffles the tx hash, as Tom did
in commit 265f94ff54 ("net: Recompute sk_txhash on negative routing
advice") for established sockets.

This has nice effect making TCP flows resilient to some kind of black
holes, even at connection establish phase.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:01:04 -07:00
Eric W. Biederman be10de0a32 netfilter: Add blank lines in callers of netfilter hooks
In code review it was noticed that I had failed to add some blank lines
in places where they are customarily used.  Taking a second look at the
code I have to agree blank lines would be nice so I have added them
here.

Reported-by:  Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman 0c4b51f005 netfilter: Pass net into okfn
This is immediately motivated by the bridge code that chains functions that
call into netfilter.  Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn.  For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman 9dff2c966a netfilter: Use nf_hook_state.net
Instead of saying "net = dev_net(state->in?state->in:state->out)"
just say "state->net".  As that information is now availabe,
much less confusing and much less error prone.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman 29a26a5680 netfilter: Pass struct net into the netfilter hooks
Pass a network namespace parameter into the netfilter hooks.  At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.

This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".

In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.

The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev

In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)".  I am documenting them in case something odd
pops up and someone starts trying to track down what happened.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman 04eb44890e bridge: Add br_netif_receive_skb remove netif_receive_skb_sk
netif_receive_skb_sk is only called once in the bridge code, replace
it with a bridge specific function that calls netif_receive_skb.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman f2d74cf88c bridge: Cache net in br_nf_pre_routing_finish
This is prep work for passing net to the netfilter hooks.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:36 -07:00
Eric W. Biederman 6532948b2e bridge: Pass net into br_nf_push_frag_xmit
When struct net starts being passed through the ipv4 and ipv6 fragment
routines br_nf_push_frag_xmit will need to take a net parameter.
Prepare br_nf_push_frag_xmit before that is needed and introduce
br_nf_push_frag_xmit_sk for the call sites that still need the old
calling conventions.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:36 -07:00
Eric W. Biederman 8d4df0b930 bridge: Pass net into br_nf_ip_fragment
This is a prep work for passing struct net through ip_do_fragment and
later the netfilter okfn.   Doing this independently makes the later
code changes clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:36 -07:00
Eric W. Biederman adb28c9d33 ipv6: Compute net once in raw6_send_hdrinc
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:36 -07:00
Eric W. Biederman 19a0644ca3 ipv6: Cache net in ip6_output
Keep net in a local variable so I can use it in NF_HOOK_COND
when I pass struct net to all of the netfilter hooks.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:36 -07:00
Eric W. Biederman 78126c4190 ipv6: Only compute net once in ip6_finish_output2
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Eric W. Biederman 9865249fd1 ipv6: Don't recompute net in ip6_rcv
Avoid silly redundant code

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Eric W. Biederman 2b4aa3cec4 net: Remove dev_queue_xmit_sk
A function with weird arguments that it will never use to accomdate a
netfilter callback prototype is absolutely in the core of the
networking stack.  Frankly it does not make sense and it causes a lot
of confusion as to why arguments that are never used are being passed
to the function.

As I am preparing to make a second change to arguments to the okfn even
the names stops making sense.

As I have removed the two callers of this function remove this confusion
from the networking stack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Eric W. Biederman 1f19c578df bridge: Introduce br_send_bpdu_finish
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk,
and have br_send_bpdu_finish call dev_queue_xmit.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Eric W. Biederman f9e4306fd8 arp: Introduce arp_xmit_finish
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk,
and have arp_xmit_finish call dev_queue_xmit.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Eric W. Biederman 244ba77985 ipv6: Only compute net once in ip6mr_forward2_finish
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman 758ccac8e7 ipv4: Only compute net once in ipmr_forward_finish
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman 38184b3b07 ipv4: Only compute net once in ip_rcv_finish
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman 4ba1bf4292 ipv4: Only compute net once in ip_finish_output2
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman 9479b0af48 ipv4: Explicitly compute net in ip_fragment
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman 26a949dbd5 ipv4: Only compute net once in ip_do_fragment
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:34 -07:00
Eric W. Biederman cc4c851e4b ipv4: Don't recompute net in ipmr_queue_xmit
Calling dev_net(dev) for is just silly.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:33 -07:00
Eric W. Biederman 88f5cc2458 ipv4: Remember the net in ip_output and ip_mc_output
This is a prepatory patch to passing net int the netfilter hooks,
where net will be used again.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:33 -07:00
Eric W. Biederman e707766ce0 ipv4: Compute net once in ip_rcv
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:33 -07:00
Eric W. Biederman f8e1ac7912 ipv4: Compute net once in ip_forward_finish
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:33 -07:00
Eric W. Biederman fcad0ac2da ipv4: Compute net once in ip_forward
Compute struct net from the input device in ip_forward before it is
used.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:33 -07:00
Eric W. Biederman 5a70649e0d net: Merge dst_output and dst_output_sk
Add a sock paramter to dst_output making dst_output_sk superfluous.
Add a skb->sk parameter to all of the callers of dst_output
Have the callers of dst_output_sk call dst_output.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:32 -07:00
Eric W. Biederman a6568b2425 xfrm: Remove unused afinfo method init_dst
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:32 -07:00
David Ahern 58189ca7b2 net: Fix vti use case with oif in dst lookups
Steffen reported that the recent change to add oif to dst lookups breaks
the VTI use case. The problem is that with the oif set in the flow struct
the comparison to the nh_oif is triggered. Fix by splitting the
FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
nh oif (FLOWI_FLAG_SKIP_NH_OIF).

Fixes: 42a7b32b73 ("xfrm: Add oif to dst lookups")

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 16:36:34 -07:00
Thadeu Lima de Souza Cascardo c4047f533f net-sysfs: get_netdev_queue_index() cleanup
Redo commit ed1acc8cd8.

Commit 822b3b2ebf ("net: Add max rate tx queue
attribute") moved get_netdev_queue_index around, but kept the old version.
Probably because of a reuse of the original patch from before Eric's change to
that function.

Remove one inline keyword, and no need for a loop to find
an index into a table.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Fixes: 822b3b2ebf ("net: Add max rate tx queue attribute")
Acked-by:  Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 16:35:15 -07:00
Joe Stringer cc5706056b openvswitch: Fix IPv6 exthdr handling with ct helpers.
Static code analysis reveals the following bug:

        net/openvswitch/conntrack.c:281 ovs_ct_helper()
        warn: unsigned 'protoff' is never less than zero.

This signedness bug breaks error handling for IPv6 extension headers when
using conntrack helpers. Fix the error by using a local signed variable.

Fixes:  cae3a2627520: "openvswitch: Allow attaching helpers to ct
action"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 15:31:49 -07:00
Trond Myklebust 0fdea1e8a2 SUNRPC: Ensure that we wait for connections to complete before retrying
Commit 718ba5b873, moved the responsibility for unlocking the socket to
xs_tcp_setup_socket, meaning that the socket will be unlocked before we
know that it has finished trying to connect. The following patch is based on
an initial patch by Russell King to ensure that we delay clearing the
XPRT_CONNECTING flag until we either know that we failed to initiate
a connection attempt, or the connection attempt itself failed.

Fixes: 718ba5b873 ("SUNRPC: Add helpers to prevent socket create from racing")
Reported-by: Russell King <linux@arm.linux.org.uk>
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 18:01:28 -04:00
Roopa Prabhu 37a1d3611c ipv6: include NLM_F_REPLACE in route replace notifications
This patch adds NLM_F_REPLACE flag to ipv6 route replace notifications.
This makes nlm_flags in ipv6 replace notifications consistent
with ipv4.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 15:00:27 -07:00
Julia Lawall 17a9618e98 SUNRPC: drop null test before destroy functions
Remove unneeded NULL test.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
  \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 15:48:24 -04:00
Trond Myklebust 03c78827db SUNRPC: Fix races between socket connection and destroy code
When we're destroying the socket transport, we need to ensure that
we cancel any existing delayed connection attempts, and order them
w.r.t. the call to xs_close().

Reported-by:"Suzuki K. Poulose" <suzuki.poulose@arm.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-17 15:48:23 -04:00
Ilya Dryomov 7f61f54565 libceph: don't access invalid memory in keepalive2 path
This

    struct ceph_timespec ceph_ts;
    ...
    con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts);

wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
ceph_ts becomes invalid on return from prepare_write_keepalive().  As
a result, we send out bogus keepalive2 stamps.  Fix this by encoding
into a ceph_timespec member, similar to how acks are read and written.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-09-17 20:14:15 +03:00
Pablo Neira Ayuso ad5001cc7c netfilter: nf_log: wait for rcu grace after logger unregistration
The nf_log_unregister() function needs to call synchronize_rcu() to make sure
that the objects are not dereferenced anymore on module removal.

Fixes: 5962815a6a ("netfilter: nf_log: use an array of loggers instead of list")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-17 13:37:31 +02:00
Marcel Holtmann c5080d42db Bluetooth: Fix typo in smp_ah comment for hash function
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-09-17 13:20:06 +02:00
Stefan Schmidt a1d8d9a57c ieee802154: 6lowpan: use correct ESC value for dispatch
The ESC dispatch value has some history and it originally was 0x7f in rfc4944
(see section-5.1). With the release of rfc6282 this value got part of the
LOWPAN_IPHC range and was no longer available for ESC. Instead 0x40 was used
as replacement (see section-2 in rfc6282).

We have been checking the dispatch byte in an order where IPHC would always be
evaluated before ESC and thus we would never reach the ESC check as the IPHC
range already covers this value.

Signed-off-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-17 13:20:04 +02:00
Alexander Aring 776857a87c ieee802154: 6lowpan: fix drop return value
This patch changes the return value of lowpan packet receive handler to
the correct NET_RX_DROP instead RX_DROP.

This issue was detected by sparse and reported from Marcel:
net/ieee802154/6lowpan/rx.c:329:32:    expected int
net/ieee802154/6lowpan/rx.c:329:32:    got restricted lowpan_rx_result ...

Reported-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-17 13:20:04 +02:00
Alexander Aring 187625e184 ieee802154: 6lowpan: remove tx full-size calc workaround
This patch removes a workaround for datagram_size calculation while
doing fragmentation on transmit.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-09-17 13:20:04 +02:00