Commit Graph

6800 Commits

Author SHA1 Message Date
Eric Dumazet 845704a535 tcp: avoid looping in tcp_send_fin()
Presence of an unbound loop in tcp_send_fin() had always been hard
to explain when analyzing crash dumps involving gigantic dying processes
with millions of sockets.

Lets try a different strategy :

In case of memory pressure, try to add the FIN flag to last packet
in write queue, even if packet was already sent. TCP stack will
be able to deliver this FIN after a timeout event. Note that this
FIN being delivered by a retransmit, it also carries a Push flag
given our current implementation.

By checking sk_under_memory_pressure(), we anticipate that cooking
many FIN packets might deplete tcp memory.

In the case we could not allocate a packet, even with __GFP_WAIT
allocation, then not sending a FIN seems quite reasonable if it allows
to get rid of this socket, free memory, and not block the process from
eventually doing other useful work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24 11:06:48 -04:00
Eric Dumazet d83769a580 tcp: fix possible deadlock in tcp_send_fin()
Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
case a huge process is killed by OOM, and tcp_mem[2] is hit.

To be able to free memory we need to make progress, so this
patch allows FIN packets to not care about tcp_mem[2], if
skb allocation succeeded.

In a follow-up patch, we might abort tcp_send_fin() infinite loop
in case TIF_MEMDIE is set on this thread, as memory allocator
did its best getting extra memory already.

This patch reverts d22e153718 ("tcp: fix tcp fin memory accounting")

Fixes: d22e153718 ("tcp: fix tcp fin memory accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 14:13:11 -04:00
jbaron@akamai.com 3c7151275c tcp: add memory barriers to write space paths
Ensure that we either see that the buffer has write space
in tcp_poll() or that we perform a wakeup from the input
side. Did not run into any actual problem here, but thought
that we should make things explicit.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21 15:57:34 -04:00
Sebastian Pöhn 2ab957492d ip_forward: Drop frames with attached skb->sk
Initial discussion was:
[FYI] xfrm: Don't lookup sk_policy for timewait sockets

Forwarded frames should not have a socket attached. Especially
tw sockets will lead to panics later-on in the stack.

This was observed with TPROXY assigning a tw socket and broken
policy routing (misconfigured). As a result frame enters
forwarding path instead of input. We cannot solve this in
TPROXY as it cannot know that policy routing is broken.

v2:
Remove useless comment

Signed-off-by: Sebastian Poehn <sebastian.poehn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-20 14:07:33 -04:00
Eric Dumazet 521f1cf1db inet_diag: fix access to tcp cc information
Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
   icsk->icsk_ca_ops can change under us and module be unloaded.
   -> Access to freed memory.
   Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
   but again this is not safe against icsk->icsk_ca_ops
   change and nla_put() errors were ignored. Some sockets
   could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:28:31 -04:00
Eric Dumazet fad9dfefea tcp: tcp_get_info() should fetch socket fields once
tcp_get_info() can be called without holding socket lock,
so any socket fields can change under us.

Use READ_ONCE() to fetch sk_pacing_rate and sk_max_pacing_rate

Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:28:31 -04:00
WANG Cong 540207ae69 fou: avoid missing unlock in failure path
Fixes: 7a6c8c34e5 ("fou: implement FOU_CMD_GET")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:11:19 -04:00
David S. Miller bae97d8410 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

A final pull request, I know it's very late but this time I think it's worth a
bit of rush.

The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.

This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.

This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.

Patrick McHardy says:

====================
netfilter: nf_tables: concatenation support

The following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).

The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.

Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.

Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.

As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.

The patches are split up as follows:

 * the first five patches add length validation to register loads and
   stores to make sure we stay within bounds and prepare the validation
   functions for the new addressing mode

 * the next patches prepare for changing to 32 bit addressing by
   introducing a struct nft_regs, which holds the verdict register as
   well as the data registers. The verdict members are moved to a new
   struct nft_verdict to allow to pull struct nft_data out of the stack.

 * the next patches contain preparatory conversions of expressions and
   sets to use 32 bit addressing

 * the next patch introduces so far unused register conversion helpers
   for parsing and dumping register numbers over netlink

 * following is the real conversion to 32 bit addressing, consisting of
   replacing struct nft_data in struct nft_regs by an array of u32s and
   actually translating and validating the new register numbers.

 * the final two patches add support for variable sized data items and
   variable sized keys / data in set elements

The patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================

Patrick McHardy says:

====================
netfilter: nf_tables: dynamic stateful expression instantiation

The following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.

Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.

In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.

We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.

The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:

1. Limit the rate of new SSH connections per host, similar to iptables
   hashlimit:

        flow ip saddr timeout 60s \
        limit 10/second \
        accept

2. Account network traffic between each set of /24 networks:

        flow ip saddr & 255.255.255.0 . ip daddr & 255.255.255.0 \
        counter

3. Account traffic to each host per user:

        flow skuid . ip daddr \
        counter

4. Account traffic for each combination of source address and TCP flags:

        flow ip saddr . tcp flags \
        counter

The resulting set content after a Xmas-scan look like this:

{
        192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
        192.168.122.1 . ack : counter packets 74 bytes 3848,
        192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}

In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 18:51:19 -04:00
David S. Miller 87ffabb1f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 15:44:14 -04:00
David S. Miller 6e8a9d9148 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Al Viro says:

====================
netdev-related stuff in vfs.git

There are several commits sitting in vfs.git that probably ought to go in
via net-next.git.  First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the ->sendmsg()
and ->recvmsg() patches a while ago.  It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter.  The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO.  Next there's
afs and rxrpc stuff from dhowells.  And then there's sanitizing kernel_sendmsg
et.al.  + missing inlined helper for "how much data is left in msg->msg_iter" -
this stuff is used in e.g.  cifs stuff, but it belongs in net-next.

That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 18:18:05 -04:00
Eric Dumazet 789f558cfb tcp/dccp: get rid of central timewait timer
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 16:40:05 -04:00
Kenneth Klette Jonassen 3d0d26c797 tcp: fix bogus RTT for CC when retransmissions are acked
Since retransmitted segments are not used for RTT estimation, previously
SACKed segments present in the rtx queue are used. This estimation can be
several times larger than the actual RTT. When a cumulative ack covers both
previously SACKed and retransmitted segments, CC may thus get a bogus RTT.

Such segments previously had an RTT estimation in tcp_sacktag_one(), so it
seems reasonable to not reuse them in tcp_clean_rtx_queue() at all.

Afaik, this has had no effect on SRTT/RTO because of Karn's check.

Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 13:54:25 -04:00
Patrick McHardy 49499c3e6e netfilter: nf_tables: switch registers to 32 bit addressing
Switch the nf_tables registers from 128 bit addressing to 32 bit
addressing to support so called concatenations, where multiple values
can be concatenated over multiple registers for O(1) exact matches of
multiple dimensions using sets.

The old register values are mapped to areas of 128 bits for compatibility.
When dumping register numbers, values are expressed using the old values
if they refer to the beginning of a 128 bit area for compatibility.

To support concatenations, register loads of less than a full 32 bit
value need to be padded. This mainly affects the payload and exthdr
expressions, which both unconditionally zero the last word before
copying the data.

Userspace fully passes the testsuite using both old and new register
addressing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:29 +02:00
Patrick McHardy a55e22e92f netfilter: nf_tables: get rid of NFT_REG_VERDICT usage
Replace the array of registers passed to expressions by a struct nft_regs,
containing the verdict as a seperate member, which aliases to the
NFT_REG_VERDICT register.

This is needed to seperate the verdict from the data registers completely,
so their size can be changed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-13 17:17:07 +02:00
WANG Cong 7a6c8c34e5 fou: implement FOU_CMD_GET
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong 02d793c5bb fou: add network namespace support
Also convert the spinlock to a mutex.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong 4cbcdf2b6c fou: always use be16 for port
udp_config.local_udp_port is be16. And iproute2 passes
network order for FOU_ATTR_PORT.

This doesn't fix any bug, just for consistency.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong 67270636a8 fou: exit early when parsing config fails
Not a big deal, just for corretness.

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
WANG Cong 9272f04872 fou: avoid calling udp_del_offload() twice
This fixes the following harmless warning:

./ip/ip fou del port 7777
[  122.907516] udp_del_offload: didn't find offload for port 7777

Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-12 21:25:13 -04:00
Al Viro 01e97e6517 new helper: msg_data_left()
convert open-coded instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 15:53:35 -04:00
Eric Dumazet b52e69217b tcp: md5: fix a typo in tcp_v4_md5_lookup()
Lookup key for tcp_md5_do_lookup() has to be taken
from addr_sk, not sk (which can be the listener)

Fixes: fd3a154a00 ("tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 18:11:16 -04:00
Eric Dumazet b50edd7812 tcp: tcp_make_synack() should clear skb->tstamp
I noticed tcpdump was giving funky timestamps for locally
generated SYNACK messages on loopback interface.

11:42:46.938990 IP 127.0.0.1.48245 > 127.0.0.2.23850: S
945476042:945476042(0) win 43690 <mss 65495,nop,nop,sackOK,nop,wscale 7>

20:28:58.502209 IP 127.0.0.2.23850 > 127.0.0.1.48245: S
3160535375:3160535375(0) ack 945476043 win 43690 <mss
65495,nop,nop,sackOK,nop,wscale 7>

This is because we need to clear skb->tstamp before
entering lower stack, otherwise net_timestamp_check()
does not set skb->tstamp.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 17:28:57 -04:00
Jesse Gross b736a623bd udptunnels: Call handle_offloads after inserting vlan tag.
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.

Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:56:32 -04:00
David S. Miller ca69d7102f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree.
They are:

* nf_tables set timeout infrastructure from Patrick Mchardy.

1) Add support for set timeout support.

2) Add support for set element timeouts using the new set extension
   infrastructure.

4) Add garbage collection helper functions to get rid of stale elements.
   Elements are accumulated in a batch that are asynchronously released
   via RCU when the batch is full.

5) Add garbage collection synchronization helpers. This introduces a new
   element busy bit to address concurrent access from the netlink API and the
   garbage collector.

5) Add timeout support for the nft_hash set implementation. The garbage
   collector peridically checks for stale elements from the workqueue.

* iptables/nftables cgroup fixes:

6) Ignore non full-socket objects from the input path, otherwise cgroup
   match may crash, from Daniel Borkmann.

7) Fix cgroup in nf_tables.

8) Save some cycles from xt_socket by skipping packet header parsing when
   skb->sk is already set because of early demux. Also from Daniel.

* br_netfilter updates from Florian Westphal.

9) Save frag_max_size and restore it from the forward path too.

10) Use a per-cpu area to restore the original source MAC address when traffic
    is DNAT'ed.

11) Add helper functions to access physical devices.

12) Use these new physdev helper function from xt_physdev.

13) Add another nf_bridge_info_get() helper function to fetch the br_netfilter
    state information.

14) Annotate original layer 2 protocol number in nf_bridge info, instead of
    using kludgy flags.

15) Also annotate the pkttype mangling when the packet travels back and forth
    from the IP to the bridge layer, instead of using a flag.

* More nf_tables set enhancement from Patrick:

16) Fix possible usage of set variant that doesn't support timeouts.

17) Avoid spurious "set is full" errors from Netlink API when there are pending
    stale elements scheduled to be released.

18) Restrict loop checks to set maps.

19) Add support for dynamic set updates from the packet path.

20) Add support to store optional user data (eg. comments) per set element.

BTW, I have also pulled net-next into nf-next to anticipate the conflict
resolution between your okfn() signature changes and Florian's br_netfilter
updates.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:46:04 -04:00
Al Viro 237dae8890 Merge branch 'iocb' into for-davem
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:01:38 -04:00
Eric Dumazet dd929c1b3d tcp: do not rearm rsk_timer on FastOpen requests
FastOpen requests are not like other regular request sockets.

They do not yet use rsk_timer : tcp_fastopen_queue_check()
simply manually removes one expired request from fastopenq->rskq_rst
list.

Therefore, tcp_check_req() must not call mod_timer_pending(),
otherwise we crash because rsk_timer was not initialized.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 20:19:42 -04:00
Andi Kleen 5eeb292215 fou: Don't use const __read_mostly
const __read_mostly is a senseless combination. If something
is already const it cannot be __read_mostly. Remove the bogus
__read_mostly in the fou driver.

This fixes section conflicts with LTO.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 15:29:08 -04:00
David Miller c1f8667677 netfilter: Fix switch statement warnings with recent gcc.
More recent GCC warns about two kinds of switch statement uses:

1) Switching on an enumeration, but not having an explicit case
   statement for all members of the enumeration.  To show the
   compiler this is intentional, we simply add a default case
   with nothing more than a break statement.

2) Switching on a boolean value.  I think this warning is dumb
   but nevertheless you get it wholesale with -Wswitch.

This patch cures all such warnings in netfilter.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 15:20:50 -04:00
Pablo Neira Ayuso aadd51aa71 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflicts between 5888b93 ("Merge branch 'nf-hook-compress'") and
Florian Westphal br_netfilter works.

Conflicts:
        net/bridge/br_netfilter.c

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 18:30:21 +02:00
Hannes Frederic Sowa 926a882f69 ipv4: ip_tunnel: use net namespace from rtable not socket
The socket parameter might legally be NULL, thus sock_net is sometimes
causing a NULL pointer dereference. Using net_device pointer in dst_entry
is more reliable.

Fixes: b6a7719aed ("ipv4: hash net ptr into fragmentation bucket selection")
Reported-by: Rick Jones <rick.jones2@hp.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 12:09:42 -04:00
Florian Westphal c737b7c451 netfilter: bridge: add helpers for fetching physin/outdev
right now we store this in the nf_bridge_info struct, accessible
via skb->nf_bridge.  This patch prepares removal of this pointer from skb:

Instead of using skb->nf_bridge->x, we use helpers to obtain the in/out
device (or ifindexes).

Followup patches to netfilter will then allow nf_bridge_info to be
obtained by a call into the br_netfilter core, rather than keeping a
pointer to it in sk_buff.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-08 16:49:08 +02:00
Sheng Yong 8bc0034cf6 net: remove extra newlines
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 22:24:37 -04:00
Daniel Lee 2646c831c0 tcp: RFC7413 option support for Fast Open client
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
Daniel Lee 7f9b838b71 tcp: RFC7413 option support for Fast Open server
Fast Open has been using the experimental option with a magic number
(RFC6994) to request and grant Fast Open cookies. This patch enables
the server to support the official IANA option 34 in RFC7413 in
addition.

The change has passed all existing Fast Open tests with both
old and new options at Google.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
David Miller 79b16aadea udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
2015-04-07 15:29:08 -04:00
David Miller 7026b1ddb6 netfilter: Pass socket pointer down through okfn().
On the output paths in particular, we have to sometimes deal with two
socket contexts.  First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device.  We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 15:25:55 -04:00
David S. Miller c85d6975ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx4/cmd.c
	net/core/fib_rules.c
	net/ipv4/fib_frontend.c

The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.

The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-06 22:34:15 -04:00
David S. Miller b85c3dc9bd netfilter: Pass nf_hook_state through arpt_do_table().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 13:26:52 -04:00
David S. Miller 073bfd5686 netfilter: Pass nf_hook_state through nft_set_pktinfo*().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 12:54:27 -04:00
David S. Miller 1c491ba259 netfilter: Pass nf_hook_state through ipt_do_table().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 12:47:04 -04:00
David S. Miller d7cf4081ed netfilter: Pass nf_hook_state through nf_nat_ipv4_{in,out,fn,local_fn}().
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 12:45:19 -04:00
David S. Miller 238e54c9cb netfilter: Make nf_hookfn use nf_hook_state.
Pass the nf_hook_state all the way down into the hook
functions themselves.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 12:31:38 -04:00
David S. Miller 1d1de89b9a netfilter: Use nf_hook_state in nf_queue_entry.
That way we don't have to reinstantiate another nf_hook_state
on the stack of the nf_reinject() path.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-04 12:25:22 -04:00
Ian Morris 00db41243e ipv4: coding style: comparison for inequality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for non-NULL pointer is done as x != NULL and sometimes as x. x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
Ian Morris 51456b2914 ipv4: coding style: comparison for equality with NULL
The ipv4 code uses a mixture of coding styles. In some instances check
for NULL pointer is done as x == NULL and sometimes as !x. !x is
preferred according to checkpatch and this patch makes the code
consistent by adopting the latter form.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-03 12:11:15 -04:00
WANG Cong 419df12fb5 net: move fib_rules_unregister() under rtnl lock
We have to hold rtnl lock for fib_rules_unregister()
otherwise the following race could happen:

fib_rules_unregister():	fib_nl_delrule():
...				...
...				ops = lookup_rules_ops();
list_del_rcu(&ops->list);
				list_for_each_entry(ops->rules) {
fib_rules_cleanup_ops(ops);	  ...
  list_del_rcu();		  list_del_rcu();
				}

Note, net->rules_mod_lock is actually not needed at all,
either upper layer netns code or rtnl lock guarantees
we are safe.

Cc: Alexander Duyck <alexander.h.duyck@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 20:52:34 -04:00
WANG Cong ed785309c9 ipv4: take rtnl_lock and mark mrt table as freed on namespace cleanup
This is the IPv4 part for commit 905a6f96a1
(ipv6: take rtnl_lock and mark mrt6 table as freed on namespace cleanup).

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 20:52:34 -04:00
Neal Cardwell 666b805150 tcp: fix FRTO undo on cumulative ACK of SACKed range
On processing cumulative ACKs, the FRTO code was not checking the
SACKed bit, meaning that there could be a spurious FRTO undo on a
cumulative ACK of a previously SACKed skb.

The FRTO code should only consider a cumulative ACK to indicate that
an original/unretransmitted skb is newly ACKed if the skb was not yet
SACKed.

The effect of the spurious FRTO undo would typically be to make the
connection think that all previously-sent packets were in flight when
they really weren't, leading to a stall and an RTO.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Fixes: e33099f96d ("tcp: implement RFC5682 F-RTO")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:35:58 -04:00
David S. Miller 9f0d34bc34 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	drivers/net/usb/sr9800.c
	drivers/net/usb/usbnet.c
	include/linux/usb/usbnet.h
	net/ipv4/tcp_ipv4.c
	net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes.  In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:16:53 -04:00
Nicolas Dichtel ee9b9596a8 ipmr,ip6mr: implement ndo_get_iflink
Don't use dev->iflink anymore.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:00 -04:00
Nicolas Dichtel 1e99584b91 ipip,gre,vti,sit: implement ndo_get_iflink
Don't use dev->iflink anymore.

CC: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:00 -04:00
Nicolas Dichtel a54acb3a6f dev: introduce dev_get_iflink()
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.

There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:04:59 -04:00
Jiri Benc 67b61f6c13 netlink: implement nla_get_in_addr and nla_get_in6_addr
Those are counterparts to nla_put_in_addr and nla_put_in6_addr.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc 930345ea63 netlink: implement nla_put_in_addr and nla_put_in6_addr
IP addresses are often stored in netlink attributes. Add generic functions
to do that.

For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
not used universally throughout the kernel, in way too many places __be32 is
used to store IPv4 address.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc 8f55db4860 tcp: simplify inetpeer_addr_base use
In many places, the a6 field is typecasted to struct in6_addr. As the
fields are in union anyway, just add in6_addr type to the union and get rid
of the typecasting.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Alexander Duyck 6e47d6caff fib_trie: Cleanup ip_fib_net_exit code path
While fixing a recent issue I noticed that we are doing some unnecessary
work inside the loop for ip_fib_net_exit.  As such I am pulling out the
initialization to NULL for the locally stored fib_local, fib_main, and
fib_default.

In addition I am restoring the original code for flushing the table as
there is no need to split up the fib_table_flush and hlist_del work since
the code for packing the tnodes with multiple key vectors was dropped.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:18:56 -04:00
Alexander Duyck ad88d05136 fib_trie: Fix warning on fib4_rules_exit
This fixes the following warning:

 BUG: sleeping function called from invalid context at mm/slub.c:1268
 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0
 INFO: lockdep is turned off.
 CPU: 3 PID: 6 Comm: kworker/u8:0 Tainted: G        W       4.0.0-rc5+ #895
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: netns cleanup_net
  0000000000000006 ffff88011953fa68 ffffffff81a203b6 000000002c3a2c39
  ffff88011952a680 ffff88011953fa98 ffffffff8109daf0 ffff8801186c6aa8
  ffffffff81fbc9e5 00000000000004f4 0000000000000000 ffff88011953fac8
 Call Trace:
  [<ffffffff81a203b6>] dump_stack+0x4c/0x65
  [<ffffffff8109daf0>] ___might_sleep+0x1c3/0x1cb
  [<ffffffff8109db70>] __might_sleep+0x78/0x80
  [<ffffffff8117a60e>] slab_pre_alloc_hook+0x31/0x8f
  [<ffffffff8117d4f6>] __kmalloc+0x69/0x14e
  [<ffffffff818ed0e1>] ? kzalloc.constprop.20+0xe/0x10
  [<ffffffff818ed0e1>] kzalloc.constprop.20+0xe/0x10
  [<ffffffff818ef622>] fib_trie_table+0x27/0x8b
  [<ffffffff818ef6bd>] fib_trie_unmerge+0x37/0x2a6
  [<ffffffff810b06e1>] ? arch_local_irq_save+0x9/0xc
  [<ffffffff818e9793>] fib_unmerge+0x2d/0xb3
  [<ffffffff818f5f56>] fib4_rule_delete+0x1f/0x52
  [<ffffffff817f1c3f>] ? fib_rules_unregister+0x30/0xb2
  [<ffffffff817f1c8b>] fib_rules_unregister+0x7c/0xb2
  [<ffffffff818f64a1>] fib4_rules_exit+0x15/0x18
  [<ffffffff818e8c0a>] ip_fib_net_exit+0x23/0xf2
  [<ffffffff818e91f8>] fib_net_exit+0x32/0x36
  [<ffffffff817c8352>] ops_exit_list+0x45/0x57
  [<ffffffff817c8d3d>] cleanup_net+0x13c/0x1cd
  [<ffffffff8108b05d>] process_one_work+0x255/0x4ad
  [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad
  [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab
  [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff81090686>] kthread+0xd4/0xdc
  [<ffffffff8109ec8f>] ? local_clock+0x19/0x22
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
  [<ffffffff81a2c0c8>] ret_from_fork+0x58/0x90
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83

The issue was that as a part of exiting the default rules were being
deleted which resulted in the local trie being unmerged.  By moving the
freeing of the FIB tables up we can avoid the unmerge since there is no
local table left when we call the fib4_rules_exit function.

Fixes: 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:18:56 -04:00
David S. Miller 4ef295e047 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:

1) Move netns to basechain and use recently added possible_net_t, from
   Patrick McHardy.

2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.

3) Restore nf_log_trace that was accidentally removed during conflict
   resolution.

4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
   all patches from Patrick McHardy.

5) Use raw_smp_processor_id() in nft_meta.

Then, several patches to prepare ground for the new set extension
infrastructure:

6) Pass object length to the hash callback in rhashtable as needed by
   the new set extension infrastructure.

7) Cleanup patch to restore struct nft_hash as wrapper for struct
   rhashtable

8) Another small source code readability cleanup for nft_hash.

9) Convert nft_hash to rhashtable callbacks.

And finally...

10) Add the new set extension infrastructure.

11) Convert the nft_hash and nft_rbtree sets to use it.

12) Batch set element release to avoid several RCU grace period in a row
    and add new function nft_set_elem_destroy() to consolidate set element
    release.

13) Return the set extension data area from nft_lookup.

14) Refactor existing transaction code to add some helper functions
    and document it.

15) Complete the set transaction support, using similar approach to what we
    already use, to activate/deactivate elements in an atomic fashion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:43:43 -07:00
Eric Dumazet 41d25fe092 tcp: tcp_syn_flood_action() can be static
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.

We can make it static, by moving it above tcp_conn_request()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:17:18 -07:00
WANG Cong f243e5a785 ipmr,ip6mr: call ip6mr_free_table() on failure path
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:13:54 -07:00
Christoph Hellwig e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Hannes Frederic Sowa b6a7719aed ipv4: hash net ptr into fragmentation bucket selection
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 14:07:04 -04:00
Joe Perches a81b2ce850 netfilter: Use LOGLEVEL_<FOO> defines
Use the #defines where appropriate.

Miscellanea:

Add explicit #include <linux/kernel.h> where it was not
previously used so that these #defines are a bit more
explicitly defined instead of indirectly included via:
	module.h->moduleparam.h->kernel.h

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-25 12:09:39 +01:00
Eric Dumazet 0144a81ccc tcp: fix ipv4 mapped request socks
ss should display ipv4 mapped request sockets like this :

tcp    SYN-RECV   0      0  ::ffff:192.168.0.1:8080   ::ffff:192.0.2.1:35261

and not like this :

tcp    SYN-RECV   0      0  192.168.0.1:8080   192.0.2.1:35261

We should init ireq->ireq_family based on listener sk_family,
not the actual protocol carried by SYN packet.

This means we can set ireq_family in inet_reqsk_alloc()

Fixes: 3f66b083a5 ("inet: introduce ireq_family")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 00:57:48 -04:00
Eric Dumazet fd3a154a00 tcp: md5: get rid of tcp_v[46]_reqsk_md5_lookup()
With request socks convergence, we no longer need
different lookup methods. A request socket can
use generic lookup function.

Add const qualifier to 2nd tcp_v[46]_md5_lookup() parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet 39f8e58e53 tcp: md5: remove request sock argument of calc_md5_hash()
Since request and established sockets now have same base,
there is no need to pass two pointers to tcp_v4_md5_hash_skb()
or tcp_v6_md5_hash_skb()

Also add a const qualifier to their struct tcp_md5sig_key argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:30 -04:00
Eric Dumazet ff74e23f7e tcp: md5: input path is run under rcu protected sections
It is guaranteed that both tcp_v4_rcv() and tcp_v6_rcv()
run from rcu read locked sections :

ip_local_deliver_finish() and ip6_input_finish() both
use rcu_read_lock()

Also align tcp_v6_inbound_md5_hash() on tcp_v4_inbound_md5_hash()
by returning a boolean.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:29 -04:00
Eric Dumazet 0980c1e308 tcp: use C99 initializers in new_state[]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:29 -04:00
Eric Dumazet 80f03e27a3 tcp: md5: fix rcu lockdep splat
While timer handler effectively runs a rcu read locked section,
there is no explicit rcu_read_lock()/rcu_read_unlock() annotations
and lockdep can be confused here :

net/ipv4/tcp_ipv4.c-906-        /* caller either holds rcu_read_lock() or socket lock */
net/ipv4/tcp_ipv4.c:907:        md5sig = rcu_dereference_check(tp->md5sig_info,
net/ipv4/tcp_ipv4.c-908-                                       sock_owned_by_user(sk) ||
net/ipv4/tcp_ipv4.c-909-                                       lockdep_is_held(&sk->sk_lock.slock));

Let's explicitely acquire rcu_read_lock() in tcp_make_synack()

Before commit fa76ce7328 ("inet: get rid of central tcp/dccp listener
timer"), we were holding listener lock so lockdep was happy.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric DUmazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 21:16:29 -04:00
Michal Kubeček d0c294c53a tcp: prevent fetching dst twice in early demux code
On s390x, gcc 4.8 compiles this part of tcp_v6_early_demux()

        struct dst_entry *dst = sk->sk_rx_dst;

        if (dst)
                dst = dst_check(dst, inet6_sk(sk)->rx_dst_cookie);

to code reading sk->sk_rx_dst twice, once for the test and once for
the argument of ip6_dst_check() (dst_check() is inline). This allows
ip6_dst_check() to be called with null first argument, causing a crash.

Protect sk->sk_rx_dst access by READ_ONCE() both in IPv4 and IPv6
TCP early demux code.

Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Fixes: c7109986db ("ipv6: Early TCP socket demux")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:38:24 -04:00
David S. Miller d5c1d8c567 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nf_tables_core.c

The nf_tables_core.c conflict was resolved using a conflict resolution
from Stephen Rothwell as a guide.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:22:43 -04:00
David S. Miller 40451fd013 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next.
Basically, more incremental updates for br_netfilter from Florian
Westphal, small nf_tables updates (including one fix for rb-tree
locking) and small two-liner to add extra validation for the REJECT6
target.

More specifically, they are:

1) Use the conntrack status flags from br_netfilter to know that DNAT is
   happening. Patch for Florian Westphal.

2) nf_bridge->physoutdev == NULL already indicates that the traffic is
   bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian.

3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc,
   from Joe Perches.

4) Consolidation of nf_tables_newtable() error path.

5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(),
   from Florian Westphal.

6) Access rb-tree root node inside the lock and remove unnecessary
   locking from the get path (we already hold nfnl_lock there), from
   Patrick McHardy.

7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't
   support interval, also from Patrick.

8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is
   actually restricting matches to TCP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:02:46 -04:00
Fan Du c69736696c inet: fix double request socket freeing
Eric Hugne reported following error :

I'm hitting this warning on latest net-next when i try to SSH into a machine
with eth0 added to a bridge (but i think the problem is older than that)

Steps to reproduce:
node2 ~ # brctl addif br0 eth0
[  223.758785] device eth0 entered promiscuous mode
node2 ~ # ip link set br0 up
[  244.503614] br0: port 1(eth0) entered forwarding state
[  244.505108] br0: port 1(eth0) entered forwarding state
node2 ~ # [  251.160159] ------------[ cut here ]------------
[  251.160831] WARNING: CPU: 0 PID: 3 at include/net/request_sock.h:102 tcp_v4_err+0x6b1/0x720()
[  251.162077] Modules linked in:
[  251.162496] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.0.0-rc3+ #18
[  251.163334] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  251.164078]  ffffffff81a8365c ffff880038a6ba18 ffffffff8162ace4 0000000000009898
[  251.165084]  0000000000000000 ffff880038a6ba58 ffffffff8104da85 ffff88003fa437c0
[  251.166195]  ffff88003fa437c0 ffff88003fa74e00 ffff88003fa43bb8 ffff88003fad99a0
[  251.167203] Call Trace:
[  251.167533]  [<ffffffff8162ace4>] dump_stack+0x45/0x57
[  251.168206]  [<ffffffff8104da85>] warn_slowpath_common+0x85/0xc0
[  251.169239]  [<ffffffff8104db65>] warn_slowpath_null+0x15/0x20
[  251.170271]  [<ffffffff81559d51>] tcp_v4_err+0x6b1/0x720
[  251.171408]  [<ffffffff81630d03>] ? _raw_read_lock_irq+0x3/0x10
[  251.172589]  [<ffffffff81534e20>] ? inet_del_offload+0x40/0x40
[  251.173366]  [<ffffffff81569295>] icmp_socket_deliver+0x65/0xb0
[  251.174134]  [<ffffffff815693a2>] icmp_unreach+0xc2/0x280
[  251.174820]  [<ffffffff8156a82d>] icmp_rcv+0x2bd/0x3a0
[  251.175473]  [<ffffffff81534ea2>] ip_local_deliver_finish+0x82/0x1e0
[  251.176282]  [<ffffffff815354d8>] ip_local_deliver+0x88/0x90
[  251.177004]  [<ffffffff815350f0>] ip_rcv_finish+0xf0/0x310
[  251.177693]  [<ffffffff815357bc>] ip_rcv+0x2dc/0x390
[  251.178336]  [<ffffffff814f5da3>] __netif_receive_skb_core+0x713/0xa20
[  251.179170]  [<ffffffff814f7fca>] __netif_receive_skb+0x1a/0x80
[  251.179922]  [<ffffffff814f97d4>] process_backlog+0x94/0x120
[  251.180639]  [<ffffffff814f9612>] net_rx_action+0x1e2/0x310
[  251.181356]  [<ffffffff81051267>] __do_softirq+0xa7/0x290
[  251.182046]  [<ffffffff81051469>] run_ksoftirqd+0x19/0x30
[  251.182726]  [<ffffffff8106cc23>] smpboot_thread_fn+0x153/0x1d0
[  251.183485]  [<ffffffff8106cad0>] ? SyS_setgroups+0x130/0x130
[  251.184228]  [<ffffffff8106935e>] kthread+0xee/0x110
[  251.184871]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.185690]  [<ffffffff81631108>] ret_from_fork+0x58/0x90
[  251.186385]  [<ffffffff81069270>] ? kthread_create_on_node+0x1b0/0x1b0
[  251.187216] ---[ end trace c947fc7b24e42ea1 ]---
[  259.542268] br0: port 1(eth0) entered forwarding state

Remove the double calls to reqsk_put()

[edumazet] :

I got confused because reqsk_timer_handler() _has_ to call
reqsk_put(req) after calling inet_csk_reqsk_queue_drop(), as
the timer handler holds a reference on req.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 21:40:48 -04:00
Alexander Duyck b6f15f828d fib_trie: Fix regression in handling of inflate/halve failure
When I updated the code to address a possible null pointer dereference in
resize I ended up reverting an exception handling fix for the suffix length
in the event that inflate or halve failed.  This change is meant to correct
that by reverting the earlier fix and instead simply getting the parent
again after inflate has been completed to avoid the possible null pointer
issue.

Fixes: ddb4b9a13 ("fib_trie: Address possible NULL pointer dereference in resize")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:58:32 -04:00
Eric Dumazet 26e3736090 ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet b282705336 net: convert syn_wait_lock to a spinlock
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet 8b929ab12f inet: remove some sk_listener dependencies
listener can be source of false sharing. request sock has some
useful information like : ireq->ir_iif, ireq->ir_num, ireq->ireq_net

This patch does not solve the major problem of having to read
sk->sk_protocol which is sharing a cache line with sk->sk_wmem_alloc.
(This same field is read later in ip_build_and_send_pkt())

One idea would be to move sk_protocol close to sk_family
(using 8 bits instead of 16 for sk_family seems enough)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet 42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
Eric Dumazet 2b41fab70f inet: cache listen_sock_qlen() and read rskq_defer_accept once
Cache listen_sock_qlen() to limit false sharing, and read
rskq_defer_accept once as it might change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
David S. Miller c0e41fa76c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix missing initialization of tuple structure in nfnetlink_cthelper
   to avoid mismatches when looking up to attach userspace helpers to
   flows, from Ian Wilson.

2) Fix potential crash in nft_hash when we hit -EAGAIN in
   nft_hash_walk(), from Herbert Xu.

3) We don't need to indicate the hook information to update the
   basechain default policy in nf_tables.

4) Restore tracing over nfnetlink_log due to recent rework to
   accomodate logging infrastructure into nf_tables.

5) Fix wrong IP6T_INV_PROTO check in xt_TPROXY.

6) Set IP6T_F_PROTO flag in nft_compat so we can use SYNPROXY6 and
   REJECT6 from xt over nftables.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-22 16:57:07 -04:00
Florian Westphal 8d0451638a netfilter: bridge: kill nf_bridge_pad
The br_netfilter frag output function calls skb_cow_head() so in
case it needs a larger headroom to e.g. re-add a previously stripped PPPOE
or VLAN header things will still work (at cost of reallocation).

We can then move nf_bridge_encap_header_len to br_netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 19:45:55 +01:00
Eric Dumazet d3593b5cef Revert "selinux: add a skb_owned_by() hook"
This reverts commit ca10b9e9a8.

No longer needed after commit eb8895debe
("tcp: tcp_make_synack() should use sock_wmalloc")

When under SYNFLOOD, we build lot of SYNACK and hit false sharing
because of multiple modifications done on sk_listener->sk_wmem_alloc

Since tcp_make_synack() uses sock_wmalloc(), there is no need
to call skb_set_owner_w() again, as this adds two atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:36:53 -04:00
David S. Miller 0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Josh Hunt d22e153718 tcp: fix tcp fin memory accounting
tcp_send_fin() does not account for the memory it allocates properly, so
sk_forward_alloc can be negative in cases where we've sent a FIN:

ss example output (ss -amn | grep -B1 f4294):
tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
Acked-by: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:18:52 -04:00
Eric Dumazet fa76ce7328 inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet 52452c5425 inet: drop prev pointer handling in request sock
When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Pablo Neira Ayuso 4017a7ee69 netfilter: restore rule tracing via nfnetlink_log
Since fab4085 ("netfilter: log: nf_log_packet() as real unified
interface"), the loginfo structure that is passed to nf_log_packet() is
used to explicitly indicate the logger type you want to use.

This is a problem for people tracing rules through nfnetlink_log since
packets are always routed to the NF_LOG_TYPE logger after the
aforementioned patch.

We can fix this by removing the trace loginfo structures, but that still
changes the log level from 4 to 5 for tracing messages and there may be
someone relying on this outthere. So let's just introduce a new
nf_log_trace() function that restores the former behaviour.

Reported-by: Markus Kötter <koetter@rrzn.uni-hannover.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-19 11:14:48 +01:00
Eric Dumazet 738e6d30d3 inet: add a schedule point in inet_twsk_purge()
On a large hash table, we can easily spend seconds to
walk over all entries. Add a cond_resched() to yield
cpu if necessary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:38:13 -04:00
Marcelo Ricardo Leitner 54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Marcelo Ricardo Leitner baf606d9c9 ipv4,ipv6: grab rtnl before locking the socket
There are some setsockopt operations in ipv4 and ipv6 that are grabbing
rtnl after having grabbed the socket lock. Yet this makes it impossible
to do operations that have to lock the socket when already within a rtnl
protected scope, like ndo dev_open and dev_stop.

We normally take coarse grained locks first but setsockopt inverted that.

So this patch invert the lock logic for these operations and makes
setsockopt grab rtnl if it will be needed prior to grabbing socket lock.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Eric Dumazet 08d2cc3b26 inet: request sock should init IPv6/IPv4 addresses
In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet b4d6444ea3 inet: get rid of last __inet_hash_connect() argument
We now always call __inet_hash_nolisten(), no need to pass it
as an argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet 77a6a471bc ipv6: get rid of __inet6_hash()
We can now use inet_hash() and __inet_hash() instead of private
functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet d1e559d0b1 inet: add IPv6 support to sk_ehashfn()
Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.

IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.

__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Eric Dumazet 5b441f76f1 net: introduce sk_ehashfn() helper
Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers
for all kind of sockets (full sockets, timewait and request sockets)

inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Eric Dumazet 6eada0110c netns: constify net_hash_mix() and various callers
const qualifiers ease code review by making clear
which objects are not written in a function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Joe Perches 1ca9e41770 netfilter: Remove uses of seq_<foo> return values
The seq_printf/seq_puts/seq_putc return values, because they
are frequently misused, will eventually be converted to void.

See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Miscellanea:

o realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-18 10:51:35 +01:00
Eric Dumazet 0470c8ca1d inet: fix request sock refcounting
While testing last patch series, I found req sock refcounting was wrong.

We must set skc_refcnt to 1 for all request socks added in hashes,
but also on request sockets created by FastOpen or syncookies.

It is tricky because we need to defer this initialization so that
future RCU lookups do not try to take a refcount on a not yet
fully initialized request socket.

Also get rid of ireq_refcnt alias.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 13854e5a60 ("inet: add proper refcounting to request sock")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:02:29 -04:00
Eric Dumazet e3d95ad7da inet: avoid fastopen lock for regular accept()
It is not because a TCP listener is FastOpen ready that
all incoming sockets actually used FastOpen.

Avoid taking queue->fastopenq->lock if not needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:01:56 -04:00
Eric Dumazet 9439ce00f2 tcp: rename struct tcp_request_sock listener
The listener field in struct tcp_request_sock is a pointer
back to the listener. We now have req->rsk_listener, so TCP
only needs one boolean and not a full pointer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:01:56 -04:00
Eric Dumazet 4e9a578e5b inet: add rsk_listener field to struct request_sock
Once we'll be able to lookup request sockets in ehash table,
we'll need to get access to listener which created this request.

This avoid doing a lookup to find the listener, which benefits
for a more solid SO_REUSEPORT, and is needed once we no
longer queue request sock into a listener private queue.

Note that 'struct tcp_request_sock'->listener could be reduced
to a single bit, as TFO listener should match req->rsk_listener.
TFO will no longer need to hold a reference on the listener.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:01:56 -04:00
Eric Dumazet e49bb337d7 inet: uninline inet_reqsk_alloc()
inet_reqsk_alloc() is becoming fat and should not be inlined.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:01:56 -04:00
Eric Dumazet 407640de21 inet: add sk_listener argument to inet_reqsk_alloc()
listener socket can be used to set net pointer, and will
be later used to hold a reference on listener.

Add a const qualifier to first argument (struct request_sock_ops *),
and factorize all write_pnet(&ireq->ireq_net, sock_net(sk));

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:01:55 -04:00
Eric Dumazet 7970ddc8f9 tcp: uninline tcp_oow_rate_limited()
tcp_oow_rate_limited() is hardly used in fast path, there is
no point inlining it.

Signed-of-by: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 15:18:00 -04:00
Eric Dumazet 1bfc4438a7 tcp: move tcp_openreq_init() to tcp_input.c
This big helper is called once from tcp_conn_request(), there is no
point having it in an include. Compiler will inline it anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 15:18:00 -04:00
Eric Dumazet cb7cf8a33f inet: Clean up inet_csk_wait_for_connect() vs. might_sleep()
I got the following trace with current net-next kernel :

[14723.885290] WARNING: CPU: 26 PID: 22658 at kernel/sched/core.c:7285 __might_sleep+0x89/0xa0()
[14723.885325] do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810e8734>] prepare_to_wait_exclusive+0x34/0xa0
[14723.885355] CPU: 26 PID: 22658 Comm: netserver Not tainted 4.0.0-dbg-DEV #1379
[14723.885359]  ffffffff81a223a8 ffff881fae9e7ca8 ffffffff81650b5d 0000000000000001
[14723.885364]  ffff881fae9e7cf8 ffff881fae9e7ce8 ffffffff810a72e7 0000000000000000
[14723.885367]  ffffffff81a57620 000000000000093a 0000000000000000 ffff881fae9e7e64
[14723.885371] Call Trace:
[14723.885377]  [<ffffffff81650b5d>] dump_stack+0x4c/0x65
[14723.885382]  [<ffffffff810a72e7>] warn_slowpath_common+0x97/0xe0
[14723.885386]  [<ffffffff810a73e6>] warn_slowpath_fmt+0x46/0x50
[14723.885390]  [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[14723.885393]  [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0
[14723.885396]  [<ffffffff810e8734>] ? prepare_to_wait_exclusive+0x34/0xa0
[14723.885399]  [<ffffffff810ccdc9>] __might_sleep+0x89/0xa0
[14723.885403]  [<ffffffff81581846>] lock_sock_nested+0x36/0xb0
[14723.885406]  [<ffffffff815829a3>] ? release_sock+0x173/0x1c0
[14723.885411]  [<ffffffff815ea1f7>] inet_csk_accept+0x157/0x2a0
[14723.885415]  [<ffffffff810e8900>] ? abort_exclusive_wait+0xc0/0xc0
[14723.885419]  [<ffffffff8161b96d>] inet_accept+0x2d/0x150
[14723.885424]  [<ffffffff8157db6f>] SYSC_accept4+0xff/0x210
[14723.885428]  [<ffffffff8165a451>] ? retint_swapgs+0xe/0x44
[14723.885431]  [<ffffffff810f4c5d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[14723.885437]  [<ffffffff81369c0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[14723.885441]  [<ffffffff8157ef40>] SyS_accept+0x10/0x20
[14723.885444]  [<ffffffff81659872>] system_call_fastpath+0x12/0x17
[14723.885447] ---[ end trace ff74cd83355b1873 ]---

In commit 26cabd3125
Peter added a sched_annotate_sleep() in sk_wait_event()

Is the following patch needed as well ?

Alternative would be to use sk_wait_event() from inet_csk_wait_for_connect()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 15:03:54 -04:00
Eric Dumazet 9f1ab18672 tcp_metrics: fix wrong lockdep annotations
Changes in tcp_metric hash table are protected by tcp_metrics_lock
only, not by genl_mutex

While we are at it use deref_locked() instead of rcu_dereference()
in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock
as well.

Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 098a697b49 ("tcp_metrics: Use a single hash table for all network namespaces.")
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 16:32:23 -04:00
David S. Miller ca00942a81 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2015-03-16

1) Fix the network header offset in _decode_session6
   when multiple IPv6 extension headers are present.
   From Hajime Tazaki.

2) Fix an interfamily tunnel crash. We set outer mode
   protocol too early and may dispatch to the wrong
   address family. Move the setting of the outer mode
   protocol behind the last accessing of the inner mode
   to fix the crash.

3) Most callers of xfrm_lookup() expect that dst_orig
   is released on error. But xfrm_lookup_route() may
   need dst_orig to handle certain error cases. So
   introduce a flag that tells what should be done in
   case of error. From Huaibin Wang.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 16:16:49 -04:00
Eric Dumazet 13854e5a60 inet: add proper refcounting to request sock
reqsk_put() is the generic function that should be used
to release a refcount (and automatically call reqsk_free())

reqsk_free() might be called if refcount is known to be 0
or undefined.

refcnt is set to one in inet_csk_reqsk_queue_add()

As request socks are not yet in global ehash table,
I added temporary debugging checks in reqsk_put() and reqsk_free()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 15:55:29 -04:00
Eric Dumazet 2c13270b44 inet: factorize sock_edemux()/sock_gen_put() code
sock_edemux() is not used in fast path, and should
really call sock_gen_put() to save some code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 15:55:29 -04:00
Eric Dumazet a58917f584 inet_diag: allow sk_diag_fill() to handle request socks
inet_diag_fill_req() is renamed to inet_req_diag_fill()
and moved up, so that it can be called fom sk_diag_fill()

inet_diag_bc_sk() is ready to handle request socks.

inet_twsk_diag_dump() is no longer needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 15:55:29 -04:00
Eric Dumazet f7e4eb03f9 inet: ip early demux should avoid request sockets
When a request socket is created, we do not cache ip route
dst entry, like for timewait sockets.

Let's use sk_fullsock() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-16 15:55:29 -04:00
Eric Dumazet a4458343ac inet_diag: factorize code in new inet_diag_msg_common_fill() helper
Now the three type of sockets share a common base, we can factorize
code in inet_diag_msg_common_fill().

inet_diag_entry no longer requires saddr_storage & daddr_storage
and the extra copies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 15:05:10 -04:00
Eric Dumazet a07c92078d inet_diag: adjust inet_sk_diag_fill() bug condition
inet_sk_diag_fill() only copes with non timewait and non request socks

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 15:05:10 -04:00
Eric Dumazet 16f86165bd inet: fill request sock ir_iif for IPv4
Once request socks will be in ehash table, they will need to have
a valid ir_iff field.

This is currently true only for IPv6. This patch extends support
for IPv4 as well.

This means inet_diag_fill_req() can now properly use ir_iif,
which is better for IPv6 link locals anyway, as request sockets
and established sockets will propagate consistent netlink idiag_if.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 15:05:10 -04:00
Eric Dumazet c8e2c80d7e inet_diag: fix possible overflow in inet_diag_dump_one_icsk()
inet_diag_dump_one_icsk() allocates too small skb.

Add inet_sk_attr_size() helper right before inet_sk_diag_fill()
so that it can be updated if/when new attributes are added.

iproute2/ss currently does not use this dump_one() interface,
this might explain nobody noticed this problem yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 15:54:27 -04:00
Eric W. Biederman 098a697b49 tcp_metrics: Use a single hash table for all network namespaces.
Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric W. Biederman 04f721c671 tcp_metrics: Rewrite tcp_metrics_flush_all
Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting
a selection of entries from a tcp metrics hash chain.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric W. Biederman 8a4bff714f tcp_metrics: Remove the unused return code from tcp_metrics_flush_all
tcp_metrics_flush_all always returns 0.  Remove the unnecessary return code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric W. Biederman 849e8a0ca8 tcp_metrics: Add a field tcpm_net and verify it matches on lookup
In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and
verify that field on all hash table lookups.

Make the field tcpm_net of type possible_net_t so it takes no space
when network namespaces are disabled.

Further add a function tm_net to read that field so we can be
efficient when network namespaces are disabled and concise
the rest of the time.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric W. Biederman 3e5da62d0b tcp_metrics: Mix the network namespace into the hash function.
In preparation for using one hash table for all network namespaces
mix the network namespace into the hash value.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric W. Biederman 6493517eae tcp_metrics: panic when tcp_metrics_init fails.
There is not a practical way to cleanup during boot so
just panic if there is a problem initializing tcp_metrics.

That will at least give us a clear place to start debugging
if something does go wrong.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 01:57:07 -04:00
Eric Dumazet 3f66b083a5 inet: introduce ireq_family
Before inserting request socks into general hash table,
fill their socket family.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:13 -04:00
Eric Dumazet d4f06873b6 inet: get_openreq4() & get_openreq6() do not need listener
ireq->ir_num contains local port, use it.

Also, get_openreq4() dumping listen_sk->refcnt makes litle sense.

inet_diag_fill_req() can also use ireq->ir_num

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:13 -04:00
Eric Dumazet 41b822c59e inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state
sock_edemux() & sock_gen_put() should be ready to cope with request socks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:13 -04:00
Eric Dumazet bd337c581b ipv6: add missing ireq_net & ir_cookie initializations
I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie

Lets clear ireq->ir_cookie in inet_reqsk_alloc()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe ("net: add real socket cookies")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 22:58:12 -04:00
Alexander Duyck 0b65bd97ba fib_trie: Provide a deterministic order for fib_alias w/ tables merged
This change makes it so that we should always have a deterministic ordering
for the main and local aliases within the merged table when two leaves
overlap.

So for example if we have a leaf with a key of 192.168.254.0.  If we
previously added two aliases with a prefix length of 24 from both local and
main the first entry would be first and the second would be second.  When I
was coding this I had added a WARN_ON should such a situation occur as I
wasn't sure how likely it would be.  However this WARN_ON has been
triggered so this is something that should be addressed.

With this patch the ordering of the aliases is as follows.  First they are
sorted on prefix length, then on their table ID, then tos, and finally
priority.  This way what we end up doing is essentially interleaving the
two tables on what used to be leaf_info structure boundaries.

Fixes: 0ddcf43d5 ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 18:26:51 -04:00
Alexander Duyck 3c9e9f7320 fib_trie: Avoid NULL pointer if local table is not allocated
The function fib_unmerge assumed the local table had already been
allocated.  If that is not the case however when custom rules are applied
then this can result in a NULL pointer dereference.

In order to prevent this we must check the value of the local table pointer
and if it is NULL simply return 0 as there is no local table to separate
from the main.

Fixes: 0ddcf43d5 ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 18:26:51 -04:00
Eric W. Biederman 0c5c9fb551 net: Introduce possible_net_t
Having to say
> #ifdef CONFIG_NET_NS
> 	struct net *net;
> #endif

in structures is a little bit wordy and a little bit error prone.

Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
>       struct net *net;
> #endif
> } possible_net_t;

And then in a header say:

> 	possible_net_t net;

Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.

Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.

This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 14:39:40 -04:00
Eric W. Biederman efd7ef1c19 net: Kill hold_net release_net
hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008.  Kill the code it is long past due.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 14:39:40 -04:00
Eric Dumazet c29390c6df xps: must clear sender_cpu before forwarding
John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John <jw@nuclearfallout.net>
Bisected-by: John <jw@nuclearfallout.net>
Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 23:51:18 -04:00
Eric Dumazet d77c555d32 net: fix CONFIG_NET_NS=n compilation
I forgot to use write_pnet() in three locations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90fe ("net: add real socket cookies")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 23:28:49 -04:00
Eric Dumazet 33cf7c90fe net: add real socket cookies
A long standing problem in netlink socket dumps is the use
of kernel socket addresses as cookies.

1) It is a security concern.

2) Sockets can be reused quite quickly, so there is
   no guarantee a cookie is used once and identify
   a flow.

3) request sock, establish sock, and timewait socks
   for a given flow have different cookies.

Part of our effort to bring better TCP statistics requires
to switch to a different allocator.

In this patch, I chose to use a per network namespace 64bit generator,
and to use it only in the case a socket needs to be dumped to netlink.
(This might be refined later if needed)

Note that I tried to carry cookies from request sock, to establish sock,
then timewait sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Salo <salo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 21:55:28 -04:00
Alexander Duyck 654eff4516 fib_trie: Only display main table in /proc/net/route
When we merged the tries for local and main I had overlooked the iterator
for /proc/net/route.  As a result it was outputting both local and main
when the two tries were merged.

This patch resolves that by only providing output for aliases that are
actually in the main trie.  As a result we should go back to the original
behavior which I assume will be necessary to maintain legacy support.

Fixes: 0ddcf43d5 ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 21:24:32 -04:00
Alexander Duyck 61f0d861fc fib_trie: Fix uninitialized variable warning
The 0-day kernel test infrastructure reported a use of uninitialized
variable warning for local_table due to the fact that the local and main
allocations had been swapped from the original setup.  This change corrects
that by making it so that we free the main table if the local table
allocation fails.

Fixes: 0ddcf43d5 ("ipv4: FIB Local/MAIN table collapse")

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 17:33:44 -04:00
Neal Cardwell d578e18ce9 tcp: restore 1.5x per RTT limit to CUBIC cwnd growth in congestion avoidance
Commit 814d488c61 ("tcp: fix the timid additive increase on stretch
ACKs") fixed a bug where tcp_cong_avoid_ai() would either credit a
connection with an increase of snd_cwnd_cnt, or increase snd_cwnd, but
not both, resulting in cwnd increasing by 1 packet on at most every
alternate invocation of tcp_cong_avoid_ai().

Although the commit correctly implemented the CUBIC algorithm, which
can increase cwnd by as much as 1 packet per 1 packet ACKed (2x per
RTT), in practice that could be too aggressive: in tests on network
paths with small buffers, YouTube server retransmission rates nearly
doubled.

This commit restores CUBIC to a maximum cwnd growth rate of 1 packet
per 2 packets ACKed (1.5x per RTT). In YouTube tests this restored
retransmit rates to low levels.

Testing: This patch has been tested in datacenter netperf transfers
and live youtube.com and google.com servers.

Fixes: 9cd981dcf1 ("tcp: fix stretch ACK bugs in CUBIC")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 16:51:51 -04:00
Neal Cardwell 9949afa42b tcp: fix tcp_cong_avoid_ai() credit accumulation bug with decreases in w
The recent change to tcp_cong_avoid_ai() to handle stretch ACKs
introduced a bug where snd_cwnd_cnt could accumulate a very large
value while w was large, and then if w was reduced snd_cwnd could be
incremented by a large delta, leading to a large burst and high packet
loss. This was tickled when CUBIC's bictcp_update() sets "ca->cnt =
100 * cwnd".

This bug crept in while preparing the upstream version of
814d488c61.

Testing: This patch has been tested in datacenter netperf transfers
and live youtube.com and google.com servers.

Fixes: 814d488c61 ("tcp: fix the timid additive increase on stretch ACKs")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 16:51:51 -04:00
Sabrina Dubroca 6dede75b7e fib_trie: call fib_table_flush_external under RTNL
Move rtnl_lock() before the call to fib4_rules_exit so that
fib_table_flush_external is called under RTNL.

Fixes: 104616e74e ("switchdev: don't support custom ip rules, for now")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 16:46:26 -04:00
Alexander Duyck 0ddcf43d5d ipv4: FIB Local/MAIN table collapse
This patch is meant to collapse local and main into one by converting
tb_data from an array to a pointer.  Doing this allows us to point the
local table into the main while maintaining the same variables in the
table.

As such the tb_data was converted from an array to a pointer, and a new
array called data is added in order to still provide an object for tb_data
to point to.

In order to track the origin of the fib aliases a tb_id value was added in
a hole that existed on 64b systems.  Using this we can also reverse the
merge in the event that custom FIB rules are enabled.

With this patch I am seeing an improvement of 20ns to 30ns for routing
lookups as long as custom rules are not enabled, with custom rules enabled
we fall back to split tables and the original behavior.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-11 16:22:14 -04:00
Alexander Duyck ddb4b9a132 fib_trie: Address possible NULL pointer dereference in resize
If the inflate call failed it would return NULL.  As a result tp would be
set to NULL and cause use to trigger a NULL pointer dereference in
should_halve if the inflate failed on the first attempt.

In order to prevent this we should decrement max_work before we actually
attempt to inflate as this will force us to exit before attempting to halve
a node we should have inflated.  In order to keep things symmetric between
inflate and halve I went ahead and also moved the decrement of max_work for
the halve case as well so we take care of that before we actually attempt
to halve the tnode.

Fixes: 88bae714 ("fib_trie: Add key vector to root, return parent key_vector in resize")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 18:36:56 -04:00
Alexander Duyck 3ec320dd5c fib_trie: Correctly handle case of key == 0 in leaf_walk_rcu
In the case of a trie that had no tnodes with a key of 0 the initial
look-up would fail resulting in an out-of-bounds cindex on the first tnode.
This resulted in an entire trie being skipped.

In order resolve this I have updated the cindex logic in the initial
look-up so that if the key is zero we will always traverse the child zero
path.

Fixes: 8be33e95 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 16:13:55 -04:00
Eric Dumazet 34160ea3f9 inet_diag: add const to inet_diag_req_v2
diag dumpers should not modify the request.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 13:45:28 -04:00
Eric Dumazet e31c5e0e48 inet_diag: cleanups
Remove all inline keywords, add some const, and cleanup style.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 13:45:28 -04:00
David S. Miller 515fb5c317 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter fixes for net-next

The following batch contains a couple of fixes to address some fallout
from the previous pull request, they are:

1) Address link problems in the bridge code after e5de75b. Fix it by
   using rcu hook to address to avoid ifdef pollution and hard
   dependency between bridge and br_netfilter.

2) Address sparse warnings in the netfilter reject code, patch from
   Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 12:48:47 -04:00
Florian Westphal a03a8dbe20 netfilter: fix sparse warnings in reject handling
make C=1 CF=-D__CHECK_ENDIAN__ shows following:

net/bridge/netfilter/nft_reject_bridge.c:65:50: warning: incorrect type in argument 3 (different base types)
net/bridge/netfilter/nft_reject_bridge.c:65:50:    expected restricted __be16 [usertype] protocol [..]
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: cast from restricted __be16
net/bridge/netfilter/nft_reject_bridge.c:102:37: warning: incorrect type in argument 1 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:121:50: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:168:52: warning: incorrect type in argument 3 (different base types) [..]
net/bridge/netfilter/nft_reject_bridge.c:233:52: warning: incorrect type in argument 3 (different base types) [..]

Caused by two (harmless) errors:
1. htons() instead of ntohs()
2. __be16 for protocol in nf_reject_ipXhdr_put API, use u8 instead.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-10 15:01:32 +01:00
Scott Feldman f8f2147150 switchdev: add netlink flags to IPv4 FIB add op
Pass in the netlink flags (NLM_F_*) into switchdev driver for IPv4 FIB add op
to allow driver to 1) optimize hardware updates, 2) handle ip route prepend
and append commands correctly.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09 23:56:52 -04:00
David S. Miller 3cef5c5b0b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cadence/macb.c

Overlapping changes in macb driver, mostly fixes and cleanups
in 'net' overlapping with the integration of at91_ether into
macb in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09 23:38:02 -04:00
Eric W. Biederman ddb3b6033c net: Remove protocol from struct dst_ops
After my change to neigh_hh_init to obtain the protocol from the
neigh_table there are no more users of protocol in struct dst_ops.
Remove the protocol field from dst_ops and all of it's initializers.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09 16:06:10 -04:00
David S. Miller 5428aef811 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. Basically, improvements for the packet rejection infrastructure,
deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for
br_netfilter. More specifically they are:

1) Send packet to reset flow if checksum is valid, from Florian Westphal.

2) Fix nf_tables reject bridge from the input chain, also from Florian.

3) Deprecate the CLUSTERIP target, the cluster match supersedes it in
   functionality and it's known to have problems.

4) A couple of cleanups for nf_tables rule tracing infrastructure, from
   Patrick McHardy.

5) Another cleanup to place transaction declarations at the bottom of
   nf_tables.h, also from Patrick.

6) Consolidate Kconfig dependencies wrt. NF_TABLES.

7) Limit table names to 32 bytes in nf_tables.

8) mac header copying in bridge netfilter is already required when
   calling ip_fragment(), from Florian Westphal.

9) move nf_bridge_update_protocol() to br_netfilter.c, also from
   Florian.

10) Small refactor in br_netfilter in the transmission path, again from
    Florian.

11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-09 15:58:21 -04:00
Willem de Bruijn c247f0534c ip: fix error queue empty skb handling
When reading from the error queue, msg_name and msg_control are only
populated for some errors. A new exception for empty timestamp skbs
added a false positive on icmp errors without payload.

`traceroute -M udpconn` only displayed gateways that return payload
with the icmp error: the embedded network headers are pulled before
sock_queue_err_skb, leaving an skb with skb->len == 0 otherwise.

Fix this regression by refining when msg_name and msg_control
branches are taken. The solutions for the two fields are independent.

msg_name only makes sense for errors that configure serr->port and
serr->addr_offset. Test the first instead of skb->len. This also fixes
another issue. saddr could hold the wrong data, as serr->addr_offset
is not initialized  in some code paths, pointing to the start of the
network header. It is only valid when serr->port is set (non-zero).

msg_control support differs between IPv4 and IPv6. IPv4 only honors
requests for ICMP and timestamps with SOF_TIMESTAMPING_OPT_CMSG. The
skb->len test can simply be removed, because skb->dev is also tested
and never true for empty skbs. IPv6 honors requests for all errors
aside from local errors and timestamps on empty skbs.

In both cases, make the policy more explicit by moving this logic to
a new function that decides whether to process msg_control and that
optionally prepares the necessary fields in skb->cb[]. After this
change, the IPv4 and IPv6 paths are more similar.

The last case is rxrpc. Here, simply refine to only match timestamps.

Fixes: 49ca0d8bfa ("net-timestamp: no-payload option")

Reported-by: Jan Niehusmann <jan@gondor.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>

----

Changes
  v1->v2
  - fix local origin test inversion in ip6_datagram_support_cmsg
  - make v4 and v6 code paths more similar by introducing analogous
    ipv4_datagram_support_cmsg
  - fix compile bug in rxrpc
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-08 23:01:54 -04:00
Alexander Duyck 88bae7149a fib_trie: Add key vector to root, return parent key_vector in resize
This change makes it so that the root of the trie contains a key_vector, by
doing this we make room to essentially collapse the entire trie by at least
one cache line as we can store the information about the tnode or leaf that
is pointed to in the root.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck f23e59fbd7 fib_trie: Move parent from key_vector to tnode
This change pulls the parent pointer from the key_vector and places it in
the tnode structure.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck 6e22d174ba fib_trie: Pull empty_children and full_children into tnode
This pulls the information about the child array out of the key_vector and
places it in the tnode since that is where it is needed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck 56ca2adf6a fib_trie: Move rcu from key_vector to tnode, add accessors.
RCU is only needed once for the entire node, not once per key_vector so we
can pull that out and move it to the tnode structure.

In addition add accessors to be used inside the RCU functions so that we
can more easily get from the key vector to either the tnode or the trie
pointers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck dc35dbeda3 fib_trie: Add tnode struct as a container for fields not needed in key_vector
This change pulls the fields not explicitly needed in the key_vector and
placed them in the new tnode structure.  By doing this we will eventually
be able to reduce the key_vector down to 16 bytes on 64 bit systems, and
12 bytes on 32 bit systems.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck 2e1ac88a48 fib_trie: Rename tnode_child_length to child_length
We are now checking the length of a key_vector instead of a tnode so it
makes sense to probably just rename this to child_length since it would
probably even be applicable to a leaf.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:28 -05:00
Alexander Duyck 754baf8dec fib_trie: replace tnode_get_child functions with get_child macros
I am replacing the tnode_get_child call with get_child since we are
techically pulling the child out of a key_vector now and not a tnode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:27 -05:00
Alexander Duyck 35c6edac19 fib_trie: Rename tnode to key_vector
Rename the tnode to key_vector.  The key_vector will be the eventual
container for all of the information needed by either a leaf or a tnode.
The final result should be much smaller than the 40 bytes currently needed
for either one.

This also updates the trie struct so that it contains an array of size 1 of
tnode pointers.  This is to bring the structure more inline with how an
actual tnode itself is configured.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:27 -05:00
Alexander Duyck 8d8e810ca8 fib_trie: Return pointer to tnode pointer in resize/inflate/halve
Resize related functions now all return a pointer to the pointer that
references the object that was resized.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:27 -05:00
Alexander Duyck 72be72607a fib_trie: Minor cleanups to fib_table_flush_external
This change just does a couple of minor cleanups on
fib_table_flush_external.  Specifically it addresses the fact that resize
was being called even though nothing was being removed from the table, and
it drops an unecessary indent since we could just call continue on the
inverse of the fi && flag check.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 15:49:27 -05:00
Fan Du 05cbc0db03 ipv4: Create probe timer for tcp PMTU as per RFC4821
As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:42 -05:00
Fan Du 6b58e0a5f3 ipv4: Use binary search to choose tcp PMTU probe_size
Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24  -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 14:57:41 -05:00
David S. Miller 23375a0fd5 ipv4: Fix unused variable warnings in fib_table_flush_external.
net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’:
net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable]
  int found = 0;
      ^
net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable]
  unsigned char slen;
                ^

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:38:35 -05:00
Scott Feldman 8e05fd7166 fib: hook IPv4 fib for hardware offload
Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB.  The switchdev driver may or
may not install the route to the offload device.  In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, route forwarding falls back to the kernel, and no more routes are
offloading.

We can refine this logic later.  For now, use the simplist model of offloading
routes up to the point of failure, and then on failure, undo everything and
mark IPv4 offloading disabled.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Scott Feldman 104616e74e switchdev: don't support custom ip rules, for now
Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-06 00:24:58 -05:00
Eric Dumazet 496127290f inet_diag: remove duplicate code from inet_twsk_diag_dump()
timewait sockets now share a common base with established sockets.

inet_twsk_diag_dump() can use inet_diag_bc_sk() instead of duplicating
code, granted that inet_diag_bc_sk() does proper userlocks
initialization.

twsk_build_assert() will catch any future changes that could break
the assumptions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 22:55:44 -05:00
Eric Dumazet 6c09fa09d4 tcp: align tcp_xmit_size_goal() on tcp_tso_autosize()
With some mss values, it is possible tcp_xmit_size_goal() puts
one segment more in TSO packet than tcp_tso_autosize().

We send then one TSO packet followed by one single MSS.

It is not a serious bug, but we can do slightly better, especially
for drivers using netif_set_gso_max_size() to lower gso_max_size.

Using same formula avoids these corner cases and makes
tcp_xmit_size_goal() a bit faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 605ad7f184 ("tcp: refine TSO autosizing")
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 22:31:12 -05:00
Alexander Drozdov 3e32e733d1 ipv4: ip_check_defrag should not assume that skb_network_offset is zero
ip_check_defrag() may be used by af_packet to defragment outgoing packets.
skb_network_offset() of af_packet's outgoing packets is not zero.

Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-05 21:43:48 -05:00
Pablo Neira Ayuso f04e599e20 netfilter: nf_tables: consolidate Kconfig options
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-06 01:21:15 +01:00
Pablo Neira Ayuso 43270b1bc5 netfilter: ipt_CLUSTERIP: deprecate it in favour of xt_cluster
xt_cluster supersedes ipt_CLUSTERIP since it can be also used in
gateway configurations (not only from the backend side).

ipt_CLUSTER is also known to leak the netdev that it uses on
device removal, which requires a rather large fix to workaround
the problem: http://patchwork.ozlabs.org/patch/358629/

So let's deprecate this so we can probably kill code this in the
future.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-06 01:21:05 +01:00
Alexander Duyck 1de3d87bcd fib_trie: Prevent allocating tnode if bits is too big for size_t
This patch adds code to prevent us from attempting to allocate a tnode with
a size larger than what can be represented by size_t.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:18 -05:00
Alexander Duyck 71e8b67d0f fib_trie: Update last spot w/ idx >> n->bits code and explanation
This change updates the fib_table_lookup function so that it is in sync
with the fib_find_node function in terms of the explanation for the index
check based on the bits value.

I have also updated it from doing a mask to just doing a compare as I have
found that seems to provide more options to the compiler as I have seen it
turn this into a shift of the value and test under some circumstances.

In addition I addressed one minor issue in which we kept computing the key
^ n->key when checking the fib aliases.  I pulled the xor out of the loop
in order to reduce the number of memory reads in the lookup.  As a result
we should save a couple cycles since the xor is only done once much earlier
in the lookup.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:18 -05:00
Alexander Duyck a7e5353123 fib_trie: Make fib_table rcu safe
The fib_table was wrapped in several places with an
rcu_read_lock/rcu_read_unlock however after looking over the code I found
several spots where the tables were being accessed as just standard
pointers without any protections.  This change fixes that so that all of
the proper protections are in place when accessing the table to take RCU
replacement or removal of the table into account.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:18 -05:00
Alexander Duyck 41b489fd6c fib_trie: move leaf and tnode to occupy the same spot in the key vector
If we are going to compact the leaf and tnode we first need to make sure
the fields are all in the same place.  In that regard I am moving the leaf
pointer which represents the fib_alias hash list to occupy what is
currently the first key_vector pointer.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:18 -05:00
Alexander Duyck d5d6487cb8 fib_trie: Update insert and delete to make use of tp from find_node
This change makes it so that the insert and delete functions make use of
the tnode pointer returned in the fib_find_node call.  By doing this we
will not have to rely on the parent pointer in the leaf which will be going
away soon.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:18 -05:00
Alexander Duyck d4a975e83f fib_trie: Fib find node should return parent
This change makes it so that the parent pointer is returned by reference in
fib_find_node.  By doing this I can use it to find the parent node when I
am performing an insertion and I don't have to look for it again in
fib_insert_node.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:17 -05:00
Alexander Duyck 8be33e955c fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf
This change makes it so that leaf_walk_rcu takes a tnode and a key instead
of the trie and a leaf.

The main idea behind this is to avoid using the leaf parent pointer as that
can have additional overhead in the future as I am trying to reduce the
size of a leaf down to 16 bytes on 64b systems and 12b on 32b systems.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:17 -05:00
Alexander Duyck 7289e6ddb6 fib_trie: Only resize tnodes once instead of on each leaf removal in fib_table_flush
This change makes it so that we only call resize on the tnodes, instead of
from each of the leaves.  By doing this we can significantly reduce the
amount of time spent resizing as we can update all of the leaves in the
tnode first before we make any determinations about resizing.  As a result
we can simply free the tnode in the case that all of the leaves from a
given tnode are flushed instead of resizing with each leaf removed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 23:35:17 -05:00
Lorenzo Colitti 9145736d48 net: ping: Return EAFNOSUPPORT when appropriate.
1. For an IPv4 ping socket, ping_check_bind_addr does not check
   the family of the socket address that's passed in. Instead,
   make it behave like inet_bind, which enforces either that the
   address family is AF_INET, or that the family is AF_UNSPEC and
   the address is 0.0.0.0.
2. For an IPv6 ping socket, ping_check_bind_addr returns EINVAL
   if the socket family is not AF_INET6. Return EAFNOSUPPORT
   instead, for consistency with inet6_bind.
3. Make ping_v4_sendmsg and ping_v6_sendmsg return EAFNOSUPPORT
   instead of EINVAL if an incorrect socket address structure is
   passed in.
4. Make IPv6 ping sockets be IPv6-only. The code does not support
   IPv4, and it cannot easily be made to support IPv4 because
   the protocol numbers for ICMP and ICMPv6 are different. This
   makes connect(::ffff:192.0.2.1) fail with EAFNOSUPPORT instead
   of making the socket unusable.

Among other things, this fixes an oops that can be triggered by:

    int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP);
    struct sockaddr_in6 sin6 = {
        .sin6_family = AF_INET6,
        .sin6_addr = in6addr_any,
    };
    bind(s, (struct sockaddr *) &sin6, sizeof(sin6));

Change-Id: If06ca86d9f1e4593c0d6df174caca3487c57a241
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 15:46:51 -05:00
Eric W. Biederman 60395a20ff neigh: Factor out ___neigh_lookup_noref
While looking at the mpls code I found myself writing yet another
version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
and __ipv6_lookup_noref.

So to make my work a little easier and to make it a smidge easier to
verify/maintain the mpls code in the future I stopped and wrote
___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
__ipv6_lookup_noref in terms of this new function.  I tested my new
version by verifying that the same code is generated in
ip_finish_output2 and ip6_finish_output2 where these functions are
inlined.

To get to ___neigh_lookup_noref I added a new neighbour cache table
function key_eq.  So that the static size of the key would be
available.

I also added __neigh_lookup_noref for people who want to to lookup
a neighbour table entry quickly but don't know which neibhgour table
they are going to look up.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04 00:23:23 -05:00
David S. Miller 71a83a6db6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/rocker/rocker.c

The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03 21:16:48 -05:00
Michal Kubeček acf8dd0a9d udp: only allow UFO for packets from SOCK_DGRAM sockets
If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a
UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to
CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer
checksum is to be computed on segmentation. However, in this case,
skb->csum_start and skb->csum_offset are never set as raw socket
transmit path bypasses udp_send_skb() where they are usually set. As a
result, driver may access invalid memory when trying to calculate the
checksum and store the result (as observed in virtio_net driver).

Moreover, the very idea of modifying the userspace provided UDP header
is IMHO against raw socket semantics (I wasn't able to find a document
clearly stating this or the opposite, though). And while allowing
CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit
too intrusive change just to handle a corner case like this. Therefore
disallowing UFO for packets from SOCK_DGRAM seems to be the best option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 22:19:29 -05:00
Florian Westphal ee586bbc28 netfilter: reject: don't send icmp error if csum is invalid
tcp resets are never emitted if the packet that triggers the
reject/reset has an invalid checksum.

For icmp error responses there was no such check.
It allows to distinguish icmp response generated via

iptables -I INPUT -p udp --dport 42 -j REJECT

and those emitted by network stack (won't respond if csum is invalid,
REJECT does).

Arguably its possible to avoid this by using conntrack and only
using REJECT with -m conntrack NEW/RELATED.

However, this doesn't work when connection tracking is not in use
or when using nf_conntrack_checksum=0.

Furthermore, sending errors in response to invalid csums doesn't make
much sense so just add similar test as in nf_send_reset.

Validate csum if needed and only send the response if it is ok.

Reference: http://bugzilla.redhat.com/show_bug.cgi?id=1169829
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-03 02:10:35 +01:00
Eric W. Biederman bdf53c5849 neigh: Don't require dst in neigh_hh_init
- Add protocol to neigh_tbl so that dst->ops->protocol is not needed
- Acquire the device from neigh->dev

This results in a neigh_hh_init that will cache the samve values
regardless of the packets flowing through it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:41 -05:00
Eric W. Biederman 59b2af26b9 arp: Kill arp_find
There are no more callers so kill this function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:41 -05:00
Eric W. Biederman 21bfb8e933 arp: Remove special case to give AX25 it's open arp operations.
The special case has been pushed out into ax25_neigh_construct so there
is no need to keep this code in arp.c

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:40 -05:00
Ying Xue 1b78414047 net: Remove iocb argument from sendmsg and recvmsg
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 13:06:31 -05:00
Eyal Birger b4772ef879 net: use common macro for assering skb->cb[] available size in protocol families
As part of an effort to move skb->dropcount to skb->cb[] use a common
macro in protocol families using skb->cb[] for ancillary data to
validate available room in skb->cb[].

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 00:19:30 -05:00
Eric Dumazet 74abc20ced tcp: cleanup static functions
tcp_fastopen_create_child() is static and should not be exported.

tcp4_gso_segment() and tcp6_gso_segment() should be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28 16:56:51 -05:00
Eric Dumazet a0ea700e40 tcp: tso: allow CA_CWR state in tcp_tso_should_defer()
Another TCP issue is triggered by ECN.

Under pressure, receiver gets ECN marks, and send back ACK packets
with ECE TCP flag. Senders enter CA_CWR state.

In this state, tcp_tso_should_defer() is short cut :

if (icsk->icsk_ca_state != TCP_CA_Open)
    goto send_now;

This means that about all ACK packets we receive are triggering
a partial send, and because cwnd is kept small, we can only send
a small amount of data for each incoming ACK,
which in return generate more ACK packets.

Allowing CA_Open and CA_CWR states to enable TSO defer in
tcp_tso_should_defer() brings performance back :
TSO autodefer has more chance to defer under pressure.

This patch increases TSO and LRO/GRO efficiency back to normal levels,
and does not impact overall ECN behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28 15:10:39 -05:00
Eric Dumazet 50c8339e92 tcp: tso: restore IW10 after TSO autosizing
With sysctl_tcp_min_tso_segs being 4, it is very possible
that tcp_tso_should_defer() decides not sending last 2 MSS
of initial window of 10 packets. This also applies if
autosizing decides to send X MSS per GSO packet, and cwnd
is not a multiple of X.

This patch implements an heuristic based on age of first
skb in write queue : If it was sent very recently (less than half srtt),
we can predict that no ACK packet will come in less than half rtt,
so deferring might cause an under utilization of our window.

This is visible on initial send (IW10) on web servers,
but more generally on some RPC, as the last part of the message
might need an extra RTT to get delivered.

Tested:

Ran following packetdrill test
// A simple server-side test that sends exactly an initial window (IW10)
// worth of packets.

`sysctl -e -q net.ipv4.tcp_min_tso_segs=4`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.1   < . 1:1(0) ack 1 win 257
+0    accept(3, ..., ...) = 4

+0    write(4, ..., 14600) = 14600
+0    > . 1:5841(5840) ack 1 win 457
+0    > . 5841:11681(5840) ack 1 win 457
// Following packet should be sent right now.
+0    > P. 11681:14601(2920) ack 1 win 457

+.1   < . 1:1(0) ack 14601 win 257

+0    close(4) = 0
+0    > F. 14601:14601(0) ack 1
+.1   < F. 1:1(0) ack 14602 win 257
+0    > . 14602:14602(0) ack 2

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28 15:10:39 -05:00
Eric Dumazet 5f852eb536 tcp: tso: remove tp->tso_deferred
TSO relies on ability to defer sending a small amount of packets.
Heuristic is to wait for future ACKS in hope to send more packets at once.
Current algorithm uses a per socket tso_deferred field as a pseudo timer.

This pseudo timer relies on future ACK, but there is no guarantee
we receive them in time.

Fix would be to use a real timer, but cost of such timer is probably too
expensive for typical cases.

This patch changes the logic to test the time of last transmit,
because we should not add bursts of more than 1ms for any given flow.

We've used this patch for about two years at Google, before FQ/pacing
as it would reduce a fair amount of bursts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-28 15:10:39 -05:00
Alexander Duyck 79e5ad2ceb fib_trie: Remove leaf_info
At this point the leaf_info hash is redundant.  By adding the suffix length
to the fib_alias hash list we no longer have need of leaf_info as we can
determine the prefix length from fa_slen.  So we can compress things by
dropping the leaf_info structure from fib_trie and instead directly connect
the leaves to the fib_alias hash list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:37:07 -05:00
Alexander Duyck 9b6ebad5c3 fib_trie: Add slen to fib alias
Make use of an empty spot in the alias to store the suffix length so that
we don't need to pull that information from the leaf_info structure.

This patch also makes a slight change to the user statistics.  Instead of
incrementing semantic_match_miss once per leaf_info miss we now just
increment it once per leaf if a match was not found.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:37:07 -05:00
Alexander Duyck 5786ec6054 fib_trie: Replace plen with slen in leaf_info
This replaces the prefix length variable in the leaf_info structure with a
suffix length value, or host identifier length in bits.  By doing this it
makes it easier to sort out since the tnodes and leaf are carrying this
value as well since it is compatible with the ->pos field in tnodes.

I also cleaned up one spot that had some list manipulation that could be
simplified.  I basically updated it so that we just use hlist_add_head_rcu
instead of calling hlist_add_before_rcu on the first node in the list.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:37:06 -05:00
Alexander Duyck 56315f9e6e fib_trie: Convert fib_alias to hlist from list
There isn't any advantage to having it as a list and by making it an hlist
we make the fib_alias more compatible with the list_info in terms of the
type of list used.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:37:06 -05:00
Madhu Challa 93a714d6b5 multicast: Extend ip address command to enable multicast group join/leave on
Joining multicast group on ethernet level via "ip maddr" command would
not work if we have an Ethernet switch that does igmp snooping since
the switch would not replicate multicast packets on ports that did not
have IGMP reports for the multicast addresses.

Linux vxlan interfaces created via "ip link add vxlan" have the group option
that enables then to do the required join.

By extending ip address command with option "autojoin" we can get similar
functionality for openvswitch vxlan interfaces as well as other tunneling
mechanisms that need to receive multicast traffic. The kernel code is
structured similar to how the vxlan driver does a group join / leave.

example:
ip address add 224.1.1.10/24 dev eth5 autojoin
ip address del 224.1.1.10/24 dev eth5

Signed-off-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:25:25 -05:00
Tom Herbert 723b8e460d udp: In udp_flow_src_port use random hash value if skb_get_hash fails
In the unlikely event that skb_get_hash is unable to deduce a hash
in udp_flow_src_port we use a consistent random value instead.
This is specified in GRE/UDP draft section 3.2.1:
https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27 16:00:01 -05:00
Neal Cardwell 6514890f7a tcp: fix tcp_should_expand_sndbuf() to use tcp_packets_in_flight()
tcp_should_expand_sndbuf() does not expand the send buffer if we have
filled the congestion window.

However, it should use tcp_packets_in_flight() instead of
tp->packets_out to make this check.

Testing has established that the difference matters a lot if there are
many SACKed packets, causing a needless performance shortfall.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-22 23:07:11 -05:00
Eric Dumazet 959d10f6bb igmp: add __ip_mc_{join|leave}_group()
There is a need to perform igmp join/leave operations while RTNL is
held.

Make ip_mc_{join|leave}_group() wrappers around
__ip_mc_{join|leave}_group() to avoid the proliferation of work queues.

For example, vxlan_igmp_join() could possibly be removed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-20 15:24:04 -05:00