There're some percpu_counter list corruption and poison overwritten warnings
in recent kernel, which is resulted by fc66f95c.
commit fc66f95c switches to use percpu_counter, in ip6_route_net_init, kernel
init the percpu_counter for dst entries, but, the percpu_counter is never destroyed
in ip6_route_net_exit. So if the related data is freed by kernel, the freed percpu_counter
is still on the list, then if we insert/remove other percpu_counter, list corruption
resulted. Also, if the insert/remove option modifies the ->prev,->next pointer of
the freed value, the poison overwritten is resulted then.
With the following patch, the percpu_counter list corruption and poison overwritten
warnings disappeared.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit a18135eb93 (Add UDP_MIB_{SND,RCV}BUFERRORS handling.)
forgot to make the necessary changes in net/ipv6/proc.c to report
additional counters in /proc/net/snmp6
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After making rcu protection for tunnels (ipip, gre, sit and ip6) a bug
was introduced into the SIOCCHGTUNNEL code.
The tunnel is first unlinked, then addresses change, then it is linked
back probably into another bucket. But while changing the parms, the
hash table is unlocked to readers and they can lookup the improper tunnel.
Respective commits are b7285b79 (ipip: get rid of ipip_lock), 1507850b
(gre: get rid of ipgre_lock), 3a43be3c (sit: get rid of ipip6_lock) and
94767632 (ip6tnl: get rid of ip6_tnl_lock).
The quick fix is to wait for quiescent state to pass after unlinking,
but if it is inappropriate I can invent something better, just let me
know.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __rcu annotations to :
struct net_protocol *inet_protos
struct net_protocol *inet6_protos
And use appropriate casts to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After running this bonding setup script
modprobe bonding miimon=100 mode=0 max_bonds=1
ifconfig bond0 10.1.1.1/16
ifenslave bond0 eth1
ifenslave bond0 eth3
on s390 with qeth-driven slaves, modprobe -r fails with this message
unregister_netdevice: waiting for bond0 to become free. Usage count = 1
due to twice detection of duplicate address.
Problem is caused by a missing decrease of ifp->refcnt in addrconf_dad_failure.
An extra call of in6_ifa_put(ifp) solves it.
Problem has been introduced with commit f2344a131b.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a bug in the interaction between ipv6_create_tempaddr and
addrconf_verify. Because ipv6_create_tempaddr uses the cstamp and tstamp
from the public address in creating a private address, if we have not
received a router advertisement in a while, tstamp + temp_valid_lft might be
< now. If this happens, the new address is created inside
ipv6_create_tempaddr, then the loop within addrconf_verify starts again and
the address is immediately deleted. We are left with no temporary addresses
on the interface, and no more will be created until the public IP address is
updated. To avoid this, set the expiry time to be the minimum of the time
left on the public address or the config option PLUS the current age of the
public interface.
Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
If privacy extentions are enabled, but no current temporary address exists,
then create one when we get a router advertisement.
Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __rcu annotation to :
(struct sock)->sk_filter
And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the previous tproxy related patches split IPv6 defragmentation and
connection tracking, but did not correctly add Kconfig stanzas to handle the
new dependencies correctly. This patch fixes that by making the config options
mirror the setup we have for IPv4: a distinct config option for defragmentation
that is automatically selected by both connection tracking and
xt_TPROXY/xt_socket.
The patch also changes the #ifdefs enclosing IPv6 specific code in xt_socket
and xt_TPROXY: we only compile these in case we have ip6tables support enabled.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
(struct ip6_tnl)->next is rcu protected :
(struct ip_tunnel)->next is rcu protected :
(struct xfrm6_tunnel)->next is rcu protected :
add __rcu annotation and proper rcu primitives.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP_TRANSPARENT requires root (more precisely CAP_NET_ADMIN privielges)
for IPV6.
However as I see right now this check was missed from the IPv6
implementation.
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6_tunnel device did not unset the flag,
IFF_XMIT_DST_RELEASE. This will make the dev layer
to release the dst before calling the tunnel.
The tunnel will not update any mtu/pmtu info, since
it does not have a dst on the skb.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support for IPV6_RECVORIGDSTADDR sockopt for UDP sockets were contributed by
Harry Mason.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Just like with IPv4, we need access to the UDP hash table to look up local
sockets, but instead of exporting the global udp_table, export a lookup
function.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The parameters for various UDP lookup functions were non-const, even though
they could be const. TProxy has some const references and instead of
downcasting it, I added const specifiers along the path.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Like with IPv4, TProxy needs IPv6 defragmentation but does not
require connection tracking. Since defragmentation was coupled
with conntrack, I split off the two, creating an nf_defrag_ipv6 module,
similar to the already existing nf_defrag_ipv4.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.
This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.
Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Function only defined and used in one file.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing profile analysis, I found fib_hash_table was sometime in a
cache line shared by a possibly often written kernel structure.
(CONFIG_IP_ROUTE_MULTIPATH || !CONFIG_IPV6_MULTIPLE_TABLES)
It's hard to detect because not easily reproductible.
Make sure we allocate a full cache line to keep this shared in all cpus
caches.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid two atomic ops on found rule in fib6_rule_lookup()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many of the used macros are just there for userspace compatibility.
Substitute the in-kernel code to directly use the terminal macro
and stuff the defines into #ifndef __KERNEL__ sections.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.
Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.
Stress test :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)
Before:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
After:
real 0m45.570s
user 0m15.525s
sys 9m56.669s
With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)
real 0m41.841s
user 0m15.261s
sys 8m45.949s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David
This is the first step for RCU conversion of neigh code.
Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.
Thanks
[PATCH net-next] net neigh: RCU conversion of neigh hash table
Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :
struct neigh_hash_table {
struct neighbour **hash_buckets;
unsigned int hash_mask;
__u32 hash_rnd;
struct rcu_head rcu;
};
And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.
This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In various situations, a device provides a packet to our stack and we
drop it before it enters protocol stack :
- softnet backlog full (accounted in /proc/net/softnet_stat)
- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)
We can handle a per-device counter of such dropped frames at core level,
and automatically adds it to the device provided stats (rx_dropped), so
that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)
This is a generalization of commit 8990f468a (net: rx_dropped
accounting), thus reverting it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Another exported symbol only used in one file
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipt_LOG & ip6t_LOG use lot of calls to printk() and use a lock in a hope
several cpus wont mix their output in syslog.
printk() being very expensive [1], its better to call it once, on a
prebuilt and complete line. Also, with mixed IPv4 and IPv6 trafic,
separate IPv4/IPv6 locks dont avoid garbage.
I used an allocation of a 1024 bytes structure, sort of seq_printf() but
with a fixed size limit.
Use a static buffer if dynamic allocation failed.
Emit a once time alert if buffer size happens to be too short.
[1]: printk() has various features like printk_delay()...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 15fc1f7056 (sit: percpu stats accounting) forgot the fallback
tunnel case (sit0), and can crash pretty fast.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AnyIP is the capability to receive packets and establish incoming
connections on IPs we have not explicitly configured on the machine.
An example use case is to configure a machine to accept all incoming
traffic on eth0, and leave the policy of whether traffic for a given IP
should be delivered to the machine up to the load balancer.
Can be setup as follows:
ip -6 rule from all iif eth0 lookup 200
ip -6 route add local default dev lo table 200
(in this case for all IPv6 addresses)
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
Other seldom used fields are kept in netdev->stats structure, possibly
unsafe.
This is a preliminary work to support lockless transmit path, and
correct RX stats, that are already unsafe.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 and IPv6 have separate neighbour tables, so
the warning messages should be distinguishable.
[ Add a suitable message prefix on the ipv4 side as well -DaveM ]
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up a missing exit path in the ipv6 module init routines. In
addrconf_init we call ipv6_addr_label_init which calls register_pernet_subsys
for the ipv6_addr_label_ops structure. But if module loading fails, or if the
ipv6 module is removed, there is no corresponding unregister_pernet_subsys call,
which leaves a now-bogus address on the pernet_list, leading to oopses in
subsequent registrations. This patch cleans up both the failed load path and
the unload path. Tested by myself with good results.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/net/addrconf.h | 1 +
net/ipv6/addrconf.c | 11 ++++++++---
net/ipv6/addrlabel.c | 5 +++++
3 files changed, 14 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Special care should be taken when slow path is hit in ip_fragment() :
When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.
Many thanks to Nick Bowler, who provided a very clean bug report and
test program.
Thanks to Jarek for reviewing my first patch and providing a V2
While Nick bisection pointed to commit 2b85a34e91 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)
A side effect is to extend work done in commit b2722b1c3a
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.
Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The family parameter xfrm_state_find is used to find a state matching a
certain policy. This value is set to the template's family
(encap_family) right before xfrm_state_find is called.
The family parameter is however also used to construct a temporary state
in xfrm_state_find itself which is wrong for inter-family scenarios
because it produces a selector for the wrong family. Since this selector
is included in the xfrm_user_acquire structure, user space programs
misinterpret IPv6 addresses as IPv4 and vice versa.
This patch splits up the original init_tempsel function into a part that
initializes the selector respectively the props and id of the temporary
state, to allow for differing ip address families whithin the state.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Under load, netif_rx() can drop incoming packets but administrators dont
have a chance to spot which device needs some tuning (RPS activation for
example)
This patch adds rx_dropped accounting in vlans and tunnels.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ip6_tnl_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As RTNL is held while doing tunnels inserts and deletes, we can remove
ipip6_lock spinlock. My initial RCU conversion was conservative and
converted the rwlock to spinlock, with no RTNL requirement.
Use appropriate rcu annotations and modern lockdep checks as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm4_tunnel_register() & xfrm6_tunnel_register() should
use rcu_assign_pointer() to make sure previous writes
(to handler->next) are committed to memory before chain
insertion.
deregister functions dont need a particular barrier.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).
Problem is that following sequence :
fd = socket(...)
connect(fd, &remote, ...)
not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)
Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.
When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.
One solution to this problem is to rehash datagram socket if needed.
We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.
This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.
Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use cmpxchg() to get rid of spinlocks in inet_add_protocol() and
friends.
inet_protos[] & inet6_protos[] are moved to read_mostly section
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC5722 prohibits reassembling IPv6 fragments when some data overlaps.
Bug spotted by Zhang Zuotao <zuotao.zhang@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC5722 prohibits reassembling fragments when some data overlaps.
Bug spotted by Zhang Zuotao <zuotao.zhang@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to accepting router advertisement, the IPv6 stack does not send router
solicitations if forwarding is enabled.
This patch enables this behavior to be overruled by setting forwarding to the
special value 2.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current IPv6 behavior is to not accept router advertisements while
forwarding, i.e. configured as router.
This does make sense, a router is typically not supposed to be auto
configured. However there are exceptions and we should allow the
current behavior to be overwritten.
Therefore this patch enables the user to overrule the "if forwarding
enabled then don't listen to RAs" rule by setting accept_ra to the
special value of 2.
An alternative would be to ignore the forwarding switch alltogether
and solely accept RAs based on the value of accept_ra. However, I
found that if not intended, accepting RAs as a router can lead to
strange unwanted behavior therefore we it seems wise to only do so
if the user explicitely asks for this behavior.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel4_handlers, tunnel64_handlers, tunnel6_handlers and
tunnel46_handlers are protected by RCU, but we dont use appropriate rcu
primitives to scan them. rcu_lock() is already held by caller.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel6_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tunnel4_handlers chain being scanned for each incoming packet,
make sure it doesnt share an often dirtied cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f3c5c1bfd4
(netfilter: xtables: make ip_tables reentrant) forgot to
also compute the jumpstack size in the compat handlers.
Result is that "iptables -I INPUT -j userchain" turns into -j DROP.
Reported by Sebastian Roesner on #netfilter, closes
http://bugzilla.netfilter.org/show_bug.cgi?id=669.
Note: arptables change is compile-tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).
Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 24b36f019 (netfilter: {ip,ip6,arp}_tables: dont block
bottom half more than necessary), lockdep can raise a warning
because we attempt to lock a spinlock with BH enabled, while
the same lock is usually locked by another cpu in a softirq context.
Disable again BH to avoid these lockdep warnings.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Diagnosed-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl output ipv6 gc_elasticity and min_adv_mss as values divided by
HZ. However, they are not in unit of jiffies, since ip6_rt_min_advmss
refers to packet size and ip6_rt_fc_elasticity is used as scaler as in
expire>>ip6_rt_gc_elasticity, so replace the jiffies conversion
handler will regular handler for them.
This has impact on scripts that are currently working assuming the
divide by HZ, will yield different results with this patch in place.
Signed-off-by: Min Zhang <mzhang@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently disable BH for the whole duration of get_counters()
On machines with a lot of cpus and large tables, this might be too long.
We can disable preemption during the whole function, and disable BH only
while fetching counters for the current cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ipv6_hdr(skb)->payload_len is ZERO and can't be used for accounting, if
the payload is a Jumbo Payload specified in RFC2675.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
If the interface has IPv6 disabled, don't add a multicast or
link-local route since we won't be adding a link-local address.
Reported-by: Mahesh Kelkar <maheshkelkar@gmail.com>
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/vhost/net.c
net/bridge/br_device.c
Fix merge conflict in drivers/vhost/net.c with guidance from
Stephen Rothwell.
Revert the effects of net-2.6 commit 573201f36f
since net-next-2.6 has fixes that make bridge netpoll work properly thus
we don't need it disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
The input handler for Type 2 Routing Header (mip6_rthdr_input())
checks if the CoA in the packet matches the CoA in the XFRM state.
Current check is buggy: it compares the adddress in the Type 2
Routing Header, i.e. the HoA, against the expected CoA in the state.
The comparison should be made against the address in the destination
field of the IPv6 header.
The bug remained unnoticed because the main (and possibly only current)
user of the code (UMIP MIPv6 Daemon) initializes the XFRM state with the
unspecified address, i.e. explicitly allows everything.
Yoshifuji-san, can you ack that one?
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a new boolean flag no_autobind is added to structure proto to avoid the autobind
calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
TCP's sendmsg() and sendpage() pathes.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/net/inet_common.h | 4 ++++
include/net/sock.h | 1 +
include/net/tcp.h | 8 ++++----
net/ipv4/af_inet.c | 15 +++++++++------
net/ipv4/tcp.c | 11 +++++------
net/ipv4/tcp_ipv4.c | 3 +++
net/ipv6/af_inet6.c | 8 ++++----
net/ipv6/tcp_ipv6.c | 3 +++
8 files changed, 33 insertions(+), 20 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
As the fragments are sent in order in most of OSes, such as Windows, Darwin and
FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
In the fast path, we check if the skb at the end of the inet_frag_queue is the
prev we expect.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
While using xfrm by MARK feature in
2.6.34 - 2.6.35 kernels, the mark
is always cleared in flowi structure via memset in
_decode_session4 (net/ipv4/xfrm4_policy.c), so
the policy lookup fails.
IPv6 code is affected by this bug too.
Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should release dst if dst->error is set.
Bug introduced in 2.6.14 by commit e104411b82
([XFRM]: Always release dst_entry on error in xfrm_lookup)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
add fast path for in-order fragments
As the fragments are sent in order in most of OSes, such as Windows, Darwin and
FreeBSD, it is likely the new fragments are at the end of the inet_frag_queue.
In the fast path, we check if the skb at the end of the inet_frag_queue is the
prev we expect.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/net/inet_frag.h | 1 +
net/ipv4/ip_fragment.c | 12 ++++++++++++
net/ipv6/reassembly.c | 11 +++++++++++
3 files changed, 24 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/snmp and /proc/net/netstat expose SNMP counters.
Width of these counters is either 32 or 64 bits, depending on the size
of "unsigned long" in kernel.
This means user program parsing these files must already be prepared to
deal with 64bit values, regardless of user program being 32 or 64 bit.
This patch introduces 64bit snmp values for IPSTAT mib, where some
counters can wrap pretty fast if they are 32bit wide.
# netstat -s|egrep "InOctets|OutOctets"
InOctets: 244068329096
OutOctets: 244069348848
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
max_desync_factor can be configured per-interface, but nothing is
using the value.
Reported-by: Piotr Lewandowski <piotr.lewandowski@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since addresses are only revalidated every 2 minutes, the reported
valid_lft can underflow shortly before the address is deleted.
Clamp it to a minimum of 0, as for prefered_lft.
Reported-by: Piotr Lewandowski <piotr.lewandowski@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LOG targets print the entire MAC header as one long string, which is not
readable very well:
IN=eth0 OUT= MAC=00:15:f2:24:91:f8:00:1b:24:dc:61:e6:08:00 ...
Add an option to decode known header formats (currently just ARPHRD_ETHER devices)
in their individual fields:
IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=0800 ...
IN=eth0 OUT= MACSRC=00:1b:24:dc:61:e6 MACDST=00:15:f2:24:91:f8 MACPROTO=86dd ...
The option needs to be explicitly enabled by userspace to avoid breaking
existing parsers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Remove the comparison within the loop to print the macheader by prepending
the colon to all but the first printk.
Based on suggestion by Jan Engelhardt <jengelh@medozas.de>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.
While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed out by Fernando Gont there is no need to encode rcv_wscale
into the cookie.
We did not use the restored rcv_wscale anyway; it is recomputed
via tcp_select_initial_window().
Thus we can save 4 bits in the ts option space by removing rcv_wscale.
In case window scaling was not supported, we set the (invalid) wscale
value 0xf.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9261e53701 (ipv6: making ip and icmp statistics per/namespace)
forgot to remove ipv6_statistics variable.
commit bc417d99bf (ipv6: remove stale MIB definitions) took care of
icmpv6_statistics & icmpv6msg_statistics
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Denis V. Lunev <den@openvz.org>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.
Callers can use __alignof__(type) to provide correct
alignment.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addition of TLLAO option created a kernel OOPS regression
for the case where neighbor advertisement is being sent via
proxy path. When using proxy, ipv6_get_ifaddr() returns NULL
causing the NULL dereference.
Change causing the bug was:
commit f7734fdf61
Author: Octavian Purdila <opurdila@ixiacom.com>
Date: Fri Oct 2 11:39:15 2009 +0000
make TLLAO option for NA packets configurable
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discard the ACK if we find options that do not match current sysctl
settings.
Previously it was possible to create a connection with sack, wscale,
etc. enabled even if the feature was disabled via sysctl.
Also remove an unneeded call to tcp_sack_reset() in
cookie_check_timestamp: Both call sites (cookie_v4_check,
cookie_v6_check) zero "struct tcp_options_received", hand it to
tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack)
and then call cookie_check_timestamp().
Even if num_sacks/dsacks were changed, the structure is allocated on
the stack and after cookie_check_timestamp returns only a few selected
members are copied to the inet_request_sock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Third param (work) is unused, remove it.
Remove __inline__ and inline qualifiers.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of doing one atomic operation per frag, we can factorize them.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When syncookies are in effect, req->iif is left uninitialized.
In case of e.g. link-local addresses the route lookup then fails
and no syn-ack is sent.
Rearrange things so ->iif is also initialized in the syncookie case.
want_cookie can only be true when the isn was zero, thus move the want_cookie
check into the "!isn" branch.
Cc: Glenn Griffin <ggriffin.kernel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU to avoid atomic ops on idev refcnt in ipv6_get_mtu()
and ip6_dst_hoplimit()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use __in6_dev_get() instead of in6_dev_get()/in6_dev_put()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parameter (work) is unused, remove it.
Reported from Eric Dumazet.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Instead of doing one atomic operation per frag, we can factorize them.
Reported from Eric Dumazet.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
If the returned csum value is 0, We has set ip_summed with
CHECKSUM_UNNECESSARY flag in __skb_checksum_complete_head().
So this patch kills the check and changes to return to upper
caller directly.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1f8438a853 (icmp: Account for ICMP out errors), I did a typo
on IPV6 side, using ICMP6_MIB_OUTMSGS instead of ICMP6_MIB_OUTERRORS
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts queue_lock rwlock to a spinlock.
(readlocked part can be changed by reads of integer values)
One atomic operation instead of four per ipq_enqueue_packet() call.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.
This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.
A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.
Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.
nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
- dev_get_by_flags() changed to dev_get_by_flags_rcu()
- ipv6_sock_ac_join() dont touch dev & idev refcounts
- ipv6_sock_ac_drop() dont touch dev & idev refcounts
- ipv6_sock_ac_close() dont touch dev & idev refcounts
- ipv6_dev_ac_dec() dount touch idev refcount
- ipv6_chk_acast_addr() dont touch idev refcount
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipmr_rules_exit() and ip6mr_rules_exit() free a list of items, but
forget to properly remove these items from list. List head is not
changed and still points to freed memory.
This can trigger a fault later when icmpv6_sk_exit() is called.
Fix is to either reinit list, or use list_del() to properly remove items
from list before freeing them.
bugzilla report : https://bugzilla.kernel.org/show_bug.cgi?id=16120
Introduced by commit d1db275dd3 (ipv6: ip6mr: support multiple
tables) and commit f0ad0860d0 (ipv4: ipmr: support multiple tables)
Reported-by: Alex Zhavnerchik <alex.vizor@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid two atomic ops per raw_send_hdrinc() call
Avoid two atomic ops per raw6_send_hdrinc() call
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With mtu=9000, mld_newpack() use order-2 GFP_ATOMIC allocations, that
are very unreliable, on machines where PAGE_SIZE=4K
Limit allocated skbs to be at most one page. (order-0 allocations)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- ipv6 msstab: account for ipv6 header size
- ipv4 msstab: add mss for Jumbograms.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
caller: if (!th->rst && !th->syn && th->ack)
callee: if (!th->ack)
make the caller only check for !syn (common for 3whs), and move
the !rst / ack test to the callee.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using vmalloc_node(size, numa_node_id()) for temporary storage is not
needed. vmalloc(size) is more respectful of user NUMA policy.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
There are more than a dozen occurrences of following code in the
IPv6 stack:
if (opt && opt->srcrt) {
struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
ipv6_addr_copy(&final, &fl.fl6_dst);
ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
final_p = &final;
}
Replace those with a helper. Note that the helper overrides final_p
in all cases. This is ok as final_p was previously initialized to
NULL when declared.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As noticed by Julia Lawall, ipip6_tunnel_add_prl() incorrectly calls
kzallloc(..., GFP_KERNEL) while a spinlock is held. She provided
a patch to use GFP_ATOMIC instead.
One possibility would be to convert this spinlock to a mutex, or
preallocate the thing before taking the lock.
After RCU conversion, it appears we dont need this lock, since
caller already holds RTNL
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct sk_forward_alloc handling for error_queue would need to use a
backlog of frames that softirq handler could not deliver because socket
is owned by user thread. Or extend backlog processing to be able to
process normal and error packets.
Another possibility is to not use mem charge for error queue, this is
what I implemented in this patch.
Note: this reverts commit 29030374
(net: fix sk_forward_alloc corruptions), since we dont need to lock
socket anymore.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f3c5c1bfd4 (netfilter: xtables: make ip_tables reentrant)
introduced a performance regression, because stackptr array is shared by
all cpus, adding cache line ping pongs. (16 cpus share a 64 bytes cache
line)
Fix this using alloc_percpu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-By: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As David found out, sock_queue_err_skb() should be called with socket
lock hold, or we risk sk_forward_alloc corruption, since we use non
atomic operations to update this field.
This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
(BH already disabled)
1) skb_tstamp_tx()
2) Before calling ip_icmp_error(), in __udp4_lib_err()
3) Before calling ipv6_icmp_error(), in __udp6_lib_err()
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f4f914b5 (net: ipv6 bind to device issue) caused
a regression with Mobile IPv6 when it changed the meaning
of fl->oif to become a strict requirement of the route
lookup. Instead, only force strict mode when
sk->sk_bound_dev_if is set on the calling socket, getting
the intended behavior and fixing the regression.
Tested-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we disallow GSO packets on the IPv6 forward path.
This patch fixes this.
Note that I discovered that our existing GSO MTU checks (e.g.,
IPv4 forwarding) are buggy in that they skip the check altogether,
when they really should be checking gso_size + header instead.
I have also been lazy here in that I haven't bothered to segment
the GSO packet by hand before generating an ICMP message. Someone
should add that to be 100% correct.
Reported-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh
This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.
After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a smatch warning:
net/ipv4/ipmr.c +1917 __ipmr_fill_mroute(12) error: buffer overflow
'(mrt)->vif_table' 32 <= 32
The ipv6 version had the same issue.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch ensures that all places that schedule the DAD timer
look at the address state in a safe manner before scheduling the
timer. This ensures that we don't end up with pending timers
after deleting an address.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes use of the new POSTDAD state. This prevents
a race between DAD completion and failure.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes use of the new state_lock to synchronise between
updates to the ifa state. This fixes the issue where a remotely
triggered address deletion (through DAD failure) coincides with a
local administrative address deletion, causing certain actions to
be performed twice incorrectly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces the boolean dead flag on inet6_ifaddr with
a state enum. This allows us to roll back changes when deleting
an address according to whether DAD has completed or not.
This patch only adds the state field and does not change the logic.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes from net/ (but not any netfilter files)
all the unnecessary return; statements that precede the
last closing brace of void functions.
It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.
Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb rxhash should be cleared when a skb is handled by a tunnel before
being delivered again, so that correct packet steering can take place.
There are other cleanups and accounting that we can factorize in a new
helper, skb_tunnel_rx()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The duplicate address check code got broken in the conversion
to hlist (2.6.35). The earlier patch did not fix the case where
two addresses match same hash value. Use two exit paths,
rather than depending on state of loop variables (from macro).
Based on earlier fix by Shan Wei.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
as addrlabels with an interface index are left alone when the
interface gets removed this results in addrlabels that can no
longer be removed.
Restrict validation of index to adding new addrlabels.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the
sk->sk_route_caps &= ~NETIF_F_GSO_MASK
that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)
So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.
Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes from net/ netfilter files
all the unnecessary return; statements that precede the
last closing brace of void functions.
It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.
Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'
Signed-off-by: Joe Perches <joe@perches.com>
[Patrick: changed to keep return statements in otherwise empty function bodies]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make sure all printk messages have a severity level.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Change netfilter asserts to standard WARN_ON. This has the
benefit of backtrace info and also causes netfilter errors
to show up on kerneloops.org.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Prepare the arrays for use with the multiregister function. The
future layer-3 xt matches can then be easily added to it without
needing more (un)register code.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Since xt_action_param is writable, let's use it. The pointer to
'bool hotdrop' always worried (8 bytes (64-bit) to write 1 byte!).
Surprisingly results in a reduction in size:
text data bss filename
5457066 692730 357892 vmlinux.o-prev
5456554 692730 357892 vmlinux.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
In future, layer-3 matches will be an xt module of their own, and
need to set the fragoff and thoff fields. Adding more pointers would
needlessy increase memory requirements (esp. so for 64-bit, where
pointers are wider).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The structures carried - besides match/target - almost the same data.
It is possible to combine them, as extensions are evaluated serially,
and so, the callers end up a little smaller.
text data bss filename
-15318 740 104 net/ipv4/netfilter/ip_tables.o
+15286 740 104 net/ipv4/netfilter/ip_tables.o
-15333 540 152 net/ipv6/netfilter/ip6_tables.o
+15269 540 152 net/ipv6/netfilter/ip6_tables.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The ip6mr /proc interface (ip6_mr_cache) can't be extended to dump routes
from any tables but the main table in a backwards compatible fashion since
the output format ends in a variable amount of output interfaces.
Introduce a new netlink interface to dump multicast routes from all tables,
similar to the netlink interface for regular routes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch adds support for multiple independant multicast routing instances,
named "tables".
Userspace multicast routing daemons can bind to a specific table instance by
issuing a setsockopt call using a new option MRT6_TABLE. The table number is
stored in the raw socket data and affects all following ip6mr setsockopt(),
getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT)
is created with a default routing rule pointing to it. Newly created pim6reg
devices have the table number appended ("pim6regX"), with the exception of
devices created in the default table, which are named just "pim6reg" for
compatibility reasons.
Packets are directed to a specific table instance using routing rules,
similar to how regular routing rules work. Currently iif, oif and mark
are supported as keys, source and destination addresses could be supported
additionally.
Example usage:
- bind pimd/xorp/... to a specific table:
uint32_t table = 123;
setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table));
- create routing rules directing packets to the new table:
# ip -6 mrule add iif eth0 lookup 123
# ip -6 mrule add oif eth0 lookup 123
Signed-off-by: Patrick McHardy <kaber@trash.net>
Now that cache entries in unres_queue don't need to be distinguished by their
network namespace pointer anymore, we can remove it from struct mfc6_cache
add pass the namespace as function argument to the functions that need it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
The unres_queue is currently shared between all namespaces. Following patches
will additionally allow to create multiple multicast routing tables in each
namespace. Having a single shared queue for all these users seems to excessive,
move the queue and the cleanup timer to the per-namespace data to unshare it.
As a side-effect, this fixes a bug in the seq file iteration functions: the
first entry returned is always from the current namespace, entries returned
after that may belong to any namespace.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adding addresses and ports to the short packet log message,
like ipv4/udp.c does it, makes these messages a lot more useful:
[ 822.182450] UDPv6: short packet: From [2001:db8:ffb4:3::1]:47839 23715/178 to [2001:db8:ffb4:3:5054:ff:feff:200]:1234
This requires us to drop logging in case pskb_may_pull() fails,
which also is consistent with ipv4/udp.c
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed when I added support for IPV6_DONTFRAG that if you set
IPV6_RECVERR and tried to send a UDP packet larger than 64K to an
IPv6 destination, you'd correctly get an EMSGSIZE, but reading from
MSG_ERRQUEUE returned the incorrect address in the cmsg:
struct msghdr:
msg_name 0x7fff8f3c96d0
msg_namelen 28
struct sockaddr_in6:
sin6_family 10
sin6_port 7639
sin6_flowinfo 0
sin6_addr ::ffff:38.32.0.0
sin6_scope_id 0 ((null))
It should have returned this in my case:
struct msghdr:
msg_name 0x7fffd866b510
msg_namelen 28
struct sockaddr_in6:
sin6_family 10
sin6_port 7639
sin6_flowinfo 0
sin6_addr 2620:0:a09:e000:21f:29ff:fe57:f88b
sin6_scope_id 0 ((null))
The problem is that ipv6_recv_error() assumes that if the error
wasn't generated by ICMPv6, it's an IPv4 address sitting there,
and proceeds to create a v4-mapped address from it.
Change ipv6_icmp_error() and ipv6_local_error() to set skb->protocol
to htons(ETH_P_IPV6) so that ipv6_recv_error() knows the address
sitting right after the extended error is IPv6, else it will
incorrectly map the first octet into an IPv4-mapped IPv6 address
in the cmsg structure returned in a recvmsg() call to obtain
the error.
Signed-off-by: Brian Haley <brian.haley@hp.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Signed-off-by: David S. Miller <davem@davemloft.net>
As per RFC 3493 the default multicast hops setting
for a socket should be "1" just like ipv4.
Ironically we have a IPV6_DEFAULT_MCASTHOPS macro
it just wasn't being used.
Reported-by: Elliot Hughes <enh@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add hlist_for_each_entry_rcu_bh() and
hlist_for_each_entry_continue_rcu_bh() macros, and use them in
ipv6_get_ifaddr(), if6_get_first() and if6_get_next() to fix lockdeps
warnings.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We dereference "sk" unconditionally elsewhere in the function.
This was left over from: b30bd282 "ip6_xmit: remove unnecessary NULL
ptr check". According to that commit message, "the sk argument to
ip6_xmit is never NULL nowadays since the skb->priority assigment
expects a valid socket."
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When queueing a skb to socket, we can immediately release its dst if
target socket do not use IP_CMSG_PKTINFO.
tcp_data_queue() can drop dst too.
This to benefit from a hot cache line and avoid the receiver, possibly
on another cpu, to dirty this cache line himself.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 95766fff ([UDP]: Add memory accounting.),
each received packet needs one extra sock_lock()/sock_release() pair.
This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.
This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.
skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.
UDP receive path now take the socket spinlock only once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts two commits:
fda48a0d7a
tcp: bind() fix when many ports are bound
and a follow-on fix for it:
6443bb1fc2
ipv6: Fix inet6_csk_bind_conflict()
It causes problems with binding listening sockets when time-wait
sockets from a previous instance still are alive.
It's too late to keep fiddling with this so late in the -rc
series, and we'll deal with it in net-next-2.6 instead.
Signed-off-by: David S. Miller <davem@davemloft.net>
Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.
We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.
Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.
Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Decouple rtnetlink address families from real address families in socket.h to
be able to add rtnetlink interfaces to code that is not a real address family
without increasing AF_MAX/NPROTO.
This will be used to add support for multicast route dumping from all tables
as the proc interface can't be extended to support anything but the main table
without breaking compatibility.
This partialy undoes the patch to introduce independant families for routing
rules and converts ipmr routing rules to a new rtnetlink family. Similar to
that patch, values up to 127 are reserved for real address families, values
above that may be used arbitrarily.
Signed-off-by: Patrick McHardy <kaber@trash.net>
fib_rules_register() duplicates the template passed to it without modification,
mark the argument as const. Additionally the templates are only needed when
instantiating a new namespace, so mark them as __net_initdata, which means
they can be discarded when CONFIG_NET_NS=n.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit fda48a0d7a (tcp: bind() fix when many ports are bound)
introduced a bug on IPV6 part.
We should not call ipv6_addr_any(inet6_rcv_saddr(sk2)) but
ipv6_addr_any(inet6_rcv_saddr(sk)) because sk2 can be IPV4, while sk is
IPV6.
Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Finally add support to detect a local IPV6_DONTFRAG event
and return the relevant data to the user if they've enabled
IPV6_RECVPATHMTU on the socket. The next recvmsg() will
return no data, but have an IPV6_PATHMTU as ancillary data.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dontfrag argument to relevant functions for
IPV6_DONTFRAG support, as well as allowing the value
to be passed-in via ancillary cmsg data.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add underlying data structure changes and basic setsockopt()
and getsockopt() support for IPV6_RECVPATHMTU, IPV6_PATHMTU,
and IPV6_DONTFRAG. IPV6_PATHMTU is actually fully functional
at this point.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port autoselection done by kernel only works when number of bound
sockets is under a threshold (typically 30000).
When this threshold is over, we must check if there is a conflict before
exiting first loop in inet_csk_get_port()
Change inet_csk_bind_conflict() to forbid two reuse-enabled sockets to
bind on same (address,port) tuple (with a non ANY address)
Same change for inet6_csk_bind_conflict()
Reported-by: Gaspar Chilingarov <gasparch@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds IPv6 support for RFC5082 Generalized TTL Security Mechanism.
Not to users of mapped address; the IPV6 and IPV4 socket options are seperate.
The server does have to deal with both IPv4 and IPv6 socket options
and the client has to handle the different for each family.
On client:
int ttl = 255;
getaddrinfo(argv[1], argv[2], &hint, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET) {
setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
} else if (rp->ai_family == AF_INET6) {
setsockopt(s, IPPROTO_IPV6, IPV6_UNICAST_HOPS,
&ttl, sizeof(ttl)))
}
if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
...
On server:
int minttl = 255 - maxhops;
getaddrinfo(NULL, port, &hints, &result);
for (rp = result; rp != NULL; rp = rp->ai_next) {
s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
if (s < 0) continue;
if (rp->ai_family == AF_INET6)
setsockopt(s, IPPROTO_IPV6, IPV6_MINHOPCOUNT,
&minttl, sizeof(minttl));
setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
break
...
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The issue raises when having 2 NICs both assigned the same
IPv6 global address.
If a sender binds to a particular NIC (SO_BINDTODEVICE),
the outgoing traffic is being sent via the first found.
The bonded device is thus not taken into an account during the
routing.
From the ip6_route_output function:
If the binding address is multicast, linklocal or loopback,
the RT6_LOOKUP_F_IFACE bit is set, but not for global address.
So binding global address will neglect SO_BINDTODEVICE-binded device,
because the fib6_rule_lookup function path won't check for the
flowi::oif field and take first route that fits.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Scott Otto <scott.otto@alcatel-lucent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to RFC2460, PMTU is set to the IPv6 Minimum Link
MTU (1280) and a fragment header should always be included
after a node receiving Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU.
After receiving a ICMPv6 Too Big message reporting PMTU is
less than the IPv6 Minimum Link MTU, sctp *can't* send any
data/control chunk that total length including IPv6 head
and IPv6 extend head is less than IPV6_MIN_MTU(1280 bytes).
The failure occured in p6_fragment(), about reason
see following(take SHUTDOWN chunk for example):
sctp_packet_transmit (SHUTDOWN chunk, len=16 byte)
|------sctp_v6_xmit (local_df=0)
|------ip6_xmit
|------ip6_output (dst_allfrag is ture)
|------ip6_fragment
In ip6_fragment(), for local_df=0, drops the the packet
and returns EMSGSIZE.
The patch fixes it with adding check length of skb->len.
In this case, Ipv6 not to fragment upper protocol data,
just only add a fragment header before it.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When building a bundle, we set dst.dev and rt6.rt6i_idev.
We must ensure to set the same device for both fields.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6651ffc8e8
("ipv6: Fix tcp_v6_send_response transport header setting.")
fixed one half of why ipv6 tcp response checksums were
invalid, but it's not the whole story.
If we're going to use CHECKSUM_PARTIAL for these things (which we are
since commit 2e8e18ef52 "tcp: Set
CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
buff->csum as we always have been here in tcp_v6_send_response. We
need to leave it at zero.
Kill that line and checksums are good again.
Signed-off-by: David S. Miller <davem@davemloft.net>
My recent patch to remove the open-coded checksum sequence in
tcp_v6_send_response broke it as we did not set the transport
header pointer on the new packet.
Actually, there is code there trying to set the transport
header properly, but it sets it for the wrong skb ('skb'
instead of 'buff').
This bug was introduced by commit
a8fdf2b331 ("ipv6: Fix
tcp_v6_send_response(): it didn't set skb transport header")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since Xtables is now reentrant/nestable, the cloned packet can also go
through Xtables and be subject to rules itself.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently, the table traverser stores return addresses in the ruleset
itself (struct ip6t_entry->comefrom). This has a well-known drawback:
the jumpstack is overwritten on reentry, making it necessary for
targets to return absolute verdicts. Also, the ruleset (which might
be heavy memory-wise) needs to be replicated for each CPU that can
possibly invoke ip6t_do_table.
This patch decouples the jumpstack from struct ip6t_entry and instead
puts it into xt_table_info. Not being restricted by 'comefrom'
anymore, we can set up a stack as needed. By default, there is room
allocated for two entries into the traverser.
arp_tables is not touched though, because there is just one/two
modules and further patches seek to collapse the table traverser
anyhow.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.
References: http://www.gossamer-threads.com/lists/iptables/devel/68781
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.
The patch kills the ipfragok parameter of .queue_xmit().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.
So the change of commit 77e2f1(ipv6: Fix ip6_xmit to
send fragments if ipfragok is true) is not needed.
So the patch remove them.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Paris got following trace with a linux-next kernel
[ 14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[ 14.204025] caller is netif_rx+0xfa/0x110
[ 14.204035] Call Trace:
[ 14.204064] [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[ 14.204070] [<ffffffff8142163a>] netif_rx+0xfa/0x110
[ 14.204090] [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[ 14.204095] [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[ 14.204099] [<ffffffff8145d610>] ip_local_out+0x20/0x30
[ 14.204105] [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[ 14.204119] [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[ 14.204125] [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[ 14.204137] [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[ 14.204149] [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[ 14.204189] [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[ 14.204233] [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b
While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.
ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.
Same change for ip6_dev_loopback_xmit()
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use KERN_NOTICE instead of KERN_EMERG by default. This only affects
kernel internal logging (like conntrack), user-specified logging rules
contain a seperate log level.
Signed-off-by: Patrick McHardy <kaber@trash.net>
xfrm_lookup should be called after ip6_route_output skb_dst_set,
otherwise skb_dst_set of xfrm_lookup is pointless
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Decouple the address family values used for fib_rules from the real
address families in socket.h. This allows to use fib_rules for
code that is not a real address family without increasing AF_MAX/NPROTO.
Values up to 127 are reserved for real address families and map directly
to the corresponding AF value, values starting from 128 are for other
uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers
for these families.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
All fib_rules implementations need to set the family in their ->fill()
functions. Since the value is available to the generic fib_nl_fill_rule()
function, set it there.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to how IPv4's ip_output.c works, have ip6_output also check
the IPSKB_REROUTED flag. It will be set from xt_TEE for cloned packets
since Xtables can currently only deal with a single packet in flight
at a time.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: David S. Miller <davem@davemloft.net>
[Patrick: changed to use an IP6SKB value instead of IPSKB]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy notes: "We used to invoke IPv4 POST_ROUTING after
fragmentation as well just to defragment the packets in conntrack
immediately afterwards, but that got changed during the
netfilter-ipsec integration. Ideally IPv6 would behave like IPv4."
This patch makes it so. Sending an oversized frame (e.g. `ping6
-s64000 -c1 ::1`) will now show up in POSTROUTING as a single skb
rather than multiple ones.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The notifier for address down should only be called if address is completely
gone, not just being marked as tentative on link transistion. The code
in net-next would case bonding/sctp/s390 to see address disappear on link
down, but they would never see it reappear on link up.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since an address in hash list has to already have a ref count,
no additional ref count is needed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When link goes down, want address to be preserved but in a tentative
state, therefore it has to stay in hash list.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes preserve IPv6 address when link goes down (good).
But would cause address to point to dead dst entry (bad).
The simplest fix is to just not delete route if address is
being held for later use.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.
sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)
This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))
This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet: Remove unused send_check length argument
This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Handle CHECKSUM_PARTIAL for SYNACK packets for IPv6
This patch moves the common code between tcp_v6_send_check and
tcp_v6_gso_send_check into a new function __tcp_v6_send_check.
It then uses the new function in tcp_v6_send_synack as well as
tcp_v6_send_response so that they handle CHECKSUM_PARTIAL properly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commits 5051ebd275 and
5051ebd275 ("ipv[46]: udp: optimize unicast RX
path") broke some programs.
After upgrading a L2TP server to 2.6.33 it started to fail, tunnels going up an
down, after the 10th tunnel came up. My modified rp-l2tp uses a global
unconnected socket bound to (INADDR_ANY, 1701) and one connected socket per
tunnel after parameter negotiation.
After ten sockets were open and due to mixed parameters to
udp[46]_lib_lookup2() kernel started to drop packets.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we will set ip_summed to CHECKSUM_NONE when necessary in
ipq_mangle_ipv6, there is no need to zap CHECKSUM_COMPLETE in
ipq_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
__xfrm_lookup() is called for each packet transmitted out of
system. The xfrm_find_bundle() does a linear search which can
kill system performance depending on how many bundles are
required per policy.
This modifies __xfrm_lookup() to store bundles directly in
the flow cache. If we did not get a hit, we just create a new
bundle instead of doing slow search. This means that we can now
get multiple xfrm_dst's for same flow (on per-cpu basis).
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ip_append() fails because of socket limit or memory shortage,
increment ICMP_MIB_OUTERRORS counter, so that "netstat -s" can report
these errors.
LANG=C netstat -s | grep "ICMP messages failed"
0 ICMP messages failed
For IPV6, implement ICMP6_MIB_OUTERRORS counter as well.
# grep Icmp6OutErrors /proc/net/dev_snmp6/*
/proc/net/dev_snmp6/eth0:Icmp6OutErrors 0
/proc/net/dev_snmp6/lo:Icmp6OutErrors 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts the list and the core manipulating with it to be the same as uc_list.
+uses two functions for adding/removing mc address (normal and "global"
variant) instead of a function parameter.
+removes dev_mcast.c completely.
+exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
manipulation with lists on a sandbox (used in bonding and 80211 drivers)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
addr_bit_test() is used in various places in IPv6 routing table
subsystem. It checks if the given fn_bit is set,
where fn_bit counts bits from MSB in words in network-order.
fn_bit : 0 .... 31 32 .... 64 65 .... 95 96 ....127
fn_bit >> 5 gives offset of word, and (~fn_bit & 0x1f) gives
count from LSB in the network-endian word in question.
fn_bit >> 5 : 0 1 2 3
~fn_bit & 0x1f: 31 .... 0 31 .... 0 31 .... 0 31 .... 0
Thus, the mask was generated as htonl(1 << (~fn_bit & 0x1f)).
This can be optimized by "sweezle" (See include/asm-generic/bitops/le.h).
In little-endian,
htonl(1 << bit) = 1 << (bit ^ BITOP_BE32_SWIZZLE)
where
BITOP_BE32_SWIZZLE is (0x1f & ~7)
So,
htonl(1 << (~fn_bit & 0x1f)) = 1 << ((~fn_bit & 0x1f) ^ (0x1f & ~7))
= 1 << ((~fn_bit ^ ~7) & 0x1f)
= 1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)
In big-endian, BITOP_BE32_SWIZZLE is equal to 0.
1 << ((~fn_bit ^ BITOP_BE32_SWIZZLE) & 0x1f)
= 1 << ((~fn_bit) & 0x1f)
= htonl(1 << (~fn_bit & 0x1f))
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
This is ipv6 variant of the commit 5e016cbf6.. ("ipv4: Don't drop
redirected route cache entry unless PTMU actually expired")
by Guenter Roeck <guenter.roeck@ericsson.com>.
Remove cache route entry in ipv6_negative_advice() only if
the timer is expired.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When cache is unresolved, c->mf[6]c_parent is set to 65535 and
minvif, maxvif are not initialized, hence we must avoid to
parse IIF and OIF.
A second problem can happen when the user dumps a cache entry
where a VIF, that was referenced at creation time, has been
removed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a dump is interrupted at the last device in a hash chain and
then continued, "idx" won't get incremented past s_idx, so s_ip_idx
is not reset when moving on to the next device. This means of all
following devices only the last n - s_ip_idx addresses are dumped.
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use list_add_tail() to get the behavior we had before
the list_head conversion for ipv6 address lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_tgchk_param *par) { ... }
// </smpl>
Minus the change it does to xt_ct_find_proto.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
This semantic patch may not be too precise (checking for functions
that use xt_mtchk_param rather than functions referenced by
xt_match.checkentry), but reviewed, it produced the intended result.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_mtchk_param *par) { ... }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Supplement to 1159683ef4.
Downgrade the log level to INFO for most checkentry messages as they
are, IMO, just an extra information to the -EINVAL code that is
returned as part of a parameter "constraint violation". Leave errors
to real errors, such as being unable to create a LED trigger.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The order of the IPv6 raw table is currently reversed, that makes impossible
to use the NOTRACK target in IPv6: for example if someone enters
ip6tables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
and if we receive fragmented packets then the first fragment will be
untracked and thus skip nf_ct_frag6_gather (and conntrack), while all
subsequent fragments enter nf_ct_frag6_gather and reassembly will never
successfully be finished.
Singed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
hlist_for_each_entry(p...) will not necessarily initialize 'p'
to anything if the hlist is empty. GCC notices this and emits
a warning.
Just return true explicitly when we hit a match, and return
false is we fall out of the loop without one.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reduces timer events while keeping accuracy by rounding
our timer and/or batching several address validations in addrconf_verify().
addrconf_verify() is called at earliest timeout among interface addresses'
timeouts, but at maximum ADDR_CHECK_FREQUENCY (120 secs).
In most cases, all of timeouts of interface addresses are long enough
(e.g. several hours or days vs 2 minutes), this timer is usually called
every ADDR_CHECK_FREQUENCY, and it is okay to be lazy.
(Note this timer could be eliminated if all code paths which modifies
variables related to timeouts call us manually, but it is another story.)
However, in other least but important cases, we try keeping accuracy.
When the real interface address timeout is coming, and the timeout
is just before the rounded timeout, we accept some error.
When a timeout has been reached, we also try batching other several
events in very near future.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable regen_advance is only used in the privacy case.
Move it to simplify code and eliminate ifdef's
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor stuff, reformat comments and add whitespace for clarity
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to list macro's for the list of addresses per interface
in IPv6.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing hash function has a couple of issues:
* it is hardwired to 16 for IN6_ADDR_HSIZE
* limited to 256 and callers using int
* use jhash2 rather than some old BSD algorithm
No need for random seed since this is local only (based on assigned
addresses) table.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert from reader/writer lock to RCU and spinlock for addrconf
hash list.
Adds an additional helper macro for hlist_for_each_entry_continue_rcu
to handle the continue case.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using hash list macros, simplifies code and helps later RCU.
This patch includes some initialization that is not strictly necessary,
since an empty hlist node/list is all zero; and list is in BSS
and node is allocated with kzalloc.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use list macros instead of open coded linked list.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mfc_parent of cache entries is used to index into the vif_table and is
initialised from mfcctl->mfcc_parent. This can take values of to 2^16-1,
while the vif_table has only MAXVIFS (32) entries. The same problem
affects ip6mr.
Refuse invalid values to fix a potential out-of-bounds access. Unlike
the other validity checks, this is checked in ipmr_mfc_add() instead of
the setsockopt handler since its unused in the delete path and might be
uninitialized.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the only path leading to ip6_dst_check makes an indirect call
through dst->ops, dst cannot be NULL in ip6_dst_check.
This patch removes this check in case it misleads people who
come across this code.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused headers in net/ipv6/netfilter/ip6t_LOG.c
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Since generally there could be more netdevices changing type other
than bonding, making this event type name "bonding-unrelated"
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This member is taking up a "long" per match, yet is only used by one
module out of the roughly 90 modules, ip6t_hbh. ip6t_hbh can be
restructured a little to accomodate for the lack of the .data member.
This variant uses checking the par->match address, which should avoid
having to add two extra functions, including calls, i.e.
(hbh_mt6: call hbhdst_mt6(skb, par, NEXTHDR_OPT),
dst_mt6: call hbhdst_mt6(skb, par, NEXTHDR_DEST))
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
If we are managing IPv6 addresses using DHCP, it would be nice
for user-space to be notified if an address configured through
DHCP fails DAD. Otherwise user-space would have to poll to see
whether DAD succeeds.
This patch uses the existing notification mechanism and simply
hooks it into the DAD failure code path.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 6b03a53a (tcp: use limited socket backlog) added the possibility
of dropping frames when backlog queue is full.
Commit d218d111 (tcp: Generalized TTL Security Mechanism) added the
possibility of dropping frames when TTL is under a given limit.
This patch adds new SNMP MIB entries, named TCPBacklogDrop and
TCPMinTTLDrop, published in /proc/net/netstat in TcpExt: line
netstat -s | egrep "TCPBacklogDrop|TCPMinTTLDrop"
TCPBacklogDrop: 0
TCPMinTTLDrop: 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV6_PREFER_SRC_xxx definitions:
| #define IPV6_PREFER_SRC_TMP 0x0001
| #define IPV6_PREFER_SRC_PUBLIC 0x0002
| #define IPV6_PREFER_SRC_COA 0x0004
RT6_LOOKUP_F_xxx definitions:
| #define RT6_LOOKUP_F_SRCPREF_TMP 0x00000008
| #define RT6_LOOKUP_F_SRCPREF_PUBLIC 0x00000010
| #define RT6_LOOKUP_F_SRCPREF_COA 0x00000020
So, we can translate between these two groups by shift operation
instead of multiple 'if's.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make udp adapt to the limited socket backlog change.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make tcp adapt to the limited socket backlog change.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This solves a potential race problem during the cleanup process.
The issue is that addrconf_ifdown() needs to traverse address list,
but then drop lock to call the notifier. The version in -next
could get confused if add/delete happened during this window.
Original code (2.6.32 and earlier) was okay because all addresses
were always deleted.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My recent change in net-next to retain permanent addresses caused regression.
Device refcount would not go to zero when device was unregistered because
left over anycast reference would hold ipv6 dev reference which would hold
device references...
The correct procedure is to call notify chain when address is no longer
available for use. When interface comes back DAD timer will notify
back that address is available.
Also, link local addresses should be purged when interface is brought
down. The address might be changed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Router Solicitation timer races with device state changes
because it doesn't lock the device. Use local variable to avoid
one repeated dereference.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timer code runs in bottom half, so there is no need for
using _bh form of locking. Also check if device is not ready
to avoid race with address that is no longer active.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I merged the bundle creation code, I introduced a bogus
flowi value in the bundle. Instead of getting from the caller,
it was instead set to the flow in the route object, which is
totally different.
The end result is that the bundles we created never match, and
we instead end up with an ever growing bundle list.
Thanks to Jamal for find this problem.
Reported-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clients will set their MTU to 1280 if they receive a
ICMPV6_PKT_TOOBIG message with an MTU less than 1280.
To allow encapsulating of packets over a 1280 link
we should always accept packets with a size of 1280
for forwarding even if the path has a lower MTU and
fragment the encapsulated packets afterwards.
In case a forwarded packet is not going to be encapsulated
a ICMPV6_PKT_TOOBIG msg will still be send by ip6_fragment()
with the correct MTU.
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4291 section 2.4 states that all uncategorized addresses
should be considered as Global Unicast.
This will remove IPV6_ADDR_RESERVED completely
and return IPV6_ADDR_UNICAST in ipv6_addr_type() instead.
Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just pass in the entire repl struct. In case of a new table (e.g.
ip6t_register_table), the repldata has been previously filled with
table->name and table->size already (in ip6t_alloc_initial_table).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The macro is replaced by a list.h-like foreach loop. This makes
the code more inspectable.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The macro is replaced by a list.h-like foreach loop. This makes
the code much more inspectable.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
pass mark to all SA lookups to prepare them for when we add code
to have them search.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuck. It turns out that when we restart sysctls we were restarting
with the values already changed. Which unfortunately meant that
the second time through we thought there was no change and skipped
all kinds of work, despite the fact that there was indeed a change.
I have fixed this the simplest way possible by restoring the changed
values when we restart the sysctl write.
One of my coworkers spotted this bug when after disabling forwarding
on an interface pings were still forwarded.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ICMPV6_PKT_TOOBIG message is received with a MTU below 1280,
all further packets include a fragment header.
Unlike regular defragmentation, conntrack also needs to "reassemble"
those fragments in order to obtain a packet without the fragment
header for connection tracking. Currently nf_conntrack_reasm checks
whether a fragment has either IP6_MF set or an offset != 0, which
makes it ignore those fragments.
Remove the invalid check and make reassembly handle fragment queues
containing only a single fragment.
Reported-and-tested-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Dunno, what was the idea, it wasn't used for a long time.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP6 MIB statistics was per-netns for quite a time.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only used for writing, so convert to spinlock
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
call_rcu() will unconditionally reinitialize RCU head anyway.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __percpu sparse annotations to net.
These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.
The macro and type tricks around snmp stats make things a bit
interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
snmp_mib_*() users which used to cast the argument to (void **) are
updated to cast it to (void __percpu **).
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls. This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.
Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register. These parameters are no longer
used and so are just dead code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same stuff as in ip_gre patch: receive hook can be called before netns
setup is done, oopsing in net_generic().
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ipcomp_tunnel_attach fails we will call ipcomp_destroy twice.
This may lead to double-frees on certain structures.
As there is no reason to explicitly call ipcomp_destroy, this patch
removes it from ipcomp*.c and lets the standard xfrm_state destruction
take place.
This is based on the discovery and patch by Alexey Dobriyan.
Tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When no more memory can be allocated, fq_find() will return NULL and
increase the value of IPSTATS_MIB_REASMFAILS. In this case,
ipv6_frag_rcv() also increase the value of IPSTATS_MIB_REASMFAILS.
So, the patch deletes redundant counter of IPSTATS_MIB_REASMFAILS in fq_find().
and deletes the unused parameter of idev.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, each connection needs a unique identity. Conntrack zones allow
to specify a numerical zone using the CT target, connections in different
zones can use the same identity.
Example:
iptables -t raw -A PREROUTING -i veth0 -j CT --zone 1
iptables -t raw -A OUTPUT -o veth1 -j CT --zone 1
Signed-off-by: Patrick McHardy <kaber@trash.net>
The error handlers might need the template to get the conntrack zone
introduced in the next patches to perform a conntrack lookup.
Signed-off-by: Patrick McHardy <kaber@trash.net>
The variable 'copied' is used in udp_recvmsg() to emphasize that the passed
'len' is adjusted to fit the actual datagram length. But the same can be
done by adjusting 'len' directly. This patch thus removes the indirection.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't need to disable bottom half it is already down in the
previous lock. Move some blank lines to group locking in same
context.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Permanent IPV6 addresses should not be removed when the link is
set to admin down, only when device is removed.
When link is lost permanent addresses should be marked as tentative
so that when link comes back they are subject to duplicate address
detection (if DAD was enabled for that address).
Other routing systems keep manually configured IPv6 addresses
when link is set down.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the fib size exceeds what can be dumped in a single skb, the
dump is suspended and resumed once the last skb has been received
by userspace. When the fib is changed while the dump is suspended,
the walker might contain stale pointers, causing a crash when the
dump is resumed.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffffa01bce04>] fib6_walk_continue+0xbb/0x124 [ipv6]
PGD 5347a067 PUD 65c7067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
...
RIP: 0010:[<ffffffffa01bce04>]
[<ffffffffa01bce04>] fib6_walk_continue+0xbb/0x124 [ipv6]
...
Call Trace:
[<ffffffff8104aca3>] ? mutex_spin_on_owner+0x59/0x71
[<ffffffffa01bd105>] inet6_dump_fib+0x11b/0x1b9 [ipv6]
[<ffffffff81371af4>] netlink_dump+0x5b/0x19e
[<ffffffff8134f288>] ? consume_skb+0x28/0x2a
[<ffffffff81373b69>] netlink_recvmsg+0x1ab/0x2c6
[<ffffffff81372781>] ? netlink_unicast+0xfa/0x151
[<ffffffff813483e0>] __sock_recvmsg+0x6d/0x79
[<ffffffff81348a53>] sock_recvmsg+0xca/0xe3
[<ffffffff81066d4b>] ? autoremove_wake_function+0x0/0x38
[<ffffffff811ed1f8>] ? radix_tree_lookup_slot+0xe/0x10
[<ffffffff810b3ed7>] ? find_get_page+0x90/0xa5
[<ffffffff810b5dc5>] ? filemap_fault+0x201/0x34f
[<ffffffff810ef152>] ? fget_light+0x2f/0xac
[<ffffffff813519e7>] ? verify_iovec+0x4f/0x94
[<ffffffff81349a65>] sys_recvmsg+0x14d/0x223
Store the serial number when beginning to walk the fib and reload
pointers when continuing to walk after a change occured. Similar
to other dumping functions, this might cause unrelated entries to
be missed when entries are deleted.
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In POST_ROUTING hook, calling dev_net(in) is going to oops.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The static initial tables are pretty large, and after the net
namespace has been instantiated, they just hang around for nothing.
This commit removes them and creates tables on-demand at runtime when
needed.
Size shrinks by 7735 bytes (x86_64).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The respective xt_table structures already have most of the metadata
needed for hook setup. Add a 'priority' field to struct xt_table so
that xt_hook_link() can be called with a reduced number of arguments.
So should we be having more tables in the future, it comes at no
static cost (only runtime, as before) - space saved:
6807373->6806555.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The calls to ip6t_do_table only show minimal differences, so it seems
like a good cleanup to merge them to a single one too.
Space saving obtained by both patches: 6807725->6807373
("Total" column from `size -A`.)
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
This patch combines all the per-hook functions in a given table into
a single function. Together with the 2nd patch, further
simplifications are possible up to the point of output code reduction.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
As per C99 6.2.4(2) when temporary table data goes out of scope,
the behaviour is undefined:
if (compat) {
struct foo tmp;
...
private = &tmp;
}
[dereference private]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Support initializing selected parameters of new conntrack entries from a
"conntrack template", which is a specially marked conntrack entry attached
to the skb.
Currently the helper and the event delivery masks can be initialized this
way.
Signed-off-by: Patrick McHardy <kaber@trash.net>
I'm not sure about rcu stuff near kmem cache destruction:
* checks for non-empty hashes look bogus, they're done _before_
rcu_berrier()
* unregistering netns ops is done before kmem_cache destoy
(as it should), and unregistering involves rcu barriers by itself
So it looks nothing should be done.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 0b5ccb2(title:ipv6: reassembly: use seperate reassembly queues for
conntrack and local delivery) has broken the saddr&&daddr member of
nf_ct_frag6_queue when creating new queue. And then hash value
generated by nf_hashfn() was not equal with that generated by fq_find().
So, a new received fragment can't be inserted to right queue.
The patch fixes the bug with adding member of user to nf_ct_frag6_queue structure.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
GC is non-existent in netns, so after you hit GC threshold, no new
dst entries will be created until someone triggers cleanup in init_net.
Make xfrm4_dst_ops and xfrm6_dst_ops per-netns.
This is not done in a generic way, because it woule waste
(AF_MAX - 2) * sizeof(struct dst_ops) bytes per-netns.
Reorder GC threshold initialization so it'd be done before registering
XFRM policies.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use macro to define high/low thresh value, refer to IPV6_FRAG_TIMEOUT.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The following three macro definitions are never used, so delete them.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add ->net to match destructor list like ->net in constructor list.
Make sure it's set in ebtables/iptables/ip6tables, this requires to
propagate netns up to *_unregister_table().
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Some complex match modules (like xt_hashlimit/xt_recent) want netns
information at constructor and destructor time. We propably can play
games at match destruction time, because netns can be passed in object,
but I think it's cleaner to explicitly pass netns.
Add ->net, make sure it's set from ebtables/iptables/ip6tables code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
__net_init/__net_exit are apparently not going away, so use them
to full extent.
In some cases __net_init was removed, because it was called from
__net_exit code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we don't increment SYN-ACK timeouts & retransmissions
although we do increment the same stats for SYN. We seem to have lost
the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
(commit 2248761e in the netdev-vger-cvs tree).
This patch fixes this issue. In the process we also rename the v4/v6
syn/ack retransmit functions for clarity. We also add a new
request_socket operations (syn_ack_timeout) so we can keep code in
inet_connection_sock.c protocol agnostic.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes CERT-FI FICORA #341748
Discovered by Olli Jarva and Tuomo Untinen from the CROSS
project at Codenomicon Ltd.
Just like in CVE-2007-4567, we can't rely upon skb_dst() being
non-NULL at this point. We fixed that in commit
e76b2b2567 ("[IPV6]: Do no rely on
skb->dst before it is assigned.")
However commit 483a47d2fe ("ipv6: added
net argument to IP6_INC_STATS_BH") put a new version of the same bug
into this function.
Complicating analysis further, this bug can only trigger when network
namespaces are enabled in the build. When namespaces are turned off,
the dev_net() does not evaluate it's argument, so the dereference
would not occur.
So, for a long time, namespaces couldn't be turned on unless SYSFS was
disabled. Therefore, this code has largely been disabled except by
people turning it on explicitly for namespace development.
With help from Eugene Teo <eugene@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found a potential null dereference in snmp6_unregister_dev.
There is a check for idev being NULL, but it is dereferenced
earlier. But idev cannot be NULL when passed to
snmp6_unregister_dev, so remove the test.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have L3 tunnels with different inner/outer families
(i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
destination address, multicast packets will be loopbacked back to the
sending socket even if IP*_MULTICAST_LOOP is set to disabled.
The mc_loop flag is present in the family specific part of the socket
(e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner
family mc_loop flag. When the packet is pushed through the L3 tunnel
it will eventually be processed by the outer family which if different
will check the flag in a different part of the socket then it was set.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add rtnetlink init_rcvwnd to set the TCP initial receive window size
advertised by passive and active TCP connections.
The current Linux TCP implementation limits the advertised TCP initial
receive window to the one prescribed by slow start. For short lived
TCP connections used for transaction type of traffic (i.e. http
requests), bounding the advertised TCP initial receive window results
in increased latency to complete the transaction.
Support for setting initial congestion window is already supported
using rtnetlink init_cwnd, but the feature is useless without the
ability to set a larger TCP initial receive window.
The rtnetlink init_rcvwnd allows increasing the TCP initial receive
window, allowing TCP connection to advertise larger TCP receive window
than the ones bounded by slow start.
Signed-off-by: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl table was copied, all right, but ->data for net.ipv6.route.gc_min_interval_ms
was not reinitialized for "!= &init_net" case.
In init_net everthing works by accident due to correct ->data initialization
in source table.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It creates a regression, triggering badness for SYN_RECV
sockets, for example:
[19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
[19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
[19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32)
[19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000
[19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000
This is likely caused by the change in the 'estab' parameter
passed to tcp_parse_options() when invoked by the functions
in net/ipv4/tcp_minisocks.c
But even if that is fixed, the ->conn_request() changes made in
this patch series is fundamentally wrong. They try to use the
listening socket's 'dst' to probe the route settings. The
listening socket doesn't even have a route, and you can't
get the right route (the child request one) until much later
after we setup all of the state, and it must be done by hand.
This stuff really isn't ready, so the best thing to do is a
full revert. This reverts the following commits:
f55017a93f022c3f7d821aba721ebacda42ebd67345cda2fd6dc343475ed05eaade2786a2a2d6bf8
Signed-off-by: David S. Miller <davem@davemloft.net>
When fragments from bridge netfilter are passed to IPv4 or IPv6 conntrack
and a reassembly queue with the same fragment key already exists from
reassembling a similar packet received on a different device (f.i. with
multicasted fragments), the reassembled packet might continue on a different
codepath than where the head fragment originated. This can cause crashes
in bridge netfilter when a fragment received on a non-bridge device (and
thus with skb->nf_bridge == NULL) continues through the bridge netfilter
code.
Add a new reassembly identifier for packets originating from bridge
netfilter and use it to put those packets in insolated queues.
Fixes http://bugzilla.kernel.org/show_bug.cgi?id=14805
Reported-and-Tested-by: Chong Qiao <qiaochong@loongson.cn>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently the same reassembly queue might be used for packets reassembled
by conntrack in different positions in the stack (PREROUTING/LOCAL_OUT),
as well as local delivery. This can cause "packet jumps" when the fragment
completing a reassembled packet is queued from a different position in the
stack than the previous ones.
Add a "user" identifier to the reassembly queue key to seperate the queues
of each caller, similar to what we do for IPv4.
Signed-off-by: Patrick McHardy <kaber@trash.net>
First patch changes __inet_hash_nolisten() and __inet6_hash()
to get a timewait parameter to be able to unhash it from ehash
at same time the new socket is inserted in hash.
This makes sure timewait socket wont be found by a concurrent
writer in __inet_check_established()
Reported-by: kapil dakhane <kdakhane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
Its currently possible that several threads issuing a connect() find
the same timewait socket and try to reuse it, leading to list
corruptions.
Condition for bug is that these threads bound their socket on same
address/port of to-be-find timewait socket, and connected to same
target. (SO_REUSEADDR needed)
To fix this problem, we could unhash timewait socket while holding
ehash lock, to make sure lookups/changes will be serialized. Only
first thread finds the timewait socket, other ones find the
established socket and return an EADDRNOTAVAIL error.
This second version takes into account Evgeniy's review and makes sure
inet_twsk_put() is called outside of locked sections.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function walks the whole hashtable so there is no point in
passing it a network namespace. Instead I purge all timewait
sockets from dead network namespaces that I find. If the namespace
is one of the once I am trying to purge I am guaranteed no new timewait
sockets can be formed so this will get them all. If the namespace
is one I am not acting for it might form a few more but I will
call inet_twsk_purge again and shortly to get rid of them. In
any even if the network namespace is dead timewait sockets are
useless.
Move the calls of inet_twsk_purge into batch_exit routines so
that if I am killing a bunch of namespaces at once I will just
call inet_twsk_purge once and save a lot of redundant unnecessary
work.
My simple 4k network namespace exit test the cleanup time dropped from
roughly 8.2s to 1.6s. While the time spent running inet_twsk_purge fell
to about 2ms. 1ms for ipv4 and 1ms for ipv6.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the code so fib_rules_register always takes a template instead
of the actual fib_rules_ops structure that will be used. This is
required for network namespace support so 2 out of the 3 callers already
do this, it allows the error handling to be made common, and it allows
fib_rules_unregister to free the template for hte caller.
Modify fib_rules_unregister to use call_rcu instead of syncrhonize_rcu
to allw multiple namespaces to be cleaned up in the same rcu grace
period.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d124356ce314fff22a047ea334379d5105b2d834
Author: Patrick McHardy <kaber@trash.net>
Date: Thu Dec 3 12:16:35 2009 +0100
net: fib_rules: allow to delete local rule
Allow to delete the local rule and recreate it with a higher priority. This
can be used to force packets with a local destination out on the wire instead
of routing them to loopback. Additionally this patch allows to recreate rules
with a priority of 0.
Combined with the previous patch to allow oif classification, a socket can
be bound to the desired interface and packets routed to the wire like this:
# move local rule to lower priority
ip rule add pref 1000 lookup local
ip rule del pref 0
# route packets of sockets bound to eth0 to the wire independant
# of the destination address
ip rule add pref 100 oif eth0 lookup 100
ip route add default dev eth0 table 100
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Parse incoming TCP_COOKIE option(s).
Calculate <SYN,ACK> TCP_COOKIE option.
Send optional <SYN,ACK> data.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1d: define TCP cookie option, extend existing struct's
TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS
TCPCT part 1f: Initiator Cookie => Responder
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration). There are no additions to
tcp_request_sock, and only 1 pointer in tcp_sock.
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley):
http://thread.gmane.org/gmane.linux.network/102586
The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data. This is more flexible and
less subject to user configuration error. Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.
"Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
"Re: what a new TCP header might look like", May 12, 1998.
ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
These functions will also be used in subsequent patches that implement
additional features.
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: generate Responder Cookie secret
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
Signed-off-by: William.Allen.Simpson@gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.
Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.
Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take advantage of the new pernet automatic storage management,
and stop using compatibility network namespace functions.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using the hardcoded truncation for authentication
algorithms, use the truncation length specified on xfrm_state.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generated with the following semantic patch
@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)
@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)
applied over {include,net,drivers/net}.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>