linux

Commit Graph

Author	SHA1	Message	Date
Neal Cardwell	f3f121359c	ipv6: tcp: clean up tcp_v6_early_demux() icsk variable Remove an icsk variable, which by convention should refer to an inet_connection_sock rather than an inet_sock. In the process, make the tcp_v6_early_demux() code and formatting a bit more like tcp_v4_early_demux(), to ease comparisons and maintenance. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-23 13:03:44 -04:00
Nicolas Dichtel	51ebd31815	ipv6: add support of equal cost multipath (ECMP) Each nexthop is added like a single route in the routing table. All routes that have the same metric/weight and destination but not the same gateway are considering as ECMP routes. They are linked together, through a list called rt6i_siblings. ECMP routes can be added in one shot, with RTA_MULTIPATH attribute or one after the other (in both case, the flag NLM_F_EXCL should not be set). The patch is based on a previous work from Luc Saillard <luc.saillard@6wind.com>. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-23 02:38:32 -04:00
Eric Dumazet	9f0d3c2781	ipv6: addrconf: fix /proc/net/if_inet6 Commit `1d5783030a` (ipv6/addrconf: speedup /proc/net/if_inet6 filling) added bugs hiding some devices from if_inet6 and breaking applications. "ip -6 addr" could still display all IPv6 addresses, while "ifconfig -a" couldnt. One way to reproduce the bug is by starting in a shell : unshare -n /bin/bash ifconfig lo up And in original net namespace, lo device disappeared from if_inet6 Reported-by: Jan Hinnerk Stosch <janhinnerk.stosch@gmail.com> Tested-by: Jan Hinnerk Stosch <janhinnerk.stosch@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mihai Maruseac <mihai.maruseac@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-16 14:41:47 -04:00
Alexey Kuznetsov	4c67525849	tcp: resets are misrouted After commit `e2446eaa` ("tcp_v4_send_reset: binding oif to iif in no sock case").. tcp resets are always lost, when routing is asymmetric. Yes, backing out that patch will result in misrouting of resets for dead connections which used interface binding when were alive, but we actually cannot do anything here. What's died that's died and correct handling normal unbound connections is obviously a priority. Comment to comment: > This has few benefits: > 1. tcp_v6_send_reset already did that. It was done to route resets for IPv6 link local addresses. It was a mistake to do so for global addresses. The patch fixes this as well. Actually, the problem appears to be even more serious than guaranteed loss of resets. As reported by Sergey Soloviev <sol@eqv.ru>, those misrouted resets create a lot of arp traffic and huge amount of unresolved arp entires putting down to knees NAT firewalls which use asymmetric routing. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>	2012-10-12 13:52:40 -04:00
Eric Dumazet	863472454c	ipv6: gro: fix PV6_GRO_CB(skb)->proto problem It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive() if a new skb is allocated (to serve as an anchor for frag_list) We copy NAPI_GRO_CB() only (not the IPV6 specific part) in : NAPI_GRO_CB(nskb) = NAPI_GRO_CB(p); So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead of IPPROTO_TCP (6) ipv6_gro_complete() isnt able to call ops->gro_complete() [ tcp6_gro_complete() ] Fix this by moving proto in NAPI_GRO_CB() and getting rid of IPV6_GRO_CB Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-08 15:40:43 -04:00
Eric Dumazet	51ec04038c	ipv6: GRO should be ECN friendly IPv4 side of the problem was addressed in commit `a9e050f4e7` (net: tcp: GRO should be ECN friendly) This patch does the same, but for IPv6 : A Traffic Class mismatch doesnt mean flows are different, but instead should force a flush of previous packets. This patch removes artificial packet reordering problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-07 14:44:36 -04:00
Linus Torvalds	283dbd8205	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking changes from David Miller: "The most important bit in here is the fix for input route caching from Eric Dumazet, it's a shame we couldn't fully analyze this in time for 3.6 as it's a 3.6 regression introduced by the routing cache removal. Anyways, will send quickly to -stable after you pull this in. Other changes of note: 1) Fix lockdep splats in team and bonding, from Eric Dumazet. 2) IPV6 adds link local route even when there is no link local address, from Nicolas Dichtel. 3) Fix ixgbe PTP implementation, from Jacob Keller. 4) Fix excessive stack usage in cxgb4 driver, from Vipul Pandya. 5) MAC length computed improperly in VLAN demux, from Antonio Quartulli." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) ipv6: release reference of ip6_null_entry's dst entry in __ip6_del_rt Remove noisy printks from llcp_sock_connect tipc: prevent dropped connections due to rcvbuf overflow silence some noisy printks in irda team: set qdisc_tx_busylock to avoid LOCKDEP splat bonding: set qdisc_tx_busylock to avoid LOCKDEP splat sctp: check src addr when processing SACK to update transport state sctp: fix a typo in prototype of __sctp_rcv_lookup() ipv4: add a fib_type to fib_info can: mpc5xxx_can: fix section type conflict can: peak_pcmcia: fix error return code can: peak_pci: fix error return code cxgb4: Fix build error due to missing linux/vmalloc.h include. bnx2x: fix ring size for 10G functions cxgb4: Dynamically allocate memory in t4_memory_rw() and get_vpd_params() ixgbe: add support for X540-AT1 ixgbe: fix poll loop for FDIRCTRL.INIT_DONE bit ixgbe: fix PTP ethtool timestamping function ixgbe: (PTP) Fix PPS interrupt code ixgbe: Fix PTP X540 SDP alignment code for PPS signal ...	2012-10-06 03:11:59 +09:00
Andi Kleen	04a6f82cf0	sections: fix section conflicts in net Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-06 03:04:45 +09:00
Gao feng	6825a26c2d	ipv6: release reference of ip6_null_entry's dst entry in __ip6_del_rt as we hold dst_entry before we call __ip6_del_rt, so we should alse call dst_release not only return -ENOENT when the rt6_info is ip6_null_entry. and we already hold the dst entry, so I think it's safe to call dst_release out of the write-read lock. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-04 16:00:07 -04:00
Nicolas Dichtel	62b54dd915	ipv6: don't add link local route when there is no link local address When an address is added on loopback (ip -6 a a 2002::1/128 dev lo), a route to fe80::/64 is added in the main table: unreachable fe80::/64 dev lo proto kernel metric 256 error -101 This route does not match any prefix (no fe80:: address on lo). In fact, addrconf_dev_config() will not add link local address because this function filters interfaces by type. If the link local address is added manually, the route to the link local prefix will be automatically added by addrconf_add_linklocal(). Note also, that this route is not deleted when the address is removed. After looking at the code, it seems that addrconf_add_lroute() is redundant with addrconf_add_linklocal(), because this function will add the link local route when the link local address is configured. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-02 22:36:23 -04:00
Eric Dumazet	861b650101	tcp: gro: add checksuming helpers skb with CHECKSUM_NONE cant currently be handled by GRO, and we notice this deep in GRO stack in tcp[46]_gro_receive() But there are cases where GRO can be a benefit, even with a lack of checksums. This preliminary work is needed to add GRO support to tunnels. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-01 17:00:27 -04:00
Nicolas Dichtel	64c6d08e64	ipv6: del unreachable route when an addr is deleted on lo When an address is added on loopback (ip -6 a a 2002::1/128 dev lo), two routes are added: - one in the local table: local 2002::1 via :: dev lo proto none metric 0 - one the in main table (for the prefix): unreachable 2002::1 dev lo proto kernel metric 256 error -101 When the address is deleted, the route inserted in the main table remains because we use rt6_lookup(), which returns NULL when dst->error is set, which is the case here! Thus, it is better to use ip6_route_lookup() to avoid this kind of filter. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-01 16:49:23 -04:00
Lin Ming	188c517a05	ipv6: return errno pointers consistently for fib6_add_1() fib6_add_1() should consistently return errno pointers, rather than a mixture of NULL and errno pointers. Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-28 18:48:28 -04:00
David S. Miller	6a06e5e1bb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/team/team.c drivers/net/usb/qmi_wwan.c net/batman-adv/bat_iv_ogm.c net/ipv4/fib_frontend.c net/ipv4/route.c net/l2tp/l2tp_netlink.c The team, fib_frontend, route, and l2tp_netlink conflicts were simply overlapping changes. qmi_wwan and bat_iv_ogm were of the "use HEAD" variety. With help from Antonio Quartulli. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-28 14:40:49 -04:00
Eric Dumazet	bcc452935d	ipv6: gre: remove ip6gre_header_parse() dev_parse_header() callers provide 8 bytes of storage, so it's not possible to store an IPv6 address. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-27 18:49:22 -04:00
Konstantin Khlebnikov	4b7cc7fc26	nf_defrag_ipv6: fix oops on module unloading fix copy-paste error introduced in linux-next commit "ipv6: add a new namespace for nf_conntrack_reasm" Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Amerigo Wang <amwang@redhat.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-27 18:14:55 -04:00
stephen hemminger	eccc1bb8d4	tunnel: drop packet if ECN present with not-ECT Linux tunnels were written before RFC6040 and therefore never implemented the corner case of ECN getting set in the outer header and the inner header not being ready for it. Section 4.2. Default Tunnel Egress Behaviour. o If the inner ECN field is Not-ECT, the decapsulator MUST NOT propagate any other ECN codepoint onwards. This is because the inner Not-ECT marking is set by transports that rely on dropped packets as an indication of congestion and would not understand or respond to any other ECN codepoint [RFC4774]. Specifically: * If the inner ECN field is Not-ECT and the outer ECN field is CE, the decapsulator MUST drop the packet. * If the inner ECN field is Not-ECT and the outer ECN field is Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the outgoing packet with the ECN field cleared to Not-ECT. This patch moves the ECN decap logic out of the individual tunnels into a common place. It also adds logging to allow detecting broken systems that set ECN bits incorrectly when tunneling (or an intermediate router might be changing the header). Overloads rx_frame_error to keep track of ECN related error. Thanks to Chris Wright who caught this while reviewing the new VXLAN tunnel. This code was tested by injecting faulty logic in other end GRE to send incorrectly encapsulated packets. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-27 18:12:37 -04:00
stephen hemminger	b0558ef24a	xfrm: remove extranous rcu_read_lock The handlers for xfrm_tunnel are always invoked with rcu read lock already. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-27 18:12:37 -04:00
stephen hemminger	0c5794a66c	gre: remove unnecessary rcu_read_lock/unlock The gre function pointers for receive and error handling are always called (from gre.c) with rcu_read_lock already held. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-27 18:12:37 -04:00
Eric Dumazet	96af69ea2a	ipv6: mip6: fix mip6_mh_filter() mip6_mh_filter() should not modify its input, or else its caller would need to recompute ipv6_hdr() if skb->head is reallocated. Use skb_header_pointer() instead of pskb_may_pull() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-25 16:04:44 -04:00
Eric Dumazet	1b05c4b50e	ipv6: raw: fix icmpv6_filter() icmpv6_filter() should not modify its input, or else its caller would need to recompute ipv6_hdr() if skb->head is reallocated. Use skb_header_pointer() instead of pskb_may_pull() and change the prototype to make clear both sk and skb are const. Also, if icmpv6 header cannot be found, do not deliver the packet, as we do in IPv4. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-25 13:21:49 -04:00
Eric Dumazet	5640f76858	net: use a per task frag allocator We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-24 16:31:37 -04:00
David S. Miller	ae4735166e	Merge branch 'master' of git://1984.lsi.us.es/nf-next Pablo Neira Ayuso says: ==================== This patchset contains updates for your net-next tree, they are: * Mostly fixes for the recently pushed IPv6 NAT support: - Fix crash while removing nf_nat modules from Patrick McHardy. - Fix unbalanced rcu_read_unlock from Ulrich Weber. - Merge NETMAP and REDIRECT into one single xt_target module, from Jan Engelhardt. - Fix Kconfig for IPv6 NAT, which allows inconsistent configurations, from myself. * Updates for ipset, all of the from Jozsef Kadlecsik: - Add the new "nomatch" option to obtain reverse set matching. - Support for /0 CIDR in hash:net,iface set type. - One non-critical fix for a rare crash due to pass really wrong configuration parameters. - Coding style cleanups. - Sparse fixes. - Add set revision supported via modinfo.i * One extension for the xt_time match, to support matching during the transition between two days with one single rule, from Florian Westphal. * Fix maximum packet length supported by nfnetlink_queue and add NFQA_CAP_LEN attribute, from myself. You can notice that this batch contains a couple of fixes that may go to 3.6-rc but I don't consider them critical to push them: * The ipset fix for the /0 cidr case, which is triggered with one inconsistent command line invocation of ipset. * The nfnetlink_queue maximum packet length supported since it requires the new NFQA_CAP_LEN attribute to provide a full workaround for the described problem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-24 15:42:04 -04:00
Neal Cardwell	016818d076	tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS When taking SYNACK RTT samples for servers using TCP Fast Open, fix the code to ensure that we only call tcp_valid_rtt_meas() after we receive the ACK that completes the 3-way handshake. Previously we were always taking an RTT sample in tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we receive the SYN. So for TFO we must wait until tcp_rcv_state_process() to take the RTT sample. To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock() before we set the snt_synack timestamp, since tcp_synack_rtt_meas() already ensures that we only take a SYNACK RTT sample if snt_synack is non-zero. To be careful, we only take a snt_synack timestamp when a SYNACK transmit or retransmit succeeds. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-22 15:47:10 -04:00
Neal Cardwell	623df484a7	tcp: extract code to compute SYNACK RTT In preparation for adding another spot where we compute the SYNACK RTT, extract this code so that it can be shared. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-22 15:47:10 -04:00
Wei Yongjun	f950c0ecc7	ipv6: fix return value check in fib6_add() In case of error, the function fib6_add_1() returns ERR_PTR() or NULL pointer. The ERR_PTR() case check is missing in fib6_add(). dpatch engine is used to generated this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-21 13:43:52 -04:00
Jan Engelhardt	2cbc78a29e	netfilter: combine ipt_REDIRECT and ip6t_REDIRECT Combine more modules since the actual code is so small anyway that the kmod metadata and the module in its loaded state totally outweighs the combined actual code size. IP_NF_TARGET_REDIRECT becomes a compat option; IP6_NF_TARGET_REDIRECT is completely eliminated since it has not see a release yet. Signed-off-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-09-21 12:12:05 +02:00
Jan Engelhardt	b3d54b3e40	netfilter: combine ipt_NETMAP and ip6t_NETMAP Combine more modules since the actual code is so small anyway that the kmod metadata and the module in its loaded state totally outweighs the combined actual code size. IP_NF_TARGET_NETMAP becomes a compat option; IP6_NF_TARGET_NETMAP is completely eliminated since it has not see a release yet. Signed-off-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-09-21 12:11:08 +02:00
Pablo Neira Ayuso	b0041d1b8e	netfilter: fix IPv6 NAT dependencies in Kconfig * NF_NAT_IPV6 requires IP6_NF_IPTABLES * IP6_NF_TARGET_MASQUERADE, IP6_NF_TARGET_NETMAP, IP6_NF_TARGET_REDIRECT and IP6_NF_TARGET_NPT require NF_NAT_IPV6. This change just mirrors what IPv4 does in Kconfig, for consistency. Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-09-21 11:33:19 +02:00
Amerigo Wang	6b102865e7	ipv6: unify fragment thresh handling code Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 17:23:28 -04:00
Amerigo Wang	d4915c087f	ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inline Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 17:23:28 -04:00
Amerigo Wang	b836c99fd6	ipv6: unify conntrack reassembly expire code with standard one Two years ago, Shan Wei tried to fix this: http://patchwork.ozlabs.org/patch/43905/ The problem is that RFC2460 requires an ICMP Time Exceeded -- Fragment Reassembly Time Exceeded message should be sent to the source of that fragment, if the defragmentation times out. " If insufficient fragments are received to complete reassembly of a packet within 60 seconds of the reception of the first-arriving fragment of that packet, reassembly of that packet must be abandoned and all the fragments that have been received for that packet must be discarded. If the first fragment (i.e., the one with a Fragment Offset of zero) has been received, an ICMP Time Exceeded -- Fragment Reassembly Time Exceeded message should be sent to the source of that fragment. " As Herbert suggested, we could actually use the standard IPv6 reassembly code which follows RFC2460. With this patch applied, I can see ICMP Time Exceeded sent from the receiver when the sender sent out 3/4 fragmented IPv6 UDP packet. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 17:23:28 -04:00
Amerigo Wang	c038a767cd	ipv6: add a new namespace for nf_conntrack_reasm As pointed by Michal, it is necessary to add a new namespace for nf_conntrack_reasm code, this prepares for the second patch. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 17:23:28 -04:00
Li RongQing	3fd91fb358	ipv6: recursive check rt->dst.from when call rt6_check_expired If dst cache dst_a copies from dst_b, and dst_b copies from dst_c, check if dst_a is expired or not, we should not end with dst_a->dst.from, dst_b, we should check dst_c. CC: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 15:35:33 -04:00
Nicolas Dichtel	2c20cbd7e3	ipv6: use DST_* macro to set obselete field Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-18 15:57:04 -04:00
Nicolas Dichtel	6f3118b571	ipv6: use net->rt_genid to check dst validity IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or deleted, all dst should be invalidated. To force the validation, dst entries should be created with ->obsolete set to DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling ip6_dst_alloc(), except for ip6_rt_copy(). As a consequence, we can remove the specific code in inet6_connection_sock. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-18 15:57:03 -04:00
David S. Miller	b48b63a1f6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/netfilter/nfnetlink_log.c net/netfilter/xt_LOG.c Rather easy conflict resolution, the 'net' tree had bug fixes to make sure we checked if a socket is a time-wait one or not and elide the logging code if so. Whereas on the 'net-next' side we are calculating the UID and GID from the creds using different interfaces due to the user namespace changes from Eric Biederman. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-15 11:43:53 -04:00
Li RongQing	5744dd9b71	ipv6: replace write lock with read lock when get route info geting route info does not write rt->rt6i_table, so replace write lock with read lock Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-13 16:53:46 -04:00
Eric Dumazet	fb0af4c74f	ipv6: route templates can be const We kmemdup() templates, so they can be const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-13 16:52:55 -04:00
YOSHIFUJI Hideaki / 吉藤英明	91b4b04ff8	ipv6: Compare addresses only bits up to the prefix length (RFC6724). Compare bits up to the source address's prefix length only to allows DNS load balancing to continue to be used as a tie breaker. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-13 16:34:03 -04:00
YOSHIFUJI Hideaki / 吉藤英明	417962a02b	ipv6: Add labels for site-local and 6bone testing addresses (RFC6724) Added labels for site-local addresses (fec0::/10) and 6bone testing addresses (3ffe::/16) in order to depreference them. Note that the RFC introduced new rows for Teredo, ULA and 6to4 addresses in the default policy table. Some of them have different labels from ours. For backward compatibility, we do not change the "default" labels. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-13 16:34:03 -04:00
Amerigo Wang	fdd6681d92	ipv6: remove some useless RCU read lock After this commit: commit `97cac0821a` Author: David S. Miller <davem@davemloft.net> Date: Mon Jul 2 22:43:47 2012 -0700 ipv6: Store route neighbour in rt6_info struct. we no longer use RCU to protect route neighbour. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-10 16:31:18 -04:00
Eric W. Biederman	15e473046c	netlink: Rename pid to portid to avoid confusion It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-10 15:30:41 -04:00
Nicolas Dichtel	b4949ab269	ipv6: fix handling of throw routes It's the same problem that previous fix about blackhole and prohibit routes. When adding a throw route, it was handled like a classic route. Moreover, it was only possible to add this kind of routes by specifying an interface. Before the patch: $ ip route add throw 2001::2/128 RTNETLINK answers: No such device $ ip route add throw 2001::2/128 dev eth0 $ ip -6 route \| grep 2001::2 2001::2 dev eth0 metric 1024 After: $ ip route add throw 2001::2/128 $ ip -6 route \| grep 2001::2 throw 2001::2 dev lo metric 1024 error -11 Reported-by: Markus Stenberg <markus.stenberg@iki.fi> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-07 14:17:10 -04:00
Eric Dumazet	979402b16c	udp: increment UDP_MIB_INERRORS if copy failed In UDP recvmsg(), we miss an increase of UDP_MIB_INERRORS if the copy of skb to userspace failed for whatever reason. Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-07 12:56:00 -04:00
Nicolas Dichtel	ef2c7d7b59	ipv6: fix handling of blackhole and prohibit routes When adding a blackhole or a prohibit route, they were handling like classic routes. Moreover, it was only possible to add this kind of routes by specifying an interface. Bug already reported here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498 Before the patch: $ ip route add blackhole 2001::1/128 RTNETLINK answers: No such device $ ip route add blackhole 2001::1/128 dev eth0 $ ip -6 route \| grep 2001 2001::1 dev eth0 metric 1024 After: $ ip route add blackhole 2001::1/128 $ ip -6 route \| grep 2001 blackhole 2001::1 dev lo metric 1024 error -22 v2: wrong patch v3: add a field fc_type in struct fib6_config to store RTN_* type Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-05 17:49:28 -04:00
Stephen Rothwell	2c96932229	netfilter: ipv6: using csum_ipv6_magic requires net/ip6_checksum.h Fixes this build error: net/ipv6/netfilter/nf_nat_l3proto_ipv6.c: In function 'nf_nat_ipv6_csum_recalc': net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:144:4: error: implicit declaration of function 'csum_ipv6_magic' [-Werror=implicit-function-declaration] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-05 17:46:06 -04:00
Julian Anastasov	d013ef2aba	tcp: fix possible socket refcount problem for ipv6 commit `144d56e910` ("tcp: fix possible socket refcount problem") is missing the IPv6 part. As tcp_release_cb is shared by both protocols we should hold sock reference for the TCP_MTU_REDUCED_DEFERRED bit. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-05 17:16:25 -04:00
Pablo Neira Ayuso	ace1fe1231	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next This merges (`3f509c6` netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.	2012-09-03 15:34:51 +02:00
Jerry Chu	8336886f78	tcp: TCP Fast Open Server - support TFO listeners This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-31 20:02:19 -04:00
Sorin Dumitru	eb7e057596	ipv6: remove some deadcode __ipv6_regen_rndid no longer returns anything other than 0 so there's no point in verifying what it returns Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-31 16:33:32 -04:00
Julia Lawall	48f125ce1c	net: ipv6: fix error return code Initialize return variable before exiting on an error path. The initial initialization of the return variable is also dropped, because that value is never used. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> ( if@p1 ($ret < 0\\|ret != 0$) { ... return ret; } \| ret@p1 = 0 ) ... when != ret = e1 when != &ret *if(...) { ... when != ret = e2 when forall return ret; } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-31 16:27:48 -04:00
Patrick McHardy	8a91bb0c30	netfilter: ip6tables: add stateless IPv6-to-IPv6 Network Prefix Translation target Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:25 +02:00
Patrick McHardy	ed72d9e294	netfilter: ip6tables: add NETMAP target Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:19 +02:00
Patrick McHardy	115e23ac78	netfilter: ip6tables: add REDIRECT target Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:19 +02:00
Patrick McHardy	b3f644fc82	netfilter: ip6tables: add MASQUERADE target Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:18 +02:00
Patrick McHardy	58a317f106	netfilter: ipv6: add IPv6 NAT support Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:17 +02:00
Patrick McHardy	0ad352cb43	netfilter: ipv6: expand skb head in ip6_route_me_harder after oif change Expand the skb headroom if the oif changed due to rerouting similar to how IPv4 packets are handled. Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:15 +02:00
Patrick McHardy	2b60af0178	netfilter: nf_conntrack_ipv6: fix tracking of ICMPv6 error messages containing fragments ICMPv6 error messages are tracked by extracting the conntrack tuple of the inner packet and looking up the corresponding conntrack entry. Tuple extraction uses the ->get_l4proto() callback, which in case of fragments returns NEXTHDR_FRAGMENT instead of the upper protocol, even for the first fragment when the entire next header is present, resulting in a failure to find the correct connection tracking entry. This patch changes ipv6_get_l4proto() to use ipv6_skip_exthdr() instead of nf_ct_ipv6_skip_exthdr() in order to skip fragment headers when the fragment offset is zero. Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:11 +02:00
Patrick McHardy	4cdd34084d	netfilter: nf_conntrack_ipv6: improve fragmentation handling The IPv6 conntrack fragmentation currently has a couple of shortcomings. Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the defragmented packet is then passed to conntrack, the resulting conntrack information is attached to each original fragment and the fragments then continue their way through the stack. Helper invocation occurs in the POSTROUTING hook, at which point only the original fragments are available. The result of this is that fragmented packets are never passed to helpers. This patch improves the situation in the following way: - If a reassembled packet belongs to a connection that has a helper assigned, the reassembled packet is passed through the stack instead of the original fragments. - During defragmentation, the largest received fragment size is stored. On output, the packet is refragmented if required. If the largest received fragment size exceeds the outgoing MTU, a "packet too big" message is generated, thus behaving as if the original fragments were passed through the stack from an outside point of view. - The ipv6_helper() hook function can't receive fragments anymore for connections using a helper, so it is switched to use ipv6_skip_exthdr() instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the reassembled packets are passed to connection tracking helpers. The result of this is that we can properly track fragmented packets, but still generate ICMPv6 Packet too big messages if we would have before. This patch is also required as a precondition for IPv6 NAT, where NAT helpers might enlarge packets up to a point that they require fragmentation. In that case we can't generate Packet too big messages since the proper MTU can't be calculated in all cases (f.i. when changing textual representation of a variable amount of addresses), so the packet is transparently fragmented iff the original packet or fragments would have fit the outgoing MTU. IPVS parts by Jesper Dangaard Brouer <brouer@redhat.com>. Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:10 +02:00
David S. Miller	e6acb38480	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace This is an initial merge in of Eric Biederman's work to start adding user namespace support to the networking. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-24 18:54:37 -04:00
Eric Dumazet	748e2d9396	net: reinstate rtnl in call_netdevice_notifiers() Eric Biederman pointed out that not holding RTNL while calling call_netdevice_notifiers() was racy. This patch is a direct transcription his feedback against commit `0115e8e30d` (net: remove delay at device dismantle) Thanks Eric ! Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-23 09:24:42 -07:00
Eric Dumazet	b87fb39e39	ipv6: gre: fix ip6gre_err() ip6gre_err() miscomputes grehlen (sizeof(ipv6h) is 4 or 8, not 40 as expected), and should take into account 'offset' parameter. Also uses pskb_may_pull() to cope with some fragged skbs Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-22 22:48:32 -07:00
Eric Dumazet	0115e8e30d	net: remove delay at device dismantle I noticed extra one second delay in device dismantle, tracked down to a call to dst_dev_event() while some call_rcu() are still in RCU queues. These call_rcu() were posted by rt_free(struct rtable *rt) calls. We then wait a little (but one second) in netdev_wait_allrefs() before kicking again NETDEV_UNREGISTER. As the call_rcu() are now completed, dst_dev_event() can do the needed device swap on busy dst. To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called after a rcu_barrier(), but outside of RTNL lock. Use NETDEV_UNREGISTER_FINAL with care ! Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after IP cache removal. With help from Gao feng Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-22 21:50:36 -07:00
David S. Miller	bf277b0cce	Merge git://1984.lsi.us.es/nf-next Pablo Neira Ayuso says: ==================== This is the first batch of Netfilter and IPVS updates for your net-next tree. Mostly cleanups for the Netfilter side. They are: * Remove unnecessary RTNL locking now that we have support for namespace in nf_conntrack, from Patrick McHardy. * Cleanup to eliminate unnecessary goto in the initialization path of several Netfilter tables, from Jean Sacren. * Another cleanup from Wu Fengguang, this time to PTR_RET instead of if IS_ERR then return PTR_ERR. * Use list_for_each_entry_continue_rcu in nf_iterate, from Michael Wang. * Add pmtu_disc sysctl option to disable PMTU in their tunneling transmitter, from Julian Anastasov. * Generalize application protocol registration in IPVS and modify IPVS FTP helper to use it, from Julian Anastasov. * update Kconfig. The IPVS FTP helper depends on the Netfilter FTP helper for NAT support, from Julian Anastasov. * Add logic to update PMTU for IPIP packets in IPVS, again from Julian Anastasov. * A couple of sparse warning fixes for IPVS and Netfilter from Claudiu Ghioc and Patrick McHardy respectively. Patrick's IPv6 NAT changes will follow after this batch, I need to flush this batch first before refreshing my tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-22 18:48:52 -07:00
David S. Miller	1304a7343b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2012-08-22 14:21:38 -07:00
Neal Cardwell	fae6ef87fa	net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child() This commit removes the sk_rx_dst_set calls from tcp_create_openreq_child(), because at that point the icsk_af_ops field of ipv6_mapped TCP sockets has not been set to its proper final value. Instead, to make sure we get the right sk_rx_dst_set variant appropriate for the address family of the new connection, we have tcp_v{4,6}_syn_recv_sock() directly call the appropriate function shortly after the call to tcp_create_openreq_child() returns. This also moves inet6_sk_rx_dst_set() to avoid a forward declaration with the new approach. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Artem Savkov <artem.savkov@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-20 03:03:33 -07:00
Patrick McHardy	9d7b0fc1ef	net: ipv6: fix oops in inet_putpeer() Commit `97bab73f` (inet: Hide route peer accesses behind helpers.) introduced a bug in xfrm6_policy_destroy(). The xfrm_dst's _rt6i_peer member is not initialized, causing a false positive result from inetpeer_ptr_is_peer(), which in turn causes a NULL pointer dereference in inet_putpeer(). Pid: 314, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #17 To Be Filled By O.E.M. To Be Filled By O.E.M./P4S800D-X EIP: 0060:[<c03abf93>] EFLAGS: 00010246 CPU: 0 EIP is at inet_putpeer+0xe/0x16 EAX: 00000000 EBX: f3481700 ECX: 00000000 EDX: 000dd641 ESI: f3481700 EDI: c05e949c EBP: f551def4 ESP: f551def4 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000070 CR3: 3243d000 CR4: 00000750 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 f551df04 c0423de1 00000000 f3481700 f551df18 c038d5f7 f254b9f8 f551df28 f34f85d8 f551df20 c03ef48d f551df3c c0396870 f30697e8 f24e1738 c05e98f4 f5509540 c05cd2b4 f551df7c c0142d2b c043feb5 f5509540 00000000 c05cd2e8 [<c0423de1>] xfrm6_dst_destroy+0x42/0xdb [<c038d5f7>] dst_destroy+0x1d/0xa4 [<c03ef48d>] xfrm_bundle_flo_delete+0x2b/0x36 [<c0396870>] flow_cache_gc_task+0x85/0x9f [<c0142d2b>] process_one_work+0x122/0x441 [<c043feb5>] ? apic_timer_interrupt+0x31/0x38 [<c03967eb>] ? flow_cache_new_hashrnd+0x2b/0x2b [<c0143e2d>] worker_thread+0x113/0x3cc Fix by adding a init_dst() callback to struct xfrm_policy_afinfo to properly initialize the dst's peer pointer. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-20 02:56:56 -07:00
Dan Carpenter	5ef5d6c569	gre: information leak in ip6_tnl_ioctl() There is a one byte hole between p->hop_limit and p->flowinfo where stack memory is leaked to the user. This was introduced in `c12b395a46` "gre: Support GRE over IPv6". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-20 02:21:30 -07:00
Dan Carpenter	898132ae76	ipv6: move dereference after check in fl_free() There is a dereference before checking for NULL bug here. Generally free() functions should accept NULL pointers. For example, fl_create() can pass a NULL pointer to fl_free() on the error path. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-08-16 16:04:42 -07:00
Eric W. Biederman	4f82f45730	net ip6 flowlabel: Make owner a union of struct pid * and kuid_t Correct a long standing omission and use struct pid in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS. This guarantees we don't have issues when pid wraparound occurs. Use a kuid_t in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_USER to add user namespace support. In /proc/net/ip6_flowlabel capture the current pid namespace when opening the file and release the pid namespace when the file is closed ensuring we print the pid owner value that is meaning to the reader of the file. Similarly use from_kuid_munged to print uid values that are meaningful to the reader of the file. This requires exporting pid_nr_ns so that ipv6 can continue to built as a module. Yoiks what silliness Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-08-14 21:49:25 -07:00
Eric W. Biederman	a7cb5a49bf	userns: Print out socket uids in a user namespace aware fashion. Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Sridhar Samudrala <sri@us.ibm.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-08-14 21:48:06 -07:00
Ben Hutchings	4acd4945cd	ipv6: addrconf: Avoid calling netdevice notifiers with RCU read-side lock Cong Wang reports that lockdep detected suspicious RCU usage while enabling IPV6 forwarding: [ 1123.310275] =============================== [ 1123.442202] [ INFO: suspicious RCU usage. ] [ 1123.558207] 3.6.0-rc1+ #109 Not tainted [ 1123.665204] ------------------------------- [ 1123.768254] include/linux/rcupdate.h:430 Illegal context switch in RCU read-side critical section! [ 1123.992320] [ 1123.992320] other info that might help us debug this: [ 1123.992320] [ 1124.307382] [ 1124.307382] rcu_scheduler_active = 1, debug_locks = 0 [ 1124.522220] 2 locks held by sysctl/5710: [ 1124.648364] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81768498>] rtnl_trylock+0x15/0x17 [ 1124.882211] #1: (rcu_read_lock){.+.+.+}, at: [<ffffffff81871df8>] rcu_lock_acquire+0x0/0x29 [ 1125.085209] [ 1125.085209] stack backtrace: [ 1125.332213] Pid: 5710, comm: sysctl Not tainted 3.6.0-rc1+ #109 [ 1125.441291] Call Trace: [ 1125.545281] [<ffffffff8109d915>] lockdep_rcu_suspicious+0x109/0x112 [ 1125.667212] [<ffffffff8107c240>] rcu_preempt_sleep_check+0x45/0x47 [ 1125.781838] [<ffffffff8107c260>] __might_sleep+0x1e/0x19b [...] [ 1127.445223] [<ffffffff81757ac5>] call_netdevice_notifiers+0x4a/0x4f [...] [ 1127.772188] [<ffffffff8175e125>] dev_disable_lro+0x32/0x6b [ 1127.885174] [<ffffffff81872d26>] dev_forward_change+0x30/0xcb [ 1128.013214] [<ffffffff818738c4>] addrconf_forward_change+0x85/0xc5 [...] addrconf_forward_change() uses RCU iteration over the netdev list, which is unnecessary since it already holds the RTNL lock. We also cannot reasonably require netdevice notifier functions not to sleep. Reported-by: Cong Wang <amwang@redhat.com> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 17:02:12 -07:00
Igor Maravic	4855d6f311	net: ipv6: proc: Fix error handling Fix error handling in case making of dir dev_snmp6 failes Signed-off-by: Igor Maravic <igorm@etf.rs> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 14:45:07 -07:00
xeb@mail.ru	c12b395a46	gre: Support GRE over IPv6 GRE over IPv6 implementation. Signed-off-by: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 14:28:32 -07:00
Wu Fengguang	19e303d67d	netfilter: PTR_RET can be used This quiets the coccinelle warnings: net/bridge/netfilter/ebtable_filter.c:107:1-3: WARNING: PTR_RET can be used net/bridge/netfilter/ebtable_nat.c:107:1-3: WARNING: PTR_RET can be used net/ipv6/netfilter/ip6table_filter.c:65:1-3: WARNING: PTR_RET can be used net/ipv6/netfilter/ip6table_mangle.c💯1-3: WARNING: PTR_RET can be used net/ipv6/netfilter/ip6table_raw.c:44:1-3: WARNING: PTR_RET can be used net/ipv6/netfilter/ip6table_security.c:62:1-3: WARNING: PTR_RET can be used net/ipv4/netfilter/iptable_filter.c:72:1-3: WARNING: PTR_RET can be used net/ipv4/netfilter/iptable_mangle.c:107:1-3: WARNING: PTR_RET can be used net/ipv4/netfilter/iptable_raw.c:51:1-3: WARNING: PTR_RET can be used net/ipv4/netfilter/iptable_security.c:70:1-3: WARNING: PTR_RET can be used Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-08-14 02:31:47 +02:00
Eric Dumazet	63d02d157e	net: tcp: ipv6_mapped needs sk_rx_dst_set method commit `5d299f3d3c` (net: ipv6: fix TCP early demux) added a regression for ipv6_mapped case. [ 67.422369] SELinux: initialized (dev autofs, type autofs), uses genfs_contexts [ 67.449678] SELinux: initialized (dev autofs, type autofs), uses genfs_contexts [ 92.631060] BUG: unable to handle kernel NULL pointer dereference at (null) [ 92.631435] IP: [< (null)>] (null) [ 92.631645] PGD 0 [ 92.631846] Oops: 0010 [#1] SMP [ 92.632095] Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss snd_pcm snd_timer serio_raw button floppy snd i2c_i801 i2c_core soundcore snd_page_alloc shpchp ide_cd_mod cdrom microcode ehci_hcd ohci_hcd uhci_hcd [ 92.634294] CPU 0 [ 92.634294] Pid: 4469, comm: sendmail Not tainted 3.6.0-rc1 #3 [ 92.634294] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 92.634294] RSP: 0018:ffff880245fc7cb0 EFLAGS: 00010282 [ 92.634294] RAX: ffffffffa01985f0 RBX: ffff88024827ad00 RCX: 0000000000000000 [ 92.634294] RDX: 0000000000000218 RSI: ffff880254735380 RDI: ffff88024827ad00 [ 92.634294] RBP: ffff880245fc7cc8 R08: 0000000000000001 R09: 0000000000000000 [ 92.634294] R10: 0000000000000000 R11: ffff880245fc7bf8 R12: ffff880254735380 [ 92.634294] R13: ffff880254735380 R14: 0000000000000000 R15: 7fffffffffff0218 [ 92.634294] FS: 00007f4516ccd6f0(0000) GS:ffff880256600000(0000) knlGS:0000000000000000 [ 92.634294] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 92.634294] CR2: 0000000000000000 CR3: 0000000245ed1000 CR4: 00000000000007f0 [ 92.634294] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 92.634294] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 92.634294] Process sendmail (pid: 4469, threadinfo ffff880245fc6000, task ffff880254b8cac0) [ 92.634294] Stack: [ 92.634294] ffffffff813837a7 ffff88024827ad00 ffff880254b6b0e8 ffff880245fc7d68 [ 92.634294] ffffffff81385083 00000000001d2680 ffff8802547353a8 ffff880245fc7d18 [ 92.634294] ffffffff8105903a ffff88024827ad60 0000000000000002 00000000000000ff [ 92.634294] Call Trace: [ 92.634294] [<ffffffff813837a7>] ? tcp_finish_connect+0x2c/0xfa [ 92.634294] [<ffffffff81385083>] tcp_rcv_state_process+0x2b6/0x9c6 [ 92.634294] [<ffffffff8105903a>] ? sched_clock_cpu+0xc3/0xd1 [ 92.634294] [<ffffffff81059073>] ? local_clock+0x2b/0x3c [ 92.634294] [<ffffffff8138caf3>] tcp_v4_do_rcv+0x63a/0x670 [ 92.634294] [<ffffffff8133278e>] release_sock+0x128/0x1bd [ 92.634294] [<ffffffff8139f060>] __inet_stream_connect+0x1b1/0x352 [ 92.634294] [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f [ 92.634294] [<ffffffff8104b333>] ? wake_up_bit+0x25/0x25 [ 92.634294] [<ffffffff813325f5>] ? lock_sock_nested+0x74/0x7f [ 92.634294] [<ffffffff8139f223>] ? inet_stream_connect+0x22/0x4b [ 92.634294] [<ffffffff8139f234>] inet_stream_connect+0x33/0x4b [ 92.634294] [<ffffffff8132e8cf>] sys_connect+0x78/0x9e [ 92.634294] [<ffffffff813fd407>] ? sysret_check+0x1b/0x56 [ 92.634294] [<ffffffff81088503>] ? __audit_syscall_entry+0x195/0x1c8 [ 92.634294] [<ffffffff811cc26e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 92.634294] [<ffffffff813fd3e2>] system_call_fastpath+0x16/0x1b [ 92.634294] Code: Bad RIP value. [ 92.634294] RIP [< (null)>] (null) [ 92.634294] RSP <ffff880245fc7cb0> [ 92.634294] CR2: 0000000000000000 [ 92.648982] ---[ end trace 24e2bed94314c8d9 ]--- [ 92.649146] Kernel panic - not syncing: Fatal exception in interrupt Fix this using inet_sk_rx_dst_set(), and export this function in case IPv6 is modular. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-09 20:56:09 -07:00
Pavel Emelyanov	1fb9489bf1	net: Loopback ifindex is constant now As pointed out, there are places, that access net->loopback_dev->ifindex and after ifindex generation is made per-net this value becomes constant equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use it where appropriate. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-09 16:18:07 -07:00
Eric Dumazet	a399a80531	time: jiffies_delta_to_clock_t() helper to the rescue Various /proc/net files sometimes report crazy timer values, expressed in clock_t units. This happens when an expired timer delta (expires - jiffies) is passed to jiffies_to_clock_t(). This function has an overflow in : return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ); commit `cbbc719fcc` (time: Change jiffies_to_clock_t() argument type to unsigned long) only got around the problem. As we cant output negative values in /proc/net/tcp without breaking various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper that caps the negative delta to a 0 value. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Maciej Żenczykowski <maze@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: hank <pyu@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-09 16:17:03 -07:00
Eric Dumazet	5d299f3d3c	net: ipv6: fix TCP early demux IPv6 needs a cookie in dst_check() call. We need to add rx_dst_cookie and provide a family independent sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-06 13:33:21 -07:00
Linus Torvalds	ac694dbdbc	Merge branch 'akpm' (Andrew's patch-bomb) Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...	2012-07-31 19:25:39 -07:00
Mel Gorman	99a1dec70d	net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket Introduce sk_gfp_atomic(), this function allows to inject sock specific flags to each sock related allocation. It is only used on allocation paths that may be required for writing pages back to network storage. [davem@davemloft.net: Use sk_gfp_atomic only when necessary] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:46 -07:00
Andrew Morton	c255a45805	memcg: rename config variables Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [mhocko@suse.cz: fix missed bits] Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-31 18:42:43 -07:00
Eric Dumazet	cca32e4bf9	net: TCP early demux cleanup early_demux() handlers should be called in RCU context, and as we use skb_dst_set_noref(skb, dst), caller must not exit from RCU context before dst use (skb_dst(skb)) or release (skb_drop(dst)) Therefore, rcu_read_lock()/rcu_read_unlock() pairs around ->early_demux() are confusing and not needed : Protocol handlers are already in an RCU read lock section. (__netif_receive_skb() does the rcu_read_lock() ) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:21 -07:00
Li Wei	8253947e2c	ipv6: fix incorrect route 'expires' value passed to userspace When userspace use RTM_GETROUTE to dump route table, with an already expired route entry, we always got an 'expires' value(2147157) calculated base on INT_MAX. The reason of this problem is in the following satement: rt->dst.expires - jiffies < INT_MAX gcc promoted the type of both sides of '<' to unsigned long, thus a small negative value would be considered greater than INT_MAX. With the help of Eric Dumazet, do the out of bound checks in rtnl_put_cacheinfo(), _after_ conversion to clock_t. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-29 23:18:31 -07:00
Eric Dumazet	c7109986db	ipv6: Early TCP socket demux This is the IPv6 missing bits for infrastructure added in commit `41063e9dd1` (ipv4: Early TCP socket demux.) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-26 15:50:39 -07:00
Eric Dumazet	563d34d057	tcp: dont drop MTU reduction indications ICMP messages generated in output path if frame length is bigger than mtu are actually lost because socket is owned by user (doing the xmit) One example is the ipgre_tunnel_xmit() calling icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); We had a similar case fixed in commit `a34a101e1e` (ipv6: disable GSO on sockets hitting dst_allfrag). Problem of such fix is that it relied on retransmit timers, so short tcp sessions paid a too big latency increase price. This patch uses the tcp_release_cb() infrastructure so that MTU reduction messages (ICMP messages) are not lost, and no extra delay is added in TCP transmits. Reported-by: Maciej Żenczykowski <maze@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-23 00:58:46 -07:00
David S. Miller	f5b0a87436	net: Document dst->obsolete better. Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 13:31:21 -07:00
Yuchung Cheng	2100c8d2d9	net-tcp: Fast Open base This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:55:36 -07:00
Eric Dumazet	ddbe503203	ipv6: add ipv6_addr_hash() helper Introduce ipv6_addr_hash() helper doing a XOR on all bits of an IPv6 address, with an optimized x86_64 version. Use it in flow dissector, as suggested by Andrew McGregor, to reduce hash collision probabilities in fq_codel (and other users of flow dissector) Use it in ip6_tunnel.c and use more bit shuffling, as suggested by David Laight, as existing hash was ignoring most of them. Use it in sunrpc and use more bit shuffling, using hash_32(). Use it in net/ipv6/addrconf.c, using hash_32() as well. As a cleanup, use it in net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew McGregor <andrewmcgr@gmail.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Tom Herbert <therbert@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 11:28:46 -07:00
Eric Dumazet	d3818c92af	ipv6: fix inet6_csk_xmit() We should provide to inet6_csk_route_socket a struct flowi6 pointer, so that net6_csk_xmit() works correctly instead of sending garbage. Also add some consts Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-18 08:59:58 -07:00
David S. Miller	a6ff1a2f1e	Merge branch 'nexthop_exceptions' These patches implement the final mechanism necessary to really allow us to go without the route cache in ipv4. We need a place to have long-term storage of PMTU/redirect information which is independent of the routes themselves, yet does not get us back into a situation where we have to write to metrics or anything like that. For this we use an "next-hop exception" table in the FIB nexthops. The one thing I desperately want to avoid is having to create clone routes in the FIB trie for this purpose, because that is very expensive. However, I'm willing to entertain such an idea later if this current scheme proves to have downsides that the FIB trie variant would not have. In order to accomodate this any such scheme, we need to be able to produce a full flow key at PMTU/redirect time. That required an adjustment of the interface call-sites used to propagate these events. For a PMTU/redirect with a fully specified socket, we pass that socket and use it to produce the flow key. Otherwise we use a passed in SKB to formulate the key. There are two cases that need to be distinguished, ICMP message processing (in which case the IP header is at skb->data) and output packet processing (mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)). We also have to make the code able to handle the case where the dst itself passed into the dst_ops->{update_pmtu,redirect} method is invalidated. This matters for calls from sockets that have cached that route. We provide a inet{,6} helper function for this purpose, and edit SCTP specially since it caches routes at the transport rather than socket level. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 10:48:26 -07:00
David S. Miller	6700c2709c	net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 03:29:28 -07:00
Li Wei	a858d64b77	ipv6: fix unappropriate errno returned for non-multicast address We need to check the passed in multicast address and return appropriate errno(EINVAL) if it is not valid. And it's no need to walk through the ipv6_mc_list in this situation. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-17 01:35:03 -07:00
Denis Ovsienko	f0396f60d7	ipv6: fix RTPROT_RA markup of RA routes w/nexthops Userspace implementations of network routing protocols sometimes need to tell RA-originated IPv6 routes from other kernel routes to make proper routing decisions. This makes most sense for RA routes with nexthops, namely, default routes and Route Information routes. The intended mean of preserving RA route origin in a netlink message is through indicating RTPROT_RA as protocol code. Function rt6_fill_node() tried to do that for default routes, but its test condition was taken wrong. This change is modeled after the original mailing list posting by Jeff Haran. It fixes the test condition for default route case and sets the same behaviour for Route Information case (both types use nexthops). Handling of the 3rd RA route type, Prefix Information, is left unchanged, as it stands for interface connected routes (without nexthops). Signed-off-by: Denis Ovsienko <infrastation@yandex.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:55:54 -07:00
David S. Miller	35ad9b9cf7	ipv6: Add helper inet6_csk_update_pmtu(). This is the ipv6 version of inet_csk_update_pmtu(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 03:44:56 -07:00
Steffen Klassert	8104891b86	ipv6: Initialize the struct rt6_info behind the dst_enty field We start initializing the struct rt6_info at the first field behind the struct dst_enty. This is error prone because it might leave a new field uninitialized. So start initializing the struct rt6_info right behind the dst_entry. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-14 00:29:12 -07:00
David S. Miller	1ed5c48f23	net: Remove checks for dst_ops->redirect being NULL. No longer necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:41:25 -07:00
David S. Miller	b587ee3ba2	net: Add dummy dst_ops->redirect method where needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:39:24 -07:00
David S. Miller	b94f1c0904	ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect(). And delete rt6_redirect(), since it is no longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:33:37 -07:00
David S. Miller	ec18d9a269	ipv6: Add redirect support to all protocol icmp error handlers. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:25:15 -07:00
David S. Miller	3a5ad2ee5e	ipv6: Add ip6_redirect() and ip6_sk_redirect() helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:08:07 -07:00
David S. Miller	6e157b6ac6	ipv6: Pull main logic of rt6_redirect() into rt6_do_redirect(). Hook it into dst_ops->redirect as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-12 00:05:02 -07:00
David S. Miller	e8599ff4b1	ipv6: Move bulk of redirect handling into rt6_redirect(). This sets things up so that we can have the protocol error handlers call down into the ipv6 route code for redirects just as ipv4 already does. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 23:43:53 -07:00
David S. Miller	30f2a5f379	ipv6: Export ndisc option parsing from ndisc.c This is going to be used internally by the rt6 redirect code. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 23:39:11 -07:00
Eric Dumazet	46d3ceabd8	tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-11 18:12:59 -07:00
Ben Hutchings	2c53040f01	net: Fix (nearly-)kernel-doc comments for various functions Fix incorrect start markers, wrapped summary lines, missing section breaks, incorrect separators, and some name mismatches. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 23:13:45 -07:00
David S. Miller	87a50699cb	rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). Nobody provides non-zero values any longer. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:13 -07:00
David S. Miller	3e12939a2a	inet: Kill FLOWI_FLAG_PRECOW_METRICS. No longer needed. TCP writes metrics, but now in it's own special cache that does not dirty the route metrics. Therefore there is no longer any reason to pre-cow metrics in this way. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:12 -07:00
David S. Miller	1d861aa4b3	inet: Minimize use of cached route inetpeer. Only use it in the absolutely required cases: 1) COW'ing metrics 2) ipv4 PMTU 3) ipv4 redirects Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:11 -07:00
David S. Miller	16d1839907	inet: Remove ->get_peer() method. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:10 -07:00
David S. Miller	81166dd6fa	tcp: Move timestamps from inetpeer to metrics cache. With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:40:08 -07:00
David S. Miller	ab92bb2f67	tcp: Abstract back handling peer aliveness test into helper function. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 20:33:49 -07:00
David S. Miller	d3a5ea6e21	Merge branch 'master' of git://1984.lsi.us.es/nf-next	2012-07-07 16:18:50 -07:00
Eric Dumazet	c56bf6fe78	ipv6: fix a bad cast in ip6_dst_lookup_tail() Fix a bug in ip6_dst_lookup_tail(), where typeof(dst) is "struct dst_entry *", not "struct dst_entry " Reported-by: Fengguang Wu <wfg@linux.intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-06 00:23:41 -07:00
Eric Dumazet	883dd4fb59	ipv6: remove redundant declarations remove redundant declarations, they belong in include/net/tcp.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 23:40:28 -07:00
Steffen Klassert	a2de86f63c	ipv6: Initialize the neighbour pointer of rt6_info on allocation git commit `97cac082` (ipv6: Store route neighbour in rt6_info struct) added a neighbour pointer to rt6_info. Currently we don't initialize this pointer at allocation time. We assume this pointer to be valid if it is not a null pointer, so initialize it on allocation. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 14:20:07 -07:00
RongQing.Li	43264e0bd9	ipv6: remove unnecessary codes in tcp_ipv6.c opt always equals np->opts, so it is meaningless to define opt, and check if opt does not equal np->opts and then try to free opt. Signed-off-by: RongQing.Li <roy.qing.li@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 03:11:15 -07:00
David S. Miller	97cac0821a	ipv6: Store route neighbour in rt6_info struct. This makes for a simplified conversion away from dst_get_neighbour(). All code outside of ipv6 will use neigh lookups via dst_neigh_lookup(). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 02:41:58 -07:00
David S. Miller	1d248b1cf4	net: Pass neighbours and dest address into NETEVENT_REDIRECT events. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 02:21:55 -07:00
David S. Miller	f894cbf847	net: Add optional SKB arg to dst_ops->neigh_lookup(). Causes the handler to use the daddr in the ipv4/ipv6 header when the route gateway is unspecified (local subnet). Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:04:01 -07:00
David S. Miller	5110effee8	net: Do delayed neigh confirmation. When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:03:06 -07:00
Pablo Neira Ayuso	08911475d1	netfilter: nf_conntrack: generalize nf_ct_l4proto_net This patch generalizes nf_ct_l4proto_net by splitting it into chunks and moving the corresponding protocol part to where it really belongs to. To clarify, note that we follow two different approaches to support per-net depending if it's built-in or run-time loadable protocol tracker. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>	2012-07-04 19:37:22 +02:00
Ville Nuorvala	d0087b29f7	ipv6_tunnel: Allow receiving packets on the fallback tunnel if they pass sanity checks At Facebook, we do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap an extra IP header on incoming packets so they can be routed to the backend. In the v4 tunnel driver, when these packets fall on the default tunl0 device, the behavior is to decapsulate them and drop them back on the stack. So our setup is that tunl0 has the VIP and eth0 has (obviously) the backend's real address. In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same behavior - if you didn't have an explicit tunnel setup, it would drop the packet. This patch brings that v4 feature to the v6 driver. The same IPv6 address checks are performed as with any normal tunnel, but as the fallback tunnel endpoint addresses are unspecified, the checks must be performed on a per-packet basis, rather than at tunnel configuration time. [Patch description modified by phil@ipom.com] Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com> Tested-by: Phil Dibowitz <phil@ipom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-29 00:52:32 -07:00
Neal Cardwell	9f10d3f6f9	tcp: plug dst leak in tcp_v6_conn_request() The code in tcp_v6_conn_request() was implicitly assuming that tcp_v6_send_synack() would take care of dst_release(), much as tcp_v4_send_synack() already does. This resulted in tcp_v6_conn_request() leaking a dst if sysctl_tw_recycle is enabled. This commit restructures tcp_v6_send_synack() so that it accepts a dst pointer and takes care of releasing the dst that is passed in, to plug the leak and avoid future surprises by bringing the IPv6 behavior in line with the IPv4 side. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 17:54:03 -07:00
Neal Cardwell	9494218fba	tcp: use inet6_csk_route_req() in tcp_v6_send_synack() With the recent change (earlier in this patch series) to set flowi6_oif to treq->iif in inet6_csk_route_req(), the dst lookup in these two functions is now identical, so tcp_v6_send_synack() can now just call inet6_csk_route_req(), to reduce code duplication and keep things closer to the IPv4 side, which is structured this way. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 17:53:50 -07:00
Neal Cardwell	3840a06e60	tcp: pass fl6 to inet6_csk_route_req() This commit changes inet_csk_route_req() so that it uses a pointer to a struct flowi6, rather than allocating its own on the stack. This brings its behavior in line with its IPv4 cousin, inet_csk_route_req(), and allows a follow-on patch to fix a dst leak. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 17:53:50 -07:00
Neal Cardwell	9247869ee6	tcp: fix inet6_csk_route_req() for link-local addresses Fix inet6_csk_route_req() to use as the flowi6_oif the treq->iif, which is correctly fixed up in tcp_v6_conn_request() to handle the case of link-local addresses. This brings it in line with the tcp_v6_send_synack() code, which is already correctly using the treq->iif in this way. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 17:53:50 -07:00
David S. Miller	b26d344c6b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/caif/caif_hsi.c drivers/net/usb/qmi_wwan.c The qmi_wwan merge was trivial. The caif_hsi.c, on the other hand, was not. It's a conflict between `1c385f1fdf` ("caif-hsi: Replace platform device with ops structure.") in the net-next tree and commit `39abbaef19` ("caif-hsi: Postpone init of HIS until open()") in the net tree. I did my best with that one and will ask Sjur to check it out. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 17:37:00 -07:00
Eric Dumazet	22911fc581	net: skb_free_datagram_locked() doesnt drop all packets dropwatch wrongly diagnose all received UDP packets as drops. This patch removes trace_kfree_skb() done in skb_free_datagram_locked(). Locations calling skb_free_datagram_locked() should do it on their own. As a result, drops are accounted on the right function. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:40:57 -07:00
Thomas Graf	74a0bd7d0e	ip6mr: Do not use RTA_PUT() macros Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:36:44 -07:00
Gao feng	8fc0278168	netfilter: nf_ct_icmpv6: add icmpv6_kmemdup_sysctl_table function Split sysctl function into smaller chucks to cleanup code and prepare patches to reduce ifdef pollution. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-27 19:16:31 +02:00
Gao feng	f1caad2745	netfilter: nf_conntrack: prepare l4proto->init_net cleanup l4proto->init contain quite redundant code. We can simplify this by adding a new parameter l3proto. This patch prepares that code simplification. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-27 18:31:14 +02:00
David McCullough	4dc27d1cf3	net/ipv6/route.c: packets originating on device match lo Fix to allow IPv6 packets originating locally to match rules with the "iff" set to "lo". This allows IPv6 rule matching work the same as it does for IPv4. From the iproute2 man page: iif NAME select the incoming device to match. If the interface is loop‐ back, the rule only matches packets originating from this host. This means that you may create separate routing tables for for‐ warded and local packets and, hence, completely segregate them. Signed-off-by: David McCullough <david_mccullough@mcafee.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-25 23:54:32 -07:00
Neal Cardwell	437c5b53f6	tcp: heed result of security_inet_conn_request() in tcp_v6_conn_request() If security_inet_conn_request() returns non-zero then TCP/IPv6 should drop the request, just as in TCP/IPv4 and DCCP in both IPv4 and IPv6. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-25 16:05:19 -07:00
David S. Miller	e486463e82	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/usb/qmi_wwan.c net/batman-adv/translation-table.c net/ipv6/route.c qmi_wwan.c resolution provided by Bjørn Mork. batman-adv conflict is dealing merely with the changes of global function names to have a proper subsystem prefix. ipv6's route.c conflict is merely two side-by-side additions of network namespace methods. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-25 15:50:32 -07:00
Eric Dumazet	fa809e2fd6	ipv6: fib: fix fib dump restart Commit `2bec5a369e` (ipv6: fib: fix crash when changing large fib while dumping it) introduced ability to restart the dump at tree root, but failed to skip correctly a count of already dumped entries. Code didn't match Patrick intent. We must skip exactly the number of already dumped entries. Note that like other /proc/net files or netlink producers, we could still dump some duplicates entries. Reported-by: Debabrata Banerjee <dbavatar@gmail.com> Reported-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-25 15:37:19 -07:00
David S. Miller	f9242b6b28	inet: Sanitize inet{,6} protocol demux. Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay are just a straight arrays. Remove all unnecessary hash masking. Document MAX_INET_PROTOS. Use RAW_HTABLE_SIZE when appropriate. Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-19 18:56:21 -07:00
Thomas Graf	d189634eca	ipv6: Move ipv6 proc file registration to end of init order /proc/net/ipv6_route reflects the contents of fib_table_hash. The proc handler is installed in ip6_route_net_init() whereas fib_table_hash is allocated in fib6_net_init() _after_ the proc handler has been installed. This opens up a short time frame to access fib_table_hash with its pants down. Move the registration of the proc files to a later point in the init order to avoid the race. Tested :-) Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-18 18:38:50 -07:00
David S. Miller	82f437b950	Merge branch 'master' of git://1984.lsi.us.es/nf-next Pablo says: ==================== This is the second batch of Netfilter updates for net-next. It contains the kernel changes for the new user-space connection tracking helper infrastructure. More details on this infrastructure are provides here: http://lwn.net/Articles/500196/ Still, I plan to provide some official documentation through the conntrack-tools user manual on how to setup user-space utilities for this. So far, it provides two helper in user-space, one for NFSv3 and another for Oracle/SQLnet/TNS. Yet in my TODO list. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-16 15:23:35 -07:00
Pablo Neira Ayuso	12f7a50533	netfilter: add user-space connection tracking helper infrastructure There are good reasons to supports helpers in user-space instead: * Rapid connection tracking helper development, as developing code in user-space is usually faster. * Reliability: A buggy helper does not crash the kernel. Moreover, we can monitor the helper process and restart it in case of problems. * Security: Avoid complex string matching and mangling in kernel-space running in privileged mode. Going further, we can even think about running user-space helpers as a non-root process. * Extensibility: It allows the development of very specific helpers (most likely non-standard proprietary protocols) that are very likely not to be accepted for mainline inclusion in the form of kernel-space connection tracking helpers. This patch adds the infrastructure to allow the implementation of user-space conntrack helpers by means of the new nfnetlink subsystem `nfnetlink_cthelper' and the existing queueing infrastructure (nfnetlink_queue). I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register ipv[4\|6]_helper which results from splitting ipv[4\|6]_confirm into two pieces. This change is required not to break NAT sequence adjustment and conntrack confirmation for traffic that is enqueued to our user-space conntrack helpers. Basic operation, in a few steps: 1) Register user-space helper by means of `nfct': nfct helper add ftp inet tcp [ It must be a valid existing helper supported by conntrack-tools ] 2) Add rules to enable the FTP user-space helper which is used to track traffic going to TCP port 21. For locally generated packets: iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp For non-locally generated packets: iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp 3) Run the test conntrackd in helper mode (see example files under doc/helper/conntrackd.conf conntrackd 4) Generate FTP traffic going, if everything is OK, then conntrackd should create expectations (you can check that with `conntrack': conntrack -E expect [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp [DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp This confirms that our test helper is receiving packets including the conntrack information, and adding expectations in kernel-space. The user-space helper can also store its private tracking information in the conntrack structure in the kernel via the CTA_HELP_INFO. The kernel will consider this a binary blob whose layout is unknown. This information will be included in the information that is transfered to user-space via glue code that integrates nfnetlink_queue and ctnetlink. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-16 15:40:02 +02:00
David S. Miller	aee289baaa	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/ipv6/route.c Pull in 'net' again to get the revert of Thomas's change which introduced regressions. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-16 01:23:04 -07:00
David S. Miller	e8803b6c38	Revert "ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route" This reverts commit `2a0c451ade`. It causes crashes, because now ip6_null_entry is used before it is initialized. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-16 01:12:19 -07:00
David S. Miller	42ae66c80d	ipv6: Fix types of ip6_update_pmtu(). The mtu should be a __be32, not the mark. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-15 20:01:57 -07:00
David S. Miller	7e52b33bd5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/ipv6/route.c This deals with a merge conflict between the net-next addition of the inetpeer network namespace ops, and Thomas Graf's bug fix in `2a0c451ade` which makes sure we don't register /proc/net/ipv6_route before it is actually safe to do so. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-15 15:51:55 -07:00
Thomas Graf	2a0c451ade	ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route /proc/net/ipv6_route reflects the contents of fib_table_hash. The proc handler is installed in ip6_route_net_init() whereas fib_table_hash is allocated in fib6_net_init() _after_ the proc handler has been installed. This opens up a short time frame to access fib_table_hash with its pants down. fib6_init() as a whole can't be moved to an earlier position as it also registers the rtnetlink message handlers which should be registered at the end. Therefore split it into fib6_init() which is run early and fib6_init_late() to register the rtnetlink message handlers. Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-15 15:30:15 -07:00
David S. Miller	81aded2467	ipv6: Handle PMTU in ICMP error handlers. One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts to handle the error pass the 32-bit info cookie in network byte order whereas ipv4 passes it around in host byte order. Like the ipv4 side, we have two helper functions. One for when we have a socket context and one for when we do not. ip6ip6 tunnels are not handled here, because they handle PMTU events by essentially relaying another ICMP packet-too-big message back to the original sender. This patch allows us to get rid of rt6_do_pmtu_disc(). It handles all kinds of situations that simply cannot happen when we do the PMTU update directly using a fully resolved route. In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very likely be removed or changed into a BUG_ON() check. We should never have a prefixed ipv6 route when we get there. Another piece of strange history here is that TCP and DCCP, unlike in ipv4, never invoke the update_pmtu() method from their ICMP error handlers. This is incredibly astonishing since this is the context where we have the most accurate context in which to make a PMTU update, namely we have a fully connected socket and associated cached socket route. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-15 14:54:11 -07:00
David S. Miller	3639339553	ipv4: Handle PMTU in all ICMP error handlers. With ip_rt_frag_needed() removed, we have to explicitly update PMTU information in every ICMP error handler. Create two helper functions to facilitate this. 1) ipv4_sk_update_pmtu() This updates the PMTU when we have a socket context to work with. 2) ipv4_update_pmtu() Raw version, used when no socket context is available. For this interface, we essentially just pass in explicit arguments for the flow identity information we would have extracted from the socket. And you'll notice that ipv4_sk_update_pmtu() is simply implemented in terms of ipv4_update_pmtu() Note that __ip_route_output_key() is used, rather than something like ip_route_output_flow() or ip_route_output_key(). This is because we absolutely do not want to end up with a route that does IPSEC encapsulation and the like. Instead, we only want the route that would get us to the node described by the outermost IP header. Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-14 22:22:07 -07:00
David S. Miller	43b03f1f6d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: MAINTAINERS drivers/net/wireless/iwlwifi/pcie/trans.c The iwlwifi conflict was resolved by keeping the code added in 'net' that turns off the buggy chip feature. The MAINTAINERS conflict was merely overlapping changes, one change updated all the wireless web site URLs and the other changed some GIT trees to be Johannes's instead of John's. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 21:59:18 -07:00
Michel Machado	95603e2293	net-next: add dev_loopback_xmit() to avoid duplicate code Add dev_loopback_xmit() in order to deduplicate functions ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c). I was about to reinvent the wheel when I noticed that ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I need and are not IP-only functions, but they were not available to reuse elsewhere. ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I understand that this is harmless, and should be in dev_loopback_xmit(). Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jpirko@redhat.com> CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 18:51:09 -07:00

1 2 3 4 5 ...

3114 Commits