linux

Commit Graph

Author	SHA1	Message	Date
Florian Westphal	9500507c61	netfilter: conntrack: remove timer from ecache extension This brings the (per-conntrack) ecache extension back to 24 bytes in size (was 152 byte on x86_64 with lockdep on). When event delivery fails, re-delivery is attempted via work queue. Redelivery is attempted at least every 0.1 seconds, but can happen more frequently if userspace is not congested. The nf_ct_release_dying_list() function is removed. With this patch, ownership of the to-be-redelivered conntracks (on-dying-list-with-DYING-bit not yet set) is with the work queue, which will release the references once event is out. Joint work with Pablo Neira Ayuso. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-06-25 19:15:38 +02:00
WANG Cong	122ff243f5	ipv4: make ip_local_reserved_ports per netns ip_local_port_range is already per netns, so should ip_local_reserved_ports be. And since it is none by default we don't actually need it when we don't enable CONFIG_SYSCTL. By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port() Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-14 15:31:45 -04:00
Lorenzo Colitti	84f39b08d7	net: support marking accepting TCP sockets When using mark-based routing, sockets returned from accept() may need to be marked differently depending on the incoming connection request. This is the case, for example, if different socket marks identify different networks: a listening socket may want to accept connections from all networks, but each connection should be marked with the network that the request came in on, so that subsequent packets are sent on the correct network. This patch adds a sysctl to mark TCP sockets based on the fwmark of the incoming SYN packet. If enabled, and an unmarked socket receives a SYN, then the SYN packet's fwmark is written to the connection's inet_request_sock, and later written back to the accepted socket when the connection is established. If the socket already has a nonzero mark, then the behaviour is the same as it is today, i.e., the listening socket's fwmark is used. Black-box tested using user-mode linux: - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the mark of the incoming SYN packet. - The socket returned by accept() is marked with the mark of the incoming SYN packet. - Tested with syncookies=1 and syncookies=2. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-13 18:35:09 -04:00
Lorenzo Colitti	e110861f86	net: add a sysctl to reflect the fwmark on replies Kernel-originated IP packets that have no user socket associated with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.) are emitted with a mark of zero. Add a sysctl to make them have the same mark as the packet they are replying to. This allows an administrator that wishes to do so to use mark-based routing, firewalling, etc. for these replies by marking the original packets inbound. Tested using user-mode linux: - ICMP/ICMPv6 echo replies and errors. - TCP RST packets (IPv4 and IPv6). Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-13 18:35:08 -04:00
Cong Wang	ba6b918ab2	ping: move ping_group_range out of CONFIG_SYSCTL Similarly, when CONFIG_SYSCTL is not set, ping_group_range should still work, just that no one can change it. Therefore we should move it out of sysctl_net_ipv4.c. And, it should not share the same seqlock with ip_local_port_range. BTW, rename it to ->ping_group_range instead. Cc: David S. Miller <davem@davemloft.net> Cc: Francois Romieu <romieu@fr.zoreil.com> Reported-by: Stefan de Konink <stefan@konink.de> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-08 22:50:47 -04:00
Cong Wang	c9d8f1a642	ipv4: move local_port_range out of CONFIG_SYSCTL When CONFIG_SYSCTL is not set, ip_local_port_range should still work, just that no one can change it. Therefore we should move it out of sysctl_inet.c. Also, rename it to ->ip_local_ports instead. Cc: David S. Miller <davem@davemloft.net> Cc: Francois Romieu <romieu@fr.zoreil.com> Reported-by: Stefan de Konink <stefan@konink.de> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-08 22:50:47 -04:00
David S. Miller	e86e180b82	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, most relevantly they are: * cleanup to remove double semicolon from stephen hemminger. * calm down sparse warning in xt_ipcomp, from Fan Du. * nf_ct_labels support for nf_tables, from Florian Westphal. * new macros to simplify rcu dereferences in the scope of nfnetlink and nf_tables, from Patrick McHardy. * Accept queue and drop (including reason for drop) to verdict parsing in nf_tables, also from Patrick. * Remove unused random seed initialization in nfnetlink_log, from Florian Westphal. * Allow to attach user-specific information to nf_tables rules, useful to attach user comments to rule, from me. * Return errors in ipset according to the manpage documentation, from Jozsef Kadlecsik. * Fix coccinelle warnings related to incorrect bool type usage for ipset, from Fengguang Wu. * Add hash:ip,mark set type to ipset, from Vytas Dauksa. * Fix message for each spotted by ipset for each netns that is created, from Ilia Mirkin. * Add forceadd option to ipset, which evicts a random entry from the set if it becomes full, from Josh Hunt. * Minor IPVS cleanups and fixes from Andi Kleen and Tingwei Liu. * Improve conntrack scalability by removing a central spinlock, original work from Eric Dumazet. Jesper Dangaard Brouer took them over to address remaining issues. Several patches to prepare this change come in first place. * Rework nft_hash to resolve bugs (leaking chain, missing rcu synchronization on element removal, etc. from Patrick McHardy. * Restore context in the rule deletion path, as we now release rule objects synchronously, from Patrick McHardy. This gets back event notification for anonymous sets. * Fix NAT family validation in nft_nat, also from Patrick. * Improve scalability of xt_connlimit by using an array of spinlocks and by introducing a rb-tree of hashtables for faster lookup of accounted objects per network. This patch was preceded by several patches and refactorizations to accomodate this change including the use of kmem_cache, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-17 15:06:24 -04:00
Eric Dumazet	d32d9bb85c	flowcache: restore a single flow_cache kmem_cache It is not legal to create multiple kmem_cache having the same name. flowcache can use a single kmem_cache, no need for a per netns one. Fixes: `ca925cf153` ("flowcache: Make flow cache name space aware") Reported-by: Jakub Kicinski <moorray3@wp.pl> Tested-by: Jakub Kicinski <moorray3@wp.pl> Tested-by: Fan Du <fan.du@windriver.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-10 21:45:11 -04:00
Jesper Dangaard Brouer	93bb0ceb75	netfilter: conntrack: remove central spinlock nf_conntrack_lock nf_conntrack_lock is a monolithic lock and suffers from huge contention on current generation servers (8 or more core/threads). Perf locking congestion is clear on base kernel: - 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh - _raw_spin_lock_bh + 25.33% init_conntrack + 24.86% nf_ct_delete_from_lists + 24.62% __nf_conntrack_confirm + 24.38% destroy_conntrack + 0.70% tcp_packet + 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup + 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free + 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer + 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete + 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table This patch change conntrack locking and provides a huge performance improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with 10Gbit/s ixgbe (with tool trafgen): Base kernel: 810.405 new conntrack/sec After patch: 2.233.876 new conntrack/sec Notice other floods attack (SYN+ACK or ACK) can easily be deflected using: # iptables -A INPUT -m state --state INVALID -j DROP # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0 Use an array of hashed spinlocks to protect insertions/deletions of conntracks into the hash table. 1024 spinlocks seem to give good results, at minimal cost (4KB memory). Due to lockdep max depth, 1024 becomes 8 if CONFIG_LOCKDEP=y The hash resize is a bit tricky, because we need to take all locks in the array. A seqcount_t is used to synchronize the hash table users with the resizing process. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-03-07 11:41:13 +01:00
Jesper Dangaard Brouer	b7779d06f9	netfilter: conntrack: spinlock per cpu to protect special lists. One spinlock per cpu to protect dying/unconfirmed/template special lists. (These lists are now per cpu, a bit like the untracked ct) Add a @cpu field to nf_conn, to make sure we hold the appropriate spinlock at removal time. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-03-07 11:40:38 +01:00
Alexander Aring	7240cdec60	6lowpan: handling 6lowpan fragmentation via inet_frag api This patch drops the current way of 6lowpan fragmentation on receiving side and replace it with a implementation which use the inet_frag api. The old fragmentation handling has some race conditions and isn't rfc4944 compatible. Also adding support to match fragments on destination address, source address, tag value and datagram_size which is missing in the current implementation. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-02-28 17:05:22 -05:00
Alexander Aring	633fc86ff6	net: ns: add ieee802154_6lowpan namespace This patch adds necessary ieee802154 6lowpan namespace to provide the inet_frag information. This is a initial support for handling 6lowpan fragmentation with the inet_frag api. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-02-28 17:05:22 -05:00
Steffen Klassert	1a1ccc96ab	xfrm: Remove caching of xfrm_policy_sk_bundles We currently cache socket policy bundles at xfrm_policy_sk_bundles. These cached bundles are never used. Instead we create and cache a new one whenever xfrm_lookup() is called on a socket policy. Most protocols cache the used routes to the socket, so let's remove the unused caching of socket policy bundles in xfrm. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2014-02-19 10:35:43 +01:00
Fan Du	ca925cf153	flowcache: Make flow cache name space aware Inserting a entry into flowcache, or flushing flowcache should be based on per net scope. The reason to do so is flushing operation from fat netns crammed with flow entries will also making the slim netns with only a few flow cache entries go away in original implementation. Since flowcache is tightly coupled with IPsec, so it would be easier to put flow cache global parameters into xfrm namespace part. And one last thing needs to do is bumping flow cache genid, and flush flow cache should also be made in per net style. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2014-02-12 07:02:11 +01:00
Florent Fourcot	6444f72b4b	ipv6: add flowlabel_consistency sysctl With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of flow label unicity. This patch introduces a new sysctl to protect the old behaviour, enable by default. Changelog of V3: * rename ip6_flowlabel_consistency to flowlabel_consistency * use net_info_ratelimited() * checkpatch cleanups Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-19 17:12:31 -08:00
FX Le Bail	ec35b61ea5	IPv6: move the anycast_src_echo_reply sysctl to netns_sysctl_ipv6 This change move anycast_src_echo_reply sysctl with other ipv6 sysctls. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-14 18:18:22 -08:00
Hannes Frederic Sowa	f87c10a8aa	ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-13 11:22:54 -08:00
Patrick McHardy	1d49144c0a	netfilter: nf_tables: add "inet" table for IPv4/IPv6 This patch adds a new table family and a new filter chain that you can use to attach IPv4 and IPv6 rules. This should help to simplify rule-set maintainance in dual-stack setups. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-01-07 23:57:25 +01:00
FX Le Bail	509aba3b0d	IPv6: add the option to use anycast addresses as source addresses in echo reply This change allows to follow a recommandation of RFC4942. - Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses as source addresses for ICMPv6 echo reply. This sysctl is false by default to preserve existing behavior. - Add inline check ipv6_anycast_destination(). - Use them in icmpv6_echo_reply(). Reference: RFC4942 - IPv6 Transition/Coexistence Security Considerations (http://tools.ietf.org/html/rfc4942#section-2.1.6) 2.1.6. Anycast Traffic Identification and Security [...] To avoid exposing knowledge about the internal structure of the network, it is recommended that anycast servers now take advantage of the ability to return responses with the anycast address as the source address if possible. Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-07 15:51:39 -05:00
David S. Miller	855404efae	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: * Add full port randomization support. Some crazy researchers found a way to reconstruct the secure ephemeral ports that are allocated in random mode by sending off-path bursts of UDP packets to overrun the socket buffer of the DNS resolver to trigger retransmissions, then if the timing for the DNS resolution done by a client is larger than usual, then they conclude that the port that received the burst of UDP packets is the one that was opened. It seems a bit aggressive method to me but it seems to work for them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a new NAT mode to fully randomize ports using prandom. * Add a new classifier to x_tables based on the socket net_cls set via cgroups. These includes two patches to prepare the field as requested by Zefan Li. Also from Daniel Borkmann. * Use prandom instead of get_random_bytes in several locations of the netfilter code, from Florian Westphal. * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack mark, also from Florian Westphal. * Fix compilation warning due to unused variable in IPVS, from Geert Uytterhoeven. * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti. * Add IPComp extension to x_tables, from Fan Du. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-05 20:18:50 -05:00
David S. Miller	1669cb9855	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2013-12-19 1) Use the user supplied policy index instead of a generated one if present. From Fan Du. 2) Make xfrm migration namespace aware. From Fan Du. 3) Make the xfrm state and policy locks namespace aware. From Fan Du. 4) Remove ancient sleeping when the SA is in acquire state, we now queue packets to the policy instead. This replaces the sleeping code. 5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the posibility to sleep. The sleeping code is gone, so remove it. 6) Check user specified spi for IPComp. Thr spi for IPcomp is only 16 bit wide, so check for a valid value. From Fan Du. 7) Export verify_userspi_info to check for valid user supplied spi ranges with pfkey and netlink. From Fan Du. 8) RFC3173 states that if the total size of a compressed payload and the IPComp header is not smaller than the size of the original payload, the IP datagram must be sent in the original non-compressed form. These packets are dropped by the inbound policy check because they are not transformed. Document the need to set 'level use' for IPcomp to receive such packets anyway. From Fan Du. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-19 18:37:49 -05:00
Hannes Frederic Sowa	974eda11c5	inet: make no_pmtu_disc per namespace and kill ipv4_config The other field in ipv4_config, log_martians, was converted to a per-interface setting, so we can just remove the whole structure. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-18 16:58:20 -05:00
Jesper Dangaard Brouer	8cf4d6a224	net: reorder struct netns_ct for better cache-line usage Reorder struct netns_ct so that atomic_t "count" changes don't slowdown users of read mostly fields. This is based on Eric Dumazet's proposed patch: "netfilter: conntrack: remove the central spinlock" http://thread.gmane.org/gmane.linux.network/268758/focus=47306 The tricky part of cache-aligning this structure, that it is getting inlined in struct net (include/net/net_namespace.h), thus changes to other netns_xxx structures affects our alignment. Eric's original patch contained an ambiguity on 32-bit regarding alignment in struct net. This patch also takes 32-bit into account, and in case of changed (struct net) alignment sysctl_xxx entries have been ordered according to how often they are accessed. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-12-13 12:55:55 +01:00
Steffen Klassert	5b8ef3415a	xfrm: Remove ancient sleeping when the SA is in acquire state We now queue packets to the policy if the states are not yet resolved, this replaces the ancient sleeping code. Also the sleeping can cause indefinite task hangs if the needed state does not get resolved. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2013-12-06 07:24:31 +01:00
Fan Du	283bc9f35b	xfrm: Namespacify xfrm state/policy locks By semantics, xfrm layer is fully name space aware, so will the locks, e.g. xfrm_state/pocliy_lock. Ensure exclusive access into state/policy link list for different name space with one global lock is not right in terms of semantics aspect at first place, as they are indeed mutually independent with each other, but also more seriously causes scalability problem. One practical scenario is on a Open Network Stack, more than hundreds of lxc tenants acts as routers within one host, a global xfrm_state/policy_lock becomes the bottleneck. But onces those locks are decoupled in a per-namespace fashion, locks contend is just with in specific name space scope, without causing additional SPD/SAD access delay for other name space. Also this patch improve scalability while as without changing original xfrm behavior. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2013-12-06 06:45:06 +01:00
Eric W. Biederman	a4fe34bf90	tcp_memcontrol: Remove the per netns control. The code that is implemented is per memory cgroup not per netns, and having per netns bits is just confusing. Remove the per netns bits to make it easier to see what is really going on. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-21 18:43:02 -04:00
Pablo Neira Ayuso	ed683f138b	netfilter: nf_tables: add ARP filtering support This patch registers the ARP family and he filter chain type for this family. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-10-14 18:01:03 +02:00
Pablo Neira Ayuso	0628b123c9	netfilter: nfnetlink: add batch support and use it from nf_tables This patch adds a batch support to nfnetlink. Basically, it adds two new control messages: * NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch, the nfgenmsg->res_id indicates the nfnetlink subsystem ID. * NFNL_MSG_BATCH_END, that results in the invocation of the ss->commit callback function. If not specified or an error ocurred in the batch, the ss->abort function is invoked instead. The end message represents the commit operation in nftables, the lack of end message results in an abort. This patch also adds the .call_batch function that is only called from the batch receival path. This patch adds atomic rule updates and dumps based on bitmask generations. This allows to atomically commit a set of rule-set updates incrementally without altering the internal state of existing nf_tables expressions/matches/targets. The idea consists of using a generation cursor of 1 bit and a bitmask of 2 bits per rule. Assuming the gencursor is 0, then the genmask (expressed as a bitmask) can be interpreted as: 00 active in the present, will be active in the next generation. 01 inactive in the present, will be active in the next generation. 10 active in the present, will be deleted in the next generation. ^ gencursor Once you invoke the transition to the next generation, the global gencursor is updated: 00 active in the present, will be active in the next generation. 01 active in the present, needs to zero its future, it becomes 00. 10 inactive in the present, delete now. ^ gencursor If a dump is in progress and nf_tables enters a new generation, the dump will stop and return -EBUSY to let userspace know that it has to retry again. In order to invalidate dumps, a global genctr counter is increased everytime nf_tables enters a new generation. This new operation can be used from the user-space utility that controls the firewall, eg. nft -f restore The rule updates contained in `file' will be applied atomically. cat file ----- add filter INPUT ip saddr 1.1.1.1 counter accept #1 del filter INPUT ip daddr 2.2.2.2 counter drop #2 -EOF- Note that the rule 1 will be inactive until the transition to the next generation, the rule 2 will be evicted in the next generation. There is a penalty during the rule update due to the branch misprediction in the packet matching framework. But that should be quickly resolved once the iteration over the commit list that contain rules that require updates is finished. Event notification happens once the rule-set update has been committed. So we skip notifications is case the rule-set update is aborted, which can happen in case that the rule-set is tested to apply correctly. This patch squashed the following patches from Pablo: * nf_tables: atomic rule updates and dumps * nf_tables: get rid of per rule list_head for commits * nf_tables: use per netns commit list * nfnetlink: add batch support and use it from nf_tables * nf_tables: all rule updates are transactional * nf_tables: attach replacement rule after stale one * nf_tables: do not allow deletion/replacement of stale rules * nf_tables: remove unused NFTA_RULE_FLAGS Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-10-14 18:01:01 +02:00
Pablo Neira Ayuso	99633ab29b	netfilter: nf_tables: complete net namespace support Register family per netnamespace to ensure that sets are only visible in its approapriate namespace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-10-14 18:00:59 +02:00
Eric W. Biederman	0bbf87d852	net ipv4: Convert ipv4.ip_local_port_range to be per netns v3 - Move sysctl_local_ports from a global variable into struct netns_ipv4. - Modify inet_get_local_port_range to take a struct net, and update all of the callers. - Move the initialization of sysctl_local_ports into sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c v2: - Ensure indentation used tabs - Fixed ip.h so it applies cleanly to todays net-next v3: - Compile fixes of strange callers of inet_get_local_port_range. This patch now successfully passes an allmodconfig build. Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range Originally-by: Samya <samya@twitter.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-30 21:59:38 -07:00
David S. Miller	71acc0ddd4	Revert "net: sctp: convert sctp_checksum_disable module param into sctp sysctl" This reverts commit `cda5f98e36`. As per Vlad's request. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-09 13:09:41 -07:00
Daniel Borkmann	cda5f98e36	net: sctp: convert sctp_checksum_disable module param into sctp sysctl Get rid of the last module parameter for SCTP and make this configurable via sysctl for SCTP like all the rest of SCTP's configuration knobs. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-09 11:33:02 -07:00
fan.du	ca4c3fc24e	net: split rt_genid for ipv4 and ipv6 Current net name space has only one genid for both IPv4 and IPv6, it has below drawbacks: - Add/delete an IPv4 address will invalidate all IPv6 routing table entries. - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table entries even when the policy is only applied for one address family. Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6 separately in a fine granularity. Signed-off-by: Fan Du <fan.du@windriver.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-31 14:56:36 -07:00
Pablo Neira Ayuso	de94c4591b	netfilter: {ipt,ebt}_ULOG: rise warning on deprecation This target has been superseded by NFLOG. Spot a warning so we prepare removal in a couple of years. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>	2013-05-23 14:23:16 +02:00
Gao feng	30e0c6a6be	netfilter: nf_log: prepare net namespace support for loggers This patch adds netns support to nf_log and it prepares netns support for existing loggers. It is composed of four major changes. 1) nf_log_register has been split to two functions: nf_log_register and nf_log_set. The new nf_log_register is used to globally register the nf_logger and nf_log_set is used for enabling pernet support from nf_loggers. Per netns is not yet complete after this patch, it comes in separate follow up patches. 2) Add net as a parameter of nf_log_bind_pf. Per netns is not yet complete after this patch, it only allows to bind the nf_logger to the protocol family from init_net and it skips other cases. 3) Adapt all nf_log_packet callers to pass netns as parameter. After this patch, this function only works for init_net. 4) Make the sysctl net/netfilter/nf_log pernet. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 20:12:54 +02:00
Gao feng	f3c1a44a22	netfilter: make /proc/net/netfilter pernet This patch makes this proc dentry pernet. So far only init_net had a /proc/net/netfilter directory. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-05 19:35:02 +02:00
Nicolas Dichtel	63998ac24f	ipv6: provide addr and netconf dump consistency info This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with dev_base_seq to check if a change occurs during a netlink dump. If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message after the dump was interrupted. Note that only dump of unicast addresses is checked (multicast and anycast are not checked). Reported-by: Junwei Zhang <junwei.zhang@6wind.com> Reported-by: Hongjun Li <hongjun.li@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-24 17:16:29 -04:00
Michal Kubecek	8d068875ca	xfrm: make gc_thresh configurable in all namespaces The xfrm gc threshold can be configured via xfrm{4,6}_gc_thresh sysctl but currently only in init_net, other namespaces always use the default value. This can substantially limit the number of IPsec tunnels that can be effectively used. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2013-02-06 11:36:29 +01:00
Florian Westphal	c539f01717	netfilter: add connlabel conntrack extension similar to connmarks, except labels are bit-based; i.e. all labels may be attached to a flow at the same time. Up to 128 labels are supported. Supporting more labels is possible, but requires increasing the ct offset delta from u8 to u16 type due to increased extension sizes. Mapping of bit-identifier to label name is done in userspace. The extension is enabled at run-time once "-m connlabel" netfilter rules are added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-01-18 00:28:15 +01:00
David S. Miller	4b87f92259	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: Documentation/networking/ip-sysctl.txt drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c Both conflicts were simply overlapping context. A build fix for qlcnic is in here too, simply removing the added devinit annotations which no longer exist. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-15 15:05:59 -05:00
Hannes Frederic Sowa	5d134f1c1f	tcp: make sysctl_tcp_ecn namespace aware As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl namespace aware. The reason behind this patch is to ease the testing of ecn problems on the internet and allows applications to tune their own use of ecn. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-06 21:09:56 -08:00
Pablo Neira Ayuso	10db9069eb	netfilter: xt_CT: recover NOTRACK target support Florian Westphal reported that the removal of the NOTRACK target (`9655050` netfilter: remove xt_NOTRACK) is breaking some existing setups. That removal was scheduled for removal since long time ago as described in Documentation/feature-removal-schedule.txt What: xt_NOTRACK Files: net/netfilter/xt_NOTRACK.c When: April 2011 Why: Superseded by xt_CT Still, people may have not notice / may have decided to stick to an old iptables version. I agree with him in that some more conservative approach by spotting some printk to warn users for some time is less agressive. Current iptables 1.4.16.3 already contains the aliasing support that makes it point to the CT target, so upgrading would fix it. Still, the policy so far has been to avoid pushing our users to upgrade. As a solution, this patch recovers the NOTRACK target inside the CT target and it now spots a warning. Reported-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-12-24 12:55:09 +01:00
Pablo Neira Ayuso	252b3e8c1b	netfilter: xt_CT: fix crash while destroy ct templates In (`d871bef` netfilter: ctnetlink: dump entries from the dying and unconfirmed lists), we assume that all conntrack objects are inserted in any of the existing lists. However, template conntrack objects were not. This results in hitting BUG_ON in the destroy_conntrack path while removing a rule that uses the CT target. This patch fixes the situation by adding the template lists, which is where template conntrack objects reside now. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-12-16 23:44:12 +01:00
Neil Horman	3c68198e75	sctp: Make hmac algorithm selection for cookie generation dynamic Currently sctp allows for the optional use of md5 of sha1 hmac algorithms to generate cookie values when establishing new connections via two build time config options. Theres no real reason to make this a static selection. We can add a sysctl that allows for the dynamic selection of these algorithms at run time, with the default value determined by the corresponding crypto library availability. This comes in handy when, for example running a system in FIPS mode, where use of md5 is disallowed, but SHA1 is permitted. Note: This new sysctl has no corresponding socket option to select the cookie hmac algorithm. I chose not to implement that intentionally, as RFC 6458 contains no option for this value, and I opted not to pollute the socket option namespace. Change notes: v2) * Updated subject to have the proper sctp prefix as per Dave M. * Replaced deafult selection options with new options that allow developers to explicitly select available hmac algs at build time as per suggestion by Vlad Y. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Vlad Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-26 02:22:18 -04:00
David S. Miller	6a06e5e1bb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/team/team.c drivers/net/usb/qmi_wwan.c net/batman-adv/bat_iv_ogm.c net/ipv4/fib_frontend.c net/ipv4/route.c net/l2tp/l2tp_netlink.c The team, fib_frontend, route, and l2tp_netlink conflicts were simply overlapping changes. qmi_wwan and bat_iv_ogm were of the "use HEAD" variety. With help from Antonio Quartulli. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-28 14:40:49 -04:00
Amerigo Wang	c038a767cd	ipv6: add a new namespace for nf_conntrack_reasm As pointed by Michal, it is necessary to add a new namespace for nf_conntrack_reasm code, this prepares for the second patch. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Michal Kubeček <mkubecek@suse.cz> Cc: David Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-19 17:23:28 -04:00
Nicolas Dichtel	b42664f898	netns: move net->ipv4.rt_genid to net->rt_genid This commit prepares the use of rt_genid by both IPv4 and IPv6. Initialization is left in IPv4 part. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-18 15:57:03 -04:00
Pablo Neira Ayuso	ace1fe1231	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next This merges (`3f509c6` netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.	2012-09-03 15:34:51 +02:00
Patrick McHardy	58a317f106	netfilter: ipv6: add IPv6 NAT support Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:17 +02:00
Patrick McHardy	c7232c9979	netfilter: add protocol independent NAT core Convert the IPv4 NAT implementation to a protocol independent core and address family specific modules. Signed-off-by: Patrick McHardy <kaber@trash.net>	2012-08-30 03:00:14 +02:00
David S. Miller	e6acb38480	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace This is an initial merge in of Eric Biederman's work to start adding user namespace support to the networking. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-24 18:54:37 -04:00
Rami Rosen	f63c45e0e6	packet: fix broken build. This patch fixes a broken build due to a missing header: ... CC net/ipv4/proc.o In file included from include/net/net_namespace.h:15, from net/ipv4/proc.c:35: include/net/netns/packet.h:11: error: field 'sklist_lock' has incomplete type ... The lock of netns_packet has been replaced by a recent patch to be a mutex instead of a spinlock, but we need to replace the header file to be linux/mutex.h instead of linux/spinlock.h as well. See commit 0fa7fa98dbcc2789409ed24e885485e645803d7f: packet: Protect packet sk list with mutex (v2) patch, Signed-off-by: Rami Rosen <rosenr@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-23 09:29:45 -07:00
Pavel Emelyanov	0fa7fa98db	packet: Protect packet sk list with mutex (v2) Change since v1: * Fixed inuse counters access spotted by Eric In patch `eea68e2f` (packet: Report socket mclist info via diag module) I've introduced a "scheduling in atomic" problem in packet diag module -- the socket list is traversed under rcu_read_lock() while performed under it sk mclist access requires rtnl lock (i.e. -- mutex) to be taken. [152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002 [152363.820573] 4 locks held by crtools/12517: [152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [<ffffffff81a2dcb5>] sock_diag_rcv+0x1f/0x3e [152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [<ffffffff81a2de70>] sock_diag_rcv_msg+0xdb/0x11a [152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81a67d01>] netlink_dump+0x23/0x1ab [152363.820693] #3: (rcu_read_lock){.+.+..}, at: [<ffffffff81b6a049>] packet_diag_dump+0x0/0x1af Similar thing was then re-introduced by further packet diag patches (fanount mutex and pgvec mutex for rings) :( Apart from being terribly sorry for the above, I propose to change the packet sk list protection from spinlock to mutex. This lock currently protects two modifications: * sklist * prot inuse counters The sklist modifications can be just reprotected with mutex since they already occur in a sleeping context. The inuse counters modifications are trickier -- the __this_cpu_-s are used inside, thus requiring the caller to handle the potential issues with contexts himself. Since packet sockets' counters are modified in two places only (packet_create and packet_release) we only need to protect the context from being preempted. BH disabling is not required in this case. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-22 22:58:27 -07:00
Eric W. Biederman	e1fc3b14f9	sctp: Make sysctl tunables per net Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:32:16 -07:00
Eric W. Biederman	ebb7e95d93	sctp: Add infrastructure for per net sysctls Start with an empty sctp_net_table that will be populated as the various tunable sysctls are made per net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:30:37 -07:00
Eric W. Biederman	b01a24078f	sctp: Make the mib per network namespace Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:30:36 -07:00
Eric W. Biederman	13d782f6b4	sctp: Make the proc files per network namespace. - Convert all of the files under /proc/net/sctp to be per network namespace. - Don't print anything for /proc/net/sctp/snmp except in the initial network namespaces as the snmp counters still have to be converted to be per network namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:29:53 -07:00
Eric W. Biederman	2ce9550350	sctp: Make the ctl_sock per network namespace - Kill sctp_get_ctl_sock, it is useless now. - Pass struct net where needed so net->sctp.ctl_sock is accessible. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:17:26 -07:00
Eric W. Biederman	4db67e8086	sctp: Make the address lists per network namespace - Move the address lists into struct net - Add per network namespace initialization and cleanup - Pass around struct net so it is everywhere I need it. - Rename all of the global variable references into references to the variables moved into struct net Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-14 23:12:17 -07:00
Eric W. Biederman	7064d16e16	userns: Use kgids for sysctl_ping_group_range - Store sysctl_ping_group_range as a paire of kgid_t values instead of a pair of gid_t values. - Move the kgid conversion work from ping_init_sock into ipv4_ping_group_range - For invalid cases reset to the default disabled state. With the kgid_t conversion made part of the original value sanitation from userspace understand how the code will react becomes clearer and it becomes possible to set the sysctl ping group range from something other than the initial user namespace. Cc: Vasiliy Kulikov <segoon@openwall.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-08-14 21:49:10 -07:00
Eric Dumazet	0c7462a235	ipv4: remove rt_cache_rebuild_count After IP route cache removal, rt_cache_rebuild_count is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-30 14:53:22 -07:00
Eric Dumazet	5815d5e7aa	tcp: use hash_32() in tcp_metrics Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not guaranteed to be a power of two. Uses hash_32() instead of custom hash. tcpmhash_entries should be an unsigned int. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-20 10:59:41 -07:00
Eric Dumazet	be9f4a44e7	ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:35:30 -07:00
David S. Miller	51c5d0c4b1	tcp: Maintain dynamic metrics in local cache. Maintain a local hash table of TCP dynamic metrics blobs. Computed TCP metrics are no longer maintained in the route metrics. The table uses RCU and an extremely simple hash so that it has low latency and low overhead. A simple hash is legitimate because we only make metrics blobs for fully established connections. Some tweaking of the default hash table sizes, metric timeouts, and the hash chain length limit certainly could use some tweaking. But the basic design seems sound. With help from Eric Dumazet and Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-10 22:39:57 -07:00
David S. Miller	f4530fa574	ipv4: Avoid overhead when no custom FIB rules are installed. If the user hasn't actually installed any custom rules, or fiddled with the default ones, don't go through the whole FIB rules layer. It's just pure overhead. Instead do what we do with CONFIG_IP_MULTIPLE_TABLES disabled, check the individual tables by hand, one by one. Also, move fib_num_tclassid_users into the ipv4 network namespace. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 22:13:13 -07:00
David S. Miller	67da255210	Merge branch 'master' of git://1984.lsi.us.es/net-next	2012-06-11 12:56:14 -07:00
Gao feng	c8a627ed06	inetpeer: add namespace support for inetpeer now inetpeer doesn't support namespace,the information will be leaking across namespace. this patch move the global vars v4_peers and v6_peers to netns_ipv4 and netns_ipv6 as a field peers. add struct pernet_operations inetpeer_ops to initial pernet inetpeer data. and change family_to_base and inet_getpeer to support namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-08 14:27:23 -07:00
Gao feng	7080ba0955	netfilter: nf_ct_icmp: add namespace support This patch adds namespace support for ICMPv6 protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:40 +02:00
Gao feng	4b626b9c5d	netfilter: nf_ct_icmp: add namespace support This patch adds namespace support for ICMP protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:40 +02:00
Gao feng	0ce490ad43	netfilter: nf_ct_udp: add namespace support This patch adds namespace support for UDP protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:40 +02:00
Gao feng	d2ba1fde42	netfilter: nf_ct_tcp: add namespace support This patch adds namespace support for TCP protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
Gao feng	15f585bd76	netfilter: nf_ct_generic: add namespace support This patch adds namespace support for the generic layer 4 protocol tracker. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
Gao feng	524a53e5ad	netfilter: nf_conntrack: prepare namespace support for l3 protocol trackers This patch prepares the namespace support for layer 3 protocol trackers. Basically, this modifies the following interfaces: * nf_ct_l3proto_[un]register_sysctl. * nf_conntrack_l3proto_[un]register. We add a new nf_ct_l3proto_net is used to get the pernet data of l3proto. This adds rhe new struct nf_ip_net that is used to store the sysctl header and l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6) because the protos such tcp and tcp6 use the same data,so making nf_ip_net as a field of netns_ct is the easiest way to manager it. This patch also adds init_net to struct nf_conntrack_l3proto to initial the layer 3 protocol pernet data. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
Gao feng	2c352f444c	netfilter: nf_conntrack: prepare namespace support for l4 protocol trackers This patch prepares the namespace support for layer 4 protocol trackers. Basically, this modifies the following interfaces: * nf_ct_[un]register_sysctl * nf_conntrack_l4proto_[un]register to include the namespace parameter. We still use init_net in this patch to prepare the ground for follow-up patches for each layer 4 protocol tracker. We add a new net_id field to struct nf_conntrack_l4proto that is used to store the pernet_operations id for each layer 4 protocol tracker. Note that AF_INET6's protocols do not need to do sysctl compat. Thus, we only register compat sysctl when l4proto.l3proto != AF_INET6. Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-06-07 14:58:39 +02:00
Eric Leblond	a900689264	netfilter: nf_ct_helper: allow to disable automatic helper assignment This patch allows you to disable automatic conntrack helper lookup based on TCP/UDP ports, eg. echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper [ Note: flows that already got a helper will keep using it even if automatic helper assignment has been disabled ] Once this behaviour has been disabled, you have to explicitly use the iptables CT target to attach helper to flows. There are good reasons to stop supporting automatic helper assignment, for further information, please read: http://www.netfilter.org/news.html#2012-04-03 This patch also adds one message to inform that automatic helper assignment is deprecated and it will be removed soon (this is spotted only once, with the first flow that gets a helper attached to make it as less annoying as possible). Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2012-05-08 19:35:18 +02:00
Eric W. Biederman	6dceb03687	net ipv6: Don't use sysctl tables with .child entries. The sysctl core no longer natively understands sysctl tables with .child entries. Split the ipv6_table to remove the .child entries. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-20 21:22:29 -04:00
Eric Dumazet	95c9617472	net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-15 12:44:40 -04:00
Paul Gortmaker	187f1882b5	BUG: headers with BUG/BUG_ON etc. need linux/bug.h If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2012-03-04 17:54:34 -05:00
Eric W. Biederman	5ee4433efe	netns: Fail conspicously if someone uses net_generic at an inappropriate time. By definition net_generic should never be called when it can return NULL. Fail conspicously with a BUG_ON to make it clear when people mess up that a NULL return should never happen. Recently there was a bug in the CAIF subsystem where it was registered with register_pernet_device instead of register_pernet_subsys. It was erroneously concluded that net_generic could validly return NULL and that net_assign_generic was buggy (when it was just inefficient). Hopefully this BUG_ON will prevent people to coming to similar erroneous conclusions in the futrue. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-27 21:06:02 -05:00
Glauber Costa	3dc43e3e4d	per-netns ipv4 sysctl_tcp_mem This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:11 -05:00
Eric Dumazet	dfd56b8b38	net: use IS_ENABLED(CONFIG_IPV6) Instead of testing defined(CONFIG_IPV6) \|\| defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-11 18:25:16 -05:00
David S. Miller	b3613118eb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2011-12-02 13:49:21 -05:00
Pablo Neira Ayuso	70e9942f17	netfilter: nf_conntrack: make event callback registration per-netns This patch fixes an oops that can be triggered following this recipe: 0) make sure nf_conntrack_netlink and nf_conntrack_ipv4 are loaded. 1) container is started. 2) connect to it via lxc-console. 3) generate some traffic with the container to create some conntrack entries in its table. 4) stop the container: you hit one oops because the conntrack table cleanup tries to report the destroy event to user-space but the per-netns nfnetlink socket has already gone (as the nfnetlink socket is per-netns but event callback registration is global). To fix this situation, we make the ctnl_notifier per-netns so the callback is registered/unregistered if the container is created/destroyed. Alex Bligh and Alexey Dobriyan originally proposed one small patch to check if the nfnetlink socket is gone in nfnetlink_has_listeners, but this is a very visited path for events, thus, it may reduce performance and it looks a bit hackish to check for the nfnetlink socket only to workaround this situation. As a result, I decided to follow the bigger path choice, which seems to look nicer to me. Cc: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Alex Bligh <alex@alex.org.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-11-22 00:34:47 +01:00
Eric Dumazet	2a24444f8f	ipv6: reduce percpu needs for icmpv6msg mibs Reading /proc/net/snmp6 on a machine with a lot of cpus is very expensive (can be ~88000 us). This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding values for all possible cpus can read 16 Mbytes of memory (32MBytes on non x86 arches) ICMP messages are not considered as fast path on a typical server, and eventually few cpus handle them anyway. We can afford an atomic operation instead of using percpu data. This saves 4096 bytes per cpu and per network namespace. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:12:26 -05:00
Eric Dumazet	acb32ba3de	ipv4: reduce percpu needs for icmpmsg mibs Reading /proc/net/snmp on a machine with a lot of cpus is very expensive (can be ~88000 us). This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values for all possible cpus can read 16 Mbytes of memory. ICMP messages are not considered as fast path on a typical server, and eventually few cpus handle them anyway. We can afford an atomic operation instead of using percpu data. This saves 4096 bytes per cpu and per network namespace. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-09 16:04:20 -05:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
Vasiliy Kulikov	c319b4d76b	net: ipv4: add IPPROTO_ICMP socket kind This patch adds IPPROTO_ICMP socket kind. It makes it possible to send ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages without any special privileges. In other words, the patch makes it possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In order not to increase the kernel's attack surface, the new functionality is disabled by default, but is enabled at bootup by supporting Linux distributions, optionally with restriction to a group or a group range (see below). Similar functionality is implemented in Mac OS X: http://www.manpagez.com/man/4/icmp/ A new ping socket is created with socket(PF_INET, SOCK_DGRAM, PROT_ICMP) Message identifiers (octets 4-5 of ICMP header) are interpreted as local ports. Addresses are stored in struct sockaddr_in. No port numbers are reserved for privileged processes, port 0 is reserved for API ("let the kernel pick a free number"). There is no notion of remote ports, remote port numbers provided by the user (e.g. in connect()) are ignored. Data sent and received include ICMP headers. This is deliberate to: 1) Avoid the need to transport headers values like sequence numbers by other means. 2) Make it easier to port existing programs using raw sockets. ICMP headers given to send() are checked and sanitized. The type must be ICMP_ECHO and the code must be zero (future extensions might relax this, see below). The id is set to the number (local port) of the socket, the checksum is always recomputed. ICMP reply packets received from the network are demultiplexed according to their id's, and are returned by recv() without any modifications. IP header information and ICMP errors of those packets may be obtained via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source quenches and redirects are reported as fake errors via the error queue (IP_RECVERR); the next hop address for redirects is saved to ee_info (in network order). socket(2) is restricted to the group range specified in "/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning that nobody (not even root) may create ping sockets. Setting it to "100 100" would grant permissions to the single group (to either make /sbin/ping g+s and owned by this group or to grant permissions to the "netadmins" group), "0 4294967295" would enable it for the world, "100 4294967295" would enable it for the users, but not daemons. The existing code might be (in the unlikely case anyone needs it) extended rather easily to handle other similar pairs of ICMP messages (Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply etc.). Userspace ping util & patch for it: http://openwall.info/wiki/people/segoon/ping For Openwall GNU/*/Linux it was the last step on the road to the setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels) is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs: http://mirrors.kernel.org/openwall/Owl/current/iso/ Initially this functionality was written by Pavel Kankovsky for Linux 2.4.32, but unfortunately it was never made public. All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with the patch. PATCH v3: - switched to flowi4. - minor changes to be consistent with raw sockets code. PATCH v2: - changed ping_debug() to pr_debug(). - removed CONFIG_IP_PING. - removed ping_seq_fops.owner field (unused for procfs). - switched to proc_net_fops_create(). - switched to %pK in seq_printf(). PATCH v1: - fixed checksumming bug. - CAP_NET_RAW may not create icmp sockets anymore. RFC v2: - minor cleanups. - introduced sysctl'able group range to restrict socket(2). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-13 16:08:13 -04:00
David S. Miller	436c3b66ec	ipv4: Invalidate nexthop cache nh_saddr more correctly. Any operation that: 1) Brings up an interface 2) Adds an IP address to an interface 3) Deletes an IP address from an interface can potentially invalidate the nh_saddr value, requiring it to be recomputed. Perform the recomputation lazily using a generation ID. Reported-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-24 17:42:21 -07:00
Julian Anastasov	2553d064ff	ipvs: move struct netns_ipvs Remove include/net/netns/ip_vs.h because it depends on structures from include/net/ip_vs.h. As ipvs is pointer in struct net it is better to move struct netns_ipvs into include/net/ip_vs.h, so that we can easily use other structures in struct netns_ipvs. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-03-15 09:36:50 +09:00
Pablo Neira Ayuso	a992ca2a04	netfilter: nf_conntrack_tstamp: add flow-based timestamp extension This patch adds flow-based timestamping for conntracks. This conntrack extension is disabled by default. Basically, we use two 64-bits variables to store the creation timestamp once the conntrack has been confirmed and the other to store the deletion time. This extension is disabled by default, to enable it, you have to: echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp This patch allows to save memory for user-space flow-based loogers such as ulogd2. In short, ulogd2 does not need to keep a hashtable with the conntrack in user-space to know when they were created and destroyed, instead we use the kernel timestamp. If we want to have a sane IPFIX implementation in user-space, this nanosecs resolution timestamps are also useful. Other custom user-space applications can benefit from this via libnetfilter_conntrack. This patch modifies the /proc output to display the delta time in seconds since the flow start. You can also obtain the flow-start date by means of the conntrack-tools. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-01-19 16:00:07 +01:00
Patrick McHardy	d862a6622e	netfilter: nf_conntrack: use is_vmalloc_addr() Use is_vmalloc_addr() in nf_ct_free_hashtable() and get rid of the vmalloc flags to indicate that a hash table has been allocated using vmalloc(). Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-01-14 15:45:56 +01:00
Hans Schillstrom	763f8d0ed4	IPVS: netns, svc counters moved in ip_vs_ctl,c Last two global vars to be moved, ip_vs_ftpsvc_counter and ip_vs_nullsvc_counter. [horms@verge.net.au: removed whitespace-change-only hunk] Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	f2431e6e92	IPVS: netns, trash handling trash list per namspace, and reordering of some params in dst struct. [ horms@verge.net.au: Use cancel_delayed_work_sync() instead of cancel_rearming_delayed_work(). Found during merge conflict resoliution ] Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	f6340ee0c6	IPVS: netns, defense work timer. This patch makes defense work timer per name-space, A net ptr had to be added to the ipvs struct, since it's needed by defense_work_handler. [ horms@verge.net.au: Use cancel_delayed_work_sync() instead of cancel_rearming_delayed_work(). Found during merge conflict resoliution ] Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	a0840e2e16	IPVS: netns, ip_vs_ctl local vars moved to ipvs struct. Moving global vars to ipvs struct, except for svc table lock. Next patch for ctl will be drop-rate handling. v3 __ip_vs_mutex remains global ip_vs_conntrack_enabled(struct netns_ipvs ipvs) Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	6e67e586e7	IPVS: netns, connection hash got net as param. Connection hash table is now name space aware. i.e. net ptr >> 8 is xor:ed to the hash, and this is the first param to be compared. The net struct is 0xa40 in size ( a little bit smaller for 32 bit arch:s) and cache-line aligned, so a ptr >> 5 might be a more clever solution ? All lookups where net is compared uses net_eq() which returns 1 when netns is disabled, and the compiler seems to do something clever in that case. ip_vs_conn_fill_param() have net as first param now. Three new inlines added to keep conn struct smaller when names space is disabled. - ip_vs_conn_net() - ip_vs_conn_net_set() - ip_vs_conn_net_eq() v3 moved net compare to the end in "fast path" Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	b17fc9963f	IPVS: netns, ip_vs_stats and its procfs The statistic counter locks for every packet are now removed, and that statistic is now per CPU, i.e. no locks needed. However summing is made in ip_vs_est into ip_vs_stats struct which is moved to ipvs struc. procfs, ip_vs_stats now have a "per cpu" count and a grand total. A new function seq_file_single_net() in ip_vs.h created for handling of single_open_net() since it does not place net ptr in a struct, like others. /var/lib/lxc # cat /proc/net/ip_vs_stats_percpu Total Incoming Outgoing Incoming Outgoing CPU Conns Packets Packets Bytes Bytes 0 0 3 1 9D 34 1 0 1 2 49 70 2 0 1 2 34 76 3 1 2 2 70 74 ~ 1 7 7 18A 18E Conns/s Pkts/s Pkts/s Bytes/s Bytes/s 0 0 0 0 0 v3 ip_vs_stats reamains as before, instead ip_vs_stats_percpu is added. u64 seq lock added v4 Bug correction inbytes and outbytes as own vars.. per_cpu counter for all stats now as suggested by Julian. [horms@verge.net.au: removed whitespace-change-only hunk] Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	f131315fa2	IPVS: netns awareness to ip_vs_sync All global variables moved to struct ipvs, most external changes fixed (i.e. init_net removed) in sync_buf create + 4 replaced by sizeof(struct..) Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	29c2026fd4	IPVS: netns awareness to ip_vs_est All variables moved to struct ipvs, most external changes fixed (i.e. init_net removed) *v3 timer per ns instead of a common timer in estimator. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	ab8a5e8408	IPVS: netns awareness to ip_vs_app All variables moved to struct ipvs, most external changes fixed (i.e. init_net removed) in ip_vs_protocol param struct net *net added to: - register_app() - unregister_app() This affected almost all proto_xxx.c files Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:28 +09:00
Hans Schillstrom	9d934878e7	IPVS: netns preparation for proto_sctp In this phase (one), all local vars will be moved to ipvs struct. Remaining work, add param struct net net to a couple of functions that is common for all protos and use ip_vs_proto_data v3 Removed unuset function set_state_timeout() Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	78b16bde10	IPVS: netns preparation for proto_udp In this phase (one), all local vars will be moved to ipvs struct. Remaining work, add param struct net net to a couple of functions that is common for all protos and use ip_vs_proto_data v3 Removed unused function set_state_timeout() Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	4a85b96c08	IPVS: netns preparation for proto_tcp In this phase (one), all local vars will be moved to ipvs struct. Remaining work, add param struct net net to a couple of functions that is common for all protos and use all ip_vs_proto_data v3 Removed unused function as sugested by Simon Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	252c641032	IPVS: netns, prepare protocol Add support for protocol data per name-space. in struct ip_vs_protocol, appcnt will be removed when all protos are modified for network name-space. This patch causes warnings of unused functions, they will be used when next patch will be applied. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	b6e885ddb9	IPVS: netns awarness to lblc sheduler var sysctl_ip_vs_lblc_expiration moved to ipvs struct as sysctl_lblc_expiration procfs updated to handle this. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	d0a1eef9c3	IPVS: netns awarness to lblcr sheduler var sysctl_ip_vs_lblcr_expiration moved to ipvs struct as sysctl_lblcr_expiration procfs updated to handle this. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:27 +09:00
Hans Schillstrom	fc723250c9	IPVS: netns to services part 1 Services hash tables got netns ptr a hash arg, While Real Servers (rs) has been moved to ipvs struct. Two new inline functions added to get net ptr from skb. Since ip_vs is called from different contexts there is two places to dig for the net ptr skb->dev or skb->sk this is handled in skb_net() and skb_sknet() Global functions, ip_vs_service_get() ip_vs_lookup_real_service() etc have got struct net net as first param. If possible get net ptr skb etc, - if not &init_net is used at this early stage of patching. ip_vs_ctl.c procfs not ready for netns yet. v3 Comments by Julian - __ip_vs_service_find and __ip_vs_svc_fwm_find are fast path, net_eq(svc->net, net) so the check is at the end now. - net = skb_net(skb) in ip_vs_out moved after check for skb_dst. Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:26 +09:00
Hans Schillstrom	61b1ab4583	IPVS: netns, add basic init per netns. Preparation for network name-space init, in this stage some empty functions exists. In most files there is a check if it is root ns i.e. init_net if (!net_eq(net, &init_net)) return ... this will be removed by the last patch, when enabling name-space. *v3 ip_vs_conn.c merge error corrected. net_ipvs #ifdef removed as sugested by Jan Engelhardt [ horms@verge.net.au: Removed whitespace-change-only hunks ] Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>	2011-01-13 10:30:26 +09:00
Jan Engelhardt	20a95a2169	netns: let net_generic take pointer-to-const args This commit is same in nature as v2.6.37-rc1-755-g3654654; the network namespace itself is not modified when calling net_generic, so the parameter can be const. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-21 10:05:10 -08:00
Eric Dumazet	8e602ce298	netns: reorder fields in struct net In a network bench, I noticed an unfortunate false sharing between 'loopback_dev' and 'count' fields in "struct net". 'count' is written each time a socket is created or destroyed, while loopback_dev might be often read in routing code. Move loopback_dev in a read mostly section of "struct net" Note: struct netns_xfrm is cache line aligned on SMP. (It contains a "struct dst_ops") Move it at the end to avoid holes, and reduce sizeof(struct net) by 128 bytes on ia32. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-17 13:49:14 -07:00
Patrick McHardy	d1db275dd3	ipv6: ip6mr: support multiple tables This patch adds support for multiple independant multicast routing instances, named "tables". Userspace multicast routing daemons can bind to a specific table instance by issuing a setsockopt call using a new option MRT6_TABLE. The table number is stored in the raw socket data and affects all following ip6mr setsockopt(), getsockopt() and ioctl() calls. By default, a single table (RT6_TABLE_DFLT) is created with a default routing rule pointing to it. Newly created pim6reg devices have the table number appended ("pim6regX"), with the exception of devices created in the default table, which are named just "pim6reg" for compatibility reasons. Packets are directed to a specific table instance using routing rules, similar to how regular routing rules work. Currently iif, oif and mark are supported as keys, source and destination addresses could be supported additionally. Example usage: - bind pimd/xorp/... to a specific table: uint32_t table = 123; setsockopt(fd, SOL_IPV6, MRT6_TABLE, &table, sizeof(table)); - create routing rules directing packets to the new table: # ip -6 mrule add iif eth0 lookup 123 # ip -6 mrule add oif eth0 lookup 123 Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-05-11 14:40:55 +02:00
Patrick McHardy	6bd5214339	ipv6: ip6mr: move mroute data into seperate structure Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-05-11 14:40:53 +02:00
Patrick McHardy	f30a778421	ipv6: ip6mr: convert struct mfc_cache to struct list_head Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-05-11 14:40:51 +02:00
Patrick McHardy	c476efbcde	ipv6: ip6mr: move unres_queue and timer to per-namespace data The unres_queue is currently shared between all namespaces. Following patches will additionally allow to create multiple multicast routing tables in each namespace. Having a single shared queue for all these users seems to excessive, move the queue and the cleanup timer to the per-namespace data to unshare it. As a side-effect, this fixes a bug in the seq file iteration functions: the first entry returned is always from the current namespace, entries returned after that may belong to any namespace. Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-05-11 14:40:48 +02:00
Neil Horman	3ee943728f	ipv4: remove ip_rt_secret timer (v4) A while back there was a discussion regarding the rt_secret_interval timer. Given that we've had the ability to do emergency route cache rebuilds for awhile now, based on a statistical analysis of the various hash chain lengths in the cache, the use of the flush timer is somewhat redundant. This patch removes the rt_secret_interval sysctl, allowing us to rely solely on the statistical analysis mechanism to determine the need for route cache flushes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-08 01:57:52 -07:00
Jiri Pirko	05fceb4ad7	net: disallow to use net_assign_generic externally Now there's no need to use this fuction directly because it's handled by register_pernet_device. So to make this simple and easy to understand, make this static to do not tempt potentional users. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-27 15:49:02 -07:00
Patrick McHardy	f0ad0860d0	ipv4: ipmr: support multiple tables This patch adds support for multiple independant multicast routing instances, named "tables". Userspace multicast routing daemons can bind to a specific table instance by issuing a setsockopt call using a new option MRT_TABLE. The table number is stored in the raw socket data and affects all following ipmr setsockopt(), getsockopt() and ioctl() calls. By default, a single table (RT_TABLE_DEFAULT) is created with a default routing rule pointing to it. Newly created pimreg devices have the table number appended ("pimregX"), with the exception of devices created in the default table, which are named just "pimreg" for compatibility reasons. Packets are directed to a specific table instance using routing rules, similar to how regular routing rules work. Currently iif, oif and mark are supported as keys, source and destination addresses could be supported additionally. Example usage: - bind pimd/xorp/... to a specific table: uint32_t table = 123; setsockopt(fd, IPPROTO_IP, MRT_TABLE, &table, sizeof(table)); - create routing rules directing packets to the new table: # ip mrule add iif eth0 lookup 123 # ip mrule add oif eth0 lookup 123 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-13 14:49:34 -07:00
Patrick McHardy	0c12295a74	ipv4: ipmr: move mroute data into seperate structure Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-13 14:49:34 -07:00
Patrick McHardy	862465f2e7	ipv4: ipmr: convert struct mfc_cache to struct list_head Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-13 14:49:33 -07:00
Patrick McHardy	e258beb22f	ipv4: ipmr: move unres_queue and timer to per-namespace data The unres_queue is currently shared between all namespaces. Following patches will additionally allow to create multiple multicast routing tables in each namespace. Having a single shared queue for all these users seems to excessive, move the queue and the cleanup timer to the per-namespace data to unshare it. As a side-effect, this fixes a bug in the seq file iteration functions: the first entry returned is always from the current namespace, entries returned after that may belong to any namespace. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-13 14:49:32 -07:00
stephen hemminger	808f5114a9	packet: convert socket list to RCU (v3) Convert AF_PACKET to use RCU, eliminating one more reader/writer lock. There is no need for a real sk_del_node_init_rcu(), because sk_del_node_init is doing the equivalent thing to hlst_del_init_rcu already; but added some comments to try and make that obvious. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-22 15:45:56 -08:00
Tejun Heo	7d720c3e4f	percpu: add __percpu sparse annotations to net Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_() users which used to cast the argument to (void ) are updated to cast it to (void __percpu *). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-16 23:05:38 -08:00
Patrick McHardy	9ab99d5a43	Merge branch 'master' of /repos/git/net-next-2.6 Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-02-10 14:17:10 +01:00
Patrick McHardy	d696c7bdaa	netfilter: nf_conntrack: fix hash resizing with namespaces As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash size is global and not per namespace, but modifiable at runtime through /sys/module/nf_conntrack/hashsize. Changing the hash size will only resize the hash in the current namespace however, so other namespaces will use an invalid hash size. This can cause crashes when enlarging the hashsize, or false negative lookups when shrinking it. Move the hash size into the per-namespace data and only use the global hash size to initialize the per-namespace value when instanciating a new namespace. Additionally restrict hash resizing to init_net for now as other namespaces are not handled currently. Cc: stable@kernel.org Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-08 11:18:07 -08:00
Eric Dumazet	5b3501faa8	netfilter: nf_conntrack: per netns nf_conntrack_cachep nf_conntrack_cachep is currently shared by all netns instances, but because of SLAB_DESTROY_BY_RCU special semantics, this is wrong. If we use a shared slab cache, one object can instantly flight between one hash table (netns ONE) to another one (netns TWO), and concurrent reader (doing a lookup in netns ONE, 'finding' an object of netns TWO) can be fooled without notice, because no RCU grace period has to be observed between object freeing and its reuse. We dont have this problem with UDP/TCP slab caches because TCP/UDP hashtables are global to the machine (and each object has a pointer to its netns). If we use per netns conntrack hash tables, we also must use per netns conntrack slab caches, to guarantee an object can not escape from one namespace to another one. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> [Patrick: added unique slab name allocation] Cc: stable@kernel.org Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-02-08 11:16:56 -08:00
Alexey Dobriyan	d7c7544c3d	netns xfrm: deal with dst entries in netns GC is non-existent in netns, so after you hit GC threshold, no new dst entries will be created until someone triggers cleanup in init_net. Make xfrm4_dst_ops and xfrm6_dst_ops per-netns. This is not done in a generic way, because it woule waste (AF_MAX - 2) * sizeof(struct dst_ops) bytes per-netns. Reorder GC threshold initialization so it'd be done before registering XFRM policies. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-01-24 22:47:53 -08:00
Alexey Dobriyan	e9d3897cc2	netfilter: netns: #ifdef ->iptable_security, ->ip6table_security 'security' tables depend on SECURITY, so ifdef them. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-01-18 08:08:37 +01:00
Eric W. Biederman	d79d792ef9	net: Allow xfrm_user_net_exit to batch efficiently. xfrm.nlsk is provided by the xfrm_user module and is access via rcu from other parts of the xfrm code. Add xfrm.nlsk_stash a copy of xfrm.nlsk that will never be set to NULL. This allows the synchronize_net and netlink_kernel_release to be deferred until a whole batch of xfrm.nlsk sockets have been set to NULL. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-03 12:22:03 -08:00
Eric W. Biederman	65c0cfafce	net: remove [un]register_pernet_gen_... and update the docs. No that all of the callers have been updated to set fields in struct pernet_operations, and simplified to let the network namespace core handle the allocation and freeing of the storage for them, remove the surpurpflous methods and update the docs to the new style. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-01 16:16:00 -08:00
Alexey Dobriyan	86393e52c3	netns: embed ip6_dst_ops directly struct net::ipv6.ip6_dst_ops is separatedly dynamically allocated, but there is no fundamental reason for it. Embed it directly into struct netns_ipv6. For that: * move struct dst_ops into separate header to fix circular dependencies I honestly tried not to, it's pretty impossible to do other way * drop dynamical allocation, allocate together with netns For a change, remove struct dst_ops::dst_net, it's deducible by using container_of() given dst_ops pointer. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-01 17:40:31 -07:00
Cyrill Gorcunov	e04af024b2	net, netns_xt: shrink netns_xt members In case if kernel was compiled without ebtables support there is no need to keep ebt_table pointers in netns_xt structure. Make it config dependent. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-07-05 19:16:18 -07:00
Pablo Neira Ayuso	dd7669a92c	netfilter: conntrack: optional reliable conntrack event delivery This patch improves ctnetlink event reliability if one broadcast listener has set the NETLINK_BROADCAST_ERROR socket option. The logic is the following: if an event delivery fails, we keep the undelivered events in the missed event cache. Once the next packet arrives, we add the new events (if any) to the missed events in the cache and we try a new delivery, and so on. Thus, if ctnetlink fails to deliver an event, we try to deliver them once we see a new packet. Therefore, we may lose state transitions but the userspace process gets in sync at some point. At worst case, if no events were delivered to userspace, we make sure that destroy events are successfully delivered. Basically, if ctnetlink fails to deliver the destroy event, we remove the conntrack entry from the hashes and we insert them in the dying list, which contains inactive entries. Then, the conntrack timer is added with an extra grace timeout of random32() % 15 seconds to trigger the event again (this grace timeout is tunable via /proc). The use of a limited random timeout value allows distributing the "destroy" resends, thus, avoiding accumulating lots "destroy" events at the same time. Event delivery may re-order but we can identify them by means of the tuple plus the conntrack ID. The maximum number of conntrack entries (active or inactive) is still handled by nf_conntrack_max. Thus, we may start dropping packets at some point if we accumulate a lot of inactive conntrack entries that did not successfully report the destroy event to userspace. During my stress tests consisting of setting a very small buffer of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket flag, and generating lots of very small connections, I noticed very few destroy entries on the fly waiting to be resend. A simple way to test this patch consist of creating a lot of entries, set a very small Netlink buffer in conntrackd (+ a patch which is not in the git tree to set the BROADCAST_ERROR flag) and invoke `conntrack -F'. For expectations, no changes are introduced in this patch. Currently, event delivery is only done for new expectations (no events from expectation expiration, removal and confirmation). In that case, they need a per-expectation event cache to implement the same idea that is exposed in this patch. This patch can be useful to provide reliable flow-accouting. We still have to add a new conntrack extension to store the creation and destroy time. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-06-13 12:30:52 +02:00
Pablo Neira Ayuso	a0891aa6a6	netfilter: conntrack: move event caching to conntrack extension infrastructure This patch reworks the per-cpu event caching to use the conntrack extension infrastructure. The main drawback is that we consume more memory per conntrack if event delivery is enabled. This patch is required by the reliable event delivery that follows to this patch. BTW, this patch allows you to enable/disable event delivery via /proc/sys/net/netfilter/nf_conntrack_events in runtime, although you can still disable event caching as compilation option. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-06-13 12:26:29 +02:00
Eric Dumazet	ea781f197d	netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu() Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP. This permits an easy conversion from call_rcu() based hash lists to a SLAB_DESTROY_BY_RCU one. Avoiding call_rcu() delay at nf_conn freeing time has numerous gains. First, it doesnt fill RCU queues (up to 10000 elements per cpu). This reduces OOM possibility, if queued elements are not taken into account This reduces latency problems when RCU queue size hits hilimit and triggers emergency mode. - It allows fast reuse of just freed elements, permitting better use of CPU cache. - We delete rcu_head from "struct nf_conn", shrinking size of this structure by 8 or 16 bytes. This patch only takes care of "struct nf_conn". call_rcu() is still used for less critical conntrack parts, that may be converted later if necessary. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2009-03-25 21:05:46 +01:00
Benjamin Thery	6c5143dbcf	netns: ipmr: declare reg_vif_num per-namespace Preliminary work to make IPv4 multicast routing netns-aware. Declare variable 'reg_vif_num' per-namespace, move into struct netns_ipv4. At the moment, this variable is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:40 -08:00
Benjamin Thery	6f9374a934	netns: ipmr: declare mroute_do_assert and mroute_do_pim per-namespace Preliminary work to make IPv4 multicast routing netns-aware. Declare IPv multicast routing variables 'mroute_do_assert' and 'mroute_do_pim' per-namespace in struct netns_ipv4. At the moment, these variables are only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:40 -08:00
Benjamin Thery	1e8fb3b6a4	netns: ipmr: declare counter cache_resolve_queue_len per-namespace Preliminary work to make IPv4 multicast routing netns-aware. Declare variable cache_resolve_queue_len per-namespace: move it into struct netns_ipv4. This variable counts the number of unresolved cache entries queued in the list mfc_unres_queue. This list is kept global to all netns as the number of entries per namespace is limited to 10 (hardcoded in routine ipmr_cache_unresolved). Entries belonging to different namespaces in mfc_unres_queue will be identified by matching the mfc_net member introduced previously in struct mfc_cache. Keeping this list global to all netns, also allows us to keep a single timer (ipmr_expire_timer) to handle their expiration. In some places cache_resolve_queue_len value was tested for arming or deleting the timer. These tests were equivalent to testing mfc_unres_queue value instead and are replaced in this patch. At the moment, cache_resolve_queue_len is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:39 -08:00
Benjamin Thery	2bb8b26c3e	netns: ipmr: dynamically allocate mfc_cache_array Preliminary work to make IPv4 multicast routing netns-aware. Dynamically allocate IPv4 multicast forwarding cache, mfc_cache_array, and move it to struct netns_ipv4. At the moment, mfc_cache_array is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:38 -08:00
Benjamin Thery	cf958ae377	netns: ipmr: dynamically allocate vif_table Preliminary work to make IPv6 multicast routing netns-aware. Dynamically allocate interface table vif_table and move it to struct netns_ipv4, and update MIF_EXISTS() macro. At the moment, vif_table is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:34 -08:00
Benjamin Thery	70a269e6c9	netns: ipmr: allocate mroute_socket per-namespace. Preliminary work to make IPv4 multicast routing netns-aware. Make IPv4 multicast routing mroute_socket per-namespace, moves it into struct netns_ipv4. At the moment, mroute_socket is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-01-22 13:57:34 -08:00
Benjamin Thery	950d5704e5	netns: ip6mr: declare reg_vif_num per-namespace Preliminary work to make IPv6 multicast forwarding netns-aware. Declare variable 'reg_vif_num' per-namespace, moves into struct netns_ipv6. At the moment, this variable is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:29:24 -08:00
Benjamin Thery	a21f3f997c	netns: ip6mr: declare mroute_do_assert and mroute_do_pim per-namespace Preliminary work to make IPv6 multicast forwarding netns-aware. Declare IPv6 multicast forwarding variables 'mroute_do_assert' and 'mroute_do_pim' per-namespace in struct netns_ipv6. At the moment, these variables are only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:28:44 -08:00
Benjamin Thery	4045e57c19	netns: ip6mr: declare counter cache_resolve_queue_len per-namespace Preliminary work to make IPv6 multicast forwarding netns-aware. Declare variable cache_resolve_queue_len per-namespace: moves it into struct netns_ipv6. This variable counts the number of unresolved cache entries queued in the list mfc_unres_queue. This list is kept global to all netns as the number of entries per namespace is limited to 10 (hardcoded in routine ip6mr_cache_unresolved). Entries belonging to different namespaces in mfc_unres_queue will be identified by matching the mfc_net member introduced previously in struct mfc6_cache. Keeping this list global to all netns, also allows us to keep a single timer (ipmr_expire_timer) to handle their expiration. In some places cache_resolve_queue_len value was tested for arming or deleting the timer. These tests were equivalent to testing mfc_unres_queue value instead and are replaced in this patch. At the moment, cache_resolve_queue_len is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:27:21 -08:00
Benjamin Thery	4a6258a0e3	netns: ip6mr: dynamically allocate mfc6_cache_array Preliminary work to make IPv6 multicast forwarding netns-aware. Dynamically allocates IPv6 multicast forwarding cache, mfc6_cache_array, and moves it to struct netns_ipv6. At the moment, mfc6_cache_array is only referenced in init_net. Replace 'ARRAY_SIZE(mfc6_cache_array)' with mfc6_cache_array size: MFC6_LINES. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:24:07 -08:00
Benjamin Thery	4e16880cb4	netns: ip6mr: dynamically allocates vif6_table Preliminary work to make IPv6 multicast forwarding netns-aware. Dynamically allocates interface table vif6_table and moves it to struct netns_ipv6, and updates MIF_EXISTS() macro. At the moment, vif6_table is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:15:08 -08:00
Benjamin Thery	bd91b8bf37	netns: ip6mr: allocate mroute6_socket per-namespace. Preliminary work to make IPv6 multicast forwarding netns-aware. Make IPv6 multicast forwarding mroute6_socket per-namespace, moves it into struct netns_ipv6. At the moment, mroute6_socket is only referenced in init_net. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-12-10 16:07:08 -08:00
David S. Miller	ed77a89c30	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 Conflicts: net/netfilter/nf_conntrack_netlink.c	2008-11-28 02:19:15 -08:00
Alexey Dobriyan	b27aeadb59	netns xfrm: per-netns sysctls Make net.core.xfrm_aevent_etime net.core.xfrm_acq_expires net.core.xfrm_aevent_rseqth net.core.xfrm_larval_drop sysctls per-netns. For that make net_core_path[] global, register it to prevent two /proc/net/core antries and change initcall position -- xfrm_init() is called from fs_initcall, so this one should be fs_initcall at least. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 18:00:48 -08:00
Alexey Dobriyan	59c9940ed0	netns xfrm: per-netns MIBs Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:59:52 -08:00
Alexey Dobriyan	a6483b790f	netns xfrm: per-netns NETLINK_XFRM socket Stub senders to init_net's one temporarily. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:38:20 -08:00
Alexey Dobriyan	66caf628c3	netns xfrm: per-netns policy hash resizing work Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:28:57 -08:00
Alexey Dobriyan	dc2caba7b3	netns xfrm: per-netns policy counts Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:24:15 -08:00
Alexey Dobriyan	a35f6c5de3	netns xfrm: per-netns xfrm_policy_bydst hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:23:48 -08:00
Alexey Dobriyan	8b18f8eaf9	netns xfrm: per-netns inexact policies Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:23:26 -08:00
Alexey Dobriyan	8100bea7d6	netns xfrm: per-netns xfrm_policy_byidx hashmask Per-netns hashes are independently resizeable. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:22:58 -08:00
Alexey Dobriyan	93b851c1c9	netns xfrm: per-netns xfrm_policy_byidx hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:22:35 -08:00
Alexey Dobriyan	adfcf0b27e	netns xfrm: per-netns policy list Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:22:11 -08:00
Alexey Dobriyan	50a30657fd	netns xfrm: per-netns km_waitq Disallow spurious wakeups in __xfrm_lookup(). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:21:01 -08:00
Alexey Dobriyan	c78371441c	netns xfrm: per-netns state GC work State GC is per-netns, and this is part of it. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:20:36 -08:00
Alexey Dobriyan	b8a0ae20b0	netns xfrm: per-netns state GC list km_waitq is going to be made per-netns to disallow spurious wakeups in __xfrm_lookup(). To not wakeup after every garbage-collected xfrm_state (which potentially can be from different netns) make state GC list per-netns. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:20:11 -08:00
Alexey Dobriyan	6308273385	netns xfrm: per-netns xfrm_hash_work All of this is implicit passing which netns's hashes should be resized. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:19:07 -08:00
Alexey Dobriyan	0bf7c5b019	netns xfrm: per-netns xfrm_state counts Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:18:39 -08:00
Alexey Dobriyan	529983ecab	netns xfrm: per-netns xfrm_state_hmask Since hashtables are per-netns, they can be independently resized. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:18:12 -08:00
Alexey Dobriyan	b754a4fd8f	netns xfrm: per-netns xfrm_state_byspi hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:17:47 -08:00
Alexey Dobriyan	d320bbb306	netns xfrm: per-netns xfrm_state_bysrc hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:17:24 -08:00
Alexey Dobriyan	73d189dce4	netns xfrm: per-netns xfrm_state_bydst hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:16:58 -08:00
Alexey Dobriyan	9d4139c769	netns xfrm: per-netns xfrm_state_all list This is done to get a) simple "something leaked" check b) cover possible DoSes when other netns puts many, many xfrm_states onto a list. c) not miss "alien xfrm_state" check in some of list iterators in future. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:16:11 -08:00
Alexey Dobriyan	d62ddc21b6	netns xfrm: add netns boilerplate Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-25 17:14:31 -08:00
Alexey Dobriyan	b71b30a626	netfilter: netns ebtables: ebtable_nat in netns Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-11-04 14:30:46 +01:00
Alexey Dobriyan	4aad10938d	netfilter: netns ebtables: ebtable_filter in netns Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-11-04 14:29:58 +01:00
Alexey Dobriyan	8157e6d16a	netfilter: netns ebtables: ebtable_broute in netns Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-11-04 14:29:03 +01:00
Neil Horman	1080d709fb	net: implement emergency route cache rebulds when gc_elasticity is exceeded This is a patch to provide on demand route cache rebuilding. Currently, our route cache is rebulid periodically regardless of need. This introduced unneeded periodic latency. This patch offers a better approach. Using code provided by Eric Dumazet, we compute the standard deviation of the average hash bucket chain length while running rt_check_expire. Should any given chain length grow to larger that average plus 4 standard deviations, we trigger an emergency hash table rebuild for that net namespace. This allows for the common case in which chains are well behaved and do not grow unevenly to not incur any latency at all, while those systems (which may be being maliciously attacked), only rebuild when the attack is detected. This patch take 2 other factors into account: 1) chains with multiple entries that differ by attributes that do not affect the hash value are only counted once, so as not to unduly bias system to rebuilding if features like QOS are heavily used 2) if rebuilding crosses a certain threshold (which is adjustable via the added sysctl in this patch), route caching is disabled entirely for that net namespace, since constant rebuilding is less efficient that no caching at all Tested successfully by me. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-27 17:06:14 -07:00
Patrick McHardy	10a03a42d1	netfilter: netns: use NFPROTO_NUMPROTO instead of NUMPROTO for tables array The netfilter families have been decoupled from regular protocol families. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-20 03:31:54 -07:00
Denis V. Lunev	9261e53701	ipv6: making ip and icmp statistics per/namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-08 11:16:45 -07:00
Alexey Dobriyan	0c4c9288ad	netfilter: netns nat: per-netns bysource hash Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:11 +02:00
Alexey Dobriyan	e099a17357	netfilter: netns nat: per-netns NAT table Same story as with iptable_filter, iptables_raw tables. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:10 +02:00
Alexey Dobriyan	d716a4dfbb	netfilter: netns nf_conntrack: per-netns conntrack accounting Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:09 +02:00
Alexey Dobriyan	c2a2c7e0cc	netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_log_invalid sysctl Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:08 +02:00
Alexey Dobriyan	c04d05529a	netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_checksum sysctl Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:08 +02:00
Alexey Dobriyan	802507071b	netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_count sysctl Note, sysctl table is always duplicated, this is simpler and less special-cased. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:08 +02:00
Alexey Dobriyan	0d55af8791	netfilter: netns nf_conntrack: per-netns statistics Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:07 +02:00
Alexey Dobriyan	6058fa6bb9	netfilter: netns nf_conntrack: per-netns event cache Heh, last minute proof-reading of this patch made me think, that this is actually unneeded, simply because "ct" pointers will be different for different conntracks in different netns, just like they are different in one netns. Not so sure anymore. [Patrick: pointers will be different, flushing can only be done while inactive though and thus it needs to be per netns] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:07 +02:00
Alexey Dobriyan	63c9a26264	netfilter: netns nf_conntrack: per-netns unconfirmed list What is confirmed connection in one netns can very well be unconfirmed in another one. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:04 +02:00
Alexey Dobriyan	9b03f38d04	netfilter: netns nf_conntrack: per-netns expectations Make per-netns a) expectation hash and b) expectations count. Expectations always belongs to netns to which it's master conntrack belong. This is natural and doesn't bloat expectation. Proc files and leaf users are stubbed to init_net, this is temporary. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:03 +02:00
Alexey Dobriyan	400dad39d1	netfilter: netns nf_conntrack: per-netns conntrack hash * make per-netns conntrack hash Other solution is to add ->ct_net pointer to tuplehashes and still has one hash, I tried that it's ugly and requires more code deep down in protocol modules et al. * propagate netns pointer to where needed, e. g. to conntrack iterators. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:03 +02:00
Alexey Dobriyan	49ac8713b6	netfilter: netns nf_conntrack: per-netns conntrack count Sysctls and proc files are stubbed to init_net's one. This is temporary. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:03 +02:00
Alexey Dobriyan	dfdb8d7918	netfilter: netns nf_conntrack: add netns boilerplate One comment: #ifdefs around #include is necessary to overcome amazing compile breakages in NOTRACK-in-netns patch (see below). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-10-08 11:35:02 +02:00
Denis V. Lunev	be713a443e	netns: make uplitev6 mib per/namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-07 14:50:06 -07:00
Denis V. Lunev	0c7ed677fb	netns: make udpv6 mib per/namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-07 14:49:36 -07:00
Denis V. Lunev	835bcc0497	netns: move /proc/net/dev_snmp6 to struct net Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-07 14:45:55 -07:00
Stephen Hemminger	417f28bb34	netns: dont alloc ipv6 fib timer list FIB timer list is a trivial size structure, avoid indirection and just put it in existing ns. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:33:45 -07:00
Pavel Emelyanov	923c6586b0	mib: put icmpmsg statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:04:22 -07:00
Pavel Emelyanov	b60538a0d7	mib: put icmp statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:04:02 -07:00
Pavel Emelyanov	386019d351	mib: put udplite statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:45 -07:00
Pavel Emelyanov	2f275f91a4	mib: put udp statistics on struct net Similar to... ouch, I repeat myself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:27 -07:00
Pavel Emelyanov	61a7e26028	mib: put net statistics on struct net Similar to ip and tcp ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:08 -07:00
Pavel Emelyanov	a20f5799ca	mib: put ip statistics on struct net Similar to tcp one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:02:42 -07:00
Pavel Emelyanov	57ef42d59d	mib: put tcp statistics on struct net Proc temporary uses stats from init_net. BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:02:08 -07:00
Pavel Emelyanov	852566f53c	mib: add netns/mib.h file The only structure declared within is the netns_mib, which will carry all our mibs within. I didn't put the mibs in the existing netns_xxx structures to make it possible to mark this one as properly aligned and get in a separate "read-mostly" cache-line. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:01:24 -07:00
Denis V. Lunev	e84f84f276	netns: place rt_genid into struct net Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-05 19:04:32 -07:00
Denis V. Lunev	9f5e97e536	netns: make rt_secret_rebuild timer per namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-05 19:02:59 -07:00
Denis V. Lunev	39a23e7508	netns: register net.ipv4.route.flush in each namespace Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-05 19:02:33 -07:00
Pavel Emelyanov	0b4419162a	netns: introduce the net_hash_mix "salt" for hashes There are many possible ways to add this "salt", thus I made this patch to be the last in the series to change it if required. Currently I propose to use the struct net pointer itself as this salt, but since this pointer is most often cache-line aligned, shift this right to eliminate the bits, that are most often zeroed. After this, simply add this mix to prepared hashfn-s. For CONFIG_NET_NS=n case this salt is 0 and no changes in hashfn appear. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-16 17:14:11 -07:00
James Morris	17e6e59f0a	netfilter: ip6_tables: add ip6tables security table This is a port of the IPv4 security table for IPv6. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-09 15:58:05 -07:00
James Morris	560ee653b6	netfilter: ip_tables: add iptables security table for mandatory access control rules The following patch implements a new "security" table for iptables, so that MAC (SELinux etc.) networking rules can be managed separately to standard DAC rules. This is to help with distro integration of the new secmark-based network controls, per various previous discussions. The need for a separate table arises from the fact that existing tools and usage of iptables will likely clash with centralized MAC policy management. The SECMARK and CONNSECMARK targets will still be valid in the mangle table to prevent breakage of existing users. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-09 15:57:24 -07:00
Pavel Emelyanov	dec827d174	[NETNS]: The generic per-net pointers. Add the elastic array of void * pointer to the struct net. The access rules are simple: 1. register the ops with register_pernet_gen_device to get the id of your private pointer 2. call net_assign_generic() to put the private data on the struct net (most preferably this should be done in the ->init callback of the ops registered) 3. do not store any private reference on the net_generic array; 4. do not change this pointer while the net is alive; 5. use the net_generic() to get the pointer. When adding a new pointer, I copy the old array, replace it with a new one and schedule the old for kfree after an RCU grace period. Since the net_generic explores the net->gen array inside rcu read section and once set the net->gen->ptr[x] pointer never changes, this grants us a safe access to generic pointers. Quoting Paul: "... RCU is protecting -only- the net_generic structure that net_generic() is traversing, and the [pointer] returned by net_generic() is protected by a reference counter in the upper-level struct net." Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-15 00:36:08 -07:00
Pavel Emelyanov	0204774191	[NETNS][DCCPV6]: Move the dccp_v6_ctl_sk on the struct net. And replace all its usage with init_net's socket. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:32:25 -07:00
Pavel Emelyanov	7b1cffa8c9	[NETNS][DCCPV4]: Move the dccp_v4_ctl_sk on the struct net. And replace all its usage with init_net's socket. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:29:37 -07:00
Pavel Emelyanov	67019cc9ee	[NETNS]: Add an empty netns_dccp structure on struct net. According to the overall struct net design, it will be filled with DCCP-related members. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-13 22:28:42 -07:00
Denis V. Lunev	046ee90235	[NETNS]: Create tcp control socket in the each namespace. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-03 14:31:33 -07:00
Pavel Emelyanov	70ee115942	[SOCK][NETNS]: Add the percpu prot_inuse counter in the struct net. Such an accounting would cost us two more dereferences to get the percpu variable from the struct net, so I make sock_prot_inuse_get and _add calls work differently depending on CONFIG_NET_NS - without it old optimized routines are used. The per-cpu counter for init_net is prepared in core_initcall, so that even af_inet, that starts as fs_initcall, will already have the init_net prepared. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-31 19:42:16 -07:00
Pavel Emelyanov	8efa6e93cb	[NETNS]: Introduce a netns_core structure. There's already some stuff on the struct net, that should better be folded into netns_core structure. I'm making the per-proto inuse counter be per-net also, which is also a candidate for this, so introduce this structure and populate it a bit. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-31 19:41:14 -07:00
Pavel Emelyanov	68528f0998	[NETNS][ICMP]: Make ctl tables for ICMP sysctls per-net. Add some flesh to ipv4_sysctl_init_net and ipv4_sysctl_exit_net, i.e. copy the table, alter .data pointers and register it per-net. Other ipv4_table's sysctls are now global, but this is going to change once sysctl permissions patches migrate from -mm tree to mainline in 2.6.26 merge window :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-26 01:56:24 -07:00
Pavel Emelyanov	a24022e188	[NETNS][ICMP]: Move ICMP sysctls on struct net. Initialization is moved to icmp_sk_init, all the places, that refer to them use init_net for now. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-26 01:55:37 -07:00
Daniel Lezcano	b8ad0cbc58	[NETNS][IPV6] mcast - handle several network namespace This patch make use of the network namespace information at the right places to handle the multicast for several network namespaces. It makes the socket control to be per namespace too. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-07 11:16:55 -08:00
Daniel Lezcano	93ec926b07	[NETNS][IPV6] tcp6 - make socket control per namespace Instead of having a tcp6_socket global to all the namespace, there is tcp6 socket control per namespace. That is consistent with which namespace sent a RST and allows to pass the socket to the underlying function to retrieve the network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-07 11:16:02 -08:00
Daniel Lezcano	1762f7e88e	[NETNS][IPV6] ndisc - make socket control per namespace Make ndisc socket control per namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-07 11:15:34 -08:00
Benjamin Thery	6891a346c3	[NETNS][IPV6] route6 - make garbage collection work with multiple network namespaces This patch makes the necessary changes to make IPv6 dst_entry garbage collection work with multiple network namespaces. In ip6_dst_gc(), static local variables are now declared per-namespace. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-04 13:49:47 -08:00
Benjamin Thery	f2fc6a5458	[NETNS][IPV6] route6 - move ip6_dst_ops inside the network namespace The ip6_dst_ops is moved inside the network namespace structure. All references to this structure are now relative to the initial network namespace. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-04 13:49:23 -08:00
Daniel Lezcano	8ed6778967	[NETNS][IPV6] rt6_info - move rt6_info structure inside the namespace The rt6_info structures are moved inside the network namespace structure. All references to these structures are now relative to the initial network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-04 13:48:30 -08:00
Benjamin Thery	c572872f89	[NETNS][IPV6] rt6_stats - make the stats per network namespace The rt6_stats is now per namespace with this patch. It is allocated when a network namespace is created and freed when the network namespace exits and references are relative to the network namespace. Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-03 23:34:17 -08:00
Daniel Lezcano	dcabb819a6	[NETNS][IPV6] fib6_rules - handle several network namespaces The fib6_rules_ops is moved to the network namespace structure. All references are changed to have it relatively to it. Each time a network namespace is created a new fib6_rules_ops is allocated, initialized and stored into the network namespace structure. The common part of the fib rules is namespace aware, so it is quite easy to retrieve the network namespace from the rules and use it in the different callbacks. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-03 23:33:08 -08:00
Daniel Lezcano	63152fc0de	[NETNS][IPV6] ip6_fib - gc timer per namespace Move the timer initialization at the network namespace creation and store the network namespace in the timer argument. That enables multiple timers (one per network namespace) to do garbage collecting. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-03 23:31:11 -08:00
Daniel Lezcano	58f09b78b7	[NETNS][IPV6] ip6_fib - make it per network namespace The fib table for ipv6 are moved to the network namespace structure. All references to them are made relatively to the network namespace. All external calls to the ip6_fib functions taking the network namespace parameter are made using the init_net variable, so the ip6_fib engine is ready for the namespaces but the callers not yet. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-03 23:25:27 -08:00
Denis V. Lunev	98c6d1b261	[NETNS]: Make icmpv6_sk per namespace. All preparations are done. Now just add a hook to perform an initialization on namespace startup and replace icmpv6_sk macro with proper inline call. Actual namespace the packet belongs too will be passed later along with the one for the routing. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-02-29 11:21:22 -08:00
Denis V. Lunev	4a6ad7a141	[NETNS]: Make icmp_sk per namespace. All preparations are done. Now just add a hook to perform an initialization on namespace startup and replace icmp_sk macro with proper inline call. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-02-29 11:19:58 -08:00
Alexey Dobriyan	9ea0cb2601	[NETFILTER]: arp_tables: per-netns arp_tables FILTER Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:41 -08:00
Alexey Dobriyan	8280aa6182	[NETFILTER]: ip6_tables: per-netns IPv6 FILTER, MANGLE, RAW Now it's possible to list and manipulate per-netns ip6tables rules. Filtering decisions are based on init_net's table so far. P.S.: remove init_net check in inet6_create() to see the effect Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:39 -08:00
Alexey Dobriyan	9335f047fe	[NETFILTER]: ip_tables: per-netns FILTER, MANGLE, RAW Now, iptables show and configure different set of rules in different netnss'. Filtering decisions are still made by consulting only init_net's set. Changes are identical except naming so no splitting. P.S.: one need to remove init_net checks in nf_sockopt.c and inet_create() to see the effect. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:38 -08:00
Alexey Dobriyan	8d87005207	[NETFILTER]: x_tables: per-netns xt_tables In fact all we want is per-netns set of rules, however doing that will unnecessary complicate routines such as ipt_hook()/ipt_do_table, so make full xt_table array per-netns. Every user stubbed with init_net for a while. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:35 -08:00
Pavel Emelyanov	3b4bc4a2bf	[NETNS][FRAGS]: Isolate the secret interval from namespaces. Since we have one hashtable to lookup the fragment, having different secret_interval-s for hash rebuild doesn't make sense, so move this one to inet_frags. The inet_frags_ctl becomes empty after this, so remove it. The appropriate ctl table is kept read-only in namespaces. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:10:39 -08:00
Pavel Emelyanov	e4a2d5c2bc	[NETNS][FRAGS]: Duplicate sysctl tables for new namespaces. Each namespace has to have own tables to tune their different parameters, so duplicate the tables and register them. All the tables in sub-namespaces are temporarily made read-only. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:10:37 -08:00
Pavel Emelyanov	ac18e7509e	[NETNS][FRAGS]: Make the inet_frag_queue lookup work in namespaces. Since fragment management code is consolidated, we cannot have the pointer from inet_frag_queue to struct net, since we must know what king of fragment this is. So, I introduce the netns_frags structure. This one is currently empty, but will be eventually filled with per-namespace attributes. Each inet_frag_queue is tagged with this one. The conntrack_reasm is not "netns-izated", so it has one static netns_frags instance to keep working in init namespace. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:10:34 -08:00
Pavel Emelyanov	e0da5a480c	[NETNS]: Create ipv6 devconf-s for namespaces This is the core. Declare and register the pernet subsys for addrconf. The init callback the will create the devconf-s. The init_net will reuse the existing statically declared confs, so that accessing them from inside the ipv6 code will still work. The register_pernet_subsys() is moved above the ipv6_add_dev() call for loopback, because this function will need the net->devconf_dflt pointer to be already set. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:40 -08:00
Denis V. Lunev	6bd48fcf73	[NETNS]: Provide correct namespace for fibnl netlink socket. This patch makes the netlink socket to be per namespace. That allows to have each namespace its own socket for routing queries. Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:32 -08:00
Denis V. Lunev	e4aef8aea3	[NETNS]: Place fib tables into netns. The preparatory work has been done. All we need is to substitute fib_table_hash with net->ipv4.fib_table_hash. Netns context is available when required. Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:31 -08:00
Denis V. Lunev	e4e4971c5f	[NETNS]: Namespacing IPv4 fib rules. The final trick for rules: place fib4_rules_ops into struct net and modify initialization path for this. Acked-by: Benjamin Thery <benjamin.thery@bull.net> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:31 -08:00
Daniel Lezcano	41a76906b3	[NETNS][IPV6]: Make icmpv6_time sysctl per namespace. This patch moves the icmpv6_time sysctl to the network namespace structure. Because the ipv6 protocol is not yet per namespace, the variable is accessed relatively to the initial network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:20 -08:00
Daniel Lezcano	4990509f19	[NETNS][IPV6]: Make sysctls route per namespace. All the sysctl concerning the routes are moved to the network namespace structure. A helper function is called to initialize the variables. Because the ipv6 protocol is not yet per namespace, the variables are accessed relatively from the network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:20 -08:00
Daniel Lezcano	e71e0349eb	[NETNS][IPV6]: Make ip6_frags per namespace. The ip6_frags is moved to the network namespace structure. Because there can be multiple instances of the network namespaces, and the ip6_frags is no longer a global static variable, a helper function has been added to facilitate the initialization of the variables. Until the ipv6 protocol is not per namespace, the variables are accessed relatively from the initial network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:18 -08:00
Daniel Lezcano	99bc9c4e45	[NETNS][IPV6]: Make bindv6only sysctl per namespace. This patch moves the bindv6only sysctl to the network namespace structure. Until the ipv6 protocol is not per namespace, the sysctl variable is always from the initial network namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:18 -08:00
Daniel Lezcano	760f2d0186	[NETNS][IPV6]: Make multiple instance of sysctl tables. Each network namespace wants its own set of sysctl value, eg. we should not be able from a namespace to set a sysctl value for another namespace , especially for the initial network namespace. This patch duplicates the sysctl table when we register a new network namespace for ipv6. The duplicated table are postfixed with the "template" word to notify the developper the table is cloned. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:17 -08:00
Daniel Lezcano	b0f159db7c	[NETNS][IPV6]: Add ipv6 structure for netns. Like the ipv4 part, this patch adds an ipv6 structure in the net structure to aggregate the different resources to make ipv6 per namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:01:15 -08:00
Eric Dumazet	2a75de0c1d	[NETNS]: Should build with CONFIG_SYSCTL=n Previous NETNS patches broke CONFIG_SYSCTL=n case Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:00:40 -08:00
Pavel Emelyanov	752d14dc6a	[IPV4]: Move the devinet pointers on the struct net This is the core. Add all and default pointers on the netns_ipv4 and register a new pernet subsys to initialize them. Also add the ctl_table_header to register the net.ipv4.ip_forward ctl. I don't allocate additional memory for init_net, but use global devinets. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:58:11 -08:00
Pavel Emelyanov	8afd351c77	[NETNS]: Add the netns_ipv4 struct The ipv4 will store its parameters inside this structure. This one is empty now, but it will be eventually filled. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:58:08 -08:00
Denis V. Lunev	2aaef4e47f	[NETNS]: separate af_packet netns data Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:57:15 -08:00
Denis V. Lunev	a0a53c8ba9	[NETNS]: struct net content re-work (v3) Recently David Miller and Herbert Xu pointed out that struct net becomes overbloated and un-maintainable. There are two solutions: - provide a pointer to a network subsystem definition from struct net. This costs an additional dereferrence - place sub-system definition into the structure itself. This will speedup run-time access at the cost of recompilation time The second approach looks better for us. Other sub-systems will follow. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:57:14 -08:00

... 3 4 5 6 7 ...

448 Commits