linux

Commit Graph

Author	SHA1	Message	Date
Stephen Hemminger	a357dde9df	[TCP] illinois: Incorrect beta usage Lachlan Andrew observed that my TCP-Illinois implementation uses the beta value incorrectly: The parameter beta in the paper specifies the amount to decrease by: that is, on loss, W <- W - betaW but in tcp_illinois_ssthresh() uses beta as the amount to decrease to: W <- betaW This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, hurting performance. Note: since the base beta value is .5, it has no impact on a congested network. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-30 01:10:55 +11:00
Pavel Emelyanov	076931989f	[INET]: Fix inet_diag register vs rcv race The following race is possible when one cpu unregisters the handler while other one is trying to receive a message and call this one: CPU1: CPU2: inet_diag_rcv() inet_diag_unregister() mutex_lock(&inet_diag_mutex); netlink_rcv_skb(skb, &inet_diag_rcv_msg); if (inet_diag_table[nlh->nlmsg_type] == NULL) /* false handler is still registered / ... netlink_dump_start(idiagnl, skb, nlh, inet_diag_dump, NULL); cb = kzalloc(sizeof(cb), GFP_KERNEL); /* sleep here freeing memory * or preempt * or sleep later on nlk->cb_mutex / spin_lock(&inet_diag_register_lock); inet_diag_table[type] = NULL; ... spin_unlock(&inet_diag_register_lock); synchronize_rcu(); / CPU1 is sleeping - RCU quiescent * state is passed / return; / inet_diag_dump is finally called: / inet_diag_dump() handler = inet_diag_table[cb->nlh->nlmsg_type]; BUG_ON(handler == NULL); / OOPS! While we slept the unregister has set * handler to NULL :( */ Grep showed, that the register/unregister functions are called from init/fini module callbacks for tcp_/dccp_diag, so it's OK to use the inet_diag_mutex to synchronize manipulations with the inet_diag_table and the access to it. Besides, as Herbert pointed out, asynchronous dumps should hold this mutex as well, and thus, we provide the mutex as cb_mutex one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-30 00:08:14 +11:00
Adrian Bunk	3660019e5f	[IPV4]: Remove bogus ifdef mess in arp_process The #ifdef's in arp_process() were not only a mess, they were also wrong in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or CONFIG_NETDEV_10000=y) cases. Since they are not required this patch removes them. Also removed are some #ifdef's around #include's that caused compile errors after this change. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-26 23:17:53 +08:00
Ilpo Järvinen	7f9c33e515	[TCP] MTUprobe: Cleanup send queue check (no need to loop) The original code has striking complexity to perform a query which can be reduced to a very simple compare. FIN seqno may be included to write_seq but it should not make any significant difference here compared to skb->len which was used previously. One won't end up there with SYN still queued. Use of write_seq check guarantees that there's a valid skb in send_head so I removed the extra check. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: John Heffner <jheffner@psc.edu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-23 19:10:56 +08:00
Ilpo Järvinen	91cc17c0e5	[TCP]: MTUprobe: receiver window & data available checks fixed It seems that the checked range for receiver window check should begin from the first rather than from the last skb that is going to be included to the probe. And that can be achieved without reference to skbs at all, snd_nxt and write_seq provides the correct seqno already. Plus, it SHOULD account packets that are necessary to trigger fast retransmit [RFC4821]. Location of snd_wnd < probe_size/size_needed check is bogus because it will cause the other if() match as well (due to snd_nxt >= snd_una invariant). Removed dead obvious comment. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-23 19:08:16 +08:00
Pavel Emelyanov	d535a916cd	[IPVS]: Fix compiler warning about unused register_ip_vs_protocol This is silly, but I have turned the CONFIG_IP_VS to m, to check the compilation of one (recently sent) fix and set all the CONFIG_IP_VS_PROTO_XXX options to n to speed up the compilation. In this configuration the compiler warns me about CC [M] net/ipv4/ipvs/ip_vs_proto.o net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used Indeed. With no protocols selected there are no calls to this function - all are compiled out with ifdefs. Maybe the best fix would be to surround this call with ifdef-s or tune the Kconfig dependences, but I think that marking this register function as __used is enough. No? Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:44:01 -08:00
Jonas Danielsson	b4a9811c42	[ARP]: Fix arp reply when sender ip 0 Fix arp reply when received arp probe with sender ip 0. Send arp reply with target ip address 0.0.0.0 and target hardware address set to hardware address of requester. Previously sent reply with target ip address and target hardware address set to same as source fields. Signed-off-by: Jonas Danielsson <the.sator@gmail.com> Acked-by: Alexey Kuznetov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:38:16 -08:00
YOSHIFUJI Hideaki	354faf0977	[IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:30:31 -08:00
YOSHIFUJI Hideaki	a80cc20da4	[IPV4] TCPMD5: Omit redundant NULL check for kfree() argument. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:30:06 -08:00
Evgeniy Polyakov	1f305323ff	[NETFILTER]: Fix kernel panic with REDIRECT target. When connection tracking entry (nf_conn) is about to copy itself it can have some of its extension users (like nat) as being already freed and thus not required to be copied. Actually looking at this function I suspect it was copied from nf_nat_setup_info() and thus bug was introduced. Report and testing from David <david@unsolicited.net>. [ Patrick McHardy states: I now understand whats happening: - new connection is allocated without helper - connection is REDIRECTed to localhost - nf_nat_setup_info adds NAT extension, but doesn't initialize it yet - nf_conntrack_alter_reply performs a helper lookup based on the new tuple, finds the SIP helper and allocates a helper extension, causing reallocation because of too little space - nf_nat_move_storage is called with the uninitialized nat extension So your fix is entirely correct, thanks a lot :) ] Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 04:27:35 -08:00
Joe Perches	464c4f184a	[IPV4]: Add missing "space" Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:46:29 -08:00
Sam Jansen	5487796f0c	[TCP]: Problem bug with sysctl_tcp_congestion_control function From: "Sam Jansen" <sjansen@google.com> sysctl_tcp_congestion_control seems to have a bug that prevents it from actually calling the tcp_set_default_congestion_control function. This is not so apparent because it does not return an error and generally the /proc interface is used to configure the default TCP congestion control algorithm. This is present in 2.6.18 onwards and probably earlier, though I have not inspected 2.6.15--2.6.17. sysctl_tcp_congestion_control calls sysctl_string and expects a successful return code of 0. In such a case it actually sets the congestion control algorithm with tcp_set_default_congestion_control. Otherwise, it returns the value returned by sysctl_string. This was correct in 2.6.14, as sysctl_string returned 0 on success. However, sysctl_string was updated to return 1 on success around about 2.6.15 and sysctl_tcp_congestion_control was not updated. Even though sysctl_tcp_congestion_control returns 1, do_sysctl_strategy converts this return code to '0', so the caller never notices the error. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:28:21 -08:00
Ilpo J�rvinen	6e42141009	[TCP] MTUprobe: fix potential sk_send_head corruption When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:24:09 -08:00
Simon Horman	9055fa1f3d	[IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED, I stronly doubt that anyone is using the sys_sysctl interface to these variables. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:51:13 -08:00
Simon Horman	9e103fa6bd	[IPVS]: Fix sysctl warnings about missing strategy in schedulers sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy Switch these entried over to use CTL_UNNUMBERED as clearly the sys_syscal portion wasn't working. This is along the same lines as Christian Borntraeger's patch that fixes up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:50:21 -08:00
Christian Borntraeger	611cd55b15	[IPVS]: Fix sysctl warnings about missing strategy Running the latest git code I get the following messages during boot: sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy I removed the binary sysctl handler for those messages and also removed the definitions in ip_vs.h. The alternative would be to implement a proper strategy handler, but syscall sysctl is deprecated. There are other sysctl definitions that are commented out or work with the default sysctl_data strategy. I did not touch these. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:49:25 -08:00
Eric Dumazet	483b23ffa3	[NET]: Corrects a bug in ip_rt_acct_read() It seems that stats of cpu 0 are counted twice, since for_each_possible_cpu() is looping on all possible cpus, including 0 Before percpu conversion of ip_rt_acct, we should also remove the assumption that CPU 0 is online (or even possible) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-18 18:47:38 -08:00
Eric Dumazet	d90bf5a976	[NET]: rt_check_expire() can take a long time, add a cond_resched() On commit 39c90ece7565f5c47110c2fa77409d7a9478bd5b: [IPV4]: Convert rt_check_expire() from softirq processing to workqueue. we converted rt_check_expire() from softirq to workqueue, allowing the function to perform all work it was supposed to do. When the IP route cache is big, rt_check_expire() can take a long time to run. (default settings : 20% of the hash table is scanned at each invocation) Adding cond_resched() helps giving cpu to higher priority tasks if necessary. Using a "if (need_resched())" test before calling "cond_resched();" is necessary to avoid spending too much time doing the resched check. (My tests gave a time reduction from 88 ms to 25 ms per rt_check_expire() run on my i686 test machine) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-14 16:14:05 -08:00
Ilpo J�rvinen	e1cd8f78f8	[TCP] FRTO: Clear frto_highmark only after process_frto that uses it I broke this in commit 3de96471bd7fb76406e975ef6387abe3a0698149: [TCP]: Wrap-safed reordering detection FRTO check tcp_process_frto should always see a valid frto_highmark. An invalid frto_highmark (zero) is very likely what ultimately caused a seqno compare in tcp_frto_enter_loss to do the wrong leading to the LOST-bit leak. Having LOST-bits integry ensured like done after commit 23aeeec365dcf8bc87fae44c533e50d0bb4f23cc: [TCP] FRTO: Plug potential LOST-bit leak won't hurt. It may still be useful in some other, possibly legimate, scenario. Reported by Chazarain Guillaume <guichaz@yahoo.fr>. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-14 15:55:09 -08:00
Ilpo J�rvinen	96a2d41a3e	[TCP]: Make sure write_queue_from does not begin with NULL ptr NULL ptr can be returned from tcp_write_queue_head to cached_skb and then assigned to skb if packets_out was zero. Without this, system is vulnerable to a carefully crafted ACKs which obviously is remotely triggerable. Besides, there's very little that needs to be done in sacktag if there weren't any packets outstanding, just skipping the rest doesn't hurt. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-14 15:47:18 -08:00
Ilpo J�rvinen	23aeeec365	[TCP] FRTO: Plug potential LOST-bit leak It might be possible that, in some extreme scenario that I just cannot now construct in my mind, end_seq <= frto_highmark check does not match causing the lost_out and LOST bits become out-of-sync due to clearing and recounting in the loop. This may fix LOST-bit leak reported by Chazarain Guillaume <guichaz@yahoo.fr>. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-13 21:03:13 -08:00
Ilpo J�rvinen	746aa32d28	[TCP] FRTO: Limit snd_cwnd if TCP was application limited Otherwise TCP might violate packet ordering principles that FRTO is based on. If conventional recovery path is chosen, this won't be significant at all. In practice, any small enough value will be sufficient to provide proper operation for FRTO, yet other users of snd_cwnd might benefit from a "close enough" value. FRTO's formula is now equal to what tcp_enter_cwr() uses. FRTO used to check application limitedness a bit differently but I changed that in commit `575ee7140d` and as a result checking for application limitedness became completely non-existing. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-13 21:01:23 -08:00
Li Zefan	e0bf9cf15f	[NETFILTER]: nf_nat: fix memset error The size passing to memset is the size of a pointer. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-13 02:57:16 -08:00
Pavel Emelyanov	d71209ded2	[INET]: Use list_head-s in inetpeer.c The inetpeer.c tracks the LRU list of inet_perr-s, but makes it by hands. Use the list_head-s for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-12 21:27:28 -08:00
Adrian Bunk	22649d1afb	[IPVS]: Remove unused exports. This patch removes the following unused EXPORT_SYMBOL's: - ip_vs_try_bind_dest - ip_vs_find_dest Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-12 21:25:24 -08:00
Denis V. Lunev	2994c63863	[INET]: Small possible memory leak in FIB rules This patch fixes a small memory leak. Default fib rules can be deleted by the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by ip rule flush Such a rule will not be freed as the ref-counter has 2 on start and becomes clearly unreachable after removal. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 22:12:03 -08:00
Pavel Emelyanov	358352b8b8	[INET]: Cleanup the xfrm4_tunnel_(un)register Both check for the family to select an appropriate tunnel list. Consolidate this check and make the for() loop more readable. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:48:54 -08:00
Pavel Emelyanov	99f933263a	[INET]: Add missed tunnel64_err handler The tunnel64_protocol uses the tunnel4_protocol's err_handler and thus calls the tunnel4_protocol's handlers. This is not very good, as in case of (icmp) error the wrong error handlers will be called (e.g. ipip ones instead of sit) and this won't be noticed at all, because the error is not reported. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:47:39 -08:00
Pavel Emelyanov	03f49f3457	[NET]: Make helper to get dst entry and "use" it There are many places that get the dst entry, increase the __use counter and set the "lastuse" time stamp. Make a helper for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:28:34 -08:00
Pavel Emelyanov	b1667609cd	[IPV4]: Remove bugus goto-s from ip_route_input_slow Both places look like if (err == XXX) goto yyy; done: while both yyy targets look like err = XXX; goto done; so this is ok to remove the above if-s. yyy labels are used in other places and are not removed. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:26:41 -08:00
Ilpo J�rvinen	fbd52eb2bd	[TCP]: Split SACK FRTO flag clearing (fixes FRTO corner case bug) In case we run out of mem when fragmenting, the clearing of FLAG_ONLY_ORIG_SACKED might get missed which then feeds FRTO with false information. Move clearing outside skb processing loop so that it will get executed even if the skb loop terminates prematurely due to out-of-mem. Besides, now the core of the loop truly deals with a single skb only, which also enables creation a more self-contained of tcp_sacktag_one later on. In addition, small reorganization of if branches was made. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:24:19 -08:00
Ilpo J�rvinen	e49aa5d456	[TCP]: Add unlikely() to sacktag out-of-mem in fragment case Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:23:08 -08:00
Ilpo J�rvinen	c7caf8d3ed	[TCP]: Fix reord detection due to snd_una covered holes Fixes subtle bug like the one with fastpath_cnt_hint happening due to the way the GSO and hints interact. Because hints are not reset when just a GSOed skb is partially ACKed, there's no guarantee that the relevant part of the write queue is going to be processed in sacktag at all (skbs below snd_una) because fastpath hint can fast forward the entrypoint. This was also on the way of future reductions in sacktag's skb processing. Also future cleanups in sacktag can be made after this (in 2.6.25). This may make reordering update in tcp_try_undo_partial redundant but I'm not too sure so I left it there. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:22:18 -08:00
Ilpo J�rvinen	8dd71c5d28	[TCP]: Consider GSO while counting reord in sacktag Reordering detection fails to take account that the reordered skb may have pcount larger than 1. In such case the lowest of them had the largest reordering, the old formula used the highest of them which is pcount - 1 packets less reordered. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-10 21:20:59 -08:00
Eric Dumazet	230140cffa	[INET]: Remove per bucket rwlock in tcp/dccp ehash table. As done two years ago on IP route cache table (commit `22c047ccbc`) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:15:11 -08:00
Rumen G. Bogdanovski	efac52762b	[IPVS]: Synchronize closing of Connections This patch makes the master daemon to sync the connection when it is about to close. This makes the connections on the backup to close or timeout according their state. Before the sync was performed only if the connection is in ESTABLISHED state which always made the connections to timeout in the hard coded 3 minutes. However the Andy Gospodarek's patch ([IPVS]: use proper timeout instead of fixed value) effectively did nothing more than increasing this to 15 minutes (Established state timeout). So this patch makes use of proper timeout since it syncs the connections on status changes to FIN_WAIT (2min timeout) and CLOSE (10sec timeout). However if the backup misses CLOSE hopefully it did not miss FIN_WAIT. Otherwise we will just have to wait for the ESTABLISHED state timeout. As it is without this patch. This way the number of the hanging connections on the backup is kept to minimum. And very few of them will be left to timeout with a long timeout. This is important if we want to make use of the fix for the real server overcommit on master/backup fail-over. Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:15:10 -08:00
Rumen G. Bogdanovski	1e356f9cdf	[IPVS]: Bind connections on stanby if the destination exists This patch fixes the problem with node overload on director fail-over. Given the scenario: 2 nodes each accepting 3 connections at a time and 2 directors, director failover occurs when the nodes are fully loaded (6 connections to the cluster) in this case the new director will assign another 6 connections to the cluster, If the same real servers exist there. The problem turned to be in not binding the inherited connections to the real servers (destinations) on the backup director. Therefore: "ipvsadm -l" reports 0 connections: root@test2:~# ipvsadm -l IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP test2.local:5999 wlc -> node473.local:5999 Route 1000 0 0 -> node484.local:5999 Route 1000 0 0 while "ipvs -lnc" is right root@test2:~# ipvsadm -lnc IPVS connection entries pro expire state source virtual destination TCP 14:56 ESTABLISHED 192.168.0.10:39164 192.168.0.222:5999 192.168.0.51:5999 TCP 14:59 ESTABLISHED 192.168.0.10:39165 192.168.0.222:5999 192.168.0.52:5999 So the patch I am sending fixes the problem by binding the received connections to the appropriate service on the backup director, if it exists, else the connection will be handled the old way. So if the master and the backup directors are synchronized in terms of real services there will be no problem with server over-committing since new connections will not be created on the nonexistent real services on the backup. However if the service is created later on the backup, the binding will be performed when the next connection update is received. With this patch the inherited connections will show as inactive on the backup: root@test2:~# ipvsadm -l IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP test2.local:5999 wlc -> node473.local:5999 Route 1000 0 1 -> node484.local:5999 Route 1000 0 1 rumen@test2:~$ cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP C0A800DE:176F wlc -> C0A80033:176F Route 1000 0 1 -> C0A80032:176F Route 1000 0 1 Regards, Rumen Bogdanovski Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2007-11-07 04:15:09 -08:00
Herbert Xu	4999f3621f	[IPSEC]: Fix crypto_alloc_comp error checking The function crypto_alloc_comp returns an errno instead of NULL to indicate error. So it needs to be tested with IS_ERR. This is based on a patch by Vicen� Beltran Querol. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:15:03 -08:00
Pavel Emelyanov	c3e9a353d8	[IPV4]: Compact some ifdefs in the fib code. There are places that check for CONFIG_IP_MULTIPLE_TABLES twice in the same file, but the internals of these #ifdefs can be merged. As a side effect - remove one ifdef from inside a function. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:11:41 -08:00
Eric Dumazet	47a31a6ffc	[IPV4]: Use the {DEFINE\|REF}_PROTO_INUSE infrastructure Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast "inuse sockets" infrastructure Each protocol use then a static percpu var, instead of a dynamic one. This saves some ram and some cpu cycles Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:58 -08:00
Eric Dumazet	286ab3d460	[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way. "struct proto" currently uses an array stats[NR_CPUS] to track change on 'inuse' sockets per protocol. If NR_CPUS is big, this means we use a big memory area for this. Moreover, all this memory area is located on a single node on NUMA machines, increasing memory pressure on the boot node. In this patch, I tried to : - Keep a fast !CONFIG_SMP implementation - Keep a fast CONFIG_SMP implementation for often used protocols (tcp,udp,raw,...) - Introduce a NUMA efficient implementation Some helper macros are defined in include/net/sock.h These macros take into account CONFIG_SMP If a "struct proto" is declared without using DEFINE_PROTO_INUSE / REF_PROTO_INUSE macros, it will automatically use a default implementation, using a dynamically allocated percpu zone. This default implementation will be NUMA efficient, but might use 32/64 bytes per possible cpu because of current alloc_percpu() implementation. However it still should be better than previous implementation based on stats[NR_CPUS] field. When a "struct proto" is changed to use the new macros, we use a single static "int" percpu variable, lowering the memory and cpu costs, still preserving NUMA efficiency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:57 -08:00
Pavel Emelyanov	6a9fb9479f	[IPV4]: Clean the ip_sockglue.c from some ugly ifdefs The #idfed CONFIG_IP_MROUTE is sometimes places inside the if-s, which looks completely bad. Similar ifdefs inside the functions looks a bit better, but they are also not recommended to be used. Provide an ifdef-ed ip_mroute_opt() helper to cleanup the code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:55 -08:00
Pavel Emelyanov	429f08e950	[IPV4]: Consolidate the ip cork destruction in ip_output.c The ip_push_pending_frames and ip_flush_pending_frames do the same things to flush the sock's cork. Move this into a separate function and save ~80 bytes from the .text Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:25 -08:00
Patrick McHardy	d1332e0ab8	[NETFILTER]: remove unneeded rcu_dereference() calls As noticed by Paul McKenney, the rcu_dereference calls in the init path of NAT modules are unneeded, remove them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:23 -08:00
Jan Engelhardt	0795c65d9f	[NETFILTER]: Clean up Makefile Sort matches and targets in the NF makefiles. Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:22 -08:00
Alexey Dobriyan	7351a22a3a	[NETFILTER]: ip{,6}_queue: convert to seq_file interface I plan to kill ->get_info which means killing proc_net_create(). Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:20 -08:00
Jens Axboe	c46f2334c8	[SG] Get rid of __sg_mark_end() sg_mark_end() overwrites the page_link information, but all users want __sg_mark_end() behaviour where we just set the end bit. That is the most natural way to use the sg list, since you'll fill it in and then mark the end point. So change sg_mark_end() to only set the termination bit. Add a sg_magic debug check as well, and clear a chain pointer if it is set. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-02 08:47:06 +01:00
Adrian Bunk	87ae9afdca	cleanup asm/scatterlist.h includes Not architecture specific code should not #include <asm/scatterlist.h>. This patch therefore either replaces them with #include <linux/scatterlist.h> or simply removes them if they were unused. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-11-02 08:47:06 +01:00
Pavel Emelyanov	6257ff2177	[NET]: Forget the zero_it argument of sk_alloc() Finally, the zero_it argument can be completely removed from the callers and from the function prototype. Besides, fix the checkpatch.pl warnings about using the assignments inside if-s. This patch is rather big, and it is a part of the previous one. I splitted it wishing to make the patches more readable. Hope this particular split helped. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-01 00:39:31 -07:00
Ilpo J�rvinen	261ab365fa	[TCP]: Another TAGBITS -> SACKED_ACKED\|LOST conversion Similar to commit `3eec0047d9`, point of this is to avoid skipping R-bit skbs. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-01 00:10:18 -07:00
Ilpo J�rvinen	e56d6cd605	[TCP]: Process DSACKs that reside within a SACK block DSACK inside another SACK block were missed if start_seq of DSACK was larger than SACK block's because sorting prioritizes full processing of the SACK block before DSACK. After SACK block sorting situation is like this: SSSSSSSSS D SSSSSS SSSSSSS Because write_queue is walked in-order, when the first SACK block has been processed, TCP is already past the skb for which the DSACK arrived and we haven't taught it to backtrack (nor should we), so TCP just continues processing by going to the next SACK block after the DSACK (if any). Whenever such DSACK is present, do an embedded checking during the previous SACK block. If the DSACK is below snd_una, there won't be overlapping SACK block, and thus no problem in that case. Also if start_seq of the DSACK is equal to the actual block, it will be processed first. Tested this by using netem to duplicate 15% of packets, and by printing SACK block when found_dup_sack is true and the selected skb in the dup_sack = 1 branch (if taken): SACK block 0: 4344-5792 (relative to snd_una 2019137317) SACK block 1: 4344-5792 (relative to snd_una 2019137317) equal start seqnos => next_dup = 0, dup_sack = 1 won't occur... SACK block 0: 5792-7240 (relative to snd_una 2019214061) SACK block 1: 2896-7240 (relative to snd_una 2019214061) DSACK skb match 5792-7240 (relative to snd_una) ...and next_dup = 1 case (after the not shown start_seq sort), went to dup_sack = 1 branch. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-01 00:09:37 -07:00
David S. Miller	51c739d1f4	[NET]: Fix incorrect sg_mark_end() calls. This fixes scatterlist corruptions added by commit `68e3f5dd4d` [CRYPTO] users: Fix up scatterlist conversion errors The issue is that the code calls sg_mark_end() which clobbers the sg_page() pointer of the final scatterlist entry. The first part fo the fix makes skb_to_sgvec() do __sg_mark_end(). After considering all skb_to_sgvec() call sites the most correct solution is to call __sg_mark_end() in skb_to_sgvec() since that is what all of the callers would end up doing anyways. I suspect this might have fixed some problems in virtio_net which is the sole non-crypto user of skb_to_sgvec(). Other similar sg_mark_end() cases were converted over to __sg_mark_end() as well. Arguably sg_mark_end() is a poorly named function because it doesn't just "mark", it clears out the page pointer as a side effect, which is what led to these bugs in the first place. The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable() and arguably it could be converted to __sg_mark_end() if only so that we can delete this confusing interface from linux/scatterlist.h Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-30 21:29:29 -07:00
Alexey Dobriyan	07afa04025	[IPVS]: Remove /proc/net/ip_vs_lblcr It's under CONFIG_IP_VS_LBLCR_DEBUG option which never existed. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-30 21:16:27 -07:00
Dirk Hohndel	e403149c92	Kbuild/doc: fix links to Documentation files Fix links to files in Documentation/* in various Kconfig files Signed-off-by: Dirk Hohndel <hohndel@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-30 14:26:30 -07:00
Jean Delvare	0ccfe61803	[TCP]: Saner thash_entries default with much memory. On systems with a very large amount of memory, the heuristics in alloc_large_system_hash() result in a very large TCP established hash table: 16 millions of entries for a 128 GB ia64 system. This makes reading from /proc/net/tcp pretty slow (well over a second) and as a result netstat is slow on these machines. I know that /proc/net/tcp is deprecated in favor of tcp_diag, however at the moment netstat only knows of the former. I am skeptical that such a large TCP established hash is often needed. Just because a system has a lot of memory doesn't imply that it will have several millions of concurrent TCP connections. Thus I believe that we should put an arbitrary high limit to the size of the TCP established hash by default. Users who really need a bigger hash can always use the thash_entries boot parameter to get more. I propose 2 millions of entries as the arbitrary high limit. This makes /proc/net/tcp reasonably fast on the system in question (0.2 s) while being still large enough for me to be confident that network performance won't suffer. This is just one way to limit the hash size, there are others; I am not familiar enough with the TCP code to decide which is best. Thus, I would welcome the proposals of alternatives. [ 2 million is still too large, thus I've modified the limit in the change to be '512 * 1024'. -DaveM ] Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-30 00:59:25 -07:00
Mitsuru Chinen	064f3605be	[IPv4] SNMP: Refer correct memory location to display ICMP out-going statistics While displaying ICMP out-going statistics as Out<name> counters in /proc/net/snmp, the memory location for ICMP in-coming statistics was referred by mistake. Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-29 22:37:36 -07:00
Matthias M. Dellweg	b0a713e9e6	[TCP] MD5: Remove some more unnecessary casting. while reviewing the tcp_md5-related code further i came across with another two of these casts which you probably have missed. I don't actually think that they impose a problem by now, but as you said we should remove them. Signed-off-by: Matthias M. Dellweg <2500@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-29 22:37:27 -07:00
Xiaoliang (David) Wei	c940587bf6	[TCP] vegas: Fix a bug in disabling slow start by gamma parameter. TCP Vegas implementation has a bug in the process of disabling slow-start with gamma parameter. The bug may lead to extreme unfairness in the presence of early packet loss. See details in: http://www.cs.caltech.edu/~weixl/technical/ns2linux/known_linux/index.html#vegas Switch the order of "if (tp->snd_cwnd <= tp->snd_ssthresh)" statement and "if (diff > gamma)" statement to eliminate the problem. Signed-off-by: Xiaoliang (David) Wei <davidwei79@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-29 22:37:25 -07:00
Andy Gospodarek	5c81833c2f	[IPVS]: use proper timeout instead of fixed value Instead of using the default timeout of 3 minutes, this uses the timeout specific to the protocol used for the connection. The 3 minute timeout seems somewhat arbitrary (though I know it is used other places in the ipvs code) and when failing over it would be much nicer to use one of the configured timeout values. Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-29 22:37:23 -07:00
Herbert Xu	68e3f5dd4d	[CRYPTO] users: Fix up scatterlist conversion errors This patch fixes the errors made in the users of the crypto layer during the sg_init_table conversion. It also adds a few conversions that were missing altogether. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-27 00:52:07 -07:00
Adrian Bunk	72998d8c84	[INET] ESP: Must #include <linux/scatterlist.h> This patch fixes the following compile errors in some configurations: <-- snip --> ... CC net/ipv4/esp4.o /home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv4/esp4.c: In function 'esp_output': /home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv4/esp4.c:113: error: implicit declaration of function 'sg_init_table' make[3]: * [net/ipv4/esp4.o] Error 1 ... /home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv6/esp6.c: In function 'esp6_output': /home/bunk/linux/kernel-2.6/git/linux-2.6/net/ipv6/esp6.c:112: error: implicit declaration of function 'sg_init_table' make[3]: * [net/ipv6/esp6.o] Error 1 <-- snip --> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 22:53:58 -07:00
Paul Moore	4be2700fb7	[NetLabel]: correct usage of RCU locking This fixes some awkward, and perhaps even problematic, RCU lock usage in the NetLabel code as well as some other related trivial cleanups found when looking through the RCU locking. Most of the changes involve removing the redundant RCU read locks wrapping spinlocks in the case of a RCU writer. Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 04:29:08 -07:00
Ryousei Takano	94d3b1e586	[TCP]: fix D-SACK cwnd handling In the current net-2.6 kernel, handling FLAG_DSACKING_ACK is broken. The flag is cleared to 1 just after FLAG_DSACKING_ACK is set. if (found_dup_sack) flag \|= FLAG_DSACKING_ACK; : flag = 1; To fix it, this patch introduces a part of the tcp_sacktag_state patch: http://marc.info/?l=linux-netdev&m=119210560431519&w=2 Signed-off-by: Ryousei Takano <takano-ryousei@aist.go.jp> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 04:27:59 -07:00
Adrian Bunk	39296ed669	[INET]: Unexport icmpmsg_statistics This patch removes the unused EXPORT_SYMBOL(icmpmsg_statistics). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 04:06:08 -07:00
Adrian Bunk	0f79efdc23	[TCP]: Make tcp_match_skb_to_sack() static. tcp_match_skb_to_sack() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 03:57:36 -07:00
David S. Miller	c7da57a183	[TCP]: Fix scatterlist handling in MD5 signature support. Use sg_init_table() and sg_mark_end() as needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 00:41:21 -07:00
David S. Miller	ed0e7e0ca3	[IPSEC]: Add missing sg_init_table() calls to ESP. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-26 00:38:39 -07:00
Ryousei Takano	564262c1f0	[TCP]: Fix inconsistency of terms. Fix inconsistency of terms: 1) D-SACK 2) F-RTO Signed-off-by: Ryousei Takano <takano-ryousei@aist.go.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-25 23:03:52 -07:00
Vlad Yasevich	fee9dee730	[UDP]: Make use of inet_iif() when doing socket lookups. UDP currently uses skb->dev->ifindex which may provide the wrong information when the socket bound to a specific interface. This patch makes inet_iif() accessible to UDP and makes UDP use it. The scenario we are trying to fix is when a client is running on the same system and the server and both client and server bind to a non-loopback device. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-25 18:54:46 -07:00
David S. Miller	8a6911b12f	[IPV4]: Remove no longer used snmp4_icmp_list. This was obsoleted by a previous change, but the removal was forgotten. Reported by David Howells and David Stevens. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-25 18:40:05 -07:00
Pavel Emelyanov	03cf786c4e	[IPV4]: Explicitly call fib_get_table() in fib_frontend.c In case the "multiple tables" config option is y, the ip_fib_local_table is not a variable, but a macro, that calls fib_get_table(RT_TABLE_LOCAL). Some code uses this "variable" 3 times in one place, thus implicitly making 3 calls. Fix it. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-23 21:27:58 -07:00
Chuck Lever	c2636b4d9e	[NET]: Treat the sign of the result of skb_headroom() consistently In some places, the result of skb_headroom() is compared to an unsigned integer, and in others, the result is compared to a signed integer. Make the comparisons consistent and correct. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-23 21:27:55 -07:00
Timo Teras	6a5f44d7a0	[IPV4] ip_gre: sendto/recvfrom NBMA address When GRE tunnel is in NBMA mode, this patch allows an application to use a PF_PACKET socket to: - send a packet to specific NBMA address with sendto() - use recvfrom() to receive packet and check which NBMA address it came from This is required to implement properly NHRP over GRE tunnel. Signed-off-by: Timo Teras <timo.teras@iki.fi> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-23 21:27:53 -07:00
Jean Delvare	305e1e9691	[INET]: Let inet_diag and friends autoload By adding module aliases to inet_diag, tcp_diag and dccp_diag, we let them load automatically as needed. This makes tools like "ss" run faster. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-22 02:59:54 -07:00
Matt LaPlante	01dd2fbf0d	typo fixes Most of these fixes were already submitted for old kernel versions, and were approved, but for some reason they never made it into the releases. Because this is a consolidation of a couple old missed patches, it touches both Kconfigs and documentation texts. Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-20 01:34:40 +02:00
Linus Torvalds	804b908adf	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [NET]: Fix possible dev_deactivate race condition [INET]: Justification for local port range robustness. [PACKET]: Kill unused pg_vec_endpage() function [NET]: QoS/Sched as menuconfig [NET]: Fix bug in sk_filter race cures. [PATCH] mac80211: make ieee802_11_parse_elems return void	2007-10-19 11:54:39 -07:00
Pavel Emelyanov	6651fd561b	Use task_pid_nr() in ip_vs_sync.c The sync_master_pid and sync_backup_pid are set in set_sync_pid() and are used later for set/not-set checks and in printk. So it is safe to use the global pid value in this case. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
Pavel Emelyanov	ba25f9dcc4	Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
Jiri Slaby	1977f03272	remove asm/bitops.h includes remove asm/bitops.h includes including asm/bitops directly may cause compile errors. don't include it and include linux/bitops instead. next patch will deny including asm header directly. Cc: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:41 -07:00
Anton Arapov	a25de534f8	[INET]: Justification for local port range robustness. There is a justifying patch for Stephen's patches. Stephen's patches disallows using a port range of one single port and brakes the meaning of the 'remaining' variable, in some places it has different meaning. My patch gives back the sense of 'remaining' variable. It should mean how many ports are remaining and nothing else. Also my patch allows using a single port. I sure we must be able to use mentioned port range, this does not restricted by documentation and does not brake current behavior. usefull links: Patches posted by Stephen Hemminger http://marc.info/?l=linux-netdev&m=119206106218187&w=2 http://marc.info/?l=linux-netdev&m=119206109918235&w=2 Andrew Morton's comment http://marc.info/?l=linux-kernel&m=119248225007737&w=2 1. Allows using a port range of one single port. 2. Gives back sense of 'remaining' variable. Signed-off-by: Anton Arapov <aarapov@redhat.com> Acked-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-18 22:00:17 -07:00
Linus Torvalds	a57793651f	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (51 commits) [IPV6]: Fix again the fl6_sock_lookup() fixed locking [NETFILTER]: nf_conntrack_tcp: fix connection reopening fix [IPV6]: Fix race in ipv6_flowlabel_opt() when inserting two labels [IPV6]: Lost locking in fl6_sock_lookup [IPV6]: Lost locking when inserting a flowlabel in ipv6_fl_list [NETFILTER]: xt_sctp: fix mistake to pass a pointer where array is required [NET]: Fix OOPS due to missing check in dev_parse_header(). [TCP]: Remove lost_retrans zero seqno special cases [NET]: fix carrier-on bug? [NET]: Fix uninitialised variable in ip_frag_reasm() [IPSEC]: Rename mode to outer_mode and add inner_mode [IPSEC]: Disallow combinations of RO and AH/ESP/IPCOMP [IPSEC]: Use the top IPv4 route's peer instead of the bottom [IPSEC]: Store afinfo pointer in xfrm_mode [IPSEC]: Add missing BEET checks [IPSEC]: Move type and mode map into xfrm_state.c [IPSEC]: Fix length check in xfrm_parse_spi [IPSEC]: Move ip_summed zapping out of xfrm6_rcv_spi [IPSEC]: Get nexthdr from caller in xfrm6_rcv_spi [IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input ...	2007-10-18 14:40:30 -07:00
Eric W. Biederman	064b5bba0c	sysctl: remove broken netfilter binary sysctls No one has bothered to set strategy routine for the the netfilter sysctls that return jiffies to be sysctl_jiffies. So it appears the sys_sysctl path is unused and untested, so this patch removes the binary sysctl numbers. Which fixes the netfilter oops in 2.6.23-rc2-mm2 for me. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Patrick McHardy <kaber@trash.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	49641b58a7	sysctl: ipv4 remove binary sysctl paths where they are broken Currently tcp_available_congestion_control does not even attempt being read from sys_sysctl, and ipfrag_max_dist while it works allows setting of invalid values using sys_sysctl. So just kill the binary sys_sysctl support for these sysctls. If the support is not important enough to test and get right it probably isn't important enough to keep. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Ilpo Järvinen	df2e014bfb	[TCP]: Remove lost_retrans zero seqno special cases Both high-sack detection and new lowest seq variables have unnecessary zero special case which are now removed by setting safe initial seqnos. This also fixes problem which caused zero received_upto being passed to tcp_mark_lost_retrans which confused after relations within the marker loop causing incorrect TCPCB_SACKED_RETRANS clearing. The problem was noticed because of a performance report from TAKANO Ryousei <takano@axe-inc.co.jp>. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Ryousei Takano <takano-ryousei@aist.go.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-18 05:07:57 -07:00
David Howells	45542479fb	[NET]: Fix uninitialised variable in ip_frag_reasm() Fix uninitialised variable in ip_frag_reasm(). err should be set to -ENOMEM if the initial call of skb_clone() fails. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:37:22 -07:00
Herbert Xu	13996378e6	[IPSEC]: Rename mode to outer_mode and add inner_mode This patch adds a new field to xfrm states called inner_mode. The existing mode object is renamed to outer_mode. This is the first part of an attempt to fix inter-family transforms. As it is we always use the outer family when determining which mode to use. As a result we may end up shoving IPv4 packets into netfilter6 and vice versa. What we really want is to use the inner family for the first part of outbound processing and the outer family for the second part. For inbound processing we'd use the opposite pairing. I've also added a check to prevent silly combinations such as transport mode with inter-family transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:35:51 -07:00
Herbert Xu	ed3e37ddb0	[IPSEC]: Use the top IPv4 route's peer instead of the bottom For IPv4 we were using the bottom route's peer instead of the top one. This is wrong because the peer is only used by TCP to keep track of information about the TCP destination address which certainly does not live in the bottom route. This patch fixes that which allows us to get rid of the family check since the bottom route could be IPv6 while the top one must always be IPv4. I've also changed the other fields which are IPv4-specific to get the info from the top route instead of potentially bogus data from the bottom route. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:34:46 -07:00
Herbert Xu	17c2a42a24	[IPSEC]: Store afinfo pointer in xfrm_mode It is convenient to have a pointer from xfrm_state to address-specific functions such as the output function for a family. Currently the address-specific policy code calls out to the xfrm state code to get those pointers when we could get it in an easier way via the state itself. This patch adds an xfrm_state_afinfo to xfrm_mode (since they're address-specific) and changes the policy code to use it. I've also added an owner field to do reference counting on the module providing the afinfo even though it isn't strictly necessary today since IPv6 can't be unloaded yet. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:33:12 -07:00
Herbert Xu	1bfcb10f67	[IPSEC]: Add missing BEET checks Currently BEET mode does not reinject the packet back into the stack like tunnel mode does. Since BEET should behave just like tunnel mode this is incorrect. This patch fixes this by introducing a flags field to xfrm_mode that tells the IPsec code whether it should terminate and reinject the packet back into the stack. It then sets the flag for BEET and tunnel mode. I've also added a number of missing BEET checks elsewhere where we check whether a given mode is a tunnel or not. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:31:50 -07:00
Herbert Xu	c4541b41c0	[IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input This patch moves the tunnel parsing for IPv4 out of xfrm4_input and into xfrm4_tunnel. This change is in line with what IPv6 does and will allow us to merge the two input functions. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:28:53 -07:00
Herbert Xu	04663d0b8b	[IPSEC]: Fix pure tunnel modes involving IPv6 I noticed that my recent patch broke 6-on-4 pure IPsec tunnels (the ones that are only used for incompressible IPsec packets). Subsequent reviews show that I broke 6-on-6 pure tunnels more than three years ago and nobody ever noticed. I suppose every must be testing 6-on-6 IPComp with large pings which are very compressible :) This patch fixes both cases. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 21:28:06 -07:00
Pavel Emelyanov	c95477090a	[INET]: Consolidate frag queues freeing Since we now allocate the queues in inet_fragment.c, we can safely free it in the same place. The ->destructor callback thus becomes optional for inet_frags. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:48:26 -07:00
Pavel Emelyanov	48d6005638	[INET]: Remove no longer needed ->equal callback Since this callback is used to check for conflicts in hashtable when inserting a newly created frag queue, we can do the same by checking for matching the queue with the argument, used to create one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:47:56 -07:00
Pavel Emelyanov	abd6523d15	[INET]: Consolidate xxx_find() in fragment management Here we need another callback ->match to check whether the entry found in hash matches the key passed. The key used is the same as the creation argument for inet_frag_create. Yet again, this ->match is the same for netfilter and ipv6. Running a frew steps forward - this callback will later replace the ->equal one. Since the inet_frag_find() uses the already consolidated inet_frag_create() remove the xxx_frag_create from protocol codes. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:47:21 -07:00
Pavel Emelyanov	c6fda28229	[INET]: Consolidate xxx_frag_create() This one uses the xxx_frag_intern() and xxx_frag_alloc() routines, which are already consolidated, so remove them from protocol code (as promised). The ->constructor callback is used to init the rest of the frag queue and it is the same for netfilter and ipv6. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:46:47 -07:00
Pavel Emelyanov	e521db9d79	[INET]: Consolidate xxx_frag_alloc() Just perform the kzalloc() allocation and setup common fields in the inet_frag_queue(). Then return the result to the caller to initialize the rest. The inet_frag_alloc() may return NULL, so check the return value before doing the container_of(). This looks ugly, but the xxx_frag_alloc() will be removed soon. The xxx_expire() timer callbacks are patches, because the argument is now the inet_frag_queue, not the protocol specific queue. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:45:23 -07:00
Pavel Emelyanov	2588fe1d78	[INET]: Consolidate xxx_frag_intern This routine checks for the existence of a given entry in the hash table and inserts the new one if needed. The ->equal callback is used to compare two frag_queue-s together, but this one is temporary and will be removed later. The netfilter code and the ipv6 one use the same routine to compare frags. The inet_frag_intern() always returns non-NULL pointer, so convert the inet_frag_queue into protocol specific one (with the container_of) without any checks. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:44:34 -07:00
Pavel Emelyanov	fd9e63544c	[INET]: Omit double hash calculations in xxx_frag_intern Since the hash value is already calculated in xxx_find, we can simply use it later. This is already done in netfilter code, so make the same in ipv4 and ipv6. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-17 19:43:37 -07:00
Denis V. Lunev	f1673ca52c	[INET]: kmalloc+memset -> kzalloc in frag_alloc_queue kmalloc + memset -> kzalloc in frag_alloc_queue Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:53:13 -07:00
Pavel Emelyanov	762cc40801	[INET]: Consolidate the xxx_put These ones use the generic data types too, so move them in one place. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:43 -07:00
Pavel Emelyanov	4b6cb5d8e3	[INET]: Small cleanup for xxx_put after evictor consolidation After the evictor code is consolidated there is no need in passing the extra pointer to the xxx_put() functions. The only place when it made sense was the evictor code itself. Maybe this change must got with the previous (or with the next) patch, but I try to make them shorter as much as possible to simplify the review (but they are still large anyway), so this change goes in a separate patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:43 -07:00
Pavel Emelyanov	8e7999c44e	[INET]: Consolidate the xxx_evictor The evictors collect some statistics for ipv4 and ipv6, so make it return the number of evicted queues and account them all at once in the caller. The XXX_ADD_STATS_BH() macros are just for this case, but maybe there are places in code, that can make use of them as well. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:42 -07:00
Pavel Emelyanov	1e4b82873a	[INET]: Consolidate the xxx_frag_destroy To make in possible we need to know the exact frag queue size for inet_frags->mem management and two callbacks: * to destoy the skb (optional, used in conntracks only) * to free the queue itself (mandatory, but later I plan to move the allocation and the destruction of frag_queues into the common place, so this callback will most likely be optional too). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:42 -07:00
Pavel Emelyanov	321a3a99e4	[INET]: Consolidate xxx_the secret_rebuild This code works with the generic data types as well, so move this into inet_fragment.c This move makes it possible to hide the secret_timer management and the secret_rebuild routine completely in the inet_fragment.c Introduce the ->hashfn() callback in inet_frags() to get the hashfun for a given inet_frag_queue() object. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:41 -07:00
Pavel Emelyanov	277e650ddf	[INET]: Consolidate the xxx_frag_kill Since now all the xxx_frag_kill functions now work with the generic inet_frag_queue data type, this can be moved into a common place. The xxx_unlink() code is moved as well. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:41 -07:00
Pavel Emelyanov	04128f233f	[INET]: Collect common frag sysctl variables together Some sysctl variables are used to tune the frag queues management and it will be useful to work with them in a common way in the future, so move them into one structure, moreover they are the same for all the frag management codes. I don't place them in the existing inet_frags object, introduced in the previous patch for two reasons: 1. to keep them in the __read_mostly section; 2. not to export the whole inet_frags objects outside. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:40 -07:00
Pavel Emelyanov	7eb95156d9	[INET]: Collect frag queues management objects together There are some objects that are common in all the places which are used to keep track of frag queues, they are: * hash table * LRU list * rw lock * rnd number for hash function * the number of queues * the amount of memory occupied by queues * secret timer Move all this stuff into one structure (struct inet_frags) to make it possible use them uniformly in the future. Like with the previous patch this mostly consists of hunks like - write_lock(&ipfrag_lock); + write_lock(&ip4_frags.lock); To address the issue with exporting the number of queues and the amount of memory occupied by queues outside the .c file they are declared in, I introduce a couple of helpers. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:39 -07:00
Pavel Emelyanov	5ab11c98d3	[INET]: Move common fields from frag_queues in one place. Introduce the struct inet_frag_queue in include/net/inet_frag.h file and place there all the common fields from three structs: * struct ipq in ipv4/ip_fragment.c * struct nf_ct_frag6_queue in nf_conntrack_reasm.c * struct frag_queue in ipv6/reassembly.c After this, replace these fields on appropriate structures with this structure instance and fix the users to use correct names i.e. hunks like - atomic_dec(&fq->refcnt); + atomic_dec(&fq->q.refcnt); (these occupy most of the patch) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:38 -07:00
Ilpo Järvinen	f885c5b08e	[TCP]: high_seq parameter removed (all callers use tp->high_seq) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:37 -07:00
Patrick McHardy	861d048607	[IPV4]: Uninline netfilter okfns Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc can generate tail calls for some of the netfilter hook okfn invocations, so there is no need to inline the functions anymore. This caused huge code bloat since we ended up with one inlined version and one out-of-line version since we pass the address to nf_hook_slow. Before: text data bss dec hex filename 8997385 1016524 524652 10538561 a0ce41 vmlinux After: text data bss dec hex filename 8994009 1016524 524652 10535185 a0c111 vmlinux ------------------------------------------------------- -3376 All cases have been verified to generate tail-calls with and without netfilter. The okfns in ipmr and xfrm4_input still remain inline because gcc can't generate tail-calls for them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:35 -07:00
Herbert Xu	3db05fea51	[NETFILTER]: Replace sk_buff ** with sk_buff * With all the users of the double pointers removed, this patch mops up by finally replacing all occurances of sk_buff ** in the netfilter API by sk_buff *. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:29 -07:00
Herbert Xu	2ca7b0ac02	[NETFILTER]: Avoid skb_copy/pskb_copy/skb_realloc_headroom This patch replaces unnecessary uses of skb_copy, pskb_copy and skb_realloc_headroom by functions such as skb_make_writable and pskb_expand_head. This allows us to remove the double pointers later. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:28 -07:00
Herbert Xu	af1e1cf073	[IPVS]: Replace local version of skb_make_writable This patch removes the IPVS-specific version of skb_make_writable and replaces it with the netfilter one. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:28 -07:00
Herbert Xu	37d4187922	[NETFILTER]: Do not copy skb in skb_make_writable Now that all callers of netfilter can guarantee that the skb is not shared, we no longer have to copy the skb in skb_make_writable. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:27 -07:00
Herbert Xu	776c729e8d	[IPV4]: Change ip_defrag to return an integer Now that ip_frag always returns the packet given to it on input, we can change it to return an integer indicating error instead. This patch does that and updates all its callers accordingly. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:25 -07:00
Herbert Xu	1706d58763	[IPV4]: Make ip_defrag return the same packet This patch is a bit of a hack. However it is worth it if you consider that this is the only reason why we have to carry around the struct sk_buff ** pointers in netfilter. It makes ip_defrag always return the packet that was given to it on input. It does this by cloning the packet and replacing its original contents with the head fragment if necessary. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:25 -07:00
Al Viro	f53f4137ba	fix endianness bug in inet_lro all uses of and almost all assignments to lro_desc->tcp_ack assume that it's net-endian; one converts net-endian to host-endian and sticks it in lro_desc->tcp_ack. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-14 12:41:52 -07:00
Al Viro	9df7c98a0f	inet_lro: trivial endianness annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-14 12:41:52 -07:00
Ilpo Järvinen	b08d6cb22c	[TCP]: Limit processing lost_retrans loop to work-to-do cases This addition of lost_retrans_low to tcp_sock might be unnecessary, it's not clear how often lost_retrans worker is executed when there wasn't work to do. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:36:13 -07:00
Ilpo Järvinen	f785a8e28b	[TCP]: Fix lost_retrans loop vs fastpath problems Detection implemented with lost_retrans must work also when fastpath is taken, yet most of the queue is skipped including (very likely) those retransmitted skb's we're interested in. This problem appeared when the hints got added, which removed a need to always walk over the whole write queue head. Therefore decicion for the lost_retrans worker loop entry must be separated from the sacktag processing more than it was necessary before. It turns out to be problematic to optimize the worker loop very heavily because ack_seqs of skb may have a number of discontinuity points. Maybe similar approach as currently is implemented could be attempted but that's becoming more and more complex because the trend is towards less skb walking in sacktag marker. Trying a simple work until all rexmitted skbs heve been processed approach. Maybe after(highest_sack_end_seq, tp->high_seq) checking is not sufficiently accurate and causes entry too often in no-work-to-do cases. Since that's not known, I've separated solution to that from this patch. Noticed because of report against a related problem from TAKANO Ryousei <takano@axe-inc.co.jp>. He also provided a patch to that part of the problem. This patch includes solution to it (though this patch has to use somewhat different placement). TAKANO's description and patch is available here: http://marc.info/?l=linux-netdev&m=119149311913288&w=2 ...In short, TAKANO's problem is that end_seq the loop is using not necessarily the largest SACK block's end_seq because the current ACK may still have higher SACK blocks which are later by the loop. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:35:41 -07:00
Ilpo Järvinen	4cd829995b	[TCP]: No need to re-count fackets_out/sacked_out at RTO Both sacked_out and fackets_out are directly known from how parameter. Since fackets_out is accurate, there's no need for recounting (sacked_out was previously unnecessarily counted in the loop anyway). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:34:57 -07:00
Ilpo Järvinen	d193594299	[TCP]: Extract tcp_match_queue_to_sack from sacktag code This is necessary for upcoming DSACK bugfix. Reduces sacktag length which is not very sad thing at all... :-) Notice that there's a need to handle out-of-mem at caller's place. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:34:25 -07:00
Ilpo Järvinen	f6fb128d27	[TCP]: Kill almost unused variable pcount from sacktag It's on the way for future cutting of that function. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:33:55 -07:00
Ilpo Järvinen	3eec0047d9	[TCP]: Fix mark_head_lost to ignore R-bit when trying to mark L This condition (plain R) can arise at least in recovery that is triggered after tcp_undo_loss. There isn't any reason why they should not be marked as lost, not marking makes in_flight estimator to return too large values. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:33:11 -07:00
Ilpo Järvinen	16e906812f	[TCP]: Add bytes_acked (ABC) clearing to FRTO too I was reading tcp_enter_loss while looking for Cedric's bug and noticed bytes_acked adjustment is missing from FRTO side. Since bytes_acked will only be used in tcp_cong_avoid, I think it's safe to assume RTO would be spurious. During FRTO cwnd will be not controlled by tcp_cong_avoid and if FRTO calls for conventional recovery, cwnd is adjusted and the result of wrong assumption is cleared from bytes_acked. If RTO was in fact spurious, we did normal ABC already and can continue without any additional adjustments. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-11 17:32:31 -07:00
David S. Miller	28f7b0360f	[NETLINK]: fib_frontend build fixes 1) fibnl needs to be declared outside of config ifdefs, and also should not be explicitly initialized to NULL 2) nl_fib_input() args are wrong for netlink_kernel_create() input method Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 21:32:39 -07:00
Denis V. Lunev	cd40b7d398	[NET]: make netlink user -> kernel interface synchronious This patch make processing netlink user -> kernel messages synchronious. This change was inspired by the talk with Alexey Kuznetsov about current netlink messages processing. He says that he was badly wrong when introduced asynchronious user -> kernel communication. The call netlink_unicast is the only path to send message to the kernel netlink socket. But, unfortunately, it is also used to send data to the user. Before this change the user message has been attached to the socket queue and sk->sk_data_ready was called. The process has been blocked until all pending messages were processed. The bad thing is that this processing may occur in the arbitrary process context. This patch changes nlk->data_ready callback to get 1 skb and force packet processing right in the netlink_unicast. Kernel -> user path in netlink_unicast remains untouched. EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock drop, but the process remains in the cycle until the message will be fully processed. So, there is no need to use this kludges now. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 21:15:29 -07:00
Stephen Hemminger	227b60f510	[INET]: local port range robustness Expansion of original idea from Denis V. Lunev <den@openvz.org> Add robustness and locking to the local_port_range sysctl. 1. Enforce that low < high when setting. 2. Use seqlock to ensure atomic update. The locking might seem like overkill, but there are cases where sysadmin might want to change value in the middle of a DoS attack. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 17:30:46 -07:00
Herbert Xu	631a6698d0	[IPSEC]: Move IP protocol setting from transforms into xfrm4_input.c This patch makes the IPv4 x->type->input functions return the next protocol instead of setting it directly. This is identical to how we do things in IPv6 and will help us merge common code on the input path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:56 -07:00
Herbert Xu	ceb1eec829	[IPSEC]: Move IP length/checksum setting out of transforms This patch moves the setting of the IP length and checksum fields out of the transforms and into the xfrmX_output functions. This would help future efforts in merging the transforms themselves. It also adds an optimisation to ipcomp due to the fact that the transport offset is guaranteed to be zero. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:56 -07:00
Herbert Xu	87bdc48d30	[IPSEC]: Get rid of ipv6_{auth,esp,comp}_hdr This patch removes the duplicate ipv6_{auth,esp,comp}_hdr structures since they're identical to the IPv4 versions. Duplicating them would only create problems for ourselves later when we need to add things like extended sequence numbers. I've also added transport header type conversion headers for these types which are now used by the transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:55 -07:00
Herbert Xu	37fedd3aab	[IPSEC]: Use IPv6 calling convention as the convention for x->mode->output The IPv6 calling convention for x->mode->output is more general and could help an eventual protocol-generic x->type->output implementation. This patch adopts it for IPv4 as well and modifies the IPv4 type output functions accordingly. It also rewrites the IPv6 mac/transport header calculation to be based off the network header where practical. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:54 -07:00
Herbert Xu	7b277b1a5f	[IPSEC]: Set skb->data to payload in x->mode->output This patch changes the calling convention so that on entry from x->mode->output and before entry into x->type->output skb->data will point to the payload instead of the IP header. This is essentially a redistribution of skb_push/skb_pull calls with the aim of minimising them on the common path of tunnel + ESP. It'll also let us use the same calling convention between IPv4 and IPv6 with the next patch. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:54 -07:00
Herbert Xu	8bd1707504	[IPSEC] esp: Remove NAT-T checksum invalidation for BEET I pointed this out back when this patch was first proposed but it looks like it got lost along the way. The checksum only needs to be ignored for NAT-T in transport mode where we lose the original inner addresses due to NAT. With BEET the inner addresses will be intact so the checksum remains valid. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:53 -07:00
Ilpo Järvinen	1c1e87edb9	[TCP]: Separate lost_retrans loop into own function Follows own function for each task principle, this is really somewhat separate task being done in sacktag. Also reduces indentation. In addition, added ack_seq local var to break some long lines & fixed coding style things. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:51 -07:00
Pavel Emelyanov	e2da591338	[NETFILTER]: Make netfilter code use the seq_open_private Just switch to the consolidated calls. ipt_recent() has to initialize the private, so use the __seq_open_private() helper. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:34 -07:00
Pavel Emelyanov	cf7732e4cc	[NET]: Make core networking code use seq_open_private This concerns the ipv4 and ipv6 code mostly, but also the netlink and unix sockets. The netlink code is an example of how to use the __seq_open_private() call - it saves the net namespace on this private. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:33 -07:00
Herbert Xu	b7c6538cd8	[IPSEC]: Move state lock into x->type->output This patch releases the lock on the state before calling x->type->output. It also adds the lock to the spots where they're currently needed. Most of those places (all except mip6) are expected to disappear with async crypto. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:03 -07:00
Herbert Xu	007f0211a8	[IPSEC]: Store IPv6 nh pointer in mac_header on output Current the x->mode->output functions store the IPv6 nh pointer in the skb network header. This is inconvenient because the network header then has to be fixed up before the packet can leave the IPsec stack. The mac header field is unused on output so we can use that to store this instead. This patch does that and removes the network header fix-up in xfrm_output. It also uses ipv6_hdr where appropriate in the x->type->output functions. There is also a minor clean-up in esp4 to make it use the same code as esp6 to help any subsequent effort to merge the two. Lastly it kills two redundant skb_set_* statements in BEET that were simply copied over from transport mode. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:55:00 -07:00
Herbert Xu	436a0a4022	[IPSEC]: Move output replay code into xfrm_output The replay counter is one of only two remaining things in the output code that requires a lock on the xfrm state (the other being the crypto). This patch moves it into the generic xfrm_output so we can remove the lock from the transforms themselves. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:54 -07:00
Herbert Xu	406ef77c89	[IPSEC]: Move common output code to xfrm_output Most of the code in xfrm4_output_one and xfrm6_output_one are identical so this patch moves them into a common xfrm_output function which will live in net/xfrm. In fact this would seem to fix a bug as on IPv4 we never reset the network header after a transform which may upset netfilter later on. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:53 -07:00
Herbert Xu	bc31d3b2c7	[IPSEC] ah: Remove keys from ah_data structure The keys are only used during initialisation so we don't need to carry them in esp_data. Since we don't have to allocate them again, there is no need to place a limit on the authentication key length anymore. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:53 -07:00
Herbert Xu	4b7137ff8f	[IPSEC] esp: Remove keys from esp_data structure The keys are only used during initialisation so we don't need to carry them in esp_data. Since we don't have to allocate them again, there is no need to place a limit on the authentication key length anymore. This patch also kills the unused auth.icv member. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:52 -07:00
Stephen Hemminger	cfcabdcc2d	[NET]: sparse warning fixes Fix a bunch of sparse warnings. Mostly about 0 used as NULL pointer, and shadowed variable declarations. One notable case was that hash size should have been unsigned. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:48 -07:00
Ilpo Järvinen	de83c058af	[TCP]: "Annotate" another fackets_out state reset This should no longer be necessary because fackets_out is accurate. It indicates bugs elsewhere, thus report it. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:48 -07:00
Ilpo Järvinen	29d0a309d1	[TCP]: Fix two off-by-one errors in fackets_out adjusting logic 1) Passing wrong skb to tcp_adjust_fackets_out could corrupt fastpath_cnt_hint as tcp_skb_pcount(next_skb) is not included to it if hint points exactly to the next_skb (it's lagging behind, see sacktag). 2) When fastpath_skb_hint is put backwards to avoid dangling skb reference, the skb's pcount must also be removed from count (not included like above). Reported by Cedric Le Goater <legoater@free.fr> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:47 -07:00
Ilpo Järvinen	3de96471bd	[TCP]: Wrap-safed reordering detection FRTO check In case somebody has a suggestion about a better place for this check, which must guarantee execution "early enough" (i.e, before the wrap can occur), I'm very open to them. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:54:00 -07:00
Ilpo Järvinen	0e835331e3	[TCP]: Update comment of SACK block validator Just came across what RFC2018 states about generation of valid SACK blocks in case of reneging. Alter comment a bit to point out clearly. IMHO, there isn't any reason to change code because the validation is there for a purpose (counters will inform user about decision TCP made if this case ever surfaces). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:53:59 -07:00
Ilpo Järvinen	95eacd27e2	[TCP]: fix comments that got messed up during code move Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:53:59 -07:00
Ilpo Järvinen	dc86967b54	[TCP]: No fackets_out/highest_sack tuning when SACK isn't enabled This was found due to bug report from Cedric Le Goater though it turned this turned out to be unrelated bug. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:53:58 -07:00

1 2 3 4 5 ...

1884 Commits