linux_old1

Commit Graph

Author	SHA1	Message	Date
Neal Cardwell	70315d22d3	inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock (not just TIME_WAIT), and for such sockets the tw_substate field holds the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2. This brings the inet_diag state-matching code in line with the field it uses to populate idiag_state. This is also analogous to the info exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state and get_timewait4_sock() exports tw->tw_substate. Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi state time-wait" would also return sockets in state TCP_FIN_WAIT2. This is an old bug that predates `05dbc7b` ("tcp/dccp: remove twchain"). Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-13 22:35:46 -08:00
Daniel Borkmann	b1aac815c0	net: inet_diag: zero out uninitialized idiag_{src,dst} fields Jakub reported while working with nlmon netlink sniffer that parts of the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6. That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3]. In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab] memory through this. At least, in udp_dump_one(), we allocate a skb in ... rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL); ... and then pass that to inet_sk_diag_fill() that puts the whole struct inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0], r->id.idiag_dst[0] and leave the rest untouched: r->id.idiag_src[0] = inet->inet_rcv_saddr; r->id.idiag_dst[0] = inet->inet_daddr; struct inet_diag_msg embeds struct inet_diag_sockid that is correctly / fully filled out in IPv6 case, but for IPv4 not. So just zero them out by using plain memset (for this little amount of bytes it's probably not worth the extra check for idiag_family == AF_INET). Similarly, fix also other places where we fill that out. Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-19 14:55:52 -05:00
Eric Dumazet	c1d607cc4a	inet_diag: use sock_gen_put() TCP listener refactoring, part 6 : Use sock_gen_put() from inet_diag_dump_one_icsk() for future SYN_RECV support. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-17 15:02:02 -04:00
Eric Dumazet	634fb979e8	inet: includes a sock_common in request_sock TCP listener refactoring, part 5 : We want to be able to insert request sockets (SYN_RECV) into main ehash table instead of the per listener hash table to allow RCU lookups and remove listener lock contention. This patch includes the needed struct sock_common in front of struct request_sock This means there is no more inet6_request_sock IPv6 specific structure. Following inet_request_sock fields were renamed as they became macros to reference fields from struct sock_common. Prefix ir_ was chosen to avoid name collisions. loc_port -> ir_loc_port loc_addr -> ir_loc_addr rmt_addr -> ir_rmt_addr rmt_port -> ir_rmt_port iif -> ir_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-10 00:08:07 -04:00
Eric Dumazet	efe4208f47	ipv6: make lookups simpler and faster TCP listener refactoring, part 4 : To speed up inet lookups, we moved IPv4 addresses from inet to struct sock_common Now is time to do the same for IPv6, because it permits us to have fast lookups for all kind of sockets, including upcoming SYN_RECV. Getting IPv6 addresses in TCP lookups currently requires two extra cache lines, plus a dereference (and memory stall). inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6 This patch is way bigger than its IPv4 counter part, because for IPv4, we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6, it's not doable easily. inet6_sk(sk)->daddr becomes sk->sk_v6_daddr inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr at the same offset. We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-09 00:01:25 -04:00
Eric Dumazet	05dbc7b594	tcp/dccp: remove twchain TCP listener refactoring, part 3 : Our goal is to hash SYN_RECV sockets into main ehash for fast lookup, and parallel SYN processing. Current inet_ehash_bucket contains two chains, one for ESTABLISH (and friend states) sockets, another for TIME_WAIT sockets only. As the hash table is sized to get at most one socket per bucket, it makes little sense to have separate twchain, as it makes the lookup slightly more complicated, and doubles hash table memory usage. If we make sure all socket types have the lookup keys at the same offsets, we can use a generic and faster lookup. It turns out TIME_WAIT and ESTABLISHED sockets already have common lookup fields for IPv4. [ INET_TW_MATCH() is no longer needed ] I'll provide a follow-up to factorize IPv6 lookup as well, to remove INET6_TW_MATCH() This way, SYN_RECV pseudo sockets will be supported the same. A new sock_gen_put() helper is added, doing either a sock_put() or inet_twsk_put() [ and will support SYN_RECV later ]. Note this helper should only be called in real slow path, when rcu lookup found a socket that was moved to another identity (freed/reused immediately), but could eventually be used in other contexts, like sock_edemux() Before patch : dmesg \| grep "TCP established" TCP established hash table entries: 524288 (order: 11, 8388608 bytes) After patch : TCP established hash table entries: 524288 (order: 10, 4194304 bytes) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-08 23:19:24 -04:00
Eric Dumazet	96f817fede	tcp: shrink tcp6_timewait_sock by one cache line While working on tcp listener refactoring, I found that it would really make things easier if sock_common could include the IPv6 addresses needed in the lookups, instead of doing very complex games to get their values (depending on sock being SYN_RECV, ESTABLISHED, TIME_WAIT) For this to happen, I need to be sure that tcp6_timewait_sock and tcp_timewait_sock consume same number of cache lines. This is possible if we only use 32bits for tw_ttd, as we remove one 32bit hole in inet_timewait_sock inet_tw_time_stamp() is defined and used, even if its current implementation looks like tcp_time_stamp : We might need finer resolution for tcp_time_stamp in the future. Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8 After patch : sizeof(struct tcp6_timewait_sock) = 0xc0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-03 17:43:39 -04:00
Patrick McHardy	e32123e598	netlink: rename ssk to sk in struct netlink_skb_params Memory mapped netlink needs to store the receiving userspace socket when sending from the kernel to userspace. Rename 'ssk' to 'sk' to avoid confusion. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:56 -04:00
Nandita Dukkipati	6ba8a3b19e	tcp: Tail loss probe (TLP) This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2RTT, 1.5RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-12 08:30:34 -04:00
Linus Torvalds	6be35c700f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...	2012-12-12 18:07:07 -08:00
Neal Cardwell	5e1f54201c	inet_diag: validate port comparison byte code to prevent unsafe reads Add logic to verify that a port comparison byte code operation actually has the second inet_diag_bc_op from which we read the port for such operations. Previously the code blindly referenced op[1] without first checking whether a second inet_diag_bc_op struct could fit there. So a malicious user could make the kernel read 4 bytes beyond the end of the bytecode array by claiming to have a whole port comparison byte code (2 inet_diag_bc_op structs) when in fact the bytecode was not long enough to hold both. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 19:00:48 -05:00
Neal Cardwell	f67caec906	inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() Add logic to check the address family of the user-supplied conditional and the address family of the connection entry. We now do not do prefix matching of addresses from different address families (AF_INET vs AF_INET6), except for the previously existing support for having an IPv4 prefix match an IPv4-mapped IPv6 address (which this commit maintains as-is). This change is needed for two reasons: (1) The addresses are different lengths, so comparing a 128-bit IPv6 prefix match condition to a 32-bit IPv4 connection address can cause us to unwittingly walk off the end of the IPv4 address and read garbage or oops. (2) The IPv4 and IPv6 address spaces are semantically distinct, so a simple bit-wise comparison of the prefixes is not meaningful, and would lead to bogus results (except for the IPv4-mapped IPv6 case, which this commit maintains). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 18:59:37 -05:00
Neal Cardwell	405c005949	inet_diag: validate byte code to prevent oops in inet_diag_bc_run() Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND operations. Previously we did not validate the inet_diag_hostcond, address family, address length, and prefix length. So a malicious user could make the kernel read beyond the end of the bytecode array by claiming to have a whole inet_diag_hostcond when the bytecode was not long enough to contain a whole inet_diag_hostcond of the given address family. Or they could make the kernel read up to about 27 bytes beyond the end of a connection address by passing a prefix length that exceeded the length of addresses of the given family. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 18:59:37 -05:00
Neal Cardwell	1c95df85ca	inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state Fix inet_diag to be aware of the fact that AF_INET6 TCP connections instantiated for IPv4 traffic and in the SYN-RECV state were actually created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This means that for such connections inet6_rsk(req) returns a pointer to a random spot in memory up to roughly 64KB beyond the end of the request_sock. With this bug, for a server using AF_INET6 TCP sockets and serving IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to inet_diag_fill_req() causing an oops or the export to user space of 16 bytes of kernel memory as a garbage IPv6 address, depending on where the garbage inet6_rsk(req) pointed. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 18:59:37 -05:00
David S. Miller	d4185bbf62	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c Minor conflict between the BCM_CNIC define removal in net-next and a bug fix added to net. Based upon a conflict resolution patch posted by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-10 18:32:51 -05:00
Cyrill Gorcunov	cacb6ba0f3	net: inet_diag -- Return error code if protocol handler is missed We've observed that in case if UDP diag module is not supported in kernel the netlink returns NLMSG_DONE without notifying a caller that handler is missed. This patch makes __inet_diag_dump to return error code instead. So as example it become possible to detect such situation and handle it gracefully on userspace level. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: David Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Pavel Emelyanov <xemul@parallels.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-04 01:56:49 -04:00
Eric Dumazet	e6c022a4fa	tcp: better retrans tracking for defer-accept For passive TCP connections using TCP_DEFER_ACCEPT facility, we incorrectly increment req->retrans each time timeout triggers while no SYNACK is sent. SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for which we received the ACK from client). Only the last SYNACK is sent so that we can receive again an ACK from client, to move the req into accept queue. We plan to change this later to avoid the useless retransmit (and potential problem as this SYNACK could be lost) TCP_INFO later gives wrong information to user, claiming imaginary retransmits. Decouple req->retrans field into two independent fields : num_retrans : number of retransmit num_timeout : number of timeouts num_timeout is the counter that is incremented at each timeout, regardless of actual SYNACK being sent or not, and used to compute the exponential timeout. Introduce inet_rtx_syn_ack() helper to increment num_retrans only if ->rtx_syn_ack() succeeded. Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans when we re-send a SYNACK in answer to a (retransmitted) SYN. Prior to this patch, we were not counting these retransmits. Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS only if a synack packet was successfully queued. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Elliott Hughes <enh@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-11-03 14:45:00 -04:00
Pavel Emelyanov	e4e541a848	sock-diag: Report shutdown for inet and unix sockets (v2) Make it simple -- just put new nlattr with just sk->sk_shutdown bits. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-23 14:57:52 -04:00
Eric W. Biederman	15e473046c	netlink: Rename pid to portid to avoid confusion It is a frequent mistake to confuse the netlink port identifier with a process identifier. Try to reduce this confusion by renaming fields that hold port identifiers portid instead of pid. I have carefully avoided changing the structures exported to userspace to avoid changing the userspace API. I have successfully built an allyesconfig kernel with this change. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-10 15:30:41 -04:00
Eric W. Biederman	d06ca95643	userns: Teach inet_diag to work with user namespaces Compute the user namespace of the socket that we are replying to and translate the kuids of reported sockets into that user namespace. Cc: Andrew Vagin <avagin@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-08-14 21:55:20 -07:00
Andrey Vagin	51d7cccf07	net: make sock diag per-namespace Before this patch sock_diag works for init_net only and dumps information about sockets from all namespaces. This patch expands sock_diag for all name-spaces. It creates a netlink kernel socket for each netns and filters data during dumping. v2: filter accoding with netns in all places remove an unused variable. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-16 22:31:34 -07:00
Thomas Graf	6e277ed59a	inet_diag: Do not use RTA_PUT() macros Also, no need to trim on nlmsg_put() failure, nothing has been added yet. We also want to use nlmsg_end(), nlmsg_new() and nlmsg_free(). Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-27 15:36:44 -07:00
David S. Miller	d106352d9f	inet_diag: Move away from NLMSG_PUT(). And use nlmsg_data() while we're here too, and remove useless casts. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-26 21:28:54 -07:00
David S. Miller	0d6c4a2e46	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/param.c drivers/net/wireless/iwlwifi/iwl-agn-rx.c drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c drivers/net/wireless/iwlwifi/iwl-trans.h Resolved the iwlwifi conflict with mainline using 3-way diff posted by John Linville and Stephen Rothwell. In 'net' we added a bug fix to make iwlwifi report a more accurate skb->truesize but this conflicted with RX path changes that happened meanwhile in net-next. In e1000e a conflict arose in the validation code for settings of adapter->itr. 'net-next' had more sophisticated logic so that logic was used. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-07 23:35:40 -04:00
Shan Wei	8dcf01fc00	net: sock_diag_handler structs can be const read only, so change it to const. Signed-off-by: Shan Wei <davidshan@tencent.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-25 20:46:59 -04:00
Shan Wei	62ad6fcd74	udp_diag: implement idiag_get_info for udp/udplite to get queue information When we use netlink to monitor queue information for udp socket, idiag_rqueue and idiag_wqueue of inet_diag_msg are returned with 0. Keep consistent with netstat, just return back allocated rmem/wmem size. Signed-off-by: Shan Wei <davidshan@tencent.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-04-25 20:43:01 -04:00
Pablo Neira Ayuso	80d326fab5	netlink: add netlink_dump_control structure for netlink_dump_start() Davem considers that the argument list of this interface is getting out of control. This patch tries to address this issue following his proposal: struct netlink_dump_control c = { .dump = dump, .done = done, ... }; netlink_dump_start(..., &c); Suggested by David S. Miller. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-02-26 14:10:06 -05:00
Pavel Emelyanov	3b09c84cb6	inet_diag: Rename inet_diag_req_compat into inet_diag_req Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-11 12:56:06 -08:00
Pavel Emelyanov	c8991362a0	inet_diag: Rename inet_diag_req into inet_diag_req_v2 Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-11 12:56:06 -08:00
Pavel Emelyanov	c0636faa53	inet_diag: Add the SKMEMINFO extension Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-30 16:42:19 -05:00
Pavel Emelyanov	f65c1b534b	sock_diag: Generalize requests cookies managements The sk address is used as a cookie between dump/get_exact calls. It will be required for unix socket sdumping, so move it from inet_diag to sock_diag. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-16 13:48:27 -05:00
Pavel Emelyanov	aec8dc62f6	sock_diag: Fix module netlink aliases I've made a mistake when fixing the sock_/inet_diag aliases :( 1. The sock_diag layer should request the family-based alias, not just the IPPROTO_IP one; 2. The inet_diag layer should request for AF_INET+protocol alias, not just the protocol one. Thus fix this. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-16 13:48:27 -05:00
Eric Dumazet	dfd56b8b38	net: use IS_ENABLED(CONFIG_IPV6) Instead of testing defined(CONFIG_IPV6) \|\| defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-11 18:25:16 -05:00
Pavel Emelyanov	1942c518ca	inet_diag: Generalize inet_diag dump and get_exact calls Introduce two callbacks in inet_diag_handler -- one for dumping all sockets (with filters) and the other one for dumping a single sk. Replace direct calls to icsk handlers with indirect calls to callbacks provided by handlers. Make existing TCP and DCCP handlers use provided helpers for icsk-s. The UDP diag module will provide its own. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	3c4d05c805	inet_diag: Introduce the inet socket dumping routine The existing inet_csk_diag_fill dumps the inet connection sock info into the netlink inet_diag_message. Prepare this routine to be able to dump only the inet_sock part of a socket if the icsk part is missing. This will be used by UDP diag module when dumping UDP sockets. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	8d07d1518a	inet_diag: Introduce the byte-code run on an inet socket The upcoming UDP module will require exactly this ability, so just move the existing code to provide one. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	efb3cb428d	inet_diag: Split inet_diag_get_exact into parts Similar to previous patch: the 1st part locks the inet handler and will get generalized and the 2nd one dumps icsk-s and will be used by TCP and DCCP handlers. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	476f7dbff3	inet_diag: Split inet_diag_get_exact into parts The 1st part locks the inet handler and the 2nd one dump the inet connection sock. In the next patches the 1st part will be generalized to call the socket dumping routine indirectly (i.e. TCP/UDP/DCCP) and the 2nd part will be used by TCP and DCCP handlers. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	b005ab4ef8	inet_diag: Export inet diag cookie checking routine The netlink diag susbsys stores sk address bits in the nl message as a "cookie" and uses one when dumps details about particular socket. The same will be required for udp diag module, so introduce a heler in inet_diag module Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
Pavel Emelyanov	87c22ea52e	inet_diag: Reduce the number of args for bytecode run routine Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:07 -05:00
Pavel Emelyanov	7b35eadd7e	inet_diag: Remove indirect sizeof from inet diag handlers There's an info_size value stored on inet_diag_handler, but for existing code this value is effectively constant, so just use sizeof(struct tcp_info) where required. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:07 -05:00
Pavel Emelyanov	8ef874bfc7	sock_diag: Move the sock_ code to net/core/ This patch moves the sock_ code from inet_diag.c to generic sock_diag.c file and provides necessary request_module-s calls and a pointer on inet_diag_compat dumping routine. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:02 -05:00
Pavel Emelyanov	a029fe26b6	inet_diag: Cleanup type2proto last user Now all the code works with sock_diag_req-compatible structs, so it's possible to stop using the inet_diag_type2proto in inet_csk_diag_fill. Pass the inet_diag_req into it and use the sdiag_protocol field. At the same time remove the explicit ext argument, since it's also on the req. However, this conversion is still required in _compat code, so just move this routine, not remove. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:02 -05:00
Pavel Emelyanov	d23deaa07b	inet_diag: Introduce socket family checks The new API will specify family to work with. Teach the existing socket walking code to bypass not interesting ones. To preserve compatibility with existing behavior the _compat code sets interesting family to AF_UNSPEC to dump them all. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:02 -05:00
Pavel Emelyanov	25c4cd2b6d	inet_diag: Switch the _dump to work with new header Make inet_diag_dumo work with given header instead of calculating one from the nl message. The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code converts the old header to new one. Also fix the bytecode calculation to find one at proper offset. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:02 -05:00
Pavel Emelyanov	fe50ce2846	inet_diag: Switch the _get_exact to work with new header Make inet_diag_get_exact work with given header instead of calculating one from the nl message. The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code converts the old header to new one. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	126fdc3249	inet_diag: Introduce new inet_diag_req header This one coinsides with the sock_diag_req in the beginning and contains only used fields from its previous analogue. The existing code is patched to use the _compat version of it for now. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	d366477a52	sock_diag: Initial skeleton When receiving the SOCK_DIAG_BY_FAMILY message we have to find the handler for provided family and pass the nl message to it. This patch describes an infrastructure to work with such nandlers and implements stubs for AF_INET(6) ones. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	f13c95f0e2	inet_diag: Switch from _GETSOCK to IPPROTO_ numbers Sorry, but the vger didn't let this message go to the list. Re-sending it with less spam-filter-prone subject. When dumping the AF_INET/AF_INET6 sockets user will also specify the protocol, so prepare the protocol diag handlers to work with IPPROTO_ constants. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	37f352b5e3	inet_diag: Move byte-code finding up the call-stack Current code calculates it at fixed offset. This offset will change, so move the BC calculation upper to make the further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	8d34172dfd	sock_diag: Introduce new message type This type will run the family+protocol based socket dumping. Also prepare the stub function for it. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:58:01 -05:00
Pavel Emelyanov	7f1fb60c4f	inet_diag: Partly rename inet_ to sock_ The ultimate goal is to get the sock_diag module, that works in family+protocol terms. Currently this is suitable to do on the inet_diag basis, so rename parts of the code. It will be moved to sock_diag.c later. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 13:57:36 -05:00
David S. Miller	6dec4ac4ee	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/ipv4/inet_diag.c	2011-11-26 14:47:03 -05:00
Alexey Dobriyan	4e3fd7a06d	net: remove ipv6_addr_copy() C assignment can handle struct in6_addr copying. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-22 16:43:32 -05:00
Maciej Żenczykowski	717b6d8366	net-netlink: fix diag to export IPv4 tos for dual-stack IPv6 sockets Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-22 16:03:10 -05:00
Maciej Żenczykowski	06236ac372	net-netlink: Add a new attribute to expose TCLASS values via netlink commit `3ceca74966` added a TOS attribute. Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket, furthermore they can have different values. As such one cannot in a sane way expose both through a single attribute. Signed-off-by: Maciej Żenczyowski <maze@google.com> CC: Murali Raja <muralira@google.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 01:08:49 -05:00
Murali Raja	3ceca74966	net-netlink: Add a new attribute to expose TOS values via netlink This patch exposes the tos value for the TCP sockets when the TOS flag is requested in the ext_flags for the inet_diag request. This would mainly be used to expose TOS values for both for TCP and UDP sockets. Currently it is supported for TCP. When netlink support for UDP would be added the support to expose the TOS values would alse be done. For IPV4 tos value is exposed and for IPV6 tclass value is exposed. Signed-off-by: Murali Raja <muralira@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-12 19:09:18 -04:00
David S. Miller	9f6ec8d697	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c	2011-06-20 22:29:08 -07:00
Eric Dumazet	eeb1497277	inet_diag: fix inet_diag_bc_audit() A malicious user or buggy application can inject code and trigger an infinite loop in inet_diag_bc_audit() Also make sure each instruction is aligned on 4 bytes boundary, to avoid unaligned accesses. Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-17 16:25:39 -04:00
Greg Rose	c7ac8679be	rtnetlink: Compute and store minimum ifinfo dump size The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2011-06-09 20:38:07 -07:00
Eric Dumazet	b71d1d426d	inet: constify ip headers and in6_addr Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-22 11:04:14 -07:00
David S. Miller	b8f3ab4290	Revert "netlink: test for all flags of the NLM_F_DUMP composite" This reverts commit `0ab03c2b14`. It breaks several things including the avahi daemon. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-19 13:34:20 -08:00
Jan Engelhardt	0ab03c2b14	netlink: test for all flags of the NLM_F_DUMP composite Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT \| NLM_F_MATCH, when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL, non-dump requests with NLM_F_EXCL set are mistaken as dump requests. Substitute the condition to test for _all_ bits being set. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-09 16:25:03 -08:00
Nelson Elhage	22e76c849d	inet_diag: Make sure we actually run the same bytecode we audited. We were using nlmsg_find_attr() to look up the bytecode by attribute when auditing, but then just using the first attribute when actually running bytecode. So, if we received a message with two attribute elements, where only the second had type INET_DIAG_REQ_BYTECODE, we would validate and run different bytecode strings. Fix this by consistently using nlmsg_find_attr everywhere. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-04 12:26:34 -07:00
Eric Dumazet	a02cec2155	net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-23 14:33:39 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Roel Kluin	b4ced2b768	netlink: With opcode INET_DIAG_BC_S_LE dport was compared in inet_diag_bc_run() The s-port should be compared. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-01-19 14:12:20 -08:00
Eric Dumazet	c720c7e838	inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-18 18:52:53 -07:00
Eric Dumazet	f373b53b5f	tcp: replace ehash_size by ehash_mask Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-13 03:44:02 -07:00
Eric Dumazet	31e6d363ab	net: correct off-by-one write allocations reports commit `2b85a34e91` (net: No more expensive sock_hold()/sock_put() on each tx) changed initial sk_wmem_alloc value. We need to take into account this offset when reporting sk_wmem_alloc to user, in PROC_FS files or various ioctls (SIOCOUTQ/TIOCOUTQ) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-18 00:29:12 -07:00
Arnaldo Carvalho de Melo	ac5978e7f8	inet_diag: Remove dup assignments These are later assigned to other values without being used meanwhile. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-04-28 08:03:26 -07:00
Eric Dumazet	c25eb3bfb9	net: Convert TCP/DCCP listening hash tables to use RCU This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-23 17:22:55 -08:00
David S. Miller	7e3aab4a9c	inet_diag: Missed conversion after changing inet ehash lockl to spinlocks. They are no longer a rwlocks. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-21 16:39:19 -08:00
Eric Dumazet	5caea4ea70	net: listening_hash get a spinlock per bucket This patch prepares RCU migration of listening_hash table for TCP/DCCP protocols. listening_hash table being small (32 slots per protocol), we add a spinlock for each slot, instead of a single rwlock for whole table. This should reduce hold time of readers, and writers concurrency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-20 00:40:07 -08:00
Eric Dumazet	3ab5aee7fe	net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-16 19:40:17 -08:00
Johannes Berg	95a5afca4a	net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely) Some code here depends on CONFIG_KMOD to not try to load protocol modules or similar, replace by CONFIG_MODULES where more than just request_module depends on CONFIG_KMOD and and also use try_then_request_module in ebtables. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-16 15:24:51 -07:00
Andi Kleen	6be547a61d	inet_diag: Add empty bucket optimization to inet_diag too Skip quickly over empty buckets in inet_diag. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-28 01:09:54 -07:00
Adrian Bunk	0b04082995	net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-11 21:00:38 -07:00
Pavel Emelyanov	d86e0dac2c	[NETNS]: Tcp-v6 sockets per-net lookup. Add a net argument to inet6_lookup and propagate it further. Actually, this is tcp-v6 implementation of what was done for tcp-v4 sockets in a previous patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:28:20 -08:00
Pavel Emelyanov	c67499c0e7	[NETNS]: Tcp-v4 sockets per-net lookup. Add a net argument to inet_lookup and propagate it further into lookup calls. Plus tune the __inet_check_established. The dccp and inet_diag, which use that lookup functions pass the init_net into them. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:28:19 -08:00
Arnaldo Carvalho de Melo	8cf8e5a67f	[INET_DIAG]: Fix inet_diag_lock_handler error path. Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=9825 The inet_diag_lock_handler function uses ERR_PTR to encode errors but its callers were testing against NULL. This only happens when the only inet_diag modular user, DCCP, is not built into the kernel or available as a module. Also there was a problem with not dropping the mutex lock when a handler was not found, also fixed in this patch. This caused an OOPS and ss would then hang on subsequent calls, as &inet_diag_table_mutex was being left locked. Thanks to spike at ml.yaroslavl.ru for report it after trying 'ss -d' on a kernel that doesn't have DCCP available. This bug was introduced in cset `d523a328fb` ("Fix inet_diag dead-lock regression"), after 2.6.24-rc3, so just 2.6.24 seems to be affected. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:08 -08:00
Denis V. Lunev	b7c6ba6eb1	[NETNS]: Consolidate kernel netlink socket destruction. Create a specific helper for netlink kernel socket disposal. This just let the code look better and provides a ground for proper disposal inside a namespace. Signed-off-by: Denis V. Lunev <den@openvz.org> Tested-by: Alexey Dobriyan <adobriyan@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:08:07 -08:00
Herbert Xu	d523a328fb	[INET]: Fix inet_diag dead-lock regression The inet_diag register fix broke inet_diag module loading because the loaded module had to take the same mutex that's already held by the loader in order to register the new handler. This patch fixes it by introducing a separate mutex to protect the handling of handlers. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-12-03 15:51:25 +11:00
Pavel Emelyanov	076931989f	[INET]: Fix inet_diag register vs rcv race The following race is possible when one cpu unregisters the handler while other one is trying to receive a message and call this one: CPU1: CPU2: inet_diag_rcv() inet_diag_unregister() mutex_lock(&inet_diag_mutex); netlink_rcv_skb(skb, &inet_diag_rcv_msg); if (inet_diag_table[nlh->nlmsg_type] == NULL) /* false handler is still registered / ... netlink_dump_start(idiagnl, skb, nlh, inet_diag_dump, NULL); cb = kzalloc(sizeof(cb), GFP_KERNEL); /* sleep here freeing memory * or preempt * or sleep later on nlk->cb_mutex / spin_lock(&inet_diag_register_lock); inet_diag_table[type] = NULL; ... spin_unlock(&inet_diag_register_lock); synchronize_rcu(); / CPU1 is sleeping - RCU quiescent * state is passed / return; / inet_diag_dump is finally called: / inet_diag_dump() handler = inet_diag_table[cb->nlh->nlmsg_type]; BUG_ON(handler == NULL); / OOPS! While we slept the unregister has set * handler to NULL :( */ Grep showed, that the register/unregister functions are called from init/fini module callbacks for tcp_/dccp_diag, so it's OK to use the inet_diag_mutex to synchronize manipulations with the inet_diag_table and the access to it. Besides, as Herbert pointed out, asynchronous dumps should hold this mutex as well, and thus, we provide the mutex as cb_mutex one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-30 00:08:14 +11:00
Eric Dumazet	230140cffa	[INET]: Remove per bucket rwlock in tcp/dccp ehash table. As done two years ago on IP route cache table (commit `22c047ccbc`) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:15:11 -08:00
Jean Delvare	305e1e9691	[INET]: Let inet_diag and friends autoload By adding module aliases to inet_diag, tcp_diag and dccp_diag, we let them load automatically as needed. This makes tools like "ss" run faster. Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-22 02:59:54 -07:00
Denis V. Lunev	cd40b7d398	[NET]: make netlink user -> kernel interface synchronious This patch make processing netlink user -> kernel messages synchronious. This change was inspired by the talk with Alexey Kuznetsov about current netlink messages processing. He says that he was badly wrong when introduced asynchronious user -> kernel communication. The call netlink_unicast is the only path to send message to the kernel netlink socket. But, unfortunately, it is also used to send data to the user. Before this change the user message has been attached to the socket queue and sk->sk_data_ready was called. The process has been blocked until all pending messages were processed. The bad thing is that this processing may occur in the arbitrary process context. This patch changes nlk->data_ready callback to get 1 skb and force packet processing right in the netlink_unicast. Kernel -> user path in netlink_unicast remains untouched. EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock drop, but the process remains in the cycle until the message will be fully processed. So, there is no need to use this kludges now. Signed-off-by: Denis V. Lunev <den@openvz.org> Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 21:15:29 -07:00
Herbert Xu	0cfad07555	[NETLINK]: Avoid pointer in netlink_run_queue I was looking at Patrick's fix to inet_diag and it occured to me that we're using a pointer argument to return values unnecessarily in netlink_run_queue. Changing it to return the value will allow the compiler to generate better code since the value won't have to be memory-backed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:51:24 -07:00
Eric W. Biederman	b4b510290b	[NET]: Support multiple network namespaces with netlink Each netlink socket will live in exactly one network namespace, this includes the controlling kernel sockets. This patch updates all of the existing netlink protocols to only support the initial network namespace. Request by clients in other namespaces will get -ECONREFUSED. As they would if the kernel did not have the support for that netlink protocol compiled in. As each netlink protocol is updated to be multiple network namespace safe it can register multiple kernel sockets to acquire a presence in the rest of the network namespaces. The implementation in af_netlink is a simple filter implementation at hash table insertion and hash table look up time. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:49:09 -07:00
Ilpo Järvinen	172589ccdd	[NET]: DIV_ROUND_UP cleanup (part two) Hopefully captured all single statement cases under net/. I'm not too sure if there is some policy about #includes that are "guaranteed" (ie., in the current tree) to be available through some other #included header, so I just added linux/kernel.h to each changed file that didn't #include it previously. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:48:37 -07:00
Patrick McHardy	0a9c730144	[INET_DIAG]: Fix oops in netlink_rcv_skb netlink_run_queue() doesn't handle multiple processes processing the queue concurrently. Serialize queue processing in inet_diag to fix a oops in netlink_rcv_skb caused by netlink_run_queue passing a NULL for the skb. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054 [349587.500454] printing eip: [349587.500457] c03318ae [349587.500459] *pde = 00000000 [349587.500464] Oops: 0000 [#1] [349587.500466] PREEMPT SMP [349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa [349587.500483] CPU: 0 [349587.500485] EIP: 0060:[<c03318ae>] Not tainted VLI [349587.500487] EFLAGS: 00010246 (2.6.22.3 #1) [349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e [349587.500506] eax: 00000000 ebx: 00000000 ecx: c148d2a0 edx: c0398819 [349587.500510] esi: 00000000 edi: c0398819 ebp: c7a21c8c esp: c7a21c80 [349587.500517] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 [349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000) [349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c [349587.500542] f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c [349587.500558] f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff [349587.500578] Call Trace: [349587.500581] [<c010361a>] show_trace_log_lvl+0x1c/0x33 [349587.500591] [<c01036d4>] show_stack_log_lvl+0x8d/0xaa [349587.500595] [<c010390e>] show_registers+0x1cb/0x321 [349587.500604] [<c0103bff>] die+0x112/0x1e1 [349587.500607] [<c01132d2>] do_page_fault+0x229/0x565 [349587.500618] [<c03c8d3a>] error_code+0x72/0x78 [349587.500625] [<c0331962>] netlink_run_queue+0x40/0x76 [349587.500632] [<c03988e3>] inet_diag_rcv+0x1f/0x2c [349587.500639] [<c03312a5>] netlink_data_ready+0x57/0x59 [349587.500643] [<c0330681>] netlink_sendskb+0x24/0x45 [349587.500651] [<c03307c6>] netlink_unicast+0x100/0x116 [349587.500656] [<c0330f83>] netlink_sendmsg+0x1c2/0x280 [349587.500664] [<c02fcce9>] sock_sendmsg+0xba/0xd5 [349587.500671] [<c02fe4d1>] sys_sendmsg+0x17b/0x1e8 [349587.500676] [<c02fe92d>] sys_socketcall+0x230/0x24d [349587.500684] [<c01028d2>] syscall_call+0x7/0xb [349587.500691] ======================= [349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f Reported by Athanasius <link@miggy.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-09-11 11:33:28 +02:00
Patrick McHardy	af65bdfce9	[NETLINK]: Switch cb_lock spinlock to mutex and allow to override it Switch cb_lock to mutex and allow netlink kernel users to override it with a subsystem specific mutex for consistent locking in dump callbacks. All netlink_dump_start users have been audited not to rely on any side-effects of the previously used spinlock. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:29:03 -07:00
Thomas Graf	c702e8047f	[NETLINK]: Directly return -EINTR from netlink_dump_start() Now that all users of netlink_dump_start() use netlink_run_queue() to process the receive queue, it is possible to return -EINTR from netlink_dump_start() directly, therefore simplying the callers. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:27:33 -07:00
Thomas Graf	ead592ba24	[IPv4] diag: Use netlink_run_queue() to process the receive queue Makes use of netlink_run_queue() to process the receive queue and converts inet_diag_rcv_msg() to use the type safe netlink interface. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:27:32 -07:00
Arnaldo Carvalho de Melo	dc5fc579b9	[NETLINK]: Use nlmsg_trim() where appropriate Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:37 -07:00
Arnaldo Carvalho de Melo	b529ccf279	[NETLINK]: Introduce nlmsg_hdr() helper For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the number of direct accesses to skb->data and for consistency with all the other cast skb member helpers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:34 -07:00
Arnaldo Carvalho de Melo	27a884dc3c	[SK_BUFF]: Convert skb->tail to sk_buff_data_t So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:28 -07:00
YOSHIFUJI Hideaki	e905a9edab	[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-02-10 23:19:39 -08:00
Eric Dumazet	dbca9b2750	[NET]: change layout of ehash table ehash table layout is currently this one : First half of this table is used by sockets not in TIME_WAIT state Second half of it is used by sockets in TIME_WAIT state. This is non optimal because of for a given hash or socket, the two chain heads are located in separate cache lines. Moreover the locks of the second half are never used. If instead of this halving, we use two list heads in inet_ehash_bucket instead of only one, we probably can avoid one cache miss, and reduce ram usage, particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep together related chains to speedup lookups and socket state change. In this patch I did not try to align struct inet_ehash_bucket, but a future patch could try to make this structure have a convenient size (a power of two or a multiple of L1_CACHE_SIZE). I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so maybe we dont need to scratch our heads to align the bucket... Note : In case struct inet_ehash_bucket is not a power of two, we could probably change alloc_large_system_hash() (in case it use __get_free_pages()) to free the unused space. It currently allocates a big zone, but the last quarter of it could be freed. Again, this should be a temporary 'problem'. Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-02-08 14:16:46 -08:00
Patrick McHardy	26932566a4	[NETLINK]: Don't BUG on undersized allocations Currently netlink users BUG when the allocated skb for an event notification is undersized. While this is certainly a kernel bug, its not critical and crashing the kernel is too drastic, especially when considering that these errors have appeared multiple times in the past and it BUGs even if no listeners are present. This patch replaces BUG by WARN_ON and changes the notification functions to inform potential listeners of undersized allocations using a unique error code (EMSGSIZE). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-02-08 12:38:41 -08:00

1 2 3 4

163 Commits