linux

Commit Graph

Author	SHA1	Message	Date
David S. Miller	6b6cbc1471	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts were simply overlapping changes. In the net/ipv4/route.c case the code had simply moved around a little bit and the same fix was made in both 'net' and 'net-next'. In the net/sched/sch_generic.c case a fix in 'net' happened at the same time that a new argument was added to qdisc_hash_add(). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-15 21:16:30 -04:00
Johannes Berg	fceb6435e8	netlink: pass extended ACK struct to parsing functions Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:58:22 -04:00
Johannes Berg	2d4bc93368	netlink: extended ACK reporting Add the base infrastructure and UAPI for netlink extended ACK reporting. All "manual" calls to netlink_ack() pass NULL for now and thus don't get extended ACK reporting. Big thanks goes to Pablo Neira Ayuso for not only bringing up the whole topic at netconf (again) but also coming up with the nlattr passing trick and various other ideas. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:58:20 -04:00
David Ahern	27b3b551d8	rtnetlink: Do not generate notifications for NETDEV_CHANGE_TX_QUEUE_LEN event Changing tx queue length generates identical messages: [LINK]22: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:04:f4:b7:5c:d2 brd ff:ff:ff:ff:ff:ff promiscuity 0 dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 [LINK]22: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:04:f4:b7:5c:d2 brd ff:ff:ff:ff:ff:ff promiscuity 0 dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Remove NETDEV_CHANGE_TX_QUEUE_LEN from the list of notifiers that generate notifications. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:34 -04:00
David Ahern	b6b36eb23a	rtnetlink: Do not generate notifications for NETDEV_CHANGEUPPER event NETDEV_CHANGEUPPER is an internal event; do not generate userspace notifications. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:34 -04:00
David Ahern	aed0735909	rtnetlink: Do not generate notifications for CHANGELOWERSTATE event CHANGELOWERSTATE is an internal event; do not generate userspace notifications. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:33 -04:00
David Ahern	bf2c2984d3	rtnetlink: Do not generate notifications for PRECHANGEUPPER event PRECHANGEUPPER is an internal event; do not generate userspace notifications. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:33 -04:00
David Ahern	aef091ae58	rtnetlink: Do not generate notifications for POST_TYPE_CHANGE event Changing the master device for a link generates many messages; the one generated for POST_TYPE_CHANGE is redundant: [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br1 state UNKNOWN group default link/ether 02:02:02:02:02:03 brd ff:ff:ff:ff:ff:ff [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br1 state UNKNOWN group default link/ether 02:02:02:02:02:03 brd ff:ff:ff:ff:ff:ff Remove POST_TYPE_CHANGE from the list of notifiers that generate notifications. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:33 -04:00
David Ahern	cd8966e75e	rtnetlink: Do not generate notifications for CHANGEADDR event Changing hardware address generates redundant messages: [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 02:02:02:02:02:02 brd ff:ff:ff:ff:ff:ff Do not send a notification for the CHANGEADDR notifier. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:33 -04:00
David Ahern	46ede612c7	rtnetlink: Do not generate notification for UDP_TUNNEL_PUSH_INFO NETDEV_UDP_TUNNEL_PUSH_INFO is an internal notifier; nothing userspace can do so don't generate a netlink notification. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:33 -04:00
David Ahern	085e1a65f0	rtnetlink: Do not generate notifications for MTU events Changing MTU on a link currently causes 3 messages to be sent to userspace: [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1490 qdisc noqueue state UNKNOWN group default link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff [LINK]11: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether f2:52:5c:6d:21:f3 brd ff:ff:ff:ff:ff:ff Remove the messages sent for PRE_CHANGE_MTU and CHANGE_MTU netdev events. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-13 13:16:32 -04:00
Ilan Tayari	eaffadbbb3	gso: Support frag_list splitting with head_frag A driver may use build_skb() for received packets. These SKBs then have a head_frag. Since commit `d7e8883cfc` ("net: make GRO aware of skb->head_frag"), GRO may build frag_list SKBs out of head_frag received SKBs. In such a case, the chained SKBs end up with a head_frag. Commit `07b26c9454` ("gso: Support partial splitting at the frag_list pointer") adds partial segmentation of frag_list SKB chains into individual SKBs. However, this is not done if the chained SKBs have any linear part, because the device may not be able to DMA the private linear buffer. A chained frag_list SKB with head_frag is wrongfully detected in this case as having a private linear part and thus falls back to software GSO, while in fact the linear part is backed by a DMA page just like any other frag. This causes low performance when forwarding those packets that were built with build_skb() Allow partial segmentation at the frag_list pointer for chained SKBs with head_frag. Note that such SKBs can only be created by GRO, when applied to received packets with head_frag. Also note that this change only affects the data path that performs the partial segmentation at frag_list pointer, and not any of the other more common data paths. Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-12 13:53:35 -04:00
Johannes Berg	df7dd8fc96	net: xdp: don't export dev_change_xdp_fd() Since dev_change_xdp_fd() is only used in rtnetlink, which must be built-in, there's no reason to export dev_change_xdp_fd(). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-12 10:29:40 -04:00
Willem de Bruijn	8f917bba00	bpf: pass sk to helper functions BPF helper functions access socket fields through skb->sk. This is not set in ingress cgroup and socket filters. The association is only made in skb_set_owner_r once the filter has accepted the packet. Sk is available as socket lookup has taken place. Temporarily set skb->sk to sk in these cases. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-11 14:54:19 -04:00
Wei Yongjun	cb6bf9cfdb	devlink: fix return value check in devlink_dpipe_header_put() Fix the return value check which testing the wrong variable in devlink_dpipe_header_put(). Fixes: `1555d204e7` ("devlink: Support for pipeline debug (dpipe)") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-11 14:53:32 -04:00
Johannes Berg	be9370a7d8	bpf: remove struct bpf_prog_type_list There's no need to have struct bpf_prog_type_list since it just contains a list_head, the type, and the ops pointer. Since the types are densely packed and not actually dynamically registered, it's much easier and smaller to have an array of type->ops pointer. Also initialize this array statically to remove code needed to initialize it. In order to save duplicating the list, move it to a new header file and include it in the places needing it. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-11 14:38:43 -04:00
David S. Miller	bf74b20d00	Revert "rtnl: Add support for netdev event to link messages" This reverts commit `def12888c1`. As per discussion between Roopa Prabhu and David Ahern, it is advisable that we instead have the code collect the setlink triggered events into a bitmask emitted in the IFLA_EVENT netlink attribute. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-09 14:45:21 -07:00
Chenbo Feng	5daab9db7b	New getsockopt option to get socket cookie Introduce a new getsockopt operation to retrieve the socket cookie for a specific socket based on the socket fd. It returns a unique non-decreasing cookie for each socket. Tested: https://android-review.googlesource.com/#/c/358163/ Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-08 08:07:01 -07:00
David S. Miller	0e4c0ee580	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2017-04-06 11:57:04 -07:00
David S. Miller	6f14f443d3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple cases of overlapping changes (adding code nearby, a function whose name changes, for example). Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-06 08:24:51 -07:00
Vlad Yasevich	def12888c1	rtnl: Add support for netdev event to link messages When netdev events happen, a rtnetlink_event() handler will send messages for every event in it's white list. These messages contain current information about a particular device, but they do not include the iformation about which event just happened. The consumer of the message has to try to infer this information. In some cases (ex: NETDEV_NOTIFY_PEERS), that is not possible. This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT that would have an encoding of the which event triggered this message. This would allow the the message consumer to easily determine if it is interested in a particular event or not. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-05 08:14:14 -07:00
Vlad Yasevich	5138e86f17	rtnetlink: Convert rtnetlink_event to white list The rtnetlink_event currently functions as a blacklist where we block cerntain netdev events from being sent to user space. As a result, events have been added to the system that userspace probably doesn't care about. This patch converts the implementation to the white list so that newly events would have to be specifically added to the list to be sent to userspace. This would force new event implementers to consider whether a given event is usefull to user space or if it's just a kernel event. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-05 08:14:14 -07:00
Alexey Dobriyan	822f9bb104	soreuseport: use "unsigned int" in __reuseport_alloc() Number of sockets is limited by 16-bit, so 64-bit allocation will never happen. 16-bit ops are the worst code density-wise on x86_64 because of additional prefix (66). Space savings: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3) function old new delta reuseport_add_sock 539 536 -3 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 19:06:38 -07:00
Alexey Dobriyan	ec2e45a978	flowcache: more "unsigned int" Make ->hash_count, ->low_watermark and ->high_watermark unsigned int and propagate unsignedness to other variables. This change doesn't change code generation because these fields aren't used in 64-bit contexts but make it anyway: these fields can't be negative numbers. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 19:04:48 -07:00
Alexey Dobriyan	f31cc7e815	flowcache: make flow_cache_hash_size() return "unsigned int" Hash size can't negative so "unsigned int" is logically correct. Propagate "unsigned int" to loop counters. Space savings: add/remove: 0/0 grow/shrink: 2/2 up/down: 6/-18 (-12) function old new delta flow_cache_flush_tasklet 362 365 +3 __flow_cache_shrink 333 336 +3 flow_cache_cpu_up_prep 178 171 -7 flow_cache_lookup 1159 1148 -11 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 19:04:48 -07:00
Alexey Dobriyan	5a17d9ed9a	flowcache: make flow_key_size() return "unsigned int" Flow keys aren't 4GB+ numbers so 64-bit arithmetic is excessive. Space savings (I'm not sure what CSWTCH is): add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-48 (-48) function old new delta flow_cache_lookup 1163 1159 -4 CSWTCH 75997 75953 -44 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 19:04:48 -07:00
Simon Horman	ac6a3722fe	flow dissector: correct size of storage for ARP The last argument to __skb_header_pointer() should be a buffer large enough to store struct arphdr. This can be a pointer to a struct arphdr structure. The code was previously using a pointer to a pointer to struct arphdr. By my counting the storage available both before and after is 8 bytes on x86_64. Fixes: `55733350e5` ("flow disector: ARP support") Reported-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-03 14:46:45 -07:00
Al Viro	3278682123	make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error Fixes the mess observed in e.g. rsync over a noisy link we'd been seeing since last Summer. What happens is that we copy part of a datagram before noticing a checksum mismatch. Datagram will be resent, all right, but we want the next try go into the same place, not after it... All this family of primitives (copy/checksum and copy a datagram into destination) is "all or nothing" sort of interface - either we get 0 (meaning that copy had been successful) or we get an error (and no way to tell how much had been copied before we ran into whatever error it had been). Make all of them leave iterator unadvanced in case of errors - all callers must be able to cope with that (an error might've been caught before the iterator had been advanced), it costs very little to arrange, it's safer for callers and actually fixes at least one bug in said callers. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-04-02 12:10:57 -04:00
Alexei Starovoitov	1cf1cae963	bpf: introduce BPF_PROG_TEST_RUN command development and testing of networking bpf programs is quite cumbersome. Despite availability of user space bpf interpreters the kernel is the ultimate authority and execution environment. Current test frameworks for TC include creation of netns, veth, qdiscs and use of various packet generators just to test functionality of a bpf program. XDP testing is even more complicated, since qemu needs to be started with gro/gso disabled and precise queue configuration, transferring of xdp program from host into guest, attaching to virtio/eth0 and generating traffic from the host while capturing the results from the guest. Moreover analyzing performance bottlenecks in XDP program is impossible in virtio environment, since cost of running the program is tiny comparing to the overhead of virtio packet processing, so performance testing can only be done on physical nic with another server generating traffic. Furthermore ongoing changes to user space control plane of production applications cannot be run on the test servers leaving bpf programs stubbed out for testing. Last but not least, the upstream llvm changes are validated by the bpf backend testsuite which has no ability to test the code generated. To improve this situation introduce BPF_PROG_TEST_RUN command to test and performance benchmark bpf programs. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-04-01 12:45:57 -07:00
Paolo Abeni	6c7c98bad4	sock: avoid dirtying sk_stamp, if possible sock_recv_ts_and_drops() unconditionally set sk->sk_stamp for every packet, even if the SOCK_TIMESTAMP flag is not set in the related socket. If selinux is enabled, this cause a cache miss for every packet since sk->sk_stamp and sk->sk_security share the same cacheline. With this change sk_stamp is set only if the SOCK_TIMESTAMP flag is set, and is cleared for the first packet, so that the user perceived behavior is unchanged. This gives up to 5% speed-up under udp-flood with small packets. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-30 20:05:24 -07:00
Andrew Lunn	c6e970a04b	net: break include loop netdevice.h, dsa.h, devlink.h There is an include loop between netdevice.h, dsa.h, devlink.h because of NETDEV_ALIGN, making it impossible to use devlink structures in dsa.h. Break this loop by taking dsa.h out of netdevice.h, add a forward declaration of dsa_switch_tree and netdev_set_default_ethtool_ops() function, which is what netdevice.h requires. No longer having dsa.h in netdevice.h means the includes in dsa.h no longer get included. This breaks a few other files which depend on these includes. Add these directly in the affected file. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-28 22:46:04 -07:00
Arkadi Sharshevsky	1555d204e7	devlink: Support for pipeline debug (dpipe) The pipeline debug is used to export the pipeline abstractions for the main objects - tables, headers and entries. The only support for set is for changing the counter parameter on specific table. The basic structures: Header - can represent a real protocol header information or internal metadata. Generic protocol headers like IPv4 can be shared between drivers. Each driver can add local headers. Field - part of a header. Can represent protocol field or specific ASIC metadata field. Hardware special metadata fields can be mapped to different resources, for example switch ASIC ports can have internal number which from the systems point of view is mapped to netdeivce ifindex. Match - represent specific match rule. Can describe match on specific field or header. The header index should be specified as well in order to support several header instances of the same type (tunneling). Action - represents specific action rule. Actions can describe operations on specific field values for example like set, increment, etc. And header operation like add and delete. Value - represents value which can be associated with specific match or action. Table - represents a hardware block which can be described with match/ action behavior. The match/action can be done on the packets data or on the internal metadata that it gathered along the packets traversal throw the pipeline which is vendor specific and should be exported in order to provide understanding of ASICs behavior. Entry - represents single record in a specific table. The entry is identified by specific combination of values for match/action. Prior to accessing the tables/entries the drivers provide the header/ field data base which is used by driver to user-space. The data base is split between the shared headers and unique headers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-28 17:11:54 -07:00
Sridhar Samudrala	6d4339028b	net: Introduce SO_INCOMING_NAPI_ID This socket option returns the NAPI ID associated with the queue on which the last frame is received. This information can be used by the apps to split the incoming flows among the threads based on the Rx queue on which they are received. If the NAPI ID actually represents a sender_cpu then the value is ignored and 0 is returned. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 20:49:31 -07:00
Sridhar Samudrala	7db6b048da	net: Commonize busy polling code to focus on napi_id instead of socket Move the core functionality in sk_busy_loop() to napi_busy_loop() and make it independent of sk. This enables re-using this function in epoll busy loop implementation. Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 20:49:31 -07:00
Alexander Duyck	37056719bb	net: Track start of busy loop instead of when it should end This patch flips the logic we were using to determine if the busy polling has timed out. The main motivation for this is that we will need to support two different possible timeout values in the future and by recording the start time rather than when we would want to end we can focus on making the end_time specific to the task be it epoll or socket based polling. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 20:49:31 -07:00
Alexander Duyck	2b5cd0dfa3	net: Change return type of sk_busy_loop from bool to void checking the return value of sk_busy_loop. As there are only a few consumers of that data, and the data being checked for can be replaced with a check for !skb_queue_empty() we might as well just pull the code out of sk_busy_loop and place it in the spots that actually need it. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 20:49:30 -07:00
Alexander Duyck	545cd5e5ec	net: Busy polling should ignore sender CPUs This patch is a cleanup/fix for NAPI IDs following the changes that made it so that sender_cpu and napi_id were doing a better job of sharing the same location in the sk_buff. One issue I found is that we weren't validating the napi_id as being valid before we started trying to setup the busy polling. This change corrects that by using the MIN_NAPI_ID value that is now used in both allocating the NAPI IDs, as well as validating them. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 20:49:30 -07:00
Florian Westphal	28ee1b746f	secure_seq: downgrade to per-host timestamp offsets Unfortunately too many devices (not under our control) use tcp_tw_recycle=1, which depends on timestamps being identical of the same saddr. Although tcp_tw_recycle got removed in net-next we can't make such end hosts disappear so downgrade to per-host timestamp offsets. Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Reported-by: Yvan Vanrossomme <yvan@vanrossomme.net> Fixes: `95a22caee3` ("tcp: randomize tcp timestamp offsets for each connection") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 19:27:44 -07:00
Alexander Duyck	95f2552113	net: Do not allow negative values for busy_read and busy_poll sysctl interfaces This change basically codifies what I think was already the limitations on the busy_poll and busy_read sysctl interfaces. We weren't checking the lower bounds and as such could input negative values. The behavior when that was used was dependent on the architecture. In order to prevent any issues with that I am just disabling support for values less than 0 since this way we don't have to worry about any odd behaviors. By limiting the sysctl values this way it also makes it consistent with how we handle the SO_BUSY_POLL socket option since the value appears to be reported as a signed integer value and negative values are rejected. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 15:02:13 -07:00
Alexey Dobriyan	e013fb7c4c	net: make in_aton() 32-bit internally Converting IPv4 address doesn't need 64-bit arithmetic. Space savings: 10 bytes! add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10) function old new delta in_aton 96 86 -10 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-24 13:27:19 -07:00
Eric Dumazet	48481c8fa1	net: neigh: guard against NULL solicit() method Dmitry posted a nice reproducer of a bug triggering in neigh_probe() when dereferencing a NULL neigh->ops->solicit method. This can happen for arp_direct_ops/ndisc_direct_ops and similar, which can be used for NUD_NOARP neighbours (created when dev->header_ops is NULL). Admin can then force changing nud_state to some other state that would fire neigh timer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 21:28:13 -07:00
Chenbo Feng	6acc5c2910	Add a eBPF helper function to retrieve socket uid Returns the owner uid of the socket inside a sk_buff. This is useful to perform per-UID accounting of network traffic or per-UID packet filtering. The socket need to be a fullsock otherwise overflowuid is returned. Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 17:01:02 -07:00
Chenbo Feng	91b8270f2a	Add a helper function to get socket cookie in eBPF Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff with a known socket. Generates a new cookie if one was not yet set.If the socket pointer inside sk_buff is NULL, 0 is returned. The helper function coud be useful in monitoring per socket networking traffic statistics and provide a unique socket identifier per namespace. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 17:01:02 -07:00
David S. Miller	16ae1f2236	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/genet/bcmmii.c drivers/net/hyperv/netvsc.c kernel/bpf/hashtab.c Almost entirely overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-23 16:41:27 -07:00
Daniel Borkmann	a97e50cc4c	socket, bpf: fix sk_filter use after free in sk_clone_lock In sk_clone_lock(), we create a new socket and inherit most of the parent's members via sock_copy() which memcpy()'s various sections. Now, in case the parent socket had a BPF socket filter attached, then newsk->sk_filter points to the same instance as the original sk->sk_filter. sk_filter_charge() is then called on the newsk->sk_filter to take a reference and should that fail due to hitting max optmem, we bail out and release the newsk instance. The issue is that commit `278571baca` ("net: filter: simplify socket charging") wrongly combined the dismantle path with the failure path of xfrm_sk_clone_policy(). This means, even when charging failed, we call sk_free_unlock_clone() on the newsk, which then still points to the same sk_filter as the original sk. Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually where it tests for present sk_filter and calls sk_filter_uncharge() on it, which potentially lets sk_omem_alloc wrap around and releases the eBPF prog and sk_filter structure from the (still intact) parent. Fix it by making sure that when sk_filter_charge() failed, we reset newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(), so that we don't mess with the parents sk_filter. Only if xfrm_sk_clone_policy() fails, we did reach the point where either the parent's filter was NULL and as a result newsk's as well or where we previously had a successful sk_filter_charge(), thus for that case, we do need sk_filter_uncharge() to release the prior taken reference on sk_filter. Fixes: `278571baca` ("net: filter: simplify socket charging") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 15:37:04 -07:00
David Ahern	a7678c70ef	rtnetlink: Add dump all for netconf Use the rtnl_dump_all to dump all netconf handlers that have been registered. Allows userspace to send a dump request for PF_UNSPEC and get all families. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 12:45:17 -07:00
Reshetova, Elena	4c355cdfbb	net: convert sk_filter.refcnt from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 12:06:08 -07:00
Josh Hunt	a2d133b1d4	sock: introduce SO_MEMINFO getsockopt Allows reading of SK_MEMINFO_VARS via socket option. This way an application can get all meminfo related information in single socket option call instead of multiple calls. Adds helper function, sk_get_meminfo(), and uses that for both getsockopt and sock_diag_put_meminfo(). Suggested by Eric Dumazet. Signed-off-by: Josh Hunt <johunt@akamai.com> Reviewed-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 11:18:58 -07:00
Roopa Prabhu	7b8f7a402d	neighbour: fix nlmsg_pid in notifications neigh notifications today carry pid 0 for nlmsg_pid in all cases. This patch fixes it to carry calling process pid when available. Applications (eg. quagga) rely on nlmsg_pid to ignore notifications generated by their own netlink operations. This patch follows the routing subsystem which already sets this correctly. Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 10:48:49 -07:00
Tejun Heo	a05d4fd917	cgroup, net_cls: iterate the fds of only the tasks which are being migrated The net_cls controller controls the classid field of each socket which is associated with the cgroup. Because the classid is per-socket attribute, when a task migrates to another cgroup or the configured classid of the cgroup changes, the controller needs to walk all sockets and update the classid value, which was implemented by `3b13758f51` ("cgroups: Allow dynamically changing net_classid"). While the approach is not scalable, migrating tasks which have a lot of fds attached to them is rare and the cost is born by the ones initiating the operations. However, for simplicity, both the migration and classid config change paths call update_classid() which scans all fds of all tasks in the target css. This is an overkill for the migration path which only needs to cover a much smaller subset of tasks which are actually getting migrated in. On cgroup v1, this can lead to unexpected scalability issues when one tries to migrate a task or process into a net_cls cgroup which already contains a lot of fds. Even if the migration traget doesn't have many to get scanned, update_classid() ends up scanning all fds in the target cgroup which can be extremely numerous. Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is even worse. Before `bfc2cf6f61` ("cgroup: call subsys->attach() only for subsystems which are actually affected by migration"), cgroup core would call the ->css_attach callback even for controllers which don't see actual migration to a different css. As net_cls is always disabled but still mounted on cgroup v2, whenever a process is migrated on the cgroup v2 hierarchy, net_cls sees identity migration from root to root and cgroup core used to call ->css_attach callback for those. The net_cls ->css_attach ends up calling update_classid() on the root net_cls css to which all processes on the system belong to as the controller isn't used. This makes any cgroup v2 migration O(total_number_of_fds_on_the_system) which is horrible and easily leads to noticeable stalls triggering RCU stall warnings and so on. The worst symptom is already fixed in upstream by `bfc2cf6f61` ("cgroup: call subsys->attach() only for subsystems which are actually affected by migration"); however, backporting that commit is too invasive and we want to avoid other cases too. This patch updates net_cls's cgrp_attach() to iterate fds of only the processes which are actually getting migrated. This removes the surprising migration cost which is dependent on the total number of fds in the target cgroup. As this leaves write_classid() the only user of update_classid(), open-code the helper into write_classid(). Reported-by: David Goode <dgoode@fb.com> Fixes: `3b13758f51` ("cgroups: Allow dynamically changing net_classid") Cc: stable@vger.kernel.org # v4.4+ Cc: Nina Schiff <ninasc@fb.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-03-22 10:32:46 -07:00

1 2 3 4 5 ...

4740 Commits