linux

Commit Graph

Author	SHA1	Message	Date
Martin KaFai Lau	4a482f34af	cgroup: bpf: Add bpf_skb_in_cgroup_proto Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk belongs to a descendant of a cgroup2. It is similar to the feature added in netfilter: commit `c38c4597e4` ("netfilter: implement xt_cgroup cgroup2 path match") The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY which will be used by the bpf_skb_in_cgroup. Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY and bpf_skb_in_cgroup() are always used together. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:32:13 -04:00
Daniel Borkmann	113214be7f	bpf: refactor bpf_prog_get and type check into helper Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Jason Wang	08294a26e1	net: introduce NETDEV_CHANGE_TX_QUEUE_LEN This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:32:17 -04:00
Michal Soltys	33ef84a77d	net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() cl->cl_vt alone is relative only to the current backlog period, while the curve operates on cumulative virtual time. This patch adds missing cl->cl_vtoff. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	ab12cb4742	net/sched/sch_hfsc.c: go passive after vt update When a class is going passive, it should update its cl_vt first to be consistent with the last dequeue operation. Otherwise its cl_vt will be one packet behind and parent's cvtmax might not be updated as well. One possible side effect is if some class goes passive and subsequently goes active /without/ its parent going passive - with cl_vt lagging one packet behind - comparison made in init_vf() will be affected (same period). Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	2354f056f6	net/sched/sch_hfsc.c: remove leftover dlist and droplist This is update to: commit `a09ceb0e08` ("sched: remove qdisc->drop") That commit removed qdisc->drop, but left alone dlist and droplist that no longer serve any meaningful purpose. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	d1d0fc5e4c	net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() The condition can only succeed on wrong configurations. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Michal Soltys	12d0ad3be9	net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline Realtime scheduling implemented in HFSC uses head of the queue to make the decision about which packet to schedule next. But in case of any head drop, the deadline calculated for the previous head is not necessarily correct for the next head (unless both packets have the same length). Thanks to peek() function used during dequeue - which internally is a dequeue operation - hfsc is almost safe from this issue, as peek() dequeues and isolates the head storing it temporarily until the real dequeue happens. But there is one exception: if after the class activation a drop happens before the first dequeue operation, there's never a chance to do the peek(). Adding peek() call in enqueue - if this is the first packet in a new backlog period AND the scheduler has realtime curve defined - fixes that one corner case. The 1st hfsc_dequeue() will use that peeked packet, similarly as every subsequent hfsc_dequeue() call uses packet peeked by the previous call. Signed-off-by: Michal Soltys <soltys@ziu.info> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 05:03:43 -04:00
Eric Dumazet	19689e38ec	tcp: md5: use kmalloc() backed scratch areas Some arches have virtually mapped kernel stacks, or will soon have. tcp_md5_hash_header() uses an automatic variable to copy tcp header before mangling th->check and calling crypto function, which might be problematic on such arches. David says that using percpu storage is also problematic on non SMP builds. Just use kmalloc() to allocate scratch areas. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 04:02:55 -04:00
Mateusz Bajorski	153380ec4b	fib_rules: Added NLM_F_EXCL support to fib_nl_newrule When adding rule with NLM_F_EXCL flag then check if the same rule exist. If yes then exit with -EEXIST. This is already implemented in iproute2: if (cmd == RTM_NEWRULE) { req.n.nlmsg_flags \|= NLM_F_CREATE\|NLM_F_EXCL; req.r.rtm_type = RTN_UNICAST; } Tested ipv4 and ipv6 with net-next linux on qemu x86 expected behavior after patch: localhost ~ # ip rule 0: from all lookup local 32766: from all lookup main 32767: from all lookup default localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 RTNETLINK answers: File exists localhost ~ # ip rule 0: from all lookup local 1005: from 10.46.177.97 lookup 104 32766: from all lookup main 32767: from all lookup default There was already topic regarding this but I don't see any changes merged and problem still occurs. https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain Signed-off-by: Mateusz Bajorski <mateusz.bajorski@nokia.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:23:19 -04:00
Andrey Vagin	b1ed4c4fa9	tcp: add an ability to dump and restore window parameters We found that sometimes a restored tcp socket doesn't work. A reason of this bug is incorrect window parameters and in this case tcp_acceptable_seq() returns tcp_wnd_end(tp) instead of tp->snd_nxt. The other side drops packets with this seq, because seq is less than tp->rcv_nxt ( tcp_sequence() ). Data from a send queue is sent only if there is enough space in a window, so when we restore unacked data, we need to expand a window to fit this data. This was in a first version of this patch: "tcp: extend window to fit all restored unacked data in a send queue" Then Alexey recommended me to restore window parameters instead of adjusted them according with data in a sent queue. This sounds resonable. rcv_wnd has to be restored, because it was reported to another side and the offered window is never shrunk. One of reasons why we need to restore snd_wnd was described above. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 08:15:31 -04:00
Nikolay Aleksandrov	1080ab95e3	net: bridge: add support for IGMP/MLD stats and export them via netlink This patch adds stats support for the currently used IGMP/MLD types by the bridge. The stats are per-port (plus one stat per-bridge) and per-direction (RX/TX). The stats are exported via netlink via the new linkxstats API (RTM_GETSTATS). In order to minimize the performance impact, a new option is used to enable/disable the stats - multicast_stats_enabled, similar to the recent vlan stats. Also in order to avoid multiple IGMP/MLD type lookups and checks, we make use of the current "igmp" member of the bridge private skb->cb region to record the type on Rx (both host-generated and external packets pass by multicast_rcv()). We can do that since the igmp member was used as a boolean and all the valid IGMP/MLD types are positive values. The normal bridge fast-path is not affected at all, the only affected paths are the flooding ones and since we make use of the IGMP/MLD type, we can quickly determine if the packet should be counted using cache-hot data (cb's igmp member). We add counters for: * IGMP Queries * IGMP Leaves * IGMP v1/v2/v3 reports * MLD Queries * MLD Leaves * MLD v1/v2 reports These are invaluable when monitoring or debugging complex multicast setups with bridges. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 06:18:24 -04:00
Nikolay Aleksandrov	80e73cc563	net: rtnetlink: add support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute This patch adds support for the IFLA_STATS_LINK_XSTATS_SLAVE attribute which allows to export per-slave statistics if the master device supports the linkxstats callback. The attribute is passed down to the linkxstats callback and it is up to the callback user to use it (an example has been added to the only current user - the bridge). This allows us to query only specific slaves of master devices like bridge ports and export only what we're interested in instead of having to dump all ports and searching only for a single one. This will be used to export per-port IGMP/MLD stats and also per-port vlan stats in the future, possibly other statistics as well. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 06:15:04 -04:00
Daniel Borkmann	d2485c4242	bpf: add bpf_skb_change_type helper This work adds a helper for changing skb->pkt_type in a controlled way. We only allow a subset of possible values and can extend that in future should other use cases come up. Doing this as a helper has the advantage that errors can be handeled gracefully and thus helper kept extensible. It's a write counterpart to pkt_type member we can already read from struct __sk_buff context. Major use case is to change incoming skbs to PACKET_HOST in a programmatic way instead of having to recirculate via redirect(..., BPF_F_INGRESS), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	6578171a7f	bpf: add bpf_skb_change_proto helper This patch adds a minimal helper for doing the groundwork of changing the skb->protocol in a controlled way. Currently supported is v4 to v6 and vice versa transitions, which allows f.e. for a minimal, static nat64 implementation where applications in containers that still require IPv4 can be transparently operated in an IPv6-only environment. For example, host facing veth of the container can transparently do the transitions in a programmatic way with the help of clsact qdisc and cls_bpf. Idea is to separate concerns for keeping complexity of the helper lower, which means that the programs utilize bpf_skb_change_proto(), bpf_skb_store_bytes() and bpf_lX_csum_replace() to get the job done, instead of doing everything in a single helper (and thus partially duplicating helper functionality). Also, bpf_skb_change_proto() shouldn't need to deal with raw packet data as this is done by other helpers. bpf_skb_proto_6_to_4() and bpf_skb_proto_4_to_6() unclone the skb to operate on a private one, push or pop additionally required header space and migrate the gso/gro meta data from the shared info. We do mark the gso type as dodgy so that headers are checked and segs recalculated by the gso/gro engine. The gso_size target is adapted as well. The flags argument added is currently reserved and can be used for future extensions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
Daniel Borkmann	80b48c4457	bpf: don't use raw processor id in generic helper Use smp_processor_id() for the generic helper bpf_get_smp_processor_id() instead of the raw variant. This allows for preemption checks when we have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only need to keep the raw variant for socket filters, but we can reuse the helper that is already there from cBPF side. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:54:40 -04:00
David S. Miller	ee58b57100	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Several cases of overlapping changes, except the packet scheduler conflicts which deal with the addition of the free list parameter to qdisc_enqueue(). Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-30 05:03:36 -04:00
Linus Torvalds	32826ac41f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "I've been traveling so this accumulates more than week or so of bug fixing. It perhaps looks a little worse than it really is. 1) Fix deadlock in ath10k driver, from Ben Greear. 2) Increase scan timeout in iwlwifi, from Luca Coelho. 3) Unbreak STP by properly reinjecting STP packets back into the stack. Regression fix from Ido Schimmel. 4) Mediatek driver fixes (missing malloc failure checks, leaking of scratch memory, wrong indexing when mapping TX buffers, etc.) from John Crispin. 5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic Sowa. 6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su. 7) Fix netlink notifications in ovs for tunnels, delete link messages are never emitted because of how the device registry state is handled. From Nicolas Dichtel. 8) Conntrack module leaks kmemcache on unload, from Florian Westphal. 9) Prevent endless jump loops in nft rules, from Liping Zhang and Pablo Neira Ayuso. 10) Not early enough spinlock initialization in mlx4, from Eric Dumazet. 11) Bind refcount leak in act_ipt, from Cong WANG. 12) Missing RCU locking in HTB scheduler, from Florian Westphal. 13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU barrier, using heap for SG and IV, and erroneous use of async flag when allocating AEAD conext.) 14) RCU handling fix in TIPC, from Ying Xue. 15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in SIT driver, from Simon Horman. 16) Socket timer deadlock fix in TIPC from Jon Paul Maloy. 17) Fix potential deadlock in team enslave, from Ido Schimmel. 18) Memory leak in KCM procfs handling, from Jiri Slaby. 19) ESN generation fix in ipv4 ESP, from Herbert Xu. 20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong WANG. 21) Use after free in netem, from Eric Dumazet. 22) Uninitialized last assert time in multicast router code, from Tom Goff. 23) Skip raw sockets in sock_diag destruction broadcast, from Willem de Bruijn. 24) Fix link status reporting in thunderx, from Sunil Goutham. 25) Limit resegmentation of retransmit queue so that we do not retransmit too large GSO frames. From Eric Dumazet. 26) Delay bpf program release after grace period, from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits) openvswitch: fix conntrack netlink event delivery qed: Protect the doorbell BAR with the write barriers. neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit() e1000e: keep VLAN interfaces functional after rxvlan off cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag() bpf, perf: delay release of BPF prog after grace period net: bridge: fix vlan stats continue counter tcp: do not send too big packets at retransmit time ibmvnic: fix to use list_for_each_safe() when delete items net: thunderx: Fix TL4 configuration for secondary Qsets net: thunderx: Fix link status reporting net/mlx5e: Reorganize ethtool statistics net/mlx5e: Fix number of PFC counters reported to ethtool net/mlx5e: Prevent adding the same vxlan port net/mlx5e: Check for BlueFlame capability before allocating SQ uar net/mlx5e: Change enum to better reflect usage net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices net/mlx5: Update command strings net: marvell: Add separate config ANEG function for Marvell 88E1111 ...	2016-06-29 11:50:42 -07:00
David S. Miller	751ad819b0	Just two small fixes * fix mesh peer link counter, decrement wasn't always done at all * fix ethertype (length) for packets without RFC 1042 or bridge tunnel header -----BEGIN PGP SIGNATURE----- iQIcBAABCgAGBQJXc5wQAAoJEGt7eEactAAd2HQP/RS3lZaNn97spZp9Vb9Swqd6 UjhT4vHL2wHPMhPals2M2ztRhL8df/55tOhSuCwUki0gTsAfP4+Jz1PZ4okFUSBn 6EHvRtAgSAzjBElZVEsNO2sM/qm0UOnlqm4XpOXcKzpnRzDSlIa6vpnctp2MJMBI PB46x8BPQoi192aTzleMqgBwjrMX1ohsM6rgOIEf7rvLhftx6LRRm7/uKaiwenVT ZxkP6jIOs5ofkUHLHS4OPyId4yReO8oMMbRRA4fW1z2T8hME9PH3jTZL3Znaj7VF s2LLSqwUlv06k0WFu3SzJh750jCJ6cnJiIqggDlPChrb0/DPicqgWR7/tJAvNVfj WJqugeNtEVI+4ElGNxFzFAvjAKqO8fHY5U8Ko4LtZuwhjpo5GDvA+FErFESRXvCG c647cDMhn+6OXwpUwsggQK4c+AS4QX/PzHjmhW5tKORPXlYVXAAz+JDXn195WnzV w4UAO4ZWai9XY6oSc63uudaHMt8Xkmq8PRQsH1hHG20CAJ+7bGugaWVTci5fzeNh JC4X+Xh6t+5qq1jEWgr+KguVKXtP3pVetqGy9kYQrJov0fTe2Z6deEeY08sH1Ffz xdnQUsZphvqSkosTw76JtNPXz2ajxZCD3KNWjcpkL/NNs9lnIZPZc/mnzKP7ltro ZEYw0HbcXVt/uFhHlwdG =dAbs -----END PGP SIGNATURE----- Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just two small fixes * fix mesh peer link counter, decrement wasn't always done at all * fix ethertype (length) for packets without RFC 1042 or bridge tunnel header ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 08:33:46 -04:00
Samuel Gauthier	d913d3a763	openvswitch: fix conntrack netlink event delivery Only the first and last netlink message for a particular conntrack are actually sent. The first message is sent through nf_conntrack_confirm when the conntrack is committed. The last one is sent when the conntrack is destroyed on timeout. The other conntrack state change messages are not advertised. When the conntrack subsystem is used from netfilter, nf_conntrack_confirm is called for each packet, from the postrouting hook, which in turn calls nf_ct_deliver_cached_events to send the state change netlink messages. This commit fixes the problem by calling nf_ct_deliver_cached_events in the non-commit case as well. Fixes: `7f8a436eaa` ("openvswitch: Add conntrack action") CC: Joe Stringer <joestringer@nicira.com> CC: Justin Pettit <jpettit@nicira.com> CC: Andy Zhou <azhou@nicira.com> CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 08:13:59 -04:00
Eric Dumazet	8a6e9c6703	net_sched: netem: do not call qdisc_drop() with a NULL skb If skb_unshare() fails, we call qdisc_drop() with a NULL skb, which is no longer supported. Fixes: `520ac30f45` ("net_sched: drop packets after root qdisc lock is released") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 08:02:24 -04:00
David Barroso	b560f03ddf	neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit() neigh_xmit() expects to be called inside an RCU-bh read side critical section, and while one of its two current callers gets this right, the other one doesn't. More specifically, neigh_xmit() has two callers, mpls_forward() and mpls_output(), and while both callers call neigh_xmit() under rcu_read_lock(), this provides sufficient protection for neigh_xmit() only in the case of mpls_forward(), as that is always called from softirq context and therefore doesn't need explicit BH protection, while mpls_output() can be called from process context with softirqs enabled. When mpls_output() is called from process context, with softirqs enabled, we can be preempted by a softirq at any time, and RCU-bh considers the completion of a softirq as signaling the end of any pending read-side critical sections, so if we do get a softirq while we are in the part of neigh_xmit() that expects to be run inside an RCU-bh read side critical section, we can end up with an unexpected RCU grace period running right in the middle of that critical section, making things go boom. This patch fixes this impedance mismatch in the callee, by making neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that expects to be treated as an RCU-bh read side critical section, as this seems a safer option than fixing it in the callers. Fixes: `4fd3d7d9e8` ("neigh: Add helper function neigh_xmit") Signed-off-by: David Barroso <dbarroso@fastly.com> Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 07:58:28 -04:00
Felix Fietkau	c041778c96	cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header The PDU length of incoming LLC frames is set to the total skb payload size in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly includes the length of the IEEE 802.11 header. The resulting LLC frame header has a too large PDU length, causing the llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming skb, effectively breaking STP. Solve the problem by properly substracting the IEEE 802.11 frame header size from the PDU length, allowing the LLC processor to pick up the incoming control messages. Special thanks to Gerry Rozema for tracking down the regression and proposing a suitable patch. Fixes: `2d1c304cb2` ("cfg80211: add function for 802.3 conversion with separate output buffer") Cc: stable@vger.kernel.org Reported-by: Gerry Rozema <gerryr@rozeware.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>	2016-06-29 11:50:33 +02:00
Nikolay Aleksandrov	565ce8f32a	net: bridge: fix vlan stats continue counter I made a dumb off-by-one mistake when I added the vlan stats counter dumping code. The increment should happen before the check, not after otherwise we miss one entry when we continue dumping. Fixes: `a60c090361` ("bridge: netlink: export per-vlan stats") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 05:33:35 -04:00
Eric Dumazet	a3d2e9f8eb	tcp: do not send too big packets at retransmit time Arjun reported a bug in TCP stack and bisected it to a recent commit. In case where we process SACK, we can coalesce multiple skbs into fat ones (tcp_shift_skb_data()), to lower write queue overhead, because we do not expect to retransmit these packets. However, SACK reneging can happen, forcing the sender to retransmit all these packets. If skb->len is above 64KB, we then send buggy IP packets that could hang TSO engine on cxgb4. Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs so that we cook packets of optimal size vs TCP/pacing. Thanks to Arjun for reporting the bug and running the tests ! Fixes: `10d3be5692` ("tcp-tso: do not split TSO packets at retransmit time") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Arjun V <arjun@chelsio.com> Tested-by: Arjun V <arjun@chelsio.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 05:25:11 -04:00
Richard Alpe	bc3a334cc2	tipc: rename udp_port in struct udp_media_addr Context implies that port in struct "udp_media_addr" is referring to a UDP port. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 05:17:37 -04:00
Richard Alpe	e99429232e	tipc: honor msg2addr return value The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which prior to this patch was unhanded in the caller. Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 05:17:37 -04:00
Wei Tang	8a01ed70eb	net: the space is required before the open parenthesis '(' The space is missing before the open parenthesis '(', and this will introduce much more noise when checking patch around. Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 05:15:14 -04:00
Sven Eckelmann	420cb1b764	batman-adv: Clean up untagged vlan when destroying via rtnl-link The untagged vlan object is only destroyed when the interface is removed via the legacy sysfs interface. But it also has to be destroyed when the standard rtnl-link interface is used. Fixes: `5d2c05b213` ("batman-adv: add per VLAN interface attribute framework") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 04:01:48 -04:00
Sven Eckelmann	3b55e44220	batman-adv: Fix ICMP RR ethernet access after skb_linearize The skb_linearize may reallocate the skb. This makes the calculated pointer for ethhdr invalid. But it the pointer is used later to fill in the RR field of the batadv_icmp_packet_rr packet. Instead re-evaluate eth_hdr after the skb_linearize+skb_cow to fix the pointer and avoid the invalid read. Fixes: `da6b8c20a5` ("batman-adv: generalize batman-adv icmp packet handling") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 04:01:48 -04:00
Ben Hutchings	baceced932	batman-adv: Fix double-put of vlan object Each batadv_tt_local_entry hold a single reference to a batadv_softif_vlan. In case a new entry cannot be added to the hash table, the error path puts the reference, but the reference will also now be dropped by batadv_tt_local_entry_release(). Fixes: `a33d970d0b` ("batman-adv: Fix reference counting of vlan object for tt_local_entry") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 04:01:47 -04:00
Sven Eckelmann	9c4604a298	batman-adv: Fix use-after-free/double-free of tt_req_node The tt_req_node is added and removed from a list inside a spinlock. But the locking is sometimes removed even when the object is still referenced and will be used later via this reference. For example batadv_send_tt_request can create a new tt_req_node (including add to a list) and later re-acquires the lock to remove it from the list and to free it. But at this time another context could have already removed this tt_req_node from the list and freed it. CPU#0 batadv_batman_skb_recv from net_device 0 -> batadv_iv_ogm_receive -> batadv_iv_ogm_process -> batadv_iv_ogm_process_per_outif -> batadv_tvlv_ogm_receive -> batadv_tvlv_ogm_receive -> batadv_tvlv_containers_process -> batadv_tvlv_call_handler -> batadv_tt_tvlv_ogm_handler_v1 -> batadv_tt_update_orig -> batadv_send_tt_request -> batadv_tt_req_node_new spin_lock(...) allocates new tt_req_node and adds it to list spin_unlock(...) return tt_req_node CPU#1 batadv_batman_skb_recv from net_device 1 -> batadv_recv_unicast_tvlv -> batadv_tvlv_containers_process -> batadv_tvlv_call_handler -> batadv_tt_tvlv_unicast_handler_v1 -> batadv_handle_tt_response spin_lock(...) tt_req_node gets removed from list and is freed spin_unlock(...) CPU#0 <- returned to batadv_send_tt_request spin_lock(...) tt_req_node gets removed from list and is freed MEMORY CORRUPTION/SEGFAULT/... spin_unlock(...) This can only be solved via reference counting to allow multiple contexts to handle the list manipulation while making sure that only the last context holding a reference will free the object. Fixes: `a73105b8d4` ("batman-adv: improved client announcement mechanism") Signed-off-by: Sven Eckelmann <sven@narfation.org> Tested-by: Martin Weinelt <martin@darmstadt.freifunk.net> Tested-by: Amadeus Alfa <amadeus@chemnitz.freifunk.net> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 04:01:47 -04:00
Simon Wunderlich	0b3dd7dfb8	batman-adv: replace WARN with rate limited output on non-existing VLAN If a VLAN tagged frame is received and the corresponding VLAN is not configured on the soft interface, it will splat a WARN on every packet received. This is a quite annoying behaviour for some scenarios, e.g. if bat0 is bridged with eth0, and there are arbitrary VLAN tagged frames from Ethernet coming in without having any VLAN configuration on bat0. The code should probably create vlan objects on the fly and transparently transport these VLAN-tagged Ethernet frames, but until this is done, at least the WARN splat should be replaced by a rate limited output. Fixes: `354136bcc3` ("batman-adv: fix kernel crash due to missing NULL checks") Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-29 04:01:47 -04:00
daniel	0888d5f3c0	Bridge: Fix ipv6 mc snooping if bridge has no ipv6 address The bridge is falsly dropping ipv6 mulitcast packets if there is: 1. No ipv6 address assigned on the brigde. 2. No external mld querier present. 3. The internal querier enabled. When the bridge fails to build mld queries, because it has no ipv6 address, it slilently returns, but keeps the local querier enabled. This specific case causes confusing packet loss. Ipv6 multicast snooping can only work if: a) An external querier is present OR b) The bridge has an ipv6 address an is capable of sending own queries Otherwise it has to forward/flood the ipv6 multicast traffic, because snooping cannot work. This patch fixes the issue by adding a flag to the bridge struct that indicates that there is currently no ipv6 address assinged to the bridge and returns a false state for the local querier in __br_multicast_querier_exists(). Special thanks to Linus Lüssing. Fixes: `d1d81d4c3d` ("bridge: check return value of ipv6_dev_get_saddr()") Signed-off-by: Daniel Danzberger <daniel@dd-wrt.com> Acked-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-28 08:03:04 -04:00
Jouni Malinen	126e755732	mac80211: Fix mesh estab_plinks counting in STA removal case If a user space program (e.g., wpa_supplicant) deletes a STA entry that is currently in NL80211_PLINK_ESTAB state, the number of established plinks counter was not decremented and this could result in rejecting new plink establishment before really hitting the real maximum plink limit. For !user_mpm case, this decrementation is handled by mesh_plink_deactive(). Fix this by decrementing estab_plinks on STA deletion (mesh_sta_cleanup() gets called from there) so that the counter has a correct value and the Beacon frame advertisement in Mesh Configuration element shows the proper value for capability to accept additional peers. Cc: stable@vger.kernel.org Signed-off-by: Jouni Malinen <j@w1.fi> Signed-off-by: Johannes Berg <johannes@sipsolutions.net>	2016-06-28 12:39:50 +02:00
Amitoj Kaur Chawla	56e2f23b72	caif: Remove unneeded header file Drop redundant include of moduleparam.h The Coccinelle semantic patch used to make this change is as follows: @ includesmodule @ @@ #include <linux/module.h> @ depends on includesmodule @ @@ - #include <linux/moduleparam.h> Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-28 05:26:14 -04:00
David Ahern	637c841dd7	net: diag: Add support to filter on device index Add support to inet_diag facility to filter sockets based on device index. If an interface index is in the filter only sockets bound to that index (sk_bound_dev_if) are returned. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-28 05:25:04 -04:00
Tom Goff	70a0dec451	ipmr/ip6mr: Initialize the last assert time of mfc entries. This fixes wrong-interface signaling on 32-bit platforms for entries created when jiffies > 2^31 + MFC_ASSERT_THRESH. Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-28 04:14:09 -04:00
Stefan Hajnoczi	4192f672fa	vsock: make listener child lock ordering explicit There are several places where the listener and pending or accept queue child sockets are accessed at the same time. Lockdep is unhappy that two locks from the same class are held. Tell lockdep that it is safe and document the lock ordering. Originally Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> sent a similar patch asking whether this is safe. I have audited the code and also covered the vsock_pending_work() function. Suggested-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-27 10:44:46 -04:00
Paolo Abeni	48f1dcb55a	ipv6: enforce egress device match in per table nexthop lookups with the commit `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups"), net hop lookup is first performed on route creation in the passed-in table. However device match is not enforced in table lookup, so the found route can be later discarded due to egress device mismatch and no global lookup will be performed. This cause the following to fail: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link set dummy1 up ip link set dummy2 up ip route add 2001:db8:8086::/48 dev dummy1 metric 20 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20 ip route add 2001:db8:8086::/48 dev dummy2 metric 21 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21 RTNETLINK answers: No route to host This change fixes the issue enforcing device lookup in ip6_nh_lookup_table() v1->v2: updated commit message title Fixes: `8c14586fc3` ("net: ipv6: Use passed in table for nexthop lookups") Reported-and-tested-by: Beniamino Galvani <bgalvani@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-27 10:37:20 -04:00
David S. Miller	5db15872c5	linux-can-next-for-4.8-20160623 -----BEGIN PGP SIGNATURE----- iQEcBAABCgAGBQJXa6rGAAoJED07qiWsqSVqzwsH/0g/blikQuNNAB7LzXGTEMn8 ipWCODaYxgWE/V/1A0JjsKpxB1ZL3HE0/xg/OmZ10SRNlz5LJTLkoYZa7xYmyxtT a89mT9vdnWxwfwbeIhSSjG9W9g44vqrhMIPxffIu5yqSqdIJak0HIiuQizJ0/xjx xQVx4AykWKQ3u0/Tiyz3ez5yhyMMNEmHxKyDWpgrR6+zXlofwP/Em3NFPwk9gh32 ECAagOUOEvdjXeRs1Yn/CV0FC7Wgs4Hzr048JJ5wOteawBLr+sPOnuWZtDhNrwDK Eceh7aEeAiFfIW+jCMEe5qrpFHi8hqyqBo4L9TDYsE7T/Z6BPyh+5OUh8+5dtaw= =TZew -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-4.8-20160623' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2016-06-17 this is a pull request of 4 patches for net-next/master. Arnd Bergmann's patch fixes a regresseion in af_can introduced in linux-can-next-for-4.8-20160617. There are two patches by Ramesh Shanmugasundaram, which add CAN-2.0 support to the rcar_canfd driver. And a patch by Ed Spiridonov that adds better error diagnoses messages to the Ed Spiridonov driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-27 10:33:42 -04:00
Amitoj Kaur Chawla	810bf11033	tipc: Use kmemdup instead of kmalloc and memcpy Replace calls to kmalloc followed by a memcpy with a direct call to kmemdup. The Coccinelle semantic patch used to make this change is as follows: @@ expression from,to,size,flag; statement S; @@ - to = $kmalloc\\|kzalloc$(size,flag); + to = kmemdup(from,size,flag); if (to==NULL \|\| ...) S - memcpy(to, from, size); Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-27 09:56:58 -04:00
David S. Miller	2b7c4f7a0e	RxRPC rewrite -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAV2q17PSw1s6N8H32AQKOixAAh+Fw6H7j/biCA73Fi0CtVJpqxvDEoo0d jGBNtIueiyTnuGpV8yqT23xrgcuaQlLoEPhwQTRPFy/jI2qmah69kuIZzz6ZvNxA sSsbc4M7PerMwmX/gPwUtvWflal1ECmpc6f+5y3pZBhqowMwm9HwxR0489FCurba 4k1w/OxDQIIH88RIsNcYX129xTIvekzB8bjhkIzfQM3WLfelhYyPr8haTt+CrCuF gdLB7O+AoCI7rxXuS+7blZq4+AryzNAWjpJQdQXlClF2UJhBDNO6CptmsmL5kpuN 0a6ijumB4Onak6zZhQo5PmvX2UbQbh6QEuGm1ZsdyZoTOFPgynRv0ZVgF4JmBl0t vhkKrbtcLkSYUHnFUoCBDuJnI9exugUH5a0BjdVPs/J6Zha0eS1pF0IOZEOQzC8j C4U+dgjKP4OArRLS6rRR5oS99zFijcTR+fNp+0rORwqgiDJhBgkbIO9y2hyDrjr9 OW+Hnbm2EPNF86kLG+zi/OSZ2Af/fX5gtkHvfLETE1rxzZi6G1leBiEOezCVJa1P /jtr8RR4hDOND8XZ4qvUC0yRBo2ykyW0OjtqZ2No7PuL1z/N6gvaV0PEn62yrP/g XCcmVxXquTMrQlqu66cQGJ+ZdD6MYa44oAOHGRpX55GCvZyjpMFrLPJ8B9XbXoh3 YXIDu5U8kQs= =HxJc -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Get rid of conn bundle and transport structs Here's the next part of the AF_RXRPC rewrite. The primary purpose of this set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs. This simplifies things for future development of the connection handling. To this end, the following significant changes are made: (1) The rxrpc_connection struct is given pointers to the local and peer endpoints, inside the rxrpc_conn_parameters struct. Pointers to the transport's copy of these pointers are then redirected to the connection struct. (2) Exclusive connection handling is fixed. Exclusive connections should do just one call and then be retired. They are used in security negotiations and, I believe, the idea is to avoid reuse of negotiated security contexts. The current code is doing a single connection per socket and doing all the calls over that. With this change it gets a new connection for each call made. (3) A new sendmsg() control message marker is added to make individual calls operate over exclusive connections. This should be used in future in preference to the sockopt that marks a socket as "exclusive connection". (4) IDs for client connections initiated by a machine are now allocated from a global pool using the IDR facility and are unique across all client connections, no matter their destination. The IDR facility is then used to look up a connection on the connection ID alone. Other parameters are then verified afterwards. Note that the IDR facility may use a lot of memory if the IDs it holds are widely scattered. Given this, in a future commit, client connections will be retired if they are more than a certain distance from the last ID allocated. The client epoch is advanced by 1 each time the client ID counter wraps. Connections outside the current epoch will also be retired in a future commit. (5) The connection bundle concept is removed and the client connection tree is moved into the local endpoint. The queue for waiting for a call channel is moved to the rxrpc_connection struct as there can only be one connection for any particular key going to any particular peer now. (6) The rxrpc_transport struct is removed and the service connection tree is moved into the peer struct. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-26 16:01:54 -04:00
Eric Dumazet	4d202a0d31	net_sched: generalize bulk dequeue When qdisc bulk dequeue was added in linux-3.18 (commit `5772e9a346` "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"), it was constrained to some specific qdiscs. With some extra care, we can extend this to all qdiscs, so that typical traffic shaping solutions can benefit from small batches (8 packets in this patch). For example, HTB is often used on some multi queue device. And bonding/team are multi queue devices... Idea is to bulk-dequeue packets mapping to the same transmit queue. This brings between 35 and 80 % performance increase in HTB setup under pressure on a bonding setup : 1) NUMA node contention : 610,000 pps -> 1,110,000 pps 2) No node contention : 1,380,000 pps -> 1,930,000 pps Now we should work to add batches on the enqueue() side ;) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 12:19:35 -04:00
Eric Dumazet	338ed9b4de	net_sched: sch_htb: export class backlog in dumps We already get child qdisc qlen, we also can get its backlog so that class dumps can report it. Also replace qstats by a single drop counter, but move it in a separate cache line so that drops do not dirty useful cache lines. Tested: $ tc -s cl sh dev eth0 class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0) rate 1001Mbit 517543pps backlog 120758b 499p requeues 0 lended: 9021770 borrowed: 0 giants: 0 tokens: 9 ctokens: 9 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 12:19:35 -04:00
Eric Dumazet	008830bc32	net_sched: fq_codel: cache skb->truesize into skb->cb Now we defer skb drops, it makes sense to keep a copy of skb->truesize in struct codel_skb_cb to avoid one cache line miss per dropped skb in fq_codel_drop(), to reduce latencies a bit further. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 12:19:35 -04:00
Eric Dumazet	520ac30f45	net_sched: drop packets after root qdisc lock is released Qdisc performance suffers when packets are dropped at enqueue() time because drops (kfree_skb()) are done while qdisc lock is held, delaying a dequeue() draining the queue. Nominal throughput can be reduced by 50 % when this happens, at a time we would like the dequeue() to proceed as fast as possible. Even FQ is vulnerable to this problem, while one of FQ goals was to provide some flow isolation. This patch adds a 'struct sk_buff **to_free' parameter to all qdisc->enqueue(), and in qdisc_drop() helper. I measured a performance increase of up to 12 %, but this patch is a prereq so that future batches in enqueue() can fly. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 12:19:35 -04:00
Jarno Rajahalme	7d904c7bcd	openvswitch: Only set mark and labels with a commit flag. Only set conntrack mark or labels when the commit flag is specified. This makes sure we can not set them before the connection has been persisted, as in that case the mark and labels would be lost in an event of an userspace upcall. OVS userspace already requires the commit flag to accept setting ct_mark and/or ct_labels. Validate for this in the kernel API. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 11:55:51 -04:00
Jarno Rajahalme	1c1779fa54	openvswitch: Set mark and labels before confirming. Set conntrack mark and labels right before committing so that the initial conntrack NEW event has the mark and labels. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-25 11:55:51 -04:00
Eric Dumazet	21de12ee55	netem: fix a use after free If the packet was dropped by lower qdisc, then we must not access it later. Save qdisc_pkt_len(skb) in a temp variable. Fixes: `2ccccf5fb4` ("net_sched: update hierarchical backlog too") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-23 15:07:44 -04:00

1 2 3 4 5 ...

42472 Commits