Since addrconf_get_prefix_route inputs the address prefix to fib6_locate,
which does not uses the data which is out of the prefix_len length,
so do not need to use ipv6_addr_prefix to get address prefix.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves nft_reject_ipv4 to nft_reject and adds support
for IPv6 protocol. This patch uses functions included in nf_reject.h
to implement reject by TCP reset.
The code has to be build as a module if NF_TABLES_IPV6 is also a
module to avoid compilation error due to usage of IPv6 functions.
This has been done in Kconfig by using the construct:
depends on NF_TABLES_IPV6 || !NF_TABLES_IPV6
This seems a bit weird in terms of syntax but works perfectly.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch prepares the addition of TCP reset support in
the nft_reject module by moving reusable code into a header
file.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
recvmsg handler in net/rose/af_rose.c performs size-check ->msg_namelen.
After commit f3d3342602
(net: rework recvmsg handler msg_name and msg_namelen logic), we now
always take the else branch due to namelen being initialized to 0.
Digging in netdev-vger-cvs git repo shows that msg_namelen was
initialized with a fixed-size since at least 1995, so the else branch
was never taken.
Compile tested only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A deadlock might occur if name table is withdrawn in socket release
routine, and while packets are still being received from bearer.
CPU0 CPU1
T0: recv_msg() release()
T1: tipc_recv_msg() tipc_withdraw()
T2: [grab node lock] [grab port lock]
T3: tipc_link_wakeup_ports() tipc_nametbl_withdraw()
T4: [grab port lock]* named_cluster_distribute()
T5: wakeupdispatch() tipc_link_send()
T6: [grab node lock]*
The opposite order of holding port lock and node lock on above two
different paths may result in a deadlock. If socket lock instead of
port lock is used to protect port instance in tipc_withdraw(), the
reverse order of holding port lock and node lock will be eliminated,
as a result, the deadlock is killed as well.
Reported-by: Lars Everbrand <lars.everbrand@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove leftover code that is not used anywhere in current tree.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following are only used in one file:
tcp_connect_init
tcp_set_rto
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In Kconfig, nf_tables depends on NFNETLINK so building nf_tables as
a module or inside kernel depends on the state of NFNETLINK inside
the kernel config. If someone wants to build nf_tables inside the
kernel, it is necessary to also build NFNETLINK inside the kernel.
But NFNETLINK can not be set in the menu so it is necessary to
toggle other nfnetlink subsystems such as logging and nfacct to see
the nf_tables switch.
This patch changes the dependency from 'depend' to 'select' inside
Kconfig to allow to set the build of nftables as modules or inside
kernel independently.
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- reset netfilter-bridge state when removing the batman-adv
header from an incoming packet. This prevents netfilter
bridge from being fooled when the same packet enters a
bridge twice (or more): the first time within the
batman-adv header and the second time without.
- adjust the packet layout to prevent any architecture from
adding padding bytes. All the structs sent over the wire
now have size multiple of 4bytes (unless pack(2) is used).
- fix access to the inner vlan_eth header when reading the
VID in the rx path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJSvvT6AAoJEEKTMo6mOh1Vn4EP+gKxohskmxmnRd/MnOL1FyyR
+T1+RtNm3Z2+v0OlR1zcHS5MJwt7KDrakS+w4tJTNiZTaPPjHQZu9wAn4wlFt9ix
9GO9nAtTbOcPj72WZsmq3Wslm2qKnF+kmu7hdOu7Grv3n3EEafQ+P8vzw7FACB+N
dMiwkfGdHgSxlM7a1faJM9N2qiSM2vUesElYPyinhjCciHwBuFW3HwK3CxayVwVE
Q20AMlgUd18ZGgy5oqrZz/1LCpaOsydtbcNHOXjF8gX0zwCTmog9GTAD6hT0mtTU
8lo2DHLFH9Hxm1bs8jDkiQEBLK2gXPGb6ic07IRE6g24my62RrAizE/yMmd1WrCP
6I8Qhs58tQ0RUVH8m6XG71TBw3nLk69QbKxHbYiDWlLQhrygx/9+ZXfuu3VUPvGS
9EmNuWw3Io7+KqaT0BJC6CJIkZLTL7oJvgyO2uFGEgkSjY5vxSF8GU0yisP9xilW
bpfhQO64gsmxjY2tjmIVRckVTUOsMnJKVlquq6aHKfgsOQdnTTNOd+l7bD8izIJ8
GwZKZLuzYJqYop4TxpkbNK2ckCl0kcMCHHLInJ/pO1KU0sAs+V1+HHouX6VGr9HT
IsbBvwMslbRPi8uOFAzueHTT/DzBXdP4sMsg5QomBQOqH8LlZuQ4ZBHtUbEQOVrp
xRCy3Fj/wjWzZUFdYE3f
=RDuB
-----END PGP SIGNATURE-----
Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge
Included changes:
- reset netfilter-bridge state when removing the batman-adv
header from an incoming packet. This prevents netfilter
bridge from being fooled when the same packet enters a
bridge twice (or more): the first time within the
batman-adv header and the second time without.
- adjust the packet layout to prevent any architecture from
adding padding bytes. All the structs sent over the wire
now have size multiple of 4bytes (unless pack(2) is used).
- fix access to the inner vlan_eth header when reading the
VID in the rx path.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
This patchset contains four nf_tables fixes, one IPVS fix due to
missing updates in the interaction with the new sedadj conntrack
extension that was added to support the netfilter synproxy code,
and a couple of one-liners to fix netnamespace netfilter issues.
More specifically, they are:
* Fix ipv6_find_hdr() call without offset being explicitly initialized
in nft_exthdr, as required by that function, from Daniel Borkmann.
* Fix oops in nfnetlink_log when using netns and unloading the kernel
module, from Gao feng.
* Fix BUG_ON in nf_ct_timestamp extension after netns is destroyed,
from Helmut Schaa.
* Fix crash in IPVS due to missing sequence adjustment extension being
allocated in the conntrack, from Jesper Dangaard Brouer.
* Add bugtrap to spot a warning in case you deference sequence adjustment
conntrack area when not available, this should help to catch similar
invalid dereferences in the Netfilter tree, also from Jesper.
* Fix incomplete dumping of sets in nf_tables when retrieving by family,
from me.
* Fix oops when updating the table state (dormant <-> active) and having
user (not base ) chains, from me.
* Fix wrong validation in set element data that results in returning
-EINVAL when using the nf_tables dictionary feature with mappings,
also from me.
We don't usually have this amount of fixes by this time (as we're already
in -rc5 of the development cycle), although half of them are related to
nf_tables which is a relatively new thing, and I also believe that holidays
have also delayed the flight of bugfixes to mainstream a bit.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't export ping_table or ping_v4_sendmsg. Both are only used
inside ping code.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
inetpeer_invalidate_family defined but never used
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't export arp_invalidate, only used in arp.c
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make fib_detect_death function static only used in one file.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes dictionary mappings, eg.
add rule ip filter input meta dnat set tcp dport map { 22 => 1.1.1.1, 23 => 2.2.2.2 }
The kernel was returning -EINVAL in nft_validate_data_load() since
the type of the set element data that is passed was the real userspace
datatype instead of NFT_DATA_VALUE.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When batadv_get_vid() is invoked in interface_rx() the
batman-adv header has already been removed, therefore
the header_len argument has to be 0.
Introduced by c018ad3de6
("batman-adv: add the VLAN ID attribute to the TT entry")
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
If an interface enslaved into batman-adv is a bridge (or a
virtual interface built on top of a bridge) the nf_bridge
member of the skbs reaching the soft-interface is filled
with the state about "netfilter bridge" operations.
Then, if one of such skbs is locally delivered, the nf_bridge
member should be cleaned up to avoid that the old state
could mess up with other "netfilter bridge" operations when
entering a second bridge.
This is needed because batman-adv is an encapsulation
protocol.
However at the moment skb->nf_bridge is not released at all
leading to bogus "netfilter bridge" behaviours.
Fix this by cleaning the netfilter state of the skb before
it gets delivered to the upper layer in interface_rx().
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
In e035b77 ("netfilter: nf_tables: nft_meta module get/set ops"),
we got the meta target merged into the existing meta expression.
So let's get rid of this dead code now that we fully support that
feature.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds kernel support for the meta expression in get/set
flavour. The set operation indicates that a given packet has to be
set with a property, currently one of mark, priority, nftrace.
The get op is what was currently working: evaluate the given
packet property.
In the nftrace case, the value is always 1. Such behaviour is copied
from net/netfilter/xt_TRACE.c
The NFTA_META_DREG and NFTA_META_SREG attributes are mutually
exclusives.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make struct batadv_tvlv_tt_change a multiple 4 bytes long
to avoid padding on any architecture.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Since this is a mac address and always 48 bit, and we can assume that
it is always aligned to 2-byte boundaries, add a pack(2) pragma.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
struct batadv_icmp_header currently has a size of 17, which
will be padded to 20 on some architectures. Fix this by
unrolling the header into the parent structures.
Moreover keep the ICMP parsing functions as generic as they
are now by using a stub icmp_header struct during packet
parsing.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The size of the batadv_header of 3 is problematic on some architectures
which automatically pad all structures to a 32 bit boundary. To not lose
performance by packing this struct, better embed it into the various
host structures.
Reported-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The compiler may decide to pad the structure, and then it does not
have the expected size of 46 byte. Fix this by moving it in the
pragma pack(2) part of the code.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
If not table name is specified, the dumping of the existing sets
may be incomplete with a sufficiently large number of sets and
tables. This patch fixes missing reset of the cursors after
finding the location of the last object that has been included
in the previous multi-part message.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
if a dst is not attached to anywhere, it should be released before
exit ipip6_tunnel_xmit, otherwise cause dst memory leakage.
Fixes: 61c1db7fae ("ipv6: sit: add GSO/TSO support")
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a bug fix. The existing code tries to kill many
birds with one stone: Handling binding of actions to
filters, new actions and replacing of action
attributes. A simple test case to illustrate:
XXXX
moja@fe1:~$ sudo tc actions add action drop index 12
moja@fe1:~$ actions get action gact index 12
action order 1: gact action drop
random type none pass val 0
index 12 ref 1 bind 0
moja@fe1:~$ sudo tc actions replace action ok index 12
moja@fe1:~$ actions get action gact index 12
action order 1: gact action drop
random type none pass val 0
index 12 ref 2 bind 0
XXXX
The above shows the refcounf being wrongly incremented on replace.
There are more complex scenarios with binding of actions to filters
that i am leaving out that didnt work as well...
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPVS FTP helper ip_vs_ftp could trigger an OOPS in nf_ct_seqadj_set,
after commit 41d73ec053 (netfilter: nf_conntrack: make sequence number
adjustments usuable without NAT).
This is because, the seqadj ext is now allocated dynamically, and the
IPVS code didn't handle this situation. Fix this in the IPVS nfct
code by invoking the alloc function nfct_seqadj_ext_add().
Fixes: 41d73ec053 (netfilter: nf_conntrack: make sequence number adjustments usuable without NAT)
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Since commit 41d73ec053 (netfilter: nf_conntrack: make sequence
number adjustments usuable without NAT), the sequence number extension
is dynamically allocated.
Instead of dying, give a WARN splash, in case of wrong usage of the
seqadj code, e.g. when forgetting to allocate via nfct_seqadj_ext_add().
Wrong usage have been seen in the IPVS code path.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
net/netfilter/ipvs/ip_vs_sync.c: In function 'sync_thread_master':
net/netfilter/ipvs/ip_vs_sync.c:1640:8: warning: unused variable 'ret' [-Wunused-variable]
Commit 35a2af94c7 ("sched/wait: Make the
__wait_event*() interface more friendly") changed how the interruption
state is returned. However, sync_thread_master() ignores this state,
now causing a compile warning.
According to Julian Anastasov <ja@ssi.bg>, this behavior is OK:
"Yes, your patch looks ok to me. In the past we used ssleep() but IPVS
users were confused why IPVS threads increase the load average. So, we
switched to _interruptible calls and later the socket polling was
added."
Document this, as requested by Peter Zijlstra, to avoid precious developers
disappearing in this pitfall in the future.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
When we set burst to 1514 with low rate in userspace,
the kernel get a value of burst that less than 1514,
which doesn't work.
Because it may make some loss when transform burst
to buffer in userspace. This makes burst lose some
bytes, when the kernel transform the buffer back to
burst.
This patch adds two new attributes to support sending
burst/mtu to kernel directly to avoid the loss.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors below:
ERROR: that open brace { should be on the previous line
ERROR: open brace '{' following function declarations go on the next line
ERROR: trailing statements should be on next line
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors below:
ERROR: switch and case should be at the same inden
ERROR: code indent should use tabs where possible
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors below:
ERROR: "(foo*)" should be "(foo *)"
ERROR: "foo * bar" should be "foo *bar"
ERROR: "foo* bar" should be "foo *bar"
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors while the space is required or prohibited
to the "=,()++..."
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleanup the comments in tcp_yeah.c.
1.The old link is dead,use a new one to instead.
2.'lin' add nothing useful,remove it.
3.do not use C99 // comments.
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix checkpatch errors like:
ERROR: spaces required around that XXX
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the possibly more efficient ether_addr_equal
to instead of memcmp.
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John W. Linville <linville@tuxdriver.com>
Cc: David Miller <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is better to use batadv_compate_eth instead of memcpy for
concise style.
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module shouldn't be randomly exporting symbols
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this plugin, user could specify IPComp tagged with certain
CPI that host not interested will be DROPped or any other action.
For example:
iptables -A INPUT -p 108 -m ipcomp --ipcompspi 0x87 -j DROP
ip6tables -A INPUT -p 108 -m ipcomp --ipcompspi 0x87 -j DROP
Then input IPComp packet with CPI equates 0x87 will not reach
upper layer anymore.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch add spaces to cleanup checkpatch errors.
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In switch() had do return, and never use the 'return NULL'. The
'break' after return or goto has no effect. Remove it.
v2: make it more readable as suggested by Neil.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes checkpath.pl:
WARNING: Prefer pr_warn(... to pr_warning(...
#447: FILE: ./wpan.c:447:
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We report different pmtu values back on the first write and on further
writes on an corked socket.
Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.
Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove the unnecessary cast.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to be able to see changes to proxy NDP status on a per
interface basis via netlink (analog to proxy_arp).
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use same field for both IPv4 (proxy_arp) and IPv6 (proxy_ndp)
so fix it before API is set to be a common name
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can use kfree_skb_list() instead of open coding it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde says:
====================
this is a pull request of three patches for net-next/master.
There is a patch by Oliver Hartkopp, to clean up the CAN gw code.
Alexander Shiyan adds device tree support to the mcp251x driver and a
patch by Ezequiel Garcia lets the ti_hecc driver compile on all ARM
platforms.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 93b36cf342 ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
I made a horrible mistake to add ip6_sk_accept_pmtu to the generic
sctp_icmp_frag_needed path. This results in build warnings if IPv6 is
disabled which were luckily caught by Fengguang's kbuild bot. But it
also leads to a kernel panic IPv4 frag-needed packet is received.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit be286bafe1 ("can: gw: add a variable
limit for CAN frame routings") the detection of the frame routing has been
changed. The former solution required dev->header_ops to be unused (== NULL).
I missed to remove the obsolete checks in the original commit - so here it is.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Thanks to commits 41063e9 (ipv4: Early TCP socket demux) and 421b388
(udp: ipv4: Add udp early demux) it is now possible to parse UID and
GID socket info also for incoming TCP and UDP connections. Having
this info available, it is convenient to let NFQUEUE parse it in
order to improve and refine the traffic analysis in userspace.
Signed-off-by: Valentina Giusti <valentina.giusti@bmw-carit.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Compile error reported by Jim Davis on netdev.
ip6_sk_accept_pmtu() needs net/ip6_route.h
Signed-off-by: Scott Feldman <sfeldma@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem noticed a TCP_RR regression caused by TCP autocorking
on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be
right above RTT between hosts.
We can receive a ACK for a packet still in NIC TX ring buffer or in a
softnet completion queue.
Fix this by always pushing the skb if it is at the head of write queue.
Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
test after setting TSQ_THROTTLED.
erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1
(repeat 3 times)
Before patch :
18,1049.87,41004,39631,6295.47
17,239.52,40804,48,2912.79
18,348.40,40877,54,3573.39
After patch :
18,22.84,4606,38,16.39
17,21.56,2871,36,13.51
17,22.46,2705,37,11.83
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: f54b311142 ("tcp: auto corking")
Signed-off-by: David S. Miller <davem@davemloft.net>
list_for_each_entry(a, &act_base, head) doesn't
exit with a = NULL if we reached the end of the list.
tcf_unregister_action(), tc_lookup_action_n() and tc_lookup_action()
need fixes.
Remove tc_lookup_action_id() as its unused and not worth 'fixing'
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 1f747c26c4 ("net_sched: convert tc_action_ops to use struct list_head")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
list_for_each_entry(t, &tcf_proto_base, head) doesn't
exit with t = NULL if we reached the end of the list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 3627287463 ("net_sched: convert tcf_proto_ops to use struct
list_head")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes:
1) pass mask rather than size to tcf_hashinfo_init()
2) the cleanup should be in reversed order in mirred_cleanup_module()
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 369ba56787 ("net_sched: init struct tcf_hashinfo at register time")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In nft's nft_exthdr_eval() routine we process IPv6 extension header
through invoking ipv6_find_hdr(), but we call it with an uninitialized
offset variable that contains some stack value. In ipv6_find_hdr()
we then test if the value of offset != 0 and call skb_header_pointer()
on that offset in order to map struct ipv6hdr into it. Fix it up by
initializing offset to 0 as it was probably intended to be.
Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Useful to only set a particular range of the conntrack mark while
leaving exisiting parts of the value alone, e.g. when setting
conntrack marks via NFQUEUE.
Follows same scheme as MARK/CONNMARK targets, i.e. the mask defines
those bits that should be altered. No mask is equal to '~0', ie.
the old value is replaced by new one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
All these users need an initial seed value for jhash, prandom is
perfectly fine. This avoids draining the entropy pool where
its not strictly required.
nfnetlink_log did not use the random value at all.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
That open brace { should be on the previous line.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spaces required before the open parenthesis '(', before the open
brace '{', after that ',' and around that '?/:'.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return is not a function, parentheses are not required.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because err is always negative, remove unnecessary condition
judgment.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check the return value of request_module during dccp_probe initialisation,
bail out if that call fails.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to net, ixgbe and e1000e.
David provides compiler fixes for e1000e.
Don provides a fix for ixgbe to resolve a compile warning.
John provides a fix to net where it is useful to be able to walk all
upper devices when bringing a device online where the RTNL lock is held.
In this case, it is safe to walk the all_adj_list because the RTNL lock is
used to protect the write side as well. This patch adds a check to see
if the RTNL lock is held before throwing a warning in
netdev_all_upper_get_next_dev_rcu().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Given we allocate memory for each cpu, we can do this
using NUMA affinities, instead of using NUMA policies
of the process changing flow_limit_cpu_bitmap value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2013-12-19
1) Use the user supplied policy index instead of a generated one
if present. From Fan Du.
2) Make xfrm migration namespace aware. From Fan Du.
3) Make the xfrm state and policy locks namespace aware. From Fan Du.
4) Remove ancient sleeping when the SA is in acquire state,
we now queue packets to the policy instead. This replaces the
sleeping code.
5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the
posibility to sleep. The sleeping code is gone, so remove it.
6) Check user specified spi for IPComp. Thr spi for IPcomp is only
16 bit wide, so check for a valid value. From Fan Du.
7) Export verify_userspi_info to check for valid user supplied spi ranges
with pfkey and netlink. From Fan Du.
8) RFC3173 states that if the total size of a compressed payload and the IPComp
header is not smaller than the size of the original payload, the IP datagram
must be sent in the original non-compressed form. These packets are dropped
by the inbound policy check because they are not transformed. Document the need
to set 'level use' for IPcomp to receive such packets anyway. From Fan Du.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_rt_copy only sets dst.from if ort has flag RTF_ADDRCONF and RTF_DEFAULT.
but the prefix routes which did get installed by hand locally can have an
expiration, and no any flag combination which can ensure a potential from
does never expire, so we should always set the new created dst's from.
This also fixes the new created dst is always expired since the ort, which
is created by RA, maybe has RTF_EXPIRES and RTF_ADDRCONF, but no RTF_DEFAULT.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It already has a NULL pointer check of rtab in qdisc_put_rtab().
Remove the check outside of qdisc_put_rtab().
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It already has a NULL pointer check of rtab in qdisc_put_rtab().
Remove the check outside of qdisc_put_rtab().
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].
In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...
rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);
... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:
r->id.idiag_src[0] = inet->inet_rcv_saddr;
r->id.idiag_dst[0] = inet->inet_daddr;
struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.
So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).
Similarly, fix also other places where we fill that out.
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the first size-based qdisc that attempts to
differentiate between small flows and heavy-hitters. The goal is to
catch the heavy-hitters and move them to a separate queue with less
priority so that bulk traffic does not affect the latency of critical
traffic. Currently "less priority" means less weight (2:1 in
particular) in a Weighted Deficit Round Robin (WDRR) scheduler.
In essence, this patch addresses the "delay-bloat" problem due to
bloated buffers. In some systems, large queues may be necessary for
obtaining CPU efficiency, or due to the presence of unresponsive
traffic like UDP, or just a large number of connections with each
having a small amount of outstanding traffic. In these circumstances,
HHF aims to reduce the HoL blocking for latency sensitive traffic,
while not impacting the queues built up by bulk traffic. HHF can also
be used in conjunction with other AQM mechanisms such as CoDel.
To capture heavy-hitters, we implement the "multi-stage filter" design
in the following paper:
C. Estan and G. Varghese, "New Directions in Traffic Measurement and
Accounting", in ACM SIGCOMM, 2002.
Some configurable qdisc settings through 'tc':
- hhf_reset_timeout: period to reset counter values in the multi-stage
filter (default 40ms)
- hhf_admit_bytes: threshold to classify heavy-hitters
(default 128KB)
- hhf_evict_timeout: threshold to evict idle heavy-hitters
(default 1s)
- hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
non-heavy-hitters (default 2)
- hh_flows_limit: max number of heavy-hitter flow entries
(default 2048)
Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
reflects the bandwidth of heavy-hitters that we attempt to capture
(25Mbps with the above default settings).
The false negative rate (heavy-hitter flows getting away unclassified)
is zero by the design of the multi-stage filter algorithm.
With 100 heavy-hitter flows, using four hashes and 4000 counters yields
a false positive rate (non-heavy-hitters mistakenly classified as
heavy-hitters) of less than 1e-4.
Signed-off-by: Terry Lam <vtlam@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing my changes for TSO support in SIT devices,
I was using sit0 tunnel which appears to include nopmtudisc flag.
But using :
ip tun add sittun mode sit remote $REMOTE_IPV4 local $LOCAL_IPV4 \
dev $IFACE
We get a tunnel which rejects too long packets because of the mtu check
which is not yet GSO aware.
erd:~# ip tunnel
sittun: ipv6/ip remote 10.246.17.84 local 10.246.17.83 ttl inherit 6rd-prefix 2002::/16
sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16
This patch is based on an excellent report from
Michal Shmidt.
In the future, we probably want to extend the MTU check to do the
right thing for GSO packets...
Fixes: ("61c1db7fae21 ipv6: sit: add GSO/TSO support")
Reported-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix spelling errors in bridge driver.
Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockets marked with IPV6_PMTUDISC_PROBE (or later IPV6_PMTUDISC_INTERFACE)
don't respect this setting when the outgoing interface supports UFO.
We had the same problem in IPv4, which was fixed in commit
daba287b29 ("ipv4: fix DO and PROBE pmtu
mode regarding local fragmentation with UFO/CORK").
Also IPV6_DONTFRAG mode did not care about already corked data, thus
it may generate a fragmented frame even if this socket option was
specified. It also did not care about the length of the ipv6 header and
possible options.
In the error path allow the user to receive the pmtu notifications via
both, rxpmtu method or error queue. The user may opted in for both,
so deliver the notification to both error handlers (the handlers check
if the error needs to be enqueued).
Also report back consistent pmtu values when sending on an already
cork-appended socket.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre_header_parse() needs to parse the tunnel's ip header and it
uses mac_header to locate the iphdr. This got broken when gre tunneling
was refactored as mac_header is no longer updated to point to iphdr.
Introduce skb_pop_mac_header() helper to do the mac_header assignment
and use it in ipgre_rcv() to fix msg_name parsing.
Bug introduced in commit c544193214 (GRE: Refactor GRE tunneling code.)
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV6_PMTU_INTERFACE is the same as IPV6_PMTU_PROBE for ipv6. Add it
nontheless for symmetry with IPv4 sockets. Also drop incoming MTU
information if this mode is enabled.
The additional bit in ipv6_pinfo just eats in the padding behind the
bitfield. There are no changes to the layout of the struct at all.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The other field in ipv4_config, log_martians, was converted to a
per-interface setting, so we can just remove the whole structure.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_main.c
drivers/net/macvtap.c
Both minor merge hassles, simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to maintain our own singly linked list code.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't need to maintain our own singly linked list code.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we don't need to play with singly linked list,
and since the code is not on hot path, we can use spinlock
instead of rwlock.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks weird to store the lock out of the struct but
still points to a static variable. Just move them into the struct.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These information can be saved in tcf_exts, and this will
simplify the code.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently actions are chained by a singly linked list,
therefore it is a bit hard to add and remove a specific
entry. Convert it to struct list_head so that in the
latter patch we can remove an action without finding
its head.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not used.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This enables userspace to get VLAN TPID as well as the VLAN TCI.
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
Explicitly defining and zeroing the gap of this makes additional changes
easier.
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
We may add members to them until current aligned size without forcing
userspace to call getsockopt(..., PACKET_HDRLEN, ...).
Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is useful to be able to walk all upper devices when bringing
a device online where the RTNL lock is held. In this case it
is safe to walk the all_adj_list because the RTNL lock is used
to protect the write side as well.
This patch adds a check to see if the rtnl lock is held before
throwing a warning in netdev_all_upper_get_next_dev_rcu().
Also because we now have a call site for lockdep_rtnl_is_held()
outside COFIG_LOCK_PROVING an inline definition returning 1 is
needed. Similar to the rcu_read_lock_is_held().
Fixes: 2a47fa45d4 ("ixgbe: enable l2 forwarding acceleration for macvlans")
CC: Veaceslav Falico <vfalico@redhat.com>
Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Adds skb_copy_hash to copy rxhash and l4_rxhash from one skb to another.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In several places 'skb->rxhash = 0' is being done to clear the
rxhash value in an skb. This does not clear l4_rxhash which could
still be set so that the rxhash wouldn't be recalculated on subsequent
call to skb_get_rxhash. This patch adds an explict function to clear
all the rxhash related information in the skb properly.
skb_clear_hash_if_not_l4 clears the rxhash only if it is not marked as
l4_rxhash.
Fixed up places where 'skb->rxhash = 0' was being called.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neighbour code sends up an RTM_NEWNEIGH netlink notification if
the NUD state of a neighbour cache entry is changed by a timer (e.g.
from REACHABLE to STALE), even if the lladdr of the entry has not
changed.
But an administrative change to the the NUD state of a neighbour cache
entry that does not change the lladdr (e.g. via "ip -4 neigh change
... nud ...") does not trigger a netlink notification. This means
that netlink listeners will not hear about administrative NUD state
changes such as from a resolved state to PERMANENT.
This patch changes the neighbor code to generate an RTM_NEWNEIGH
message when the NUD state of an entry is changed administratively.
Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch brings NUMA support and automatic fallback to vmalloc()
in case kmalloc() failed to allocate FQ hash table.
NUMA support depends on XPS being setup for the device before
qdisc allocation. After a XPS change, it might be worth creating
qdisc hierarchy again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.
Example with request/response of 4K/4K
IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
We can avoid this split by including the Nagle tests at the right place.
Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.
tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.
This patch tremendously improves performance, as the traffic now looks
like :
IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
Before :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
205719.049006 task-clock # 9.278 CPUs utilized
8,449,968 context-switches # 0.041 M/sec
1,935,997 CPU-migrations # 0.009 M/sec
160,541 page-faults # 0.780 K/sec
548,478,722,290 cycles # 2.666 GHz [83.20%]
455,240,670,857 stalled-cycles-frontend # 83.00% frontend cycles idle [83.48%]
272,881,454,275 stalled-cycles-backend # 49.75% backend cycles idle [66.73%]
166,091,460,030 instructions # 0.30 insns per cycle
# 2.74 stalled cycles per insn [83.39%]
29,150,229,399 branches # 141.699 M/sec [83.30%]
1,943,814,026 branch-misses # 6.67% of all branches [83.32%]
22.173517844 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests 16851063 0.0
IpExtOutOctets 23878580777 0.0
After patch :
lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877
Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
107496.071918 task-clock # 4.847 CPUs utilized
5,635,458 context-switches # 0.052 M/sec
1,374,707 CPU-migrations # 0.013 M/sec
160,920 page-faults # 0.001 M/sec
281,500,010,924 cycles # 2.619 GHz [83.28%]
228,865,069,307 stalled-cycles-frontend # 81.30% frontend cycles idle [83.38%]
142,462,742,658 stalled-cycles-backend # 50.61% backend cycles idle [66.81%]
95,227,712,566 instructions # 0.34 insns per cycle
# 2.40 stalled cycles per insn [83.43%]
16,209,868,171 branches # 150.795 M/sec [83.20%]
874,252,952 branch-misses # 5.39% of all branches [83.37%]
22.175821286 seconds time elapsed
lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests 11239428 0.0
IpExtOutOctets 23595191035 0.0
Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.
Many thanks to Neal for review and ideas.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These function to manipulate multiple addresses are not used anywhere
in current net-next tree. Some out of tree code maybe using these but
too bad; they should submit their code upstream..
Also, make __hw_addr_flush local since only used by dev_addr_lists.c
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
Please pull this batch of updates for the 3.14 stream...
For the Bluetooth bits, Gustavo says:
"This is the first batch of patches intended for 3.14. There is
nothing big here. Most of the code are refactors, clean up, small
fixes, plus some new device id support."
And...
"More patches to 3.14. Here we have the support for Low Energy
Connection Oriented Channels (LE CoC). Basically, as the name says,
this adds supports for connection oriented channels in the same way
we already have them for BR/EDR connections so profiles/protocols
that work on top of BR/EDR can now work on LE plus a plenty of new
possibilities for LE."
For the ath10k bits, Kalle says:
"Janusz and Marek implemented DFS support to ath10k, but the code is
not enabled yet due to missing cfg80211/mac80211 patches (it will be
enabled in the next pull request). Michal did some device reset fixes
and made it possible for ath10k to share an interrupt with another
device. And lots of smaller fixes from different people."
For the iwlwifi bits, Emmanuel says:
"I have here a big rework of the rate control by Eyal. This is obviously
the biggest part of this batch.
I also have enhancement of protection flags by Avri and a few bits for
WoWLAN by Eliad and Luca. Johannes cleans up the debugfs plus a few
fixes. I provided a few things for Bluetooth coexistence.
Besides this we have an implementation for low priority scan."
Along with all that, there are big batches of updates to mwifiex and
ath9k, Jeff Kirsher's FSF address fix patches, and a handful of other
bits here and there.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains two Netfilter fixes for your net
tree, they are:
* Fix endianness in nft_reject, the NFTA_REJECT_TYPE netlink attributes
was not converted to network byte order as needed by all nfnetlink
subsystems, from Eric Leblond.
* Restrict SYNPROXY target to INPUT and FORWARD chains, this avoid a
possible crash due to misconfigurations, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is similar to the set_peek_off patch where calling bind while the
socket is stuck in unix_dgram_recvmsg() will block and cause a hung task
spew after a while.
This is also the last place that did a straightforward mutex_lock(), so
there shouldn't be any more of these patches.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using sk_dst_lock from softirq context is not supported right now.
Instead of adding BH protection everywhere,
udp_sk_rx_dst_set() can instead use xchg(), as suggested
by David.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 9750223102 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently OVS uses jhash2() for calculating flow hashes in its
internal flow_hash() function. The performance of the flow_hash()
function is critical, as the input data can be hundreds of bytes
long.
OVS is largely deployed in x86_64 based datacenters. Therefore,
we argue that the performance critical fast path of OVS should
exploit underlying CPU features in order to reduce the per packet
processing costs. We replace jhash2 with the hash implementation
provided by the kernel hash lib, which exploits the crc32l
instruction to achieve high performance
Our patch greatly reduces the hash footprint from ~200 cycles of
jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
(measured with rdtsc over maximum length flow keys on an i7 Intel
CPU).
Additionally, we wrote a microbenchmark to stress the flow table
performance. The benchmark inserts random flows into the flow
hash and then performs lookups. Our hash deployed on a CRC32
capable CPU reduces the lookup for 1000 flows, 100 masks from
~10,100us to ~6,700us, for example.
Thus, simply use the newly introduced arch_fast_hash2() as a
drop-in replacement.
Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace can therefore know whether a table is in use or not, and
by how many chains. Suggested by Pablo Neira Ayuso.
Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The HCI User Channel is an admin operation which enforces CAP_NET_ADMIN
when binding the socket. Problem now is that it then requires also
CAP_NET_RAW when calling into hci_sock_sendmsg. This is not intended
and just an oversight since general HCI sockets (which do not require
special permission to bind) and HCI User Channel share the same code
path here.
Remove the extra CAP_NET_RAW check for HCI User Channel write operation
since the permission check has already been enforced when binding the
socket. This also makes it possible to open HCI User Channel from a
privileged process and then hand the file descriptor to an unprivilged
process.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Tested-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
when I modprobe sctp_probe, it failed with "FATAL: ". I found that
sctp should load before sctp_probe register jprobe. So I add a
sctp_setup_jprobe for loading 'sctp' when first failed to register
jprobe, just do this similar to dccp_probe.
v2: add MODULE_SOFTDEP and check of request_module, as suggested by Neil
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of reaquiring the socket lock and taking the normal exit
path when a connection times out, we bail out early with a
return -ETIMEDOUT.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As warned by checkpatch.pl, use #include <linux/uaccess.h>
instead of <asm/uaccess.h>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove a number of needless 'goto exit' in send_stream
when the socket is in an unconnected state.
This patch is cosmetic and does not alter the operation of
TIPC in any way.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We remove a number of unnecessary variables and branches
in TIPC. This patch is cosmetic and does not change the
operation of TIPC in any way.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add VHT MCS/NSS set support for nl80211_set_tx_bitrate_mask().
This should be used mainly for test purpose, to check
different MCS/NSS VHT combinations.
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow to read management keys stored in a station's gtk key
array with a get_key function.
Signed-off-by: Max Stepanov <Max.Stepanov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Verify that a pairwise key index value on ieee80211_get_key call
doesn't exceed the boundaries of the pairwise key array.
Signed-off-by: Max Stepanov <Max.Stepanov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The return value of ieee80211_ibss_csa_beacon is not aligned with the
return value of ieee80211_assign_beacon(). For consistency and to be
able to use both functions with similar code, change
ieee80211_ibss_csa_beacon() not to send the bss changed notification
itself, but return what has changed so the caller can send the
notification instead.
Tested by: Simon Wunderlich <sw@simonwunderlich.de>
Acked by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Refactor ieee80211_ibss_process_chanswitch() to use
ieee80211_channel_switch() and avoid code duplication.
Tested by: Simon Wunderlich <sw@simonwunderlich.de>
Acked by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This regression was introduced in "mac80211: cache mesh
beacon".
mesh_sync_offset_adjust_tbtt() was assuming that the
beacon would be rebuilt in every single pre-tbtt
interrupt, but now the beacon update happens on the
workqueue, and it must be ready for immediate delivery to
the driver.
Save a pointer to the meshconf IE in the beacon_data (this
works because both the IE pointer and beacon buffer are
protected by the same rcu_{dereference,assign_pointer}())
for quick updates during pre-tbtt. This is faster and a
little prettier than iterating over the elements to find
the meshconf IE every time.
Signed-off-by: Thomas Pedersen <thomas@cozybit.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Keep the sched scan req when starting sched scan, and reschedule
it in case of HW restart during sched scan.
The upper layer don't have to know about the restart.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This function is not used anywhere else than in cfg.c, so there's no
need to export it.
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We call ieee80211_ibss_disconnect(), which requires sdata to be
locked, so lock the sdata during ieee80211_csa_connection_drop_work().
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In order to check against valid IPcomp spi range, export verify_userspi_info
for both pfkey and netlink interface.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
IPComp connection between two hosts is broken if given spi bigger
than 0xffff.
OUTSPI=0x87
INSPI=0x11112
ip xfrm policy update dst 192.168.1.101 src 192.168.1.109 dir out action allow \
tmpl dst 192.168.1.101 src 192.168.1.109 proto comp spi $OUTSPI
ip xfrm policy update src 192.168.1.101 dst 192.168.1.109 dir in action allow \
tmpl src 192.168.1.101 dst 192.168.1.109 proto comp spi $INSPI
ip xfrm state add src 192.168.1.101 dst 192.168.1.109 proto comp spi $INSPI \
comp deflate
ip xfrm state add dst 192.168.1.101 src 192.168.1.109 proto comp spi $OUTSPI \
comp deflate
tcpdump can capture outbound ping packet, but inbound packet is
dropped with XfrmOutNoStates errors. It looks like spi value used
for IPComp is expected to be 16bits wide only.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Can be used to add extra IEs (such as P2P NoA) without having to
reallocate the buffer.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Evan Huus found (by fuzzing in wireshark) that the radiotap
iterator code can access beyond the length of the buffer if
the first bitmap claims an extension but then there's no
data at all. Fix this.
Cc: stable@vger.kernel.org
Reported-by: Evan Huus <eapache@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the AP interface is stopped, free all AP and VLAN keys at
once to only require synchronize_net() once. Since that does
synchronize_net(), also move two such calls into the function
(using the new force_synchronize parameter) to avoid doing it
twice.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Teach sta_info_flush() to optionally also remove stations
from all VLANs associated with an AP interface to optimise
the station removal (in particular, synchronize_net().)
To not have to add the vlans argument throughout, do some
refactoring.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no reason to have one synchronize_net() for each
removed station, refactor the code slightly to have just
a single synchronize_net() for all stations.
Note that this is currently useless as hostapd removes
stations one by one and this coalescing never happens.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no reason to do this inside the sta key removal
since the keys can only be reached through the sta (and
not by the driver at all) so once the sta can no longer
be reached, the keys are safe.
This will allow further optimisation opportunities with
multiple stations.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we can assume that stations are never referenced by the
driver after sta_state returns (and this is true since the
previous iwlmvm patch and for all other drivers) then we
don't need to delay station destruction, and don't need to
play tricks with rcu_barrier() etc.
This should speed up some scenarios like hostapd shutdown.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The pointer should be cleared before synchronize_rcu() so that the
consequently dead station won't be found by any lookups in the TX
or RX paths.
Also check that the station is actually the one being removed, the
check is not needed because each 4-addr VLAN can only have a single
station and non-4-addr VLANs always have a NULL pointer there, but
the code is clearer this way (and we avoid the memory write.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>