In case of memory pressure, mptcp_sendmsg() may call
sk_stream_wait_memory() after succesfully xmitting some
bytes. If the latter fails we currently return to the
user-space the error code, ignoring the succeful xmit.
Address the issue always checking for the xmitted bytes
before mptcp_sendmsg() completes.
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A later patch will refuse to set the action of certain traps in mlxsw
and also to change the policer binding of certain groups. Pass extack so
that failure could be communicated clearly to user space.
Reviewed-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the packet trap that can report packets that were ECN marked due to RED
AQM.
Signed-off-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/fib_rules.c:26:7: warning: "CONFIG_IP_MULTIPLE_TABLES" is not defined, evaluates to 0 [-Wundef]
#elif CONFIG_IP_MULTIPLE_TABLES
^~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 8b66a6fd34 ("fib: fix another fib_rules_ops indirect call wrapper problem")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-By: Brian Vazquez <brianvv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use mptcp_for_each_subflow in mptcp_stream_accept instead of
open-coding.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* handle new SAE (WPA3 authentication) status codes in the correct way
* fix a while that should be an if instead, avoiding infinite loops
* handle beacon filtering changing better
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl8n/fgACgkQB8qZga/f
l8TiOA/+PwGoXiP7WPulI+V8Bzv/Tmg6rqKEUFX1hxaGQOAlr9hqV2jvdxkOZmIM
sQc19vT4TUMiwrdYHOKcKlY5jhi5r8aIzxSA98mGqqKsvERhJWuzoaPR5aHuqOUD
VxUidOieV+H4Ol1YxV1Tx4eCmLR8ZPjg9EebMGNPOrASg7JXCjVXb5Pp8XXmsuDn
uXdPc9QQ+DMLqCEF+1C+jhH2lDUfyk78AgPuXw7E3CP/copiKljKL7AApoFebpkM
Cu0g6hYLBDAnkuZgL7PBw/MQThN5oMLDeX8RNK0uXDy/kdnXd7RL0IMcQpmwTn81
ZeewZjYq/Ha64sscR87kbzzFM15Vo/UhyhDfqQki/2zrQ26GoQdtgbj/OB3CMMd/
8kBtZPoItlIkpO18kN9gBO/nB0cGghtsLn8u8pKdQ5RiIv1Q4JqerN4rHCyMgLrk
FeOH25feAtuJ5MlOkIuHDYTfxNZnw1DE6DQv0hZuui1dbMPHUZuMFrCgZscuVBfj
HOzhcyvs2bdCU4ZKLGiiq1zTxPply9rESmk4tvlDo5strqCKUEez8VgtZIPw1LZp
pD43UHAfRiBEAF1pO/l1E6LyTvHMHYFKMiqzwCEXZXf2L276hlQ1rYoHeWEP0hhO
N03gwHiMs1xug6MeYhsdeyR0tqXb1N7XwnkpSwgtVOH2z+pCBx4=
=9iOZ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
A few more changes, notably:
* handle new SAE (WPA3 authentication) status codes in the correct way
* fix a while that should be an if instead, avoiding infinite loops
* handle beacon filtering changing better
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the function seg6_lwt_headroom out of the seg6_iptunnel.h uapi
header, because it is only used in seg6_iptunnel.c. Moreover, it is only
used in the kernel code, as indicated by the "#ifdef __KERNEL__".
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Ioana-Ruxandra Stăncioi <stancioi@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For retransmitted packets, TCP needs to resort to using TCP timestamps
for computing RTT samples. In the common case where the data and ACK
fall in the same 1-millisecond interval, TCP senders with millisecond-
granularity TCP timestamps compute a ca_rtt_us of 0. This ca_rtt_us
of 0 propagates to rs->rtt_us.
This value of 0 can cause performance problems for congestion control
modules. For example, in BBR, the zero min_rtt sample can bring the
min_rtt and BDP estimate down to 0, reduce snd_cwnd and result in a
low throughput. It would be hard to mitigate this with filtering in
the congestion control module, because the proper floor to apply would
depend on the method of RTT sampling (using timestamp options or
internally-saved transmission timestamps).
This fix applies a floor of 1 for the RTT sample delta from TCP
timestamps, so that seq_rtt_us, ca_rtt_us, and rs->rtt_us will be at
least 1 * (USEC_PER_SEC / TCP_TS_HZ).
Note that the receiver RTT computation in tcp_rcv_rtt_measure() and
min_rtt computation in tcp_update_rtt_min() both already apply a floor
of 1 timestamp tick, so this commit makes the code more consistent in
avoiding this edge case of a value of 0.
Signed-off-by: Jianfeng Wang <jfwang@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using is_broadcast_ether_addr() instead of directly use
memcmp() to determine if the ethernet address is broadcast
address.
spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)
Signed-off-by: Huang Guobin <huangguobin4@huawei.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) UAF in chain binding support from previous batch, from Dan Carpenter.
2) Queue up delayed work to expire connections with no destination,
from Andrew Sy Kim.
3) Use fallthrough pseudo-keyword, from Gustavo A. R. Silva.
4) Replace HTTP links with HTTPS, from Alexander A. Klimov.
5) Remove superfluous null header checks in ip6tables, from
Gaurav Singh.
6) Add extended netlink error reporting for expression.
7) Report EEXIST on overlapping chain, set elements and flowtable
devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a missing return statement to atalk_proc_init so it doesn't return
-ENOMEM when successful. This allows the appletalk module to load
properly.
Fixes: e2bcd8b0ce ("appletalk: use remove_proc_subtree to simplify procfs code")
Link: https://www.downtowndougbrown.com/2020/08/hacking-up-a-fix-for-the-broken-appletalk-kernel-module-in-linux-5-1-and-newer/
Reported-by: Christopher KOBAYASHI <chris@disavowed.jp>
Reported-by: Doug Brown <doug@downtowndougbrown.com>
Signed-off-by: Vincent Duvert <vincent.ldev@duvert.net>
[lukas: add missing tags]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v5.1+
Cc: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we don't care about vlan depth, we could pass NULL instead of the
address of a unused local variable to skb_network_protocol() as a param.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In fact, skb_pagelen() - skb_headlen() is equal to __skb_pagelen(), use it
directly to avoid unnecessary skb_headlen() call.
Also fix the CHECK note of checkpatch.pl:
Comparison to NULL could be written "!__pskb_pull_tail"
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GRE tunnel can be used to transport traffic that does not rely on a
Internet checksum (e.g. SCTP). The issue can be triggered creating a GRE
or GRETAP tunnel and transmitting SCTP traffic ontop of it where CRC
offload has been disabled. In order to fix the issue we need to
recompute the GRE csum in gre_gso_segment() not relying on the inner
checksum.
The issue is still present when we have the CRC offload enabled.
In this case we need to disable the CRC offload if we require GRE
checksum since otherwise skb_checksum() will report a wrong value.
Fixes: 90017accff ("sctp: Add GSO support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to clear all of the bridge private skb variables as they can be
stale due to the packet being recirculated through the stack and then
transmitted through the bridge device. Similar memset is already done on
bridge's input. We've seen cases where proxyarp_replied was 1 on routed
multicast packets transmitted through the bridge to ports with neigh
suppress which were getting dropped. Same thing can in theory happen with
the port isolation bit as well.
Fixes: 821f1b21ca ("bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"how" was used as a boolean. Change the type to bool, and improve
variable name
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Second parameter of addrconf_ifdown "how" is used as a boolean
internally. It does not make sense to call it with something different
of 0 or 1.
This value is set to 2 in all git history.
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the masks cache size configurable, or with
a size of 0, disable it.
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a counter that counts the number of masks cache hits, and
export it through the megaflow netlink statistics.
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
info cannot be NULL here since its being accessed earlier
in the function: nlmsg_parse(info->nlhdr...). Remove this
redundant NULL check.
Signed-off-by: Gaurav Singh <gaurav1086@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
ovs_ct_put_key() is potentially copying uninitialized kernel stack memory
into socket buffers, since the compiler may leave a 3-byte hole at the end
of `struct ovs_key_ct_tuple_ipv4` and `struct ovs_key_ct_tuple_ipv6`. Fix
it by initializing `orig` with memset().
Fixes: 9dd7f8907c ("openvswitch: Add original direction conntrack tuple to sw_flow_key.")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When openvswitch conntrack offload with act_ct action. Fragment packets
defrag in the ingress tc act_ct action and miss the next chain. Then the
packet pass to the openvswitch datapath without the mru. The over
mtu packet will be dropped in output action in openvswitch for over mtu.
"kernel: net2: dropped over-mtu packet: 1528 > 1500"
This patch add mru in the tc_skb_ext for adefrag and miss next chain
situation. And also add mru in the qdisc_skb_cb. The act_ct set the mru
to the qdisc_skb_cb when the packet defrag. And When the chain miss,
The mru is set to tc_skb_ext which can be got by ovs datapath.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211.h says: Beacon filter support is advertised with the
IEEE80211_VIF_BEACON_FILTER interface capability. The driver needs to
enable beacon filter support whenever power save is enabled, that is
IEEE80211_CONF_PS is set. When power save is enabled, the stack will
not check for beacon loss and the driver needs to notify about loss
of beacons with ieee80211_beacon_loss().
Some controllers may want to dynamically enable the beacon filter
capabilities on power save entry (CONF_PS) and disable it on exit.
This is the case for the wcn36xx driver which only supports beacon
filtering in PS mode (no CONNECTION_MONITOR support).
When the mac80211 beacon monitor timer expires, the beacon filter
flag must be checked again in case it as been changed in between
(e.g. vif moved to PS mode).
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/1592471863-31402-1-git-send-email-loic.poulain@linaro.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Sparse showed up with the following error.
net/mac80211/agg-rx.c:480:43: warning: Using plain integer as NULL pointer
Fixes: 2ab4587675 (mac80211: add support for the ADDBA extension element)
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200803084540.179908-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
SAE authentication has been extended with H2E (IEEE 802.11 REVmd) and PK
(WFA) options. Those extensions use special status code values in the
SAE commit messages (Authentication frame with transaction sequence
number 1) to identify which extension is in use. mac80211 was
interpreting those new values as the AP denying authentication and that
resulted in failure to complete SAE authentication in some cases.
Fix this by adding exceptions for the new status code values 126 and
127.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200731183830.18735-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Replace EBUSY by EEXIST in the following cases:
- If the user adds a chain with a different configuration such as different
type, hook and priority.
- If the user adds a non-base chain that clashes with an existing basechain.
- If the user adds a { key : value } mapping element and the key exists
but the value differs.
- If the device already belongs to an existing flowtable.
User describe that this error reporting is confusing:
- https://bugzilla.netfilter.org/show_bug.cgi?id=1176
- https://bugzilla.netfilter.org/show_bug.cgi?id=1413
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch extends 36dd1bcc07 ("netfilter: nf_tables: initial support
for extended ACK reporting") to include netlink extended error reporting
for expressions. This allows userspace to identify what rule expression
is triggering the error.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Encap offset calculation is incorrect in esp6, from Sabrina Dubroca.
2) Better parameter validation in pfkey_dump(), from Mark Salyzyn.
3) Fix several clang issues on powerpc in selftests, from Tanner Love.
4) cmsghdr_from_user_compat_to_kern() uses the wrong length, from Al
Viro.
5) Out of bounds access in mlx5e driver, from Raed Salem.
6) Fix transfer buffer memleak in lan78xx, from Johan Havold.
7) RCU fixups in rhashtable, from Herbert Xu.
8) Fix ipv6 nexthop refcnt leak, from Xiyu Yang.
9) vxlan FDB dump must be done under RCU, from Ido Schimmel.
10) Fix use after free in mlxsw, from Ido Schimmel.
11) Fix map leak in HASH_OF_MAPS bpf code, from Andrii Nakryiko.
12) Fix bug in mac80211 Tx ack status reporting, from Vasanthakumar
Thiagarajan.
13) Fix memory leaks in IPV6_ADDRFORM code, from Cong Wang.
14) Fix bpf program reference count leaks in mlx5 during
mlx5e_alloc_rq(), from Xin Xiong.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
vxlan: fix memleak of fdb
rds: Prevent kernel-infoleak in rds_notify_queue_get()
net/sched: The error lable position is corrected in ct_init_module
net/mlx5e: fix bpf_prog reference count leaks in mlx5e_alloc_rq
net/mlx5e: E-Switch, Specify flow_source for rule with no in_port
net/mlx5e: E-Switch, Add misc bit when misc fields changed for mirroring
net/mlx5e: CT: Support restore ipv6 tunnel
net: gemini: Fix missing clk_disable_unprepare() in error path of gemini_ethernet_port_probe()
ionic: unlock queue mutex in error path
atm: fix atm_dev refcnt leaks in atmtcp_remove_persistent
net: ethernet: mtk_eth_soc: fix MTU warnings
net: nixge: fix potential memory leak in nixge_probe()
devlink: ignore -EOPNOTSUPP errors on dumpit
rxrpc: Fix race between recvmsg and sendmsg on immediate call failure
MAINTAINERS: Replace Thor Thayer as Altera Triple Speed Ethernet maintainer
selftests/bpf: fix netdevsim trap_flow_action_cookie read
ipv6: fix memory leaks on IPV6_ADDRFORM path
net/bpfilter: Initialize pos in __bpfilter_process_sockopt
igb: reinit_locked() should be called with rtnl_lock
e1000e: continue to init PHY even when failed to disable ULP
...
kernel test robot says:
net/mptcp/syncookies.c: In function 'mptcp_join_cookie_init':
include/linux/kernel.h:47:38: warning: division by zero [-Wdiv-by-zero]
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))
I forgot that spinock_t size is 0 on UP, so ARRAY_SIZE cannot be used.
Fixes: 9466a1cceb ("mptcp: enable JOIN requests even if cookies are in use")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out that on commit 41d707b733 ("fib: fix fib_rules_ops
indirect calls wrappers") I forgot to include the case when
CONFIG_IP_MULTIPLE_TABLES is not set.
Fixes: 41d707b733 ("fib: fix fib_rules_ops indirect calls wrappers")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes these errors:
net/ipv4/syncookies.c: In function 'tcp_get_cookie_sock':
net/ipv4/syncookies.c:216:19: error: 'struct tcp_request_sock' has no
member named 'drop_req'
216 | if (tcp_rsk(req)->drop_req) {
| ^~
net/ipv4/syncookies.c: In function 'cookie_tcp_reqsk_alloc':
net/ipv4/syncookies.c:289:27: warning: unused variable 'treq'
[-Wunused-variable]
289 | struct tcp_request_sock *treq;
| ^~~~
make[3]: *** [scripts/Makefile.build:280: net/ipv4/syncookies.o] Error 1
make[3]: *** Waiting for unfinished jobs....
Fixes: 9466a1cceb ("mptcp: enable JOIN requests even if cookies are in use")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
* code cleanups and fixups as usual
* AQL & internal TXQ improvements from Felix
* some mesh 802.1X support bits
* some injection improvements from Mathy of KRACK
fame, so we'll see what this results in ;-)
* some more initial S1G supports bits, this time
(some of?) the userspace APIs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl8kTFUACgkQB8qZga/f
l8SQdQ/+MfiHePQzCFcnlXCzBa6wwaPQz5XbbWMS+1yVN5WECHomtF8Hb9YA5ITd
Iw6HXcDzYk1bt1EqIPAQNQ0t2VfEvC+HtQKxL1LAXKYrlpD4JoL7hU2FK6CPA6/e
h8qwrrhOeexeen+0r6LEslxFuVKSbXuBEIszsWkQNmUOoCFWdvbyD7kPNfoQ/wcu
1TYR51Z7TiiyadptP5R0PSxlid1N1qSbnQ9Rq4ow4lfEwCfNb0ksv3eb6paQOeGX
b5MIDpZ9FHVhWAMjzt7b7KgbmPreoHbZ+bSv0YgR+g32yLIpcH3wRuPH/eTIV1uG
xTQtvHjnOtF7HTJzAzaCY4RH0ozf0rwTVwtcUxl/qVcV/JYXLKXnTOrAidj06gTr
Ic5eaUaFALRX82wnmQbjkIezolmdQXom2VEg1yPT1azXnhi/bWQlDW3XcSMAfd8o
E+HiTaGo8jD+tll1kvh5sLokecnmnKFgo5dPqlVOQ5qbJKOiLC/3g7/yjSdCKauG
COQN0b1t+lHuvu8I2tdp+S1YI1sCES6OdHGKdQQiem5D66mUBaiYN+YuqGnvRF75
8ZU9H5Mn9YRT2n6KEGsVfSnrOeqN15ZTRKepkwYgRzPtAGmVY+iNWwhE6I/clQgQ
pt9jzq4jZsFMSO1flBYrFUy+fB4IYYki+VBSk5RFq3Yr8ElN4J0=
=FR6T
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2020-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
We have a number of changes
* code cleanups and fixups as usual
* AQL & internal TXQ improvements from Felix
* some mesh 802.1X support bits
* some injection improvements from Mathy of KRACK
fame, so we'll see what this results in ;-)
* some more initial S1G supports bits, this time
(some of?) the userspace APIs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev protodown is a mechanism that allows protocols to
hold an interface down. It was initially introduced in
the kernel to hold links down by a multihoming protocol.
There was also an attempt to introduce protodown
reason at the time but was rejected. protodown and protodown reason
is supported by almost every switching and routing platform.
It was ok for a while to live without a protodown reason.
But, its become more critical now given more than
one protocol may need to keep a link down on a system
at the same time. eg: vrrp peer node, port security,
multihoming protocol. Its common for Network operators and
protocol developers to look for such a reason on a networking
box (Its also known as errDisable by most networking operators)
This patch adds support for link protodown reason
attribute. There are two ways to maintain protodown
reasons.
(a) enumerate every possible reason code in kernel
- A protocol developer has to make a request and
have that appear in a certain kernel version
(b) provide the bits in the kernel, and allow user-space
(sysadmin or NOS distributions) to manage the bit-to-reasonname
map.
- This makes extending reason codes easier (kind of like
the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/)
This patch takes approach (b).
a few things about the patch:
- It treats the protodown reason bits as counter to indicate
active protodown users
- Since protodown attribute is already an exposed UAPI,
the reason is not enforced on a protodown set. Its a no-op
if not used.
the patch follows the below algorithm:
- presence of reason bits set indicates protodown
is in use
- user can set protodown and protodown reason in a
single or multiple setlink operations
- setlink operation to clear protodown, will return -EBUSY
if there are active protodown reason bits
- reason is not included in link dumps if not used
example with patched iproute2:
$cat /etc/iproute2/protodown_reasons.d/r.conf
0 mlag
1 evpn
2 vrrp
3 psecurity
$ip link set dev vxlan0 protodown on protodown_reason vrrp on
$ip link set dev vxlan0 protodown_reason mlag on
$ip link show
14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp>
$ip link set dev vxlan0 protodown_reason mlag off
$ip link set dev vxlan0 protodown off protodown_reason vrrp off
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2020-07-31
The following pull-request contains BPF updates for your *net* tree.
We've added 5 non-merge commits during the last 21 day(s) which contain
a total of 5 files changed, 126 insertions(+), 18 deletions(-).
The main changes are:
1) Fix a map element leak in HASH_OF_MAPS map type, from Andrii Nakryiko.
2) Fix a NULL pointer dereference in __btf_resolve_helper_id() when no
btf_vmlinux is available, from Peilin Ye.
3) Init pos variable in __bpfilter_process_sockopt(), from Christoph Hellwig.
4) Fix a cgroup sockopt verifier test by specifying expected attach type,
from Jean-Philippe Brucker.
Note that when net gets merged into net-next later on, there is a small
merge conflict in kernel/bpf/btf.c between commit 5b801dfb7f ("bpf: Fix
NULL pointer dereference in __btf_resolve_helper_id()") from the bpf tree
and commit 138b9a0511 ("bpf: Remove btf_id helpers resolving") from the
net-next tree.
Resolve as follows: remove the old hunk with the __btf_resolve_helper_id()
function. Change the btf_resolve_helper_id() so it actually tests for a
NULL btf_vmlinux and bails out:
int btf_resolve_helper_id(struct bpf_verifier_log *log,
const struct bpf_func_proto *fn, int arg)
{
int id;
if (fn->arg_type[arg] != ARG_PTR_TO_BTF_ID || !btf_vmlinux)
return -EINVAL;
id = fn->btf_id[arg];
if (!id || id > btf_vmlinux->nr_types)
return -EINVAL;
return id;
}
Let me know if you run into any others issues (CC'ing Jiri Olsa so he's in
the loop with regards to merge conflict resolution).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2020-07-31
1) Fix policy matching with mark and mask on userspace interfaces.
From Xin Long.
2) Several fixes for the new ESP in TCP encapsulation.
From Sabrina Dubroca.
3) Fix crash when the hold queue is used. The assumption that
xdst->path and dst->child are not a NULL pointer only if dst->xfrm
is not a NULL pointer is true with the exception of using the
hold queue. Fix this by checking for hold queue usage before
dereferencing xdst->path or dst->child.
4) Validate pfkey_dump parameter before sending them.
From Mark Salyzyn.
5) Fix the location of the transport header with ESP in UDPv6
encapsulation. From Sabrina Dubroca.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
the earliest departure time(EDT) of the timestamped skb. By tracking EDT
values of the skb from different timestamps, we can observe when and how
much the value changed. This allows to measure the precise delay
injected on the sender host e.g. by a bpf-base throttler.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
JOIN requests do not work in syncookie mode -- for HMAC validation, the
peers nonce and the mptcp token (to obtain the desired connection socket
the join is for) are required, but this information is only present in the
initial syn.
So either we need to drop all JOIN requests once a listening socket enters
syncookie mode, or we need to store enough state to reconstruct the request
socket later.
This adds a state table (1024 entries) to store the data present in the
MP_JOIN syn request and the random nonce used for the cookie syn/ack.
When a MP_JOIN ACK passed cookie validation, the table is consulted
to rebuild the request socket from it.
An alternate approach would be to "cancel" syn-cookie mode and force
MP_JOIN to always use a syn queue entry.
However, doing so brings the backlog over the configured queue limit.
v2: use req->syncookie, not (removed) want_cookie arg
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If SYN packet contains MP_CAPABLE option, keep it enabled.
Syncokie validation and cookie-based socket creation is changed to
instantiate an mptcp request sockets if the ACK contains an MPTCP
connection request.
Rather than extend both cookie_v4/6_check, add a common helper to create
the (mp)tcp request socket.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will be used to initialize the mptcp request socket when a MP_CAPABLE
request was handled in syncookie mode, i.e. when a TCP ACK containing a
MP_CAPABLE option is a valid syncookie value.
Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
ID and stores it in the MPTCP token storage to be able to retrieve the
mptcp socket for subflow joining.
In syncookie case, we do not want to store any state, so just generate the
unique ID and use it in the reply.
This means there is a small window where another connection could generate
the same token.
When Cookie ACK comes back, we check that the token has not been registered
in the mean time. If it was, the connection needs to fall back to TCP.
Changes in v2:
- use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
(Eric Dumazet)
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syncookie code path needs to create an mptcp request sock.
Prepare for this and add mptcp prefix plus needed export of ops struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When syncookie support is added, we will need to add a variant of
subflow_init_req() helper. It will do almost same thing except
that it will not compute/add a token to the mptcp token tree.
To avoid excess copy&paste, this commit splits away part of the
code into a new helper, __subflow_init_req, that can then be re-used
from the 'no insert' function added in a followup change.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once syncookie support is added, no state will be stored anymore when the
syn/ack is generated in syncookie mode.
When the ACK comes back, the generated key will be taken from the TCP ACK,
the token is re-generated and inserted into the token tree.
This means we can't retry with a new key when the token is already taken
in the syncookie case.
Therefore, move the retry logic to the caller to prepare for syncookie
support in mptcp.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nowadays output function has a 'synack_type' argument that tells us when
the syn/ack is emitted via syncookies.
The request already tells us when timestamps are supported, so check
both to detect special timestamp for tcp option encoding is needed.
We could remove cookie_ts altogether, but a followup patch would
otherwise need to adjust function signatures to pass 'want_cookie' to
mptcp core.
This way, the 'existing' bit can be used.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
rds_notify_queue_get() is potentially copying uninitialized kernel stack
memory to userspace since the compiler may leave a 4-byte hole at the end
of `cmsg`.
In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which
unfortunately does not always initialize that 4-byte hole. Fix it by using
memset() instead.
Cc: stable@vger.kernel.org
Fixes: f037590fff ("rds: fix a leak of kernel memory")
Fixes: bdbe6fbc6a ("RDS: recv.c")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of the struct_size() helper, in multiple places, instead
of an open-coded version in order to avoid any potential type
mistakes and protect against potential integer overflows.
Also, remove unnecessary object identifier size.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to %pM instead of using custom code.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>