This reverts commit e80661dff2.
It was originally reverted as it broke the abi but can now be safely
brought back.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8cd3d50680e6493d7aa4e714e3b6f343bda2dc9d
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmKErdYACgkQONu9yGCS
aT4LIg//bqU/qlVBJq3v3QCfN93SOXwECuPXi0e8p3aG/ptn1pzV0A5+rDD4TXeU
mpw8oWSNQC0Wn8mag9poCvLFaSHuMIo2wPoPOwpRyb0BfhXCaWE4s1HbX49dNi2l
9V76KC4JH69zaHtEmnBy83zzr4nyRG6q3bTd1hD3uJaSg6+2E+DfIK7MUSAyhyPU
DSi2CSHgxZkBEJIVncCPPhz4HX7RJIav51EwAZlnqW6eX3osEe9Qh1b82qcatXIF
rC/w74mQoRnghL0FRwQ5Z4mTEcgBI3lYoKJMkt8iZgOm9nBq1F62gXysKojmjDYc
AvP97Jt+hF9hDFUfR2wW/sA/rsRhw1mx0LoFu96fc15Nxiyh8b/MFD+b3a2CqtN9
szggwoezx90RbGOsc1zNhoX3VmZ2xMeVyAVg4RCscxXiO5FR1DqV4mUlbXoQx9ah
I9+GYtc0vs9BP9pucK2kx7fzJ+eW93jIZVXHuh6mcspu7sdTBQam+ERH1gMIyCki
OXcg0HCTLQi26MNMkrdX+RebwHxB7tbLXUXqYzQKEWjsLtCyv7LYmsRV+RZSyZ1p
vqKs4gTjLPmcINebsT80mYZSH560lXkjl6Z0H5U23NQO8RG3QewrKZkHqwOn5zyQ
aF5fnmVTVcMbT8BK3qCadr00Giz5usnw8R9am5EY1ELzo48207A=
=35Fl
-----END PGP SIGNATURE-----
Merge 5.15.41 into android13-5.15
Changes in 5.15.41
batman-adv: Don't skb_split skbuffs with frag_list
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
hwmon: (tmp401) Add OF device ID table
mac80211: Reset MBSSID parameters upon connection
net: Fix features skip in for_each_netdev_feature()
net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups
net: mscc: ocelot: restrict tc-trap actions to VCAP IS2 lookup 0
net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
platform/surface: aggregator: Fix initialization order when compiling as builtin module
ice: Fix race during aux device (un)plugging
ice: fix PTP stale Tx timestamps cleanup
ipv4: drop dst in multicast routing path
drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name()
netlink: do not reset transport header in netlink_recvmsg()
net: chelsio: cxgb4: Avoid potential negative array offset
fbdev: efifb: Fix a use-after-free due early fb_info cleanup
sfc: Use swap() instead of open coding it
net: sfc: fix memory leak due to ptp channel
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
nfs: fix broken handling of the softreval mount option
ionic: fix missing pci_release_regions() on error in ionic_probe()
dim: initialize all struct fields
hwmon: (ltq-cputemp) restrict it to SOC_XWAY
procfs: prevent unprivileged processes accessing fdinfo dir
selftests: vm: Makefile: rename TARGETS to VMTARGETS
arm64: vdso: fix makefile dependency on vdso.so
virtio: fix virtio transitional ids
s390/ctcm: fix variable dereferenced before check
s390/ctcm: fix potential memory leak
s390/lcs: fix variable dereferenced before check
net/sched: act_pedit: really ensure the skb is writable
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
drm/vc4: hdmi: Fix build error for implicit function declaration
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
tls: Fix context leak on tls_device_down
drm/vmwgfx: Fix fencing on SVGAv3
gfs2: Fix filesystem block deallocation for short writes
hwmon: (f71882fg) Fix negative temperature
RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core()
iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu
ASoC: max98090: Reject invalid values in custom control put()
ASoC: max98090: Generate notifications on changes for custom control
ASoC: ops: Validate input values in snd_soc_put_volsw_range()
s390: disable -Warray-bounds
ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
net: emaclite: Don't advertise 1000BASE-T and do auto negotiation
net: sfp: Add tx-fault workaround for Huawei MA5671A SFP ONT
secure_seq: use the 64 bits of the siphash for port offset calculation
tcp: use different parts of the port_offset for index and offset
tcp: resalt the secret every 10 seconds
tcp: add small random increments to the source port
tcp: dynamically allocate the perturb table used by source ports
tcp: increase source port perturb table to 2^16
tcp: drop the hash_32() part from the index calculation
interconnect: Restore sync state by ignoring ipa-virt in provider count
firmware_loader: use kernel credentials when reading firmware
KVM: PPC: Book3S PR: Enable MSR_DR for switch_mmu_context()
usb: xhci-mtk: fix fs isoc's transfer error
x86/mm: Fix marking of unused sub-pmd ranges
tty/serial: digicolor: fix possible null-ptr-deref in digicolor_uart_probe()
tty: n_gsm: fix buffer over-read in gsm_dlci_data()
tty: n_gsm: fix mux activation issues in gsm_config()
usb: cdc-wdm: fix reading stuck on device close
usb: typec: tcpci: Don't skip cleanup in .remove() on error
usb: typec: tcpci_mt6360: Update for BMC PHY setting
USB: serial: pl2303: add device id for HP LM930 Display
USB: serial: qcserial: add support for Sierra Wireless EM7590
USB: serial: option: add Fibocom L610 modem
USB: serial: option: add Fibocom MA510 modem
slimbus: qcom: Fix IRQ check in qcom_slim_probe
fsl_lpuart: Don't enable interrupts too early
serial: 8250_mtk: Fix UART_EFR register address
serial: 8250_mtk: Fix register address for XON/XOFF character
ceph: fix setting of xattrs on async created inodes
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
mm/huge_memory: do not overkill when splitting huge_zero_page
drm/vmwgfx: Disable command buffers on svga3 without gbobjects
drm/nouveau/tegra: Stop using iommu_present()
i40e: i40e_main: fix a missing check on list iterator
net: atlantic: always deep reset on pm op, fixing up my null deref regression
net: phy: Fix race condition on link status change
writeback: Avoid skipping inode writeback
cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
net: phy: micrel: Do not use kszphy_suspend/resume for KSZ8061
net: phy: micrel: Pass .probe for KS8737
SUNRPC: Ensure that the gssproxy client can start in a connected state
drm/vmwgfx: Initialize drm_mode_fb_cmd2
Revert "drm/amd/pm: keep the BACO feature enabled for suspend"
dma-buf: call dma_buf_stats_setup after dmabuf is in valid list
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
ping: fix address binding wrt vrf
usb: gadget: uvc: rename function to be more consistent
usb: gadget: uvc: allow for application to cleanly shutdown
Linux 5.15.41
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia65cdbbddf553237d6a3a38efb9bcb2fcc3990ec
[ Upstream commit 8b796475fd7882663a870456466a4fb315cc1bd6 ]
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.
The above causes some sporadic MPTCP self-test failures, due to
this code:
tc -n $ns2 filter add dev ns2eth$i egress \
protocol ip prio 1000 \
handle 42 fw \
action pedit munge offset 148 u8 invert \
pipe csum tcp \
index 100
The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.
This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.
Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.
Fixes: db2c24175d ("act_pedit: access skb->data safely")
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This reverts commit bcba40bd36.
It breaks the abi and is not really needed for Android systems, so
revert it for now.
Bug: 161946584
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8768fa3c558d43cbe2005c151c494c9fd6146089
commit 719774377622bc4025d2a74f551b5dc2158c6c30 upstream.
Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.
For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.
Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().
This is safe because the new entry is not (yet) visible to other cpus.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ec5b0f605b105457f257f2870acad4a5d463984b ]
While investigating a related syzbot report,
I found that whenever call to tcf_exts_init()
from u32_init_knode() is failing, we end up
with an elevated refcount on ht->refcnt
To avoid that, only increase the refcount after
all possible errors have been evaluated.
Fixes: b9a24bb76b ("net_sched: properly handle failure case of tcf_exts_init()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e8a64bbaaad1f6548cec5508297bc6d45e8ab69e ]
A user may set the SO_TXTIME socket option to ensure a packet is send
at a given time. The taprio scheduler has to confirm, that it is allowed
to send a packet at that given time, by a check against the packet time
schedule. The scheduler drop the packet, if the gates are closed at the
given send time.
The check, if SO_TXTIME is set, may fail since sk_flags are part of an
union and the union is used otherwise. This happen, if a socket is not
a full socket, like a request socket for example.
Add a check to verify, if the union is used for sk_flags.
Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e65812fd22eba32f11abe28cb377cbd64cfb1ba0 ]
Currently, when inserting a new filter that needs to sit at the head
of chain 0, it will first update the heads pointer on all devices using
the (shared) block, and only then complete the initialization of the new
element so that it has a "next" element.
This can lead to a situation that the chain 0 head is propagated to
another CPU before the "next" initialization is done. When this race
condition is triggered, packets being matched on that CPU will simply
miss all other filters, and will flow through the stack as if there were
no other filters installed. If the system is using OVS + TC, such
packets will get handled by vswitchd via upcall, which results in much
higher latency and reordering. For other applications it may result in
packet drops.
This is reproducible with a tc only setup, but it varies from system to
system. It could be reproduced with a shared block amongst 10 veth
tunnels, and an ingress filter mirroring packets to another veth.
That's because using the last added veth tunnel to the shared block to
do the actual traffic, it makes the race window bigger and easier to
trigger.
The fix is rather simple, to just initialize the next pointer of the new
filter instance (tp) before propagating the head change.
The fixes tag is pointing to the original code though this issue should
only be observed when using it unlocked.
Fixes: 2190d1d094 ("net: sched: introduce helpers to work with filter chains")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/b97d5f4eaffeeb9d058155bcab63347527261abf.1649341369.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2105f700b53c24aa48b65c15652acc386044d26a ]
A tc flower filter matching TCA_FLOWER_KEY_VLAN_ETH_TYPE is expected to
match the L2 ethertype following the first VLAN header, as confirmed by
linked discussion with the maintainer. However, such rule also matches
packets that have additional second VLAN header, even though filter has
both eth_type and vlan_ethtype set to "ipv4". Looking at the code this
seems to be mostly an artifact of the way flower uses flow dissector.
First, even though looking at the uAPI eth_type and vlan_ethtype appear
like a distinct fields, in flower they are all mapped to the same
key->basic.n_proto. Second, flow dissector skips following VLAN header as
no keys for FLOW_DISSECTOR_KEY_CVLAN are set and eventually assigns the
value of n_proto to last parsed header. With these, such filters ignore any
headers present between first VLAN header and first "non magic"
header (ipv4 in this case) that doesn't result
FLOW_DISSECT_RET_PROTO_AGAIN.
Fix the issue by extending flow dissector VLAN key structure with new
'vlan_eth_type' field that matches first ethertype following previously
parsed VLAN header. Modify flower classifier to set the new
flow_dissector_key_vlan->vlan_eth_type with value obtained from
TCA_FLOWER_KEY_VLAN_ETH_TYPE/TCA_FLOWER_KEY_CVLAN_ETH_TYPE uAPIs.
Link: https://lore.kernel.org/all/Yjhgi48BpTGh6dig@nanopsycho/
Fixes: 9399ae9a6c ("net_sched: flower: Add vlan support")
Fixes: d64efd0926 ("net/sched: flower: Add supprt for matching on QinQ vlan headers")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bcb74e132a76ce0502bb33d5b65533a4ed72d159 ]
When switching zones or network namespaces without doing a ct clear in
between, it is now leaking a reference to the old ct entry. That's
because tcf_ct_skb_nfct_cached() returns false and
tcf_ct_flow_table_lookup() may simply overwrite it.
The fix is to, as the ct entry is not reusable, free it already at
tcf_ct_skb_nfct_cached().
Reported-by: Florian Westphal <fw@strlen.de>
Fixes: 2f131de361f6 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 408bdcfce8dfd6902f75fbcd3b99d8b24b506597 ]
Its the same as nf_conntrack_put(), but without the
need for an indirect call. The downside is a module dependency on
nf_conntrack, but all of these already depend on conntrack anyway.
Cc: Paul Blakey <paulb@mellanox.com>
Cc: dev@openvswitch.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2f131de361f6d0eaff17db26efdb844c178432f8 upstream.
Flow table lookup is skipped if packet either went through ct clear
action (which set the IP_CT_UNTRACKED flag on the packet), or while
switching zones and there is already a connection associated with
the packet. This will result in no SW offload of the connection,
and the and connection not being removed from flow table with
TCP teardown (fin/rst packet).
To fix the above, remove these unneccary checks in flow
table lookup.
Fixes: 46475bb20f ("net/sched: act_ct: Software offload of established flows")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 973bf8fdd12f0e70ea351c018e68edd377a836d1 ]
When adding a tc rule with a qdisc kind that is not supported or not
compiled into the kernel, the kernel emits the following error: "Error:
Specified qdisc not found.". Found via tdc testing when ETS qdisc was not
compiled in and it was not obvious right away what the message meant
without looking at the kernel code.
Change the error message to be more explicit and say the qdisc kind is
unknown.
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 429c3be8a5e2695b5b92a6a12361eb89eb185495 ]
The current implementation of HTB offload doesn't support some
parameters. Instead of ignoring them, actively return the EINVAL error
when they are set to non-defaults.
As this patch goes to stable, the driver API is not changed here. If
future drivers support more offload parameters, the checks can be moved
to the driver side.
Note that the buffer and cbuffer parameters are also not supported, but
the tc userspace tool assigns some default values derived from rate and
ceil, and identifying these defaults in sch_htb would be unreliable, so
they are still ignored.
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220125100654.424570-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit de2d807b294d3d2ce5e59043ae2634016765d076 upstream.
The attach callback of struct Qdisc_ops is used by only a few qdiscs:
mq, mqprio and htb. qdisc_graft() contains the following logic
(pseudocode):
if (!qdisc->ops->attach) {
if (ingress)
do ingress stuff;
else
do egress stuff;
}
if (!ingress) {
...
if (qdisc->ops->attach)
qdisc->ops->attach(qdisc);
} else {
...
}
As we see, the attach callback is not called if the qdisc is being
attached to ingress (TC_H_INGRESS). That wasn't a problem for mq and
mqprio, since they contain a check that they are attached to TC_H_ROOT,
and they can't be attached to TC_H_INGRESS anyway.
However, the commit cited below added the attach callback to htb. It is
needed for the hardware offload, but in the non-offload mode it
simulates the "do egress stuff" part of the pseudocode above. The
problem is that when htb is attached to ingress, neither "do ingress
stuff" nor attach() is called. It results in an inconsistency, and the
following message is printed to dmesg:
unregister_netdevice: waiting for lo to become free. Usage count = 2
This commit addresses the issue by running "do ingress stuff" in the
ingress flow even in the attach callback is present, which is fine,
because attach isn't going to be called afterwards.
The bug was found by syzbot and reported by Eric.
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fb80445c438c78b40b547d12b8d56596ce4ccfeb upstream.
commit 56b765b79e ("htb: improved accuracy at high rates") broke
"overhead X", "linklayer atm" and "mpu X" attributes.
"overhead X" and "linklayer atm" have already been fixed. This restores
the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping:
tc class add ... htb rate X overhead 4 mpu 64
The code being fixed is used by htb, tbf and act_police. Cake has its
own mpu handling. qdisc_calculate_pkt_len still uses the size table
containing values adjusted for mpu by user space.
iproute2 tc has always passed mpu into the kernel via a tc_ratespec
structure, but the kernel never directly acted on it, merely stored it
so that it could be read back by `tc class show`.
Rather, tc would generate length-to-time tables that included the mpu
(and linklayer) in their construction, and the kernel used those tables.
Since v3.7, the tables were no longer used. Along with "mpu", this also
broke "overhead" and "linklayer" which were fixed in 01cb71d2d4
("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b1
("net_sched: restore "linklayer atm" handling", v3.11).
"overhead" was fixed by simply restoring use of tc_ratespec::overhead -
this had originally been used by the kernel but was initially omitted
from the new non-table-based calculations.
"linklayer" had been handled in the table like "mpu", but the mode was
not originally passed in tc_ratespec. The new implementation was made to
handle it by getting new versions of tc to pass the mode in an extended
tc_ratespec, and for older versions of tc the table contents were analysed
at load time to deduce linklayer.
As "mpu" has always been given to the kernel in tc_ratespec,
accompanying the mpu-based table, we can restore system functionality
with no userspace change by making the kernel act on the tc_ratespec
value.
Fixes: 56b765b79e ("htb: improved accuracy at high rates")
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Vimalkumar <j.vimal@gmail.com>
Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6f022c2ddbcefaee79502ce5386dfe351d457070 ]
Netfilter conntrack maintains NAT flags per connection indicating
whether NAT was configured for the connection. Openvswitch maintains
NAT flags on the per packet flow key ct_state field, indicating
whether NAT was actually executed on the packet.
When a packet misses from tc to ovs the conntrack NAT flags are set.
However, NAT was not necessarily executed on the packet because the
connection's state might still be in NEW state. As such, openvswitch
wrongly assumes that NAT was executed and sets an incorrect flow key
NAT flags.
Fix this, by flagging to openvswitch which NAT was actually done in
act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
the packet flow key NAT flags will be correctly set.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 635d448a1cce4b4ebee52b351052c70434fa90ea ]
Zone id is not restored if we passed ct and ct rejected the connection,
as there is no ct info on the skb.
Save the zone from tc skb cb to tc skb extension and pass it on to
ovs, use that info to restore the zone id for invalid connections.
Fixes: d29334c15d ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3849595866166b23bf6a0cb9ff87e06423167f67 ]
If ct rejects a flow, it removes the conntrack info from the skb.
act_ct sets the post_ct variable so the dissector will see this case
as an +tracked +invalid state, but the zone id is lost with the
conntrack info.
To restore the zone id on such cases, set the last executed zone,
via the tc control block, when passing ct, and read it back in the
dissector if there is no ct info on the skb (invalid connection).
Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7d18a07897d07495ee140dd319b0e9265c0f68ba upstream.
tx_queue_len can be set to ~0U, we need to be more
careful about overflows.
__fls(0) is undefined, as this report shows:
UBSAN: shift-out-of-bounds in net/sched/sch_qfq.c:1430:24
shift exponent 51770272 is too large for 32-bit type 'int'
CPU: 0 PID: 25574 Comm: syz-executor.0 Not tainted 5.16.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x201/0x2d8 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:151 [inline]
__ubsan_handle_shift_out_of_bounds+0x494/0x530 lib/ubsan.c:330
qfq_init_qdisc+0x43f/0x450 net/sched/sch_qfq.c:1430
qdisc_create+0x895/0x1430 net/sched/sch_api.c:1253
tc_modify_qdisc+0x9d9/0x1e20 net/sched/sch_api.c:1660
rtnetlink_rcv_msg+0x934/0xe60 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x200/0x470 net/netlink/af_netlink.c:2496
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x814/0x9f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0xaea/0xe60 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg net/socket.c:724 [inline]
____sys_sendmsg+0x5b9/0x910 net/socket.c:2409
___sys_sendmsg net/socket.c:2463 [inline]
__sys_sendmsg+0x280/0x370 net/socket.c:2492
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 462dbc9101 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ec624fe740b416fb68d536b37fb8eef46f90b5c2 ]
BPF layer extends the qdisc control block via struct bpf_skb_data_end
and because of that there is no more room to add variables to the
qdisc layer control block without going over the skb->cb size.
Extend the qdisc control block with a tc control block,
and move all tc related variables to there as a pre-step for
extending the tc control block with additional members.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 166b6a46b78bf8b9559a6620c3032f9fe492e082 ]
We need to return EOPNOTSUPP for the unsupported mpls action type when
setup the flow action.
In the original implement, we will return 0 for the unsupported mpls
action type, actually we do not setup it and the following actions
to the flow action entry.
Fixes: 9838b20a7f ("net: sched: take rtnl lock in tc_setup_flow_action()")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f799ada6bf2397c351220088b9b0980125c77280 ]
Without dropping dst, the packets sent from local mirred/redirected
to ingress will may still use the old dst. ip_rcv() will drop it as
the old dst is for output and its .input is dst_discard.
This patch is to fix by also dropping dst for those packets that are
mirred or redirected from egress to ingress in act_mirred.
Note that we don't drop it for the direction change from ingress to
egress, as on which there might be a user case attaching a metadata
dst by act_tunnel_key that would be used later.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20 ]
mq / mqprio make the default child qdiscs visible. They only do
so for the qdiscs which are within real_num_tx_queues when the
device is registered. Depending on order of calls in the driver,
or if user space changes config via ethtool -L the number of
qdiscs visible under tc qdisc show will differ from the number
of queues. This is confusing to users and potentially to system
configuration scripts which try to make sure qdiscs have the
right parameters.
Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
Note that this uncovers the "shortcut" created by
commit 1f27cde313 ("net: sched: use pfifo_fast for non real queues")
The default child qdiscs beyond initial real_num_tx are always
pfifo_fast, no matter what the sysfs setting is. Fixing this
gets a little tricky because we'd need to keep a reference
on whatever the default qdisc was at the time of creation.
In practice this is likely an non-issue the qdiscs likely have
to be configured to non-default settings, so whatever user space
is doing such configuration can replace the pfifos... now that
it will see them.
Reported-by: Matthew Massey <matthewmassey@fb.com>
Reviewed-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
First fragmented packets (frag offset = 0) byte len is zeroed
when stolen by ip_defrag(). And since act_ct update the stats
only afterwards (at end of execute), bytes aren't correctly
accounted for such packets.
To fix this, move stats update to start of action execute.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduction of lockless subqueues broke the class statistics.
Before the change stats were accumulated in `bstats' and `qstats'
on the stack which was then copied to struct gnet_dump.
After the change the `bstats' and `qstats' are initialized to 0
and never updated, yet still fed to gnet_dump. The code updates
the global qdisc->cpu_bstats and qdisc->cpu_qstats instead,
clobbering them. Most likely a copy-paste error from the code in
mqprio_dump().
__gnet_stats_copy_basic() and __gnet_stats_copy_queue() accumulate
the values for per-CPU case but for global stats they overwrite
the value, so only stats from the last loop iteration / tc end up
in sch->[bq]stats.
Use the on-stack [bq]stats variables again and add the stats manually
in the global case.
Fixes: ce679e8df7 ("net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
https://lore.kernel.org/all/20211007175000.2334713-2-bigeasy@linutronix.de/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We observed below report when playing with netlink sock:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:580:10
shift exponent 249 is too large for 32-bit type
CPU: 0 PID: 685 Comm: a.out Not tainted
Call Trace:
dump_stack_lvl+0x8d/0xcf
ubsan_epilogue+0xa/0x4e
__ubsan_handle_shift_out_of_bounds+0x161/0x182
__qdisc_calculate_pkt_len+0xf0/0x190
__dev_queue_xmit+0x2ed/0x15b0
it seems like kernel won't check the stab log value passing from
user, and will use the insane value later to calculate pkt_len.
This patch just add a check on the size/cell_log to avoid insane
calculation.
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot found that forcing a big quantum attribute would crash hosts fast,
essentially using this:
tc qd replace dev eth0 root fq_codel quantum 4294967295
This is because fq_codel_dequeue() would have to loop
~2^31 times in :
if (flow->deficit <= 0) {
flow->deficit += q->quantum;
list_move_tail(&flow->flowchain, &q->old_flows);
goto begin;
}
SFQ max quantum is 2^19 (half a megabyte)
Lets adopt a max quantum of one megabyte for FQ_CODEL.
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reference counting issue happens in one exception handling path of
cbq_change_class(). When failing to get tcf_block, the function forgets
to decrease the refcount of "rtab" increased by qdisc_put_rtab(),
causing a refcount leak.
Fix this issue by jumping to "failure" label when get tcf_block failed.
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/1630252681-71588-1-git-send-email-xiyuyang19@fudan.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
bpf-next 2021-08-31
We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).
The main changes are:
1) Add opaque bpf_cookie to perf link which the program can read out again,
to be used in libbpf-based USDT library, from Andrii Nakryiko.
2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.
3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.
4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
to another congestion control algorithm during init, from Martin KaFai Lau.
5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.
6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.
7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
progs, from Xu Liu and Stanislav Fomichev.
8) Support for __weak typed ksyms in libbpf, from Hao Luo.
9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.
10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.
11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.
12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.
13) Another big batch to revamp XDP samples in order to give them consistent look
and feel, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
MAINTAINERS: Remove self from powerpc BPF JIT
selftests/bpf: Fix potential unreleased lock
samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
selftests/bpf: Reduce more flakyness in sockmap_listen
bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
bpf: selftests: Add dctcp fallback test
bpf: selftests: Add connect_to_fd_opts to network_helpers
bpf: selftests: Add sk_state to bpf_tcp_helpers.h
bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
selftests: xsk: Preface options with opt
selftests: xsk: Make enums lower case
selftests: xsk: Generate packets from specification
selftests: xsk: Generate packet directly in umem
selftests: xsk: Simplify cleanup of ifobjects
selftests: xsk: Decrease sending speed
selftests: xsk: Validate tx stats on tx thread
selftests: xsk: Simplify packet validation in xsk tests
selftests: xsk: Rename worker_* functions that are not thread entry points
selftests: xsk: Disassociate umem size with packets sent
selftests: xsk: Remove end-of-test packet
...
====================
Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In HTB offload mode, qdiscs of leaf classes are grafted to netdev
queues. sch_htb expects the dev_queue field of these qdiscs to point to
the corresponding queues. However, qdisc creation may fail, and in that
case noop_qdisc is used instead. Its dev_queue doesn't point to the
right queue, so sch_htb can lose track of used netdev queues, which will
cause internal inconsistencies.
This commit fixes this bug by keeping track of the netdev queue inside
struct htb_class. All reads of cl->leaf.q->dev_queue are replaced by the
new field, the two values are synced on writes, and WARNs are added to
assert equality of the two values.
The driver API has changed: when TC_HTB_LEAF_DEL needs to move a queue,
the driver used to pass the old and new queue IDs to sch_htb. Now that
there is a new field (offload_queue) in struct htb_class that needs to
be updated on this operation, the driver will pass the old class ID to
sch_htb instead (it already knows the new class ID).
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210826115425.1744053-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot report an array-index-out-of-bounds in taprio_change
index 16 is out of range for type '__u16 [16]'
that's because mqprio->num_tc is lager than TC_MAX_QUEUE,so we check
the return value of netdev_set_num_tc.
Reported-by: syzbot+2b3e5fb6c7ef285a94f6@syzkaller.appspotmail.com
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While running kselftests, Hangbin observed that sch_ets.sh often crashes,
and splats like the following one are seen in the output of 'dmesg':
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 159f12067 P4D 159f12067 PUD 159f13067 PMD 0
Oops: 0000 [#1] SMP NOPTI
CPU: 2 PID: 921 Comm: tc Not tainted 5.14.0-rc6+ #458
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
RIP: 0010:__list_del_entry_valid+0x2d/0x50
Code: 48 8b 57 08 48 b9 00 01 00 00 00 00 ad de 48 39 c8 0f 84 ac 6e 5b 00 48 b9 22 01 00 00 00 00 ad de 48 39 ca 0f 84 cf 6e 5b 00 <48> 8b 32 48 39 fe 0f 85 af 6e 5b 00 48 8b 50 08 48 39 f2 0f 85 94
RSP: 0018:ffffb2da005c3890 EFLAGS: 00010217
RAX: 0000000000000000 RBX: ffff9073ba23f800 RCX: dead000000000122
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff9073ba23fbc8
RBP: ffff9073ba23f890 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: dead000000000100
R13: ffff9073ba23fb00 R14: 0000000000000002 R15: 0000000000000002
FS: 00007f93e5564e40(0000) GS:ffff9073bba00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000014ad34000 CR4: 0000000000350ee0
Call Trace:
ets_qdisc_reset+0x6e/0x100 [sch_ets]
qdisc_reset+0x49/0x1d0
tbf_reset+0x15/0x60 [sch_tbf]
qdisc_reset+0x49/0x1d0
dev_reset_queue.constprop.42+0x2f/0x90
dev_deactivate_many+0x1d3/0x3d0
dev_deactivate+0x56/0x90
qdisc_graft+0x47e/0x5a0
tc_get_qdisc+0x1db/0x3e0
rtnetlink_rcv_msg+0x164/0x4c0
netlink_rcv_skb+0x50/0x100
netlink_unicast+0x1a5/0x280
netlink_sendmsg+0x242/0x480
sock_sendmsg+0x5b/0x60
____sys_sendmsg+0x1f2/0x260
___sys_sendmsg+0x7c/0xc0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f93e44b8338
Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
RSP: 002b:00007ffc0db737a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000061255c06 RCX: 00007f93e44b8338
RDX: 0000000000000000 RSI: 00007ffc0db73810 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000687880 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt iTCO_vendor_support intel_rapl_msr intel_rapl_common joydev i2c_i801 pcspkr i2c_smbus lpc_ich virtio_balloon ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel libata serio_raw virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod
CR2: 0000000000000000
When the change() function decreases the value of 'nstrict', we must take
into account that packets might be already enqueued on a class that flips
from 'strict' to 'quantum': otherwise that class will not be added to the
bandwidth-sharing list. Then, a call to ets_qdisc_reset() will attempt to
do list_del(&alist) with 'alist' filled with zero, hence the NULL pointer
dereference.
For classes flipping from 'strict' to 'quantum', initialize an empty list
and eventually add it to the bandwidth-sharing list, if there are packets
already enqueued. In this way, the kernel will:
a) prevent crashing as described above.
b) avoid retaining the backlog packets (for an arbitrarily long time) in
case no packet is enqueued after a change from 'strict' to 'quantum'.
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Fixes: dcc68b4d80 ("net: sch_ets: Add a new Qdisc")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when creating an ingress qdisc on an indirect device before
the driver registered for callbacks, the driver will not have a chance
to register its filter configuration callbacks.
To fix that, modify the code such that it keeps track of all the ingress
qdiscs that call flow_indr_dev_setup_offload(). When a driver calls
flow_indr_dev_register(), go through the list of tracked ingress qdiscs
and call the driver callback entry point so as to give it a chance to
register its callback.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding support for using the skb->hash value as the flow hash in CAKE,
I accidentally introduced a logic error that broke the host-only isolation
modes of CAKE (srchost and dsthost keywords). Specifically, the flow_hash
variable should stay initialised to 0 in cake_hash() in pure host-based
hashing mode. Add a check for this before using the skb->hash value as
flow_hash.
Fixes: b0c19ed608 ("sch_cake: Take advantage of skb->hash where appropriate")
Reported-by: Pete Heist <pete@heistp.net>
Tested-by: Pete Heist <pete@heistp.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.
This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org