Add a function to change the owner of a network device when it is moved
between network namespaces.
Currently, when moving network devices between network namespaces the
ownership of the corresponding sysfs entries is not changed. This leads
to problems when tools try to operate on the corresponding sysfs files.
This leads to a bug whereby a network device that is created in a
network namespaces owned by a user namespace will have its corresponding
sysfs entry owned by the root user of the corresponding user namespace.
If such a network device has to be moved back to the host network
namespace the permissions will still be set to the user namespaces. This
means unprivileged users can e.g. trigger uevents for such incorrectly
owned devices. They can also modify the settings of the device itself.
Both of these things are unwanted.
For example, workloads will create network devices in the host network
namespace. Other tools will then proceed to move such devices between
network namespaces owner by other user namespaces. While the ownership
of the device itself is updated in
net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
entry for the device is not:
drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 .
drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics
lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent
However, if a device is created directly in the network namespace then
the device's sysfs permissions will be correctly updated:
drwxr-xr-x 5 root root 0 Jan 25 18:12 .
drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 ..
-r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len
-r--r--r-- 1 root root 4096 Jan 25 18:12 address
-r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id
-r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port
-r--r--r-- 1 root root 4096 Jan 25 18:12 dormant
-r--r--r-- 1 root root 4096 Jan 25 18:12 duplex
-rw-r--r-- 1 root root 4096 Jan 25 18:12 flags
-rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias
-r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex
-r--r--r-- 1 root root 4096 Jan 25 18:12 iflink
-r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu
-r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 root root 4096 Jan 25 18:12 operstate
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 root root 0 Jan 25 18:12 power
-rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 root root 0 Jan 25 18:12 queues
-r--r--r-- 1 root root 4096 Jan 25 18:12 speed
drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics
lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 root root 4096 Jan 25 18:12 type
-rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent
Now, when creating a network device in a network namespace owned by a
user namespace and moving it to the host the permissions will be set to
the id that the user namespace root user has been mapped to on the host
leading to all sorts of permission issues:
458752
drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 .
drwxr-xr-x 9 root root 0 Jan 25 18:08 ..
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed
drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics
lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type
-rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to change the owner of a device's power entries. This
needs to happen when the ownership of a device is changed, e.g. when
moving network devices between network namespaces.
This function will be used to correctly account for ownership changes,
e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to change the owner of a device's sysfs entries. This
needs to happen when the ownership of a device is changed, e.g. when
moving network devices between network namespaces.
This function will be used to correctly account for ownership changes,
e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to change the owner of sysfs objects.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
This mirrors how a kobject is added through driver core which in its guts is
done via kobject_add_internal() which in summary creates the main directory via
create_dir(), populates that directory with the groups associated with the
ktype of the kobject (if any) and populates the directory with the basic
attributes associated with the ktype of the kobject (if any). These are the
basic steps that are associated with adding a kobject in sysfs.
Any additional properties are added by the specific subsystem itself (not by
driver core) after it has registered the device. So for the example of network
devices, a network device will e.g. register a queue subdirectory under the
basic sysfs directory for the network device and than further subdirectories
within that queues subdirectory. But that is all specific to network devices
and they call the corresponding sysfs functions to do that directly when they
create those queue objects. So anything that a subsystem adds outside of what
driver core does must also be changed by it (That's already true for removal of
files it created outside of driver core.) and it's the same for ownership
changes.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helpers to change the owner of sysfs groups.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper to change the owner of a sysfs link.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helpers to change the owner of a sysfs files.
This function will be used to correctly account for kobject ownership
changes, e.g. when moving network devices between network namespaces.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Outdated Raspberry Pi 4 firmware might configure the external PHY as
rgmii although the kernel currently sets it as rgmii-rxid. This makes
connections unreliable as ID_MODE_DIS is left enabled. To avoid this,
explicitly clear that bit whenever we don't need it.
Fixes: da38802211 ("net: bcmgenet: Add RGMII_RXID support")
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The put of the flags was added by the commit referenced in fixes tag,
however the size of the message was not extended accordingly.
Fix this by adding size of the flags bitfield to the message size.
Fixes: e382267860 ("net: sched: update action implementations to support flags")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
list_for_each_entry_rcu() has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled.
The devlink->lock is held when devlink_dpipe_table_find()
is called in non RCU read side section. Therefore, pass struct devlink
to devlink_dpipe_table_find() for lockdep checking.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
Lastly, fix the following checkpatch warning:
CHECK: Prefer kernel type 'u32' over 'u_int32_t'
#61: FILE: drivers/net/ethernet/cisco/enic/vnic_devcmd.h:653:
+ u_int32_t val[];
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
This issue was detected with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are still experiencing some packet loss with the existing advanced
congestion buffering (ACB) settings with the IMP port configured for
2Gb/sec, so revert to conservative link speeds that do not produce
packet loss until this is resolved.
Fixes: 8f1880cbe8 ("net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec")
Fixes: de34d7084e ("net: dsa: bcm_sf2: Only 7278 supports 2Gb/sec IMP port")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 7458bd540f ("net: dsa:
bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278") as it causes
advanced congestion buffering issues with 7278 switch devices when using
their internal Giabit PHY. While this is being debugged, continue with
conservative defaults that work and do not cause packet loss.
Fixes: 7458bd540f ("net: dsa: bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes:
1) Perform garbage collection from workqueue to fix rcu detected
stall in ipset hash set types, from Jozsef Kadlecsik.
2) Fix the forceadd evaluation path, also from Jozsef.
3) Fix nft_set_pipapo selftest, from Stefano Brivio.
4) Crash when add-flush-add element in pipapo set, also from Stefano.
Add test to cover this crash.
5) Remove sysctl entry under mutex in hashlimit, from Cong Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Includes this commit:
platform/chrome: wilco_ec: Include asm/unaligned instead of linux/ path
Fixes a compilation warning.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQCtZK6p/AktxXfkOlzbaomhzOwwgUCXlblyAAKCRBzbaomhzOw
wt4ZAP9/aN/gan3EUCrJvFW0z2uZbHrWfblFSUNKH2//4KPjjAEAmp6HKG4b/D73
7eeJCq9cVzCJoYL2MVhwpIwB2oxoDg4=
=f5X+
-----END PGP SIGNATURE-----
Merge tag 'tag-chrome-platform-fixes-for-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome platform fix from Benson Leung:
"Fix a build warning"
* tag 'tag-chrome-platform-fixes-for-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: wilco_ec: Include asm/unaligned instead of linux/ path
Add missing newlines to netdev_* format strings so the lines
aren't buffered by the printk subsystem.
Nitpicked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before releasing the global mutex, we only unlink the hashtable
from the hash list, its proc file is still not unregistered at
this point. So syzbot could trigger a race condition where a
parallel htable_create() could register the same file immediately
after the mutex is released.
Move htable_remove_proc_entry() back to mutex protection to
fix this. And, fold htable_destroy() into htable_put() to make
the code slightly easier to understand.
Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com
Fixes: c4a3922d2d ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Syzbot reported that ethnl_compact_sanity_checks() can be tricked into
reading past the end of ETHTOOL_A_BITSET_VALUE and ETHTOOL_A_BITSET_MASK
attributes and even the message by passing a value between (u32)(-31)
and (u32)(-1) as ETHTOOL_A_BITSET_SIZE.
The problem is that DIV_ROUND_UP(attr_nbits, 32) is 0 for such values so
that zero length ETHTOOL_A_BITSET_VALUE will pass the length check but
ethnl_bitmap32_not_zero() check would try to access up to 512 MB of
attribute "payload".
Prevent this overflow byt limiting the bitset size. Technically, compact
bitset format would allow bitset sizes up to almost 2^18 (so that the
nest size does not exceed U16_MAX) but bitsets used by ethtool are much
shorter. S16_MAX, the largest value which can be directly used as an
upper limit in policy, should be a reasonable compromise.
Fixes: 10b518d4e6 ("ethtool: netlink bitset handling")
Reported-by: syzbot+7fd4ed5b4234ab1fdccd@syzkaller.appspotmail.com
Reported-by: syzbot+709b7a64d57978247e44@syzkaller.appspotmail.com
Reported-by: syzbot+983cb8fb2d17a7af549d@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the lower and upper bounds when there are multiple TCs and
traffic is on the the same TC on the same device.
The lower bound is represented by 'qoffset' and the upper limit for
hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue
mapping when there are multiple TCs, as the queue indices for upper TCs
will be offset by 'qoffset'.
v2: Fixed commit description based on comments.
Fixes: 1b837d489e ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash")
Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change in API of bootconfig (before it comes live in a release)
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure
Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug
Fix of trace_printk() breaking postponed start up tests
Make a function static that is only used in a single file.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXlW4vxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtioAP0WLEm3dWO0z3321h/a0DSshC+Bslu3
HDPTsGVGrXmvggEA/lr1ikRHd8PsO7zW8BfaZMxoXaTqXiuSrzEWxnMlFw0=
=O8PM
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing and bootconfig updates:
"Fixes and changes to bootconfig before it goes live in a release.
Change in API of bootconfig (before it comes live in a release):
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure
Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug
Fix of trace_printk() breaking postponed start up tests
Make a function static that is only used in a single file"
* tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
bootconfig: Add append value operator support
bootconfig: Prohibit re-defining value on same key
bootconfig: Print array as multiple commands for legacy command line
bootconfig: Reject subkey and value on same parent key
tools/bootconfig: Remove unneeded error message silencer
bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
bootconfig: Set CONFIG_BOOT_CONFIG=n by default
tracing: Clear trace_state when starting trace
bootconfig: Mark boot_config_checksum() static
tracing: Disable trace_printk() on post poned tests
tracing: Have synthetic event test use raw_smp_processor_id()
tracing: Fix number printing bug in print_synth_event()
tracing: Check that number of vals matches number of synth event fields
tracing: Make synth_event trace functions endian-correct
tracing: Make sure synth_event_trace() example always uses u64
This Kselftest kunit update consists of fixes to documentation and
run-time tool from Brendan Higgins and Heidi Fahim.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl5VuwUACgkQCwJExA0N
QxwJiw/+OgVUhIVw4GNvuyDfRruZBR77h41brG3yIlkiJeswxrJBvv6mgQWP69nu
3V2MO7DrJ/Y4LINZ4ElGyiSMpoY+Tpex7GBX0WZy31FVrmOAd4AhZ/fHZar1k4ye
7rnts9Py6PwIYVxO3hcuDAfpIhEa98qKTKhVrLfHxR2CxbcvKDXIWfvz1gcp5M3y
n4D3KVXwmb6yy7q85l8VjwxXevdaFp/bGmRW5HwzpMPJkrtBJWQrFJBGxeX1LVTY
IcNKGu61Efd2KP6K9WF6EyS/seD+GbyuFOMq9xOG3WM6f65EILq6K6A24EGZtUxV
IpJySFvewf+in8lzQql6F0flCvThYXkf2Dofi3yoQAda0XrwcL+Z/rugeLMQoEHN
bYgCKzwW/otwLpJHlWJLPxEnWfuY7A1025xG7Ly+k7qBVsKy2aMZk70gP9uPr6hh
lCp+zRRrnMAwFgKNSD6hVC+yblw0ACXv0UmL+ccUtX5KtSa+yYJ3JFZhOFzhhHug
vwXCF5eLYdGuBVNWAO39kyLyV02nUwXiNaoVW5NF9fNpq6HdA6XWcofcV70AM6WZ
l3s2MDBq7hc7edYknnTHCgaFlHqIlWkFAm828HtJXBV3IpHAagPRFWUVWnkfPlU9
FCQXfnbkteB2ZUlHQwjUGBZzh07ZV0iafzNZcYzgyFCjDlVeHDw=
=Q7Zl
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kunit fixes from Shuah Khan:
"This Kselftest kunit update consists of fixes to documentation and
the run-time tool from Brendan Higgins and Heidi Fahim"
* tag 'linux-kselftest-kunit-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: run kunit_tool from any directory
kunit: test: Improve error messages for kunit_tool when kunitconfig is invalid
Documentation: kunit: fixed sphinx error in code block
This Kselftest update for Linux 5.6-rc4 consists of:
- fixes to TIMEOUT failures and out-of-tree compilation compilation
errors from Michael Ellerman.
- Declutter git status fix from Christophe Leroy
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl5VlisACgkQCwJExA0N
QxzxAhAAol+8YeyQNqkesjUUPZR+hc7fM1G3TfHlwar5ljhlwbIOFCtjp66b9EKA
4Cxy5s2/Vhkbs6CFJPa78UXRoH1enMejff6Dd5njwwNmS+cE1wAatM8RBSJeB4X3
hMjfXCwvjJXqNhayD8n+sHmpEVtCL8SmiG5kKfQu6s+qXN/4EEUw1AaUfms4WO9t
VDDC8Cc8RKhl9ZM1YxZTMoS7xISoWeZM94+aK12kXfL/rlt86k0FcN1FoApf/kIo
15ILTo4cZvWMCLdDxbpw6RSGSdB9+siNFNnWnVp5ytTaD8nVRjLSf/sHlu5B9dvh
VHPA56lofJmXjMxz/cNoHP2jgVsu+hNuG8J3h/GYkaCd6mEG8f5k7kAdqJjQ1D1/
3cA54DtxCxfmDji24bTJaD5+uG60NAAh1EjeNKiWkMK07zsUxzXqDgJLLUM67EFk
cYYwTcT9Yqc/GKVV7e2BkiwOiIYQih0NTg2ugV2HEdmm/1EqycoS0McwzIAIa5+2
k6iUQ3nlpjLnP7vz4950aLVD9a5CsrRM9dY+ngYcbaAX00g9s0G0sLVfRXW6Ls2t
9KMYoio1ERILqwvkHgdDyEXGUW/uMYhVMpbx647ZjtRAVNSVTvxZe4jIewZ3o6lx
6vJ+sxYrrXoyZPPUrQGq3NiHg3Wh8BDw5EZaXuuo8JHbVCpvrMk=
=QRUz
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fixes from Shuah Khan:
- fixes to TIMEOUT failures and out-of-tree compilation compilation
errors from Michael Ellerman.
- declutter git status fix from Christophe Leroy
* tag 'linux-kselftest-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/rseq: Fix out-of-tree compilation
selftests: Install settings files to fix TIMEOUT failures
selftest/lkdtm: Don't pollute 'git status'
This reverts commit ead68df94d.
Using the -Werror flag breaks the build for me due to mostly harmless
KASAN or similar warnings:
arch/x86/kvm/x86.c: In function ‘kvm_timer_init’:
arch/x86/kvm/x86.c:7209:1: error: the frame size of 1112 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
Feel free to add a CONFIG_WERROR if you care strong enough, but don't
break peoples builds for absolutely no good reason.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).
That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.
So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued. That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.
This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.
As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).
Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.
Quoting Feng Tang:
"It makes a big difference, that the performance score is tripled! bump
from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%"
[ The "2% gap" is the odd cacheline placement difference on that
platform: under the extreme contention case, the effect of which half
of the cacheline was hot was 5%, so with the reduced contention the
odd timing artifact is reduced too ]
It does help in the non-contended case too, but is not nearly as
noticeable.
Reported-and-tested-by: Feng Tang <feng.tang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Philip Li <philip.li@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Rostecki says:
====================
Feature probes in bpftool related to bpf_probe_write_user and
bpf_trace_printk helpers emit dmesg warnings which might be confusing
for people running bpftool on production environments. This patch series
addresses that by filtering them out by default and introducing the new
positional argument "full" which enables all available probes.
The main motivation behind those changes is ability the fact that some
probes (for example those related to "trace" or "write_user" helpers)
emit dmesg messages which might be confusing for people who are running
on production environments. For details see the Cilium issue[0].
v1 -> v2:
- Do not expose regex filters to users, keep filtering logic internal,
expose only the "full" option for including probes which emit dmesg
warnings.
v2 -> v3:
- Do not use regex for filtering out probes, use function IDs directly.
- Fix bash completion - in v2 only "prefix" was proposed after "macros",
"dev" and "kernel" were not.
- Rephrase the man page paragraph, highlight helper function names.
- Remove tests which parse the plain output of bpftool (except the
header/macros test), focus on testing JSON output instead.
- Add test which compares the output with and without "full" option.
v3 -> v4:
- Use enum to check for helper functions.
- Make selftests compatible with older versions of Python 3.x than 3.7.
[0] https://github.com/cilium/cilium/issues/10048
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add Python module with tests for "bpftool feature" command, which mainly
checks whether the "full" option is working properly.
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200226165941.6379-6-mrostecki@opensuse.org
Update bash completion for "bpftool feature" command with the new
argument: "full".
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200226165941.6379-5-mrostecki@opensuse.org
Update documentation of "bpftool feature" command with information about
new arguments: "full".
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200226165941.6379-4-mrostecki@opensuse.org
Probes related to bpf_probe_write_user and bpf_trace_printk helpers emit
dmesg warnings which might be confusing for people running bpftool on
production environments. This change filters them out by default and
introduces the new positional argument "full" which enables all
available probes.
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200226165941.6379-3-mrostecki@opensuse.org
Remove all calls of print_end_then_start_section function and for loops
out from the do_probe function. Instead, provide separate functions for
each section (like i.e. section_helpers) which are called in do_probe.
This change is motivated by better readability.
Signed-off-by: Michal Rostecki <mrostecki@opensuse.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/bpf/20200226165941.6379-2-mrostecki@opensuse.org
The dt_binding_check is added to PHONY, but it is invisible when
$(dtstree) is empty. So, it is not specified as phony for
ARCH=x86 etc.
Add it to PHONY outside the ifneq ... endif block.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
The dtbs_check should be a phony target, but currently it is not
specified so.
'make dtbs_check' works even if a file named 'dtbs_check' exists
because it depends on another phony target, scripts_dtc, but we
should not rely on it.
Add dtbs_check to PHONY.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
This if_change_rule is not working properly; it cannot detect any
command line change.
The reason is because cmd-check in scripts/Kbuild.include compares
$(cmd_$@) and $(cmd_$1), but cmd_dtc_dt_yaml does not exist here.
For if_change_rule to work properly, the stem part of cmd_* and rule_*
must match. Because this cmd_and_fixdep invokes cmd_dtc, this rule must
be named rule_dtc.
Fixes: 4f0e3a57d6 ("kbuild: Add support for DT binding schema checks")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Rob Herring <robh@kernel.org>
This sentence does not make sense in the section about mandatory-y.
This seems to be a copy-paste mistake of commit fcc8487d47 ("uapi:
export all headers under uapi directories").
The correct description would be "The convention is to list one
mandatory-y per line ...".
I just removed it instead of fixing it. If such information is needed,
it could be commented in include/asm-generic/Kbuild and
include/uapi/asm-generic/Kbuild.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Complete the comments for valid values of KBUILD_VERBOSE,
specifically for KBUILD_VERBOSE=2.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Looks like the iavf code actually experienced a race condition, when a
developer took code before the check for chain 0 was put to helper.
So use tc_cls_can_offload_and_chain0() helper instead of direct check and
move the check to _cb() so this is similar to i40e code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "/usr/bin/python3" to "/usr/bin/env python3" for
more portable solution in bpf_helpers_doc.py.
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225205426.6975-1-scott.branden@broadcom.com
Add a specific test for the crash reported by Phil Sutter and addressed
in the previous patch. The test cases that, in my intention, should
have covered these cases, that is, the ones from the 'concurrency'
section, don't run these sequences tightly enough and spectacularly
failed to catch this.
While at it, define a convenient way to add these kind of tests, by
adding a "reported issues" test section.
It's more convenient, for this particular test, to execute the set
setup in its own function. However, future test cases like this one
might need to call setup functions, and will typically need no tools
other than nft, so allow for this in check_tools().
The original form of the reproducer used here was provided by Phil.
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil reports that adding elements, flushing and re-adding them
right away:
nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }'
nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
nft flush set t s
nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'
triggers, almost reliably, a crash like this one:
[ 71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
[ 71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
[ 71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
[ 71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[ 71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
[ 71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
[ 71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
[ 71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
[ 71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
[ 71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
[ 71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
[ 71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
[ 71.334596] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 71.335780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
[ 71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 71.339718] Call Trace:
[ 71.340093] nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
[ 71.340973] nft_set_destroy+0x20/0x50 [nf_tables]
[ 71.341879] nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
[ 71.342916] process_one_work+0x1d5/0x3c0
[ 71.343601] worker_thread+0x4a/0x3c0
[ 71.344229] kthread+0xfb/0x130
[ 71.344780] ? process_one_work+0x3c0/0x3c0
[ 71.345477] ? kthread_park+0x90/0x90
[ 71.346129] ret_from_fork+0x35/0x40
[ 71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
[ 71.348153] ---[ end trace 2eaa8149ca759bcc ]---
[ 71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
[ 71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
[ 71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
[ 71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
[ 71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
[ 71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
[ 71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
[ 71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
[ 71.350025] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 71.350026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
[ 71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 71.350030] Kernel panic - not syncing: Fatal exception
[ 71.350412] Kernel Offset: disabled
[ 71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---
which is caused by dangling elements that have been deactivated, but
never removed.
On a flush operation, nft_pipapo_walk() walks through all the elements
in the mapping table, which are then deactivated by nft_flush_set(),
one by one, and added to the commit list for removal. Element data is
then freed.
On transaction commit, nft_pipapo_remove() is called, and failed to
remove these elements, leading to the stale references in the mapping.
The first symptom of this, revealed by KASan, is a one-byte
use-after-free in subsequent calls to nft_pipapo_walk(), which is
usually not enough to trigger a panic. When stale elements are used
more heavily, though, such as double-free via nft_pipapo_destroy()
as in Phil's case, the problem becomes more noticeable.
The issue comes from that fact that, on a flush operation,
nft_pipapo_remove() won't get the actual key data via elem->key,
elements to be deleted upon commit won't be found by the lookup via
pipapo_get(), and removal will be skipped. Key data should be fetched
via nft_set_ext_key(), instead.
Reported-by: Phil Sutter <phil@nwl.cc>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jozsef Kadlecsik says:
====================
ipset patches for nf
The first one is larger than usual, but the issue could not be solved simpler.
Also, it's a resend of the patch I submitted a few days ago, with a one line
fix on top of that: the size of the comment extensions was not taken into
account at reporting the full size of the set.
- Fix "INFO: rcu detected stall in hash_xxx" reports of syzbot
by introducing region locking and using workqueue instead of timer based
gc of timed out entries in hash types of sets in ipset.
- Fix the forceadd evaluation path - the bug was also uncovered by the syzbot.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Clang warns:
In file included from
../drivers/net/ethernet/mellanox/mlx5/core/main.c:73:
../drivers/net/ethernet/mellanox/mlx5/core/diag/rsc_dump.h:4:9: warning:
'__MLX5_RSC_DUMP_H' is used as a header guard here, followed by #define
of a different macro [-Wheader-guard]
#ifndef __MLX5_RSC_DUMP_H
^~~~~~~~~~~~~~~~~
../drivers/net/ethernet/mellanox/mlx5/core/diag/rsc_dump.h:5:9: note:
'__MLX5_RSC_DUMP__H' is defined here; did you mean '__MLX5_RSC_DUMP_H'?
#define __MLX5_RSC_DUMP__H
^~~~~~~~~~~~~~~~~~
__MLX5_RSC_DUMP_H
1 warning generated.
Make them match to get the intended behavior and remove the warning.
Fixes: 12206b1723 ("net/mlx5: Add support for resource dump")
Link: https://github.com/ClangBuiltLinux/linux/issues/897
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>