Petr Machata says:
====================
net: Resilient NH groups: netdevsim, selftests
Support for resilient next-hop groups was added in a previous patch set.
Resilient next hop groups add a layer of indirection between the SKB hash
and the next hop. Thus the hash is used to reference a hash table bucket,
which is then used to reference a particular next hop. This allows the
system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.
This patch set introduces mock offloading of resilient next hop groups by
the netdevsim driver, and a suite of selftests.
- Patch #1 adds a netdevsim-specific lock to protect next-hop hashtable.
Previously, netdevsim relied on RTNL to maintain mutual exclusion.
Patch #2 extracts a helper to make the following patches clearer.
- Patch #3 implements the support for offloading of resilient next-hop
groups.
- Patch #4 introduces a new debugfs interface to set activity on a selected
next-hop bucket. This simulates how HW can periodically report bucket
activity, and buckets thus marked are expected to be exempt from
migration to new next hops when the group changes.
- Patches #5 and #6 clean up the fib_nexthop selftests.
- Patches #7, #8 and #9 add tests for resilient next hop groups. Patch #7
adds resilient-hashing counterparts to fib_nexthops.sh. Patch #8 adds a
new traffic test for resilient next-hop groups. Patch #9 adds a new
traffic test for tunneling.
- Patch #10 actually leverages the netdevsim offload to implement a suite
of algorithmic tests that verify how and when buckets are migrated under
various simulated workload scenarios.
The overall plan is to contribute approximately the following patchsets:
1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (already pushed)
3) Implementation of resilient next hop group (already pushed)
4) Netdevsim offload plus a suite of selftests (this patchset)
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests
Interested parties can look at the complete code at [2].
[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Test various aspects of the resilient nexthop group offload API on top
of the netdevsim implementation. Both good and bad flows are tested.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a resilient nexthop objects version of gre_multipath_nh.sh. Test
that both IPv4 and IPv6 overlays work with resilient nexthop groups
where the nexthops are two GRE tunnels.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verify that IPv4 and IPv6 multipath forwarding works correctly with
resilient nexthop groups and with different weights.
Test that when the idle timer is not zero, the resilient groups are not
rebalanced - because the nexthop buckets are considered active - and the
initial weights (1:1) are used.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add test cases for resilient nexthop groups. Exhaustive forwarding tests
are added separately under net/forwarding/.
Examples:
# ./fib_nexthops.sh -t basic_res
Basic resilient nexthop group functional tests
----------------------------------------------
TEST: Add a nexthop group with default parameters [ OK ]
TEST: Get a nexthop group with default parameters [ OK ]
TEST: Get a nexthop group with non-default parameters [ OK ]
TEST: Add a nexthop group with 0 buckets [ OK ]
TEST: Replace nexthop group parameters [ OK ]
TEST: Get a nexthop group after replacing parameters [ OK ]
TEST: Replace idle timer [ OK ]
TEST: Get a nexthop group after replacing idle timer [ OK ]
TEST: Replace unbalanced timer [ OK ]
TEST: Get a nexthop group after replacing unbalanced timer [ OK ]
TEST: Replace with no parameters [ OK ]
TEST: Get a nexthop group after replacing no parameters [ OK ]
TEST: Replace nexthop group type - implicit [ OK ]
TEST: Replace nexthop group type - explicit [ OK ]
TEST: Replace number of nexthop buckets [ OK ]
TEST: Get a nexthop group after replacing with invalid parameters [ OK ]
TEST: Dump all nexthop buckets [ OK ]
TEST: Dump all nexthop buckets in a group [ OK ]
TEST: Dump all nexthop buckets with a specific nexthop device [ OK ]
TEST: Dump all nexthop buckets with a specific nexthop identifier [ OK ]
TEST: Dump all nexthop buckets in a non-existent group [ OK ]
TEST: Dump all nexthop buckets in a non-resilient group [ OK ]
TEST: Dump all nexthop buckets using a non-existent device [ OK ]
TEST: Dump all nexthop buckets with invalid 'groups' keyword [ OK ]
TEST: Dump all nexthop buckets with invalid 'fdb' keyword [ OK ]
TEST: Get a valid nexthop bucket [ OK ]
TEST: Get a nexthop bucket with valid group, but invalid index [ OK ]
TEST: Get a nexthop bucket from a non-resilient group [ OK ]
TEST: Get a nexthop bucket from a non-existent group [ OK ]
Tests passed: 29
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_large_res_grp
IPv4 large resilient group (128k buckets)
-----------------------------------------
TEST: Dump large (x131072) nexthop buckets [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_large_res_grp
IPv6 large resilient group (128k buckets)
-----------------------------------------
TEST: Dump large (x131072) nexthop buckets [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_res_torture
IPv4 runtime resilient nexthop group torture
--------------------------------------------
TEST: IPv4 resilient nexthop group torture test [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_res_torture
IPv6 runtime resilient nexthop group torture
--------------------------------------------
TEST: IPv6 resilient nexthop group torture test [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_res_grp_fcnal
IPv4 resilient groups functional
--------------------------------
TEST: Nexthop group updated when entry is deleted [ OK ]
TEST: Nexthop buckets updated when entry is deleted [ OK ]
TEST: Nexthop group updated after replace [ OK ]
TEST: Nexthop buckets updated after replace [ OK ]
TEST: Nexthop group updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop buckets updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop group updated after replace - nECMP [ OK ]
TEST: Nexthop buckets updated after replace - nECMP [ OK ]
Tests passed: 8
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_res_grp_fcnal
IPv6 resilient groups functional
--------------------------------
TEST: Nexthop group updated when entry is deleted [ OK ]
TEST: Nexthop buckets updated when entry is deleted [ OK ]
TEST: Nexthop group updated after replace [ OK ]
TEST: Nexthop buckets updated after replace [ OK ]
TEST: Nexthop group updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop buckets updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop group updated after replace - nECMP [ OK ]
TEST: Nexthop buckets updated after replace - nECMP [ OK ]
Tests passed: 8
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Co-developed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The lines with the IPv4 and IPv6 test cases are already very long and
more test cases will be added in subsequent patches.
List each test case in a different line to make it easier to extend the
test with more test cases.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A key component of the resilient hashing algorithm is the hash buckets'
activity. If a bucket is active, it will not be populated with a new
nexthop in order not to break existing flows. Therefore, in order to
easily and thoroughly test the algorithm, we need to be in full control
over the reported activity.
Add a debugfs interface that allows user space to have netdevsim report
a nexthop bucket within a resilient nexthop group as active. For
example:
# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
Will mark bucket 23 in nexthop group 10 as active.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow resilient nexthop groups to be programmed and account their
occupancy according to their number of buckets. The nexthop group itself
as well as its buckets are marked with hardware flags (i.e.,
'RTNH_F_TRAP').
Replacement of a single nexthop bucket can fail using the following
debugfs knob:
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
N
# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
Y
Replacement of a resilient nexthop group can fail using the following
debugfs knob:
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
N
# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
Y
This enables testing of various error paths.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of calling nexthop_set_hw_flags(), call a helper. It will be
used to also set nexthop bucket flags in a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently netdevsim relies on RTNL to maintain exclusivity in accessing the
nexthop hash table. However, bucket notification may be called without RTNL
having been held. Instead, introduce a custom lock to guard the table.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lee Jones says:
====================
Rid W=1 warnings from PTP
This set is part of a larger effort attempting to clean-up W=1
kernel builds, which are currently overwhelmingly riddled with
niggly little warnings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'control' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'event' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'addend' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'accum' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'test' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_compare' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_control' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_event' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_status' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_sel' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_st' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve1' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set_en' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve2' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'srst' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'regs' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'ptp_clock' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'caps' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts0_enabled' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts1_enabled' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_base' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_size' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'irq' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'pdev' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'register_lock' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:128: warning: Function parameter or member 'station' not described in 'pch_params'
drivers/ptp/ptp_pch.c:291: warning: Function parameter or member 'pdev' not described in 'pch_set_station_address'
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: LAPIS SEMICONDUCTOR <tshimizu818@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: IDT-support-1588@lm.renesas.com
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:193:6: warning: no previous prototype for ‘pch_ch_control_write’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:201:5: warning: no previous prototype for ‘pch_ch_event_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:212:6: warning: no previous prototype for ‘pch_ch_event_write’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:220:5: warning: no previous prototype for ‘pch_src_uuid_lo_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:231:5: warning: no previous prototype for ‘pch_src_uuid_hi_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:242:5: warning: no previous prototype for ‘pch_rx_snap_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:259:5: warning: no previous prototype for ‘pch_tx_snap_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:300:5: warning: no previous prototype for ‘pch_set_station_address’ [-Wmissing-prototypes]
Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Flavio Suligoi <f.suligoi@asem.it>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:182:5: warning: no previous prototype for ‘pch_ch_control_read’ [-Wmissing-prototypes]
Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT)
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Flavio Suligoi <f.suligoi@asem.it>
Cc: netdev@vger.kernel.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On some SoCs (e.g. BCM4908, BCM631[345]8) SF2 has an integrated
crossbar. It allows connecting its selected external ports to internal
ports. It's used by vendors to handle custom Ethernet setups.
BCM4908 has following 3x2 crossbar. On Asus GT-AC5300 rgmii is used for
connecting external BCM53134S switch. GPHY4 is usually used for WAN
port. More fancy devices use SerDes for 2.5 Gbps Ethernet.
┌──────────┐
SerDes ─── 0 ─┤ │
│ 3x2 ├─ 0 ─── switch port 7
GPHY4 ─── 1 ─┤ │
│ crossbar ├─ 1 ─── runner (accelerator)
rgmii ─── 2 ─┤ │
└──────────┘
Use setup data based on DT info to configure BCM4908's switch port 7.
Right now only GPHY and rgmii variants are supported. Handling SerDes
can be implemented later.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's needed later for proper switch / crossbar setup.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All comment lines inside the comment block have been aligned.
Every line of comment starts with a * (uniformity in code).
Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It appears that each DMA channel has its own interrupt and both rings
can be configured (the same way) to handle interrupts.
1. Make ring interrupts code generic (make it operate on given ring)
2. Move napi to ring (so each has its own)
3. Make IRQ handler generic (match ring against received IRQ number)
4. Add (optional) support for TX interrupt
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
I discovered that hardware actually supports two interrupts, one per DMA
channel (RX and TX).
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock says:
====================
macb SGMII fixed-link fixes
Some fixes to the macb driver for use in SGMII mode with a fixed-link (such as
for chip-to-chip connectivity).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When using a fixed-link configuration in SGMII mode, it's not really
sensible to have auto-negotiation enabled since the link settings are
fixed by definition. In other configurations, such as an SGMII
connection to a PHY, it should generally be enabled.
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using a fixed-link configuration with GEM in SGMII mode, such as
for a chip-to-chip interconnect, the link state was always showing as
established regardless of the actual connectivity state. We can monitor
the pcs_link_state bit in the Network Status register to determine
whether the PCS link state is actually up.
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) TC support for ICMP parameters
2) TC connection tracking with mirroring
3) A round of trivial fixups and cleanups
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBL+V4ACgkQSD+KveBX
+j68tgf+Lzq6QkpGl4dpg87pjuKIwyykGVJ0NW/Yig+bNe9V3pbFCv785xymD16N
ALcLKNtxUovE6p+Gy6JgzgBN7V9ZsM6bn564487ycYIqewz9c2KIQneJdGEUUDgi
S8NkuCaLQst36Wl8gdiygApZmw8TAafOQYYzKOZ7HdZ+aRH+fIwbVABYreAYwjGa
2iz5yu1AEzG5/10bTEx69rQHY+309EPy4N4na/jo/tXRL3fWQSQz6hzEnoi3EjAU
YsktBB72eBX/NogqLIHpSzzjPgm5IFRQbNSHyL71TH9KEr0KVpzJd9VCbaic6goL
gQn0TqoE74X7r4ZPWP3AHcGOL9L5Rg==
=L2PO
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-03-12
1) TC support for ICMP parameters
2) TC connection tracking with mirroring
3) A round of trivial fixups and cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Support matching on ICMPv4/6 type and code parameters using misc3
section of match parameters.
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add support for mirroring before the CT action by spliting the pre ct rule.
Mirror outputs are done first on the tc chain,prio table rule (the fwd
rule), which will then forward to a per port fwd table.
On this fwd table, we insert the original pre ct rule that forwards to
ct/ct nat table.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Multiple commands can be printed at the same time which can
lead to wrong order of their lines in dmesg output.
As a result, it's hard to match data dumps to the correct command
or which command was fully dumped at some point.
Fix this by displaying the corresponding command index, and also
indicate when a command was fully dumped.
Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Increasing the size of the indirection_rqt array from 128 to 256 bytes
pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function
over the warning limit when building with clang and CONFIG_KASAN:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=]
Using dynamic allocation here is safe because the caller does the
same, and it reduces the stack usage of the function to just a few
bytes.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Dump the ICOSQ's WQE descriptor when a completion with error is received.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Commit e20f0dbf20 ("net/mlx5e: RX, Add a prefetch command for small
L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e.
In the same time frame, commit 5af75c747e ("net/mlx5e: Enhanced TX
MPWQE for SKBs") added one more usage of prefetchw. When these two
changes were merged, this new occurrence of prefetchw wasn't replaced
with net_prefetchw.
This commit fixes this last occurrence of prefetchw in
mlx5e_tx_mpwqe_session_start, making the same change that was done in
mlx5e_xdp_mpwqe_session_start.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Fix the following coccicheck warnings:
drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Read congestion counters from all ports in any lag mode rather than
only in RoCE lag mode (e.g., VF lag).
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
It is allocated with kvzalloc(), the corresponding release function
should not be kfree(), use kvfree() instead.
Generated by: scripts/coccinelle/api/kfree_mismatch.cocci
Signed-off-by: Junlin Yang <yangjunlin@yulong.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The field source_eswitch_owner_vhca_id was not consumed
in the same way as in STEv0. Added the missing set.
Fixes: 10b6941864 ("net/mlx5: DR, Add HW STEv1 match logic")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove the dr_ste_v1_set_rx_decap_l3 function that was
replaced by another function - fixing a rebase error.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
"either" is outside the parentheses, so the matching "or" should be too.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
tcp: better deal with delayed TX completions
Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)
While problems have been there forever, second patch might
introduce some regressions so I prefer not backport
them to stable releases before things settle.
Many thanks to FB team for their help and tests.
Few packetdrill tests need to be changed to reflect
the improvements brought by this series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TSQ provides a nice way to avoid bufferbloat on individual socket,
including retransmit packets. We can get rid of the old
heuristic:
/* Do not sent more than we queued. 1/4 is reserved for possible
* copying overhead: fragmentation, tunneling, mangling etc.
*/
if (refcount_read(&sk->sk_wmem_alloc) >
min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
sk->sk_sndbuf))
return -EAGAIN;
This heuristic was giving false positives according to Jakub,
whenever TX completions are delayed above RTT. (Ack packets
are processed by TCP stack before clones are orphaned/freed)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub reported Data included in a Fastopen SYN that had to be
retransmit would have to wait for an RTO if TX completions are slow,
even with prior fix.
This is because tcp_rcv_fastopen_synack() does not use standard
rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
because tp->lost_out == tp->retrans_out
Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
by using tcp_mark_skb_lost() on the skb thats needs to be
sent again.
Not this raised a warning in tcp_fastretrans_alert() during my tests
since we consider the data not being aknowledged
by the receiver does not mean packet was lost on the network.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)
Main issue is that TCP stack has a logic preventing a packet
being retransmit if the prior clone has not yet been
orphaned or freed.
This logic came with commit 1f3279ae0c ("tcp: avoid
retransmits of TCP packets hanging in host queues")
Thankfully, in the case skb_still_in_host_queue() detects
the initial clone is still in flight, it can use TSQ logic
that will eventually retry later, at the moment the clone
is freed or orphaned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neil Spring <ntspring@fb.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following warning from sparse:
net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types)
net/tipc/monitor.c:263:35: expected unsigned int
net/tipc/monitor.c:263:35: got restricted __be32 [usertype]
[...]
net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit
net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock
net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit
net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock
net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock
[...]
net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces)
net/tipc/crypto.c:1201:9: expected struct tipc_aead [noderef] __rcu *__tmp
net/tipc/crypto.c:1201:9: got struct tipc_aead *
[...]
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(struct tipc_link_info)->dest is in network order (__be32), so we must
convert the value to network order before assigning. The problem detected
by sparse:
net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types)
net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest
net/tipc/netlink_compat.c:699:24: got int
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel says:
====================
mlxsw: Implement sampling using mirroring
So far, sampling was implemented using a dedicated sampling mechanism
that is available on all Spectrum ASICs. Spectrum-2 and later ASICs
support sampling by mirroring packets to the CPU port with probability.
This method has a couple of advantages compared to the legacy method:
* Extra metadata per-packet: Egress port, egress traffic class, traffic
class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow as opposed to only
ingress
This series should not result in any user-visible changes and its aim is
to convert Spectrum-2 and later ASICs to perform sampling by mirroring
to the CPU port with probability. Future submissions will expose the
additional metadata and enable sampling using more triggers (e.g.,
egress).
Series overview:
Patches #1-#3 extend the SPAN (mirroring) module to accept new
parameters required for sampling. See individual commit messages for
detailed explanation.
Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while
still using the legacy method for all ASIC generations.
Patch #6 converts Spectrum-2 and later ASICs to perform sampling by
mirroring to the CPU port with probability.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum-2 and later ASICs support sampling of packets by mirroring to
the CPU with probability. There are several advantages compared to the
legacy dedicated sampling mechanism:
* Extra metadata per-packet: Egress port, egress traffic class, traffic
class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow
Convert Spectrum-2 and later ASICs to perform sampling by mirroring to
the CPU with probability.
Subsequent patches will add support for egress / per-flow sampling and
expose the extra metadata.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.
As a preparation for more advanced sampling configurations, split the trap
configuration used for sampled packets between Spectrum-1 and later ASICs.
This is needed since packets that are mirrored to the CPU are trapped
via a different trap identifier compared to packets that are sampled
using the dedicated sampling mechanism.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.
As a preparation for more advanced sampling configurations, split the
sampling operations between Spectrum-1 and later ASICs.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>