Lawrence Brakmo says:
====================
This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.
Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
NET_XMIT_SUCCESS (0) - continue with packet output
NET_XMIT_DROP (1) - drop packet and do cn
NET_XMIT_CN (2) - continue with packet output and do cn
-EPERM - drop packet
Finally, HBM programs are modified to collect and return more
statistics.
There has been some discussion regarding the best place to manage
bandwidths. Some believe this should be done in the qdisc where it can
also be managed with a BPF program. We believe there are advantages
for doing it with a BPF program in the cgroup/skb callback. For example,
it reduces overheads in the cases where there is on primary workload and
one or more secondary workloads, where each workload is running on its
own cgroupv2. In this scenario, we only need to throttle the secondary
workloads and there is no overhead for the primary workload since there
will be no BPF program attached to its cgroup.
Regardless, we agree that this mechanism should not penalize those that
are not using it. We tested this by doing 1 byte req/reply RPCs over
loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
Each test was repeated 50 times with a 1 minute delay between each set
of 10. We then calculated the average RPCs/sec over the 50 tests. We
compare upstream with upstream + patchset and no BPF program as well
as upstream + patchset and a BPF program that just returns ALLOW_PKT.
Here are the results:
upstream 80937 RPCs/sec
upstream + patches, no BPF program 80894 RPCs/sec
upstream + patches, BPF program 80634 RPCs/sec
These numbers indicate that there is no penalty for these patches
The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.
The following results are obtained for rate limits of 1Gbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic.
Command used was:
./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
-s=<server running netserver>
where:
<rate> is 1000
--no_cn specifies no cwr notifications
dctcp uses dctcp
Cubic DCTCP
Lim, DA Mbps cwnd cred drops Mbps cwnd cred drops
-------- ---- ---- ---- ----- ---- ---- ---- -----
1G, 40 35 462 -320 67% 995 1 -212 0.05%
1G, 40,cn 736 9 -78 0.07 995 1 -212 0.05
1G, 5,cn 941 2 -189 0.13 995 1 -212 0.05
Notes:
--no_cn has no effect with DCTCP
Lim = rate limit
DA = maximum delay ack timer
cred = credit in packets
drops = % packets dropped
v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3
New egress values apply to all protocols, not just TCP
Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers
Removed changes to __tcp_transmit_skb (patch 5), no longer needed
Removed sample use of EDT
v2->v3: Removed the probe timer related changes
v3->v4: Replaced preempt_enable_no_resched() by preempt_enable()
in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro
====================
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.
For egress packets, this function can return:
NET_XMIT_SUCCESS (0) - continue with packet output
NET_XMIT_DROP (1) - drop packet and notify TCP to call cwr
NET_XMIT_CN (2) - continue with packet output and notify TCP
to call cwr
-EPERM - drop packet
For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()
The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).
The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB. e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS. Blindly enforcing expected_attach_type will
break backward compatibility.
This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.
Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).
Hence, new allowed return values of CGROUP EGRESS BPF programs are:
0: drop packet
1: keep packet
2: drop packet and call cwr
3: keep packet and call cwr
This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
0: NET_XMIT_SUCCESS skb should be transmitted (no cn)
1: NET_XMIT_DROP skb should be dropped and cwr called
2: NET_XMIT_CN skb should be transmitted and cwr called
3: -EPERM skb should be dropped (no cn)
Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Several bug fixes for the new XIVE-native code.
- Replace kvm->lock by other mutexes in several places where we hold a
vcpu mutex, to avoid lock order inversions.
- Fix a lockdep warning on guest entry for radix-mode guests.
- Fix a bug causing user-visible corruption of SPRG3 on the host.
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEv0VLfXa2m9eKuaRpnZrqdyxjcZ8FAlzvZC8SHHBhdWx1c0Bv
emxhYnMub3JnAAoJEJ2a6ncsY3Gf5EwIAKBJJDLxvW9C3bEZJOTQllgeJXCraxnh
p6NGCHVmUy21tb42KevKX2y6DMQ/i6zTaLMbFUtR3f6QEjS9DUoFOnKpK4AtibZB
Cb/oRTG8d9wmzVNiEQruianOiCGBNPRH0Mf1/tHfc1jtVSVcsCOR88PRteUnLGPF
nS+NbS6X9QadL5Dp4pv3cRyksKgkPcA0fyloHqusnldXVQbcEdtYnOWP6clhz9uy
vsO+kPkGkJBk+fLQUluOdXiRh8HewXWHiJYf8qMRfHP/L9LaiMUBmjlY6QLtXEcD
vUEgXazflS5vmgCtwgAwOlQPINS+xTGL0IBVdingJ+dLQW8il6NrdWM=
=L3ZC
-----END PGP SIGNATURE-----
Merge tag 'kvm-ppc-fixes-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master
PPC KVM fixes for 5.2
- Several bug fixes for the new XIVE-native code.
- Replace kvm->lock by other mutexes in several places where we hold a
vcpu mutex, to avoid lock order inversions.
- Fix a lockdep warning on guest entry for radix-mode guests.
- Fix a bug causing user-visible corruption of SPRG3 on the host.
The variable err is assigned with the value -ENOMEM that is never
read and it is re-assigned a new value later on. The assignment is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable err is initialized with a value that is never read
and err is reassigned a few statements later. This initialization
is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to a confusion I thought that eth_type_trans() was called by the
network stack whereas it can actually be called by network drivers to
figure out the skb protocol and next packet_type handlers.
In light of the above, it is not safe to store the frame type from the
DSA tagger's .filter callback (first entry point on RX path), since GRO
is yet to be invoked on the received traffic. Hence it is very likely
that the skb->cb will actually get overwritten between eth_type_trans()
and the actual DSA packet_type handler.
Of course, what this patch fixes is the actual overwriting of the
SJA1105_SKB_CB(skb)->type field from the GRO layer, which made all
frames be seen as SJA1105_FRAME_TYPE_NORMAL (0).
Fixes: 227d07a07e ("net: dsa: sja1105: Add support for traffic through standalone ports")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While troubleshooting issues where cloned request limits have been
exceeded, it is often beneficial to know the actual values that
have been breached. Print these values, assisting in ease of
identification of root cause of the breach.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Document the meaning of the blk_mq_hw_queue_to_node() arguments.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change one occurrence of 'performace' into 'performance'.
Cc: Max Gurtovoy <maxg@mellanox.com>
Fixes: fe631457ff ("blk-mq: map all HWQ also in hyperthreaded system") # v4.13.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc script complains about these
function headers when building with W=1.
Cc: Hannes Reinecke <hare@suse.com>
Cc: Keith Busch <keith.busch@intel.com>
Fixes: ed76e329d7 ("blk-mq: abstract out queue map") # v5.0.
Fixes: e42b3867de ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch avoids that the kernel-doc tool warns about this function
header when building with W=1.
Reviewed-by: Chaitanya Kulkarni <chiatanya.kulkarni@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
locks as nonconflicting.
We have proper fix queued up for 5.3. In the meantime, a quick revert
seems best for 5.2 and stable.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAlzxj7AVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+z00QAKsZbew6iPT/NQrE45Rsu8gED9kz
NOslHTu49LLxsW6wbpbmKu8jjGpU0UeVbNhytM6O71gyI7Vi3NLxQKi/FLP9+DyE
FvMKTeeym9rSFLb+WjooBi2QuftXComsllCh0Xm1YisFBJX1LhfkkhuYZcuPVM/A
XwSughv5IjOC7K6gA4qYW4q07BCsuKhQMYWNtI3vkxoepAXcWIuj/gN6KxWzKvKZ
HbzUdVOx+s1aBlbJSu8T8G/qFvZ1a8fG2VP9caTmOhl7o1i29NjBmUP01mMlLR71
cgb4J6y0bXLxPGl8e0svU0k5byX+RSxaRHaxzB7rzjAEaPC1DivsbmSzicyygwxw
iS44egS02gJPORIgdEPlJYKAJqrov22JilyvVby05hDl9NAvFL2zdVbYtdrdkD5D
+9137X202jRg+3XoGs4rpISr725kVFcyU7d45/4CDf5xPIyBCuk8tD3kr/+X15lb
3k8ZXF1PL7ozZ5cSZR/HyvX2uRgQ3hj1Hfg6vGr2I8iZlU8drR3CUgMagApls1fl
2Yoa9dbRFpLbvdj0HYQxP5vMd1SakwQqyfsO3TnJ2bOONvHoGBVLjPop9uSYEKSZ
TCG6eq7UQOIr4FnWD2LUqjnzxSPaSPZeh887m2OWawAWcAE+ZLhkcDZZRxEs2Dxp
G+Gnxon0vC8AYYlS
=eWTb
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.2-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fix from Bruce Fields:
"This reverts a minor fix which could cause us to treat conflicting NLM
locks as nonconflicting.
We have proper fix queued up for 5.3. In the meantime, a quick revert
seems best for 5.2 and stable"
* tag 'nfsd-5.2-1' of git://linux-nfs.org/~bfields/linux:
Revert "lockd: Show pid of lockd for remote locks"
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlzwvNkACgkQiiy9cAdy
T1HfMgwAtbhWezNhi5/ETU8OGjcxjOAMb4VZlddV5GV+EknjGoUDQ55aG7xIe2R6
7CUJifoAtHXYkB8rYb5USGmAiE/RZKRt0vawAlVPK1UwOwIPHAY4O55pnufOcy06
yJxj6bhOgBTSb3zJ7/L1Cuf5vnRkeHcVIoFxWdt0Pk1J6qlbmZr6ZkNxcuTx3IKg
t1XDIaoVbiMHrdLCpBrAoFC+2tM7PYnBxg3W9cNDVV8ExLm1DuD+b9rJlaw1B9LF
birWhNmloDSkPqdKKMNcCTng2nIE79RmYnn6ZzTiYH63AAgDG+8PTpM8fSRb86YG
sFpzb7PjM67Q1wAXNP7PNB3yTkktrNZKmWNq9fWYNcsISfUGDh8GhQTG0YXlsIXY
URyFbTTQaOa7mFNNtkJcoZ6DruCsaxcC+g8ch2+4TuBmJXRrlD02LT2IIRmKr7/q
wm+uUvrIj6l39Q9RRsMPrbgtqur8jXOvs0kepbUTdEP2v2Nil9YoMC2JUJXn7429
C+fBHIay
=5i3G
-----END PGP SIGNATURE-----
Merge tag 'v5.2-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four small smb3 fixes, one for stable"
* tag 'v5.2-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: cifs_read_allocate_pages: don't iterate through whole page array on ENOMEM
dfs_cache: fix a wrong use of kfree in flush_cache_ent()
fs/cifs/smb2pdu.c: fix buffer free in SMB2_ioctl_free
cifs: fix memory leak of pneg_inbuf on -EOPNOTSUPP ioctl case
It turns out that various triggers use led_blink_setup() from atomic
context, so we can't do a flush_work there. Flush is still needed for
slow LEDs, but we can move it to sysfs code where it is safe.
WARNING: inconsistent lock state
5.2.0-rc1 #1 Tainted: G W
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
000000006e30541b
((work_completion)(&led_cdev->set_brightness_work)){+.?.}, at:
+__flush_work+0x3b/0x38a
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x146/0x1a1
__flush_work+0x5b/0x38a
flush_work+0xb/0xd
led_blink_setup+0x1e/0xd3
led_blink_set+0x3f/0x44
tpt_trig_timer+0xdb/0x106
ieee80211_mod_tpt_led_trig+0xed/0x112
Fixes: 0db37915d9 ("leds: avoid races with workqueue")
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
__netdev_tx_sent_queue() was introduced by:
commit 3e59020abf ("net: bql: add __netdev_tx_sent_queue()")
BQL counters should be updated without flipping/caring about
BQL status, if the current skb has xmit_more set.
Using __netdev_tx_sent_queue() avoids messing with BQL stop
flag, increases performance on GSO workload by keeping
doorbells to the minimum required and also sparing atomic
operations.
Signed-off-by: Erez Alfasi <ereza@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
HW does not support push VLAN action in the RX direction (packets
arriving from the wire). The FW works around this limitation by haripining
the packet. The hairpin workaround applies only when the push VLAN action
is specified in a termination table, assuring that there are no actions
following the haripin.
Instantiate termination table for push VLAN actions. Re-use identical
terminating tables for increased HW cache efficiency.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Add HW offloading support for flows with Geneve encap/decap.
Notes about decap flows with Geneve TLV Options:
- Support offloading of 32-bit options data only
- At any given time, only one combination of class/type parameters
can be offloaded, but the same class/type combination can have
many different flows offloaded with different 32-bit option data
- Options with value of 0 can't be offloaded
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Rearrange tc tunnel code so that it would be easy to add future tunnels:
- Define tc tunnel object with the fields and callbacks that any
tunnel must implement.
- Define tc UDP tunnel object for UDP tunnels, such as VXLAN
- Move each tunnel code (GRE, VXLAN) to its own separate file
- Rewrite tc tunnel implementation in a general way - using only
the objects and their callbacks.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In mlx5e encap entry structure, IP tunnel info data structure is copied
by value. This approach worked till now, but it breaks when there are
encapsulation options, such as in case of Geneve.
These options are stored in the structure that is allocated adjacent to
the IP tunnel info struct, and not pointed at by any field in that struct.
Therefore, when copying the struct by value, we loose the address of the
original struct and can't get to the encapsulation options.
Fix the problem by storing the pointer to the tunnel info data instead.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Use Geneve TLV Options object to manage the flex parser matching
on the 32-bit options data.
When the first flow with a certain class/type values is requested to
be offloaded, create a FW object with FW command (Geneve TLV Options
general object) and start counting the number of flows using this object.
During this time, any request with a different class/type values will
fail to be offloaded.
Once the refcount reaches 0, destroy the TLV options general object,
and can now offload a flow with any class/type parameters.
Geneve TLV Options object is added to core device.
It is currently used to manage Geneve TLV options general
object allocation in FW and its reference counting only.
In the future it will also be used for managing geneve ports
by registering callbacks for ndo_udp_tunnel_add/del.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When filling in flow spec match criteria, to allow previous
modifications of the match criteria, use "|=" rather than "=".
Tunnel options are parsed before the match criteria of the offloaded
flow are being set. If the the flow that we're about to offload has
encapsulation options, the flow group might need to match on additional
criteria.
For Geneve, an additional flow group matching parameter should
be used - misc3. The appropriate bit in the match criteria is set
while parsing the tunnel options, so the criteria value shouldn't
be overwritten.
This is a pre-step for supporting Geneve TLV options offload.
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
In some case, we don't care the enc_src_ip and enc_dst_ip, and
if we don't match the field enc_src_ip and enc_dst_ip, we can use
fewer flows in hardware when revice the tunnel packets. For example,
the tunnel packets may be sent from different hosts, we must offload
one rule for each host.
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_src_ip Host0_IP enc_dst_ip 2.2.2.100 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_src_ip Host1_IP enc_dst_ip 2.2.2.100 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
If we support flows which only match the enc_key_id and enc_dst_port,
a flow can process the packets sent to VM which (mac 00:11:22:33:44:00).
$ tc filter add dev vxlan0 protocol ip parent ffff: prio 1 \
flower dst_mac 00:11:22:33:44:00 \
enc_dst_port 4789 enc_key_id 100 \
action tunnel_key unset action mirred egress redirect dev eth0_1
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Beside the special vports (PF/uplink/ecpf), the rest of the vports
are similar.
Remove vf_ prefix from function and variable names.
This patch does not change any functionality.
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
This series provides some low level updates for mlx5 driver needed for
both rdma and netdev trees.
1) Termination flow steering table bits and hardware definitions.
2) Introduce the core dump HW access registers definitions.
3) Refactor and cleans-up VF representors functions handlers.
4) Renames host_params bits to function_changed bits and add the
support for eswitch functions change event in the eswitch general case.
(for both legacy and switchdev modes).
5) Potential error pointer dereference in error handling
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Russell King says:
====================
phylink/sfp updates
This is a series of updates to phylink and sfp:
- Remove an unused net device argument from the phylink MII ioctl
emulation code.
- add support for using interrupts when using a GPIO for link status
tracking, rather than polling it at one second intervals. This
reduces the need to wakeup the CPU every second.
- add support to the MII ioctl API to read and write Clause 45 PHY
registers. I don't know how desirable this is for mainline, but I
have used this facility extensively to investigate the Marvell
88x3310 PHY. A recent illustration of use for this was debugging
the PHY-without-firmware problem recently reported.
- add mandatory attach/detach methods for the upstream side of sfp
bus code, which will allow us to remove the "netdev" structure from
the SFP layers.
- remove the "netdev" structure from the SFP upstream registration
calls, which simplifies PHY to SFP links.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The sfp-bus code now no longer has any use for the network device
structure, so remove its use.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add attach and detach methods for SFP buses, which will allow us to get
rid of the netdev storage in sfp-bus.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow userspace to generate Clause 45 MII access cycles via phylib.
This is useful for tools such as mii-diag to be able to inspect Clause
45 PHYs.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for using GPIO interrupts with a fixed-link GPIO rather than
polling the GPIO every second and invoking the phylink resolution. This
avoids unnecessary calls to mac_config().
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netdev used in the phylink ioctl emulation is never used, so let's
remove it.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.
Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.
As we want to scale vports, to simplify and also to split constants
from data,
1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Avoid typecasting from void* to mlx5_ib_dev* or mlx5e_rep_priv*
as it is not needed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Whenever device supports eswitch functions changed event, honor
such device setting. Do not limit it to ECPF.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
To support sriov on a E-Switch manager, num_vfs are queried
to the firmware whenever E-Switch manager is notified by
esw_functions_changed event.
Replace host_params event with esw_functions_changed event that reflects
more appropriate naming.
While at it, also correct num_vfs type from int to u16 as expected by
the function mlx5_esw_query_functions().
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Termination table is a flow table with a termination flag. The flag
allows the firmware to assume that the the specified actions are the last
actions list. This assumption allows the FW to safely perform potential
looping logic (e.g. hairpin). Introduce the bits for this attribute.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Pull integrity subsystem fixes from Mimi Zohar:
"Four bug fixes, none 5.2-specific, all marked for stable.
The first two are related to the architecture specific IMA policy
support. The other two patches, one is related to EVM signatures,
based on additional hash algorithms, and the other is related to
displaying the IMA policy"
* 'next-fixes-for-5.2-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: show rules with IMA_INMASK correctly
evm: check hash algorithm passed to init_desc()
ima: fix wrong signed policy requirement when not appraising
x86/ima: Check EFI_RUNTIME_SERVICES before using
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCXPExqQAKCRCAXGG7T9hj
vjS5AP49PbfE6m8K3GUqdpAbFYOnlxCrNbiaY628Klj6s5ZpYwD8CtUVGKZGhxUE
SgAr1TgAt+YCDA3M0NyEa6gtvgM/fQo=
=3zJV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.2b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"One minor cleanup patch and a fix for handling of live migration when
running as Xen guest"
* tag 'for-linus-5.2b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xenbus: Avoid deadlock during suspend due to open transactions
xen/pvcalls: Remove set but not used variable
The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()
On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.
Signed-off-by: David S. Miller <davem@davemloft.net>
- Farewell Martin Schwidefsky: add Martin to CREDITS and remove him
from MAINTAINERS
- Vasily Gorbik and Christian Borntraeger join as maintainers for s390
- Fix locking bug in ctr(aes) and ctr(des) s390 specific ciphers
- A rather large patch which fixes gcm-aes-s390 scatter gather handling
- Fix zcrypt wrong dispatching for control domain CPRBs
- Fix assignment of bus resources in PCI code
- Fix structure definition for set PCI function
- Fix one compile error and one compile warning seen when
CONFIG_OPTIMIZE_INLINING is enabled
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJc8QIzAAoJECIOw3kbKW7C8+cP/iEuFF/YFKq896Zmd50wV1fL
0ASkqKfrwWNzQz+Y9c7WuIoGNghptr4zPPANkaRLkUCIZ2SsmvSLrggtAD0Ls4ut
vAoCXLH10rDdGWmiXHuDHOsmeQ/1RdbqW6ZfZEDhLNY7vYtCFfpOyAN0QEhGa7/F
NcAemkD9Q3uerATCr37mVcK3GrzzhcbGd9mVN5uqdq0ZLRDrI4K+JxWVisdBvzve
bnWCwUsgDYOhc1C1pMDD8IsWd+F3a7V+caDNWFhMgbRCPA9adNzf9fYuEH5ftI7U
W7bS45ZR5BX3pAHtOIg6s/l2W+cu3vGKuCYIutA2JpRqGM8IASEoUJwMZ54Fu/nF
Eh+zcfizxwREX7VjXKHd9oXZ41cocYdBIt8kCMe+gbj9zKzD716wzhiBVh3WCG6v
uBEx0nHMkHlKaTaNY3oUU5HP6t+zARw+uApkpA9EdNiM1pV6T/n9ySTnJHrU0NNo
3nYlCHzm1W9RLknbp+vmc9SbbEzWhhUpCDUi/5Ny8YsFwsddMixWMmg9DI8zJpJB
Qr6OSD05dThPPH7bWNcBbbt/NU0p/BaxgbVYewcHM/cln1tw2lDuOY0jjiIB3SV/
twhMoFx7fEZCpSsP8t27f33NA3UTEpHhr2KSNZZ2w2DGu6QE/QawgUHDT+VlFs5x
WiUkvArmRqlWL3Aaz6W8
=e7cv
-----END PGP SIGNATURE-----
Merge tag 's390-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Farewell Martin Schwidefsky: add Martin to CREDITS and remove him
from MAINTAINERS
- Vasily Gorbik and Christian Borntraeger join as maintainers for s390
- Fix locking bug in ctr(aes) and ctr(des) s390 specific ciphers
- A rather large patch which fixes gcm-aes-s390 scatter gather handling
- Fix zcrypt wrong dispatching for control domain CPRBs
- Fix assignment of bus resources in PCI code
- Fix structure definition for set PCI function
- Fix one compile error and one compile warning seen when
CONFIG_OPTIMIZE_INLINING is enabled
* tag 's390-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
MAINTAINERS: add Vasily Gorbik and Christian Borntraeger for s390
MAINTAINERS: Farewell Martin Schwidefsky
s390/crypto: fix possible sleep during spinlock aquired
s390/crypto: fix gcm-aes-s390 selftest failures
s390/zcrypt: Fix wrong dispatching for control domain CPRBs
s390/pci: fix assignment of bus resources
s390/pci: fix struct definition for set PCI function
s390: mark __cpacf_check_opcode() and cpacf_query_func() as __always_inline
s390: add unreachable() to dump_fault_info() to fix -Wmaybe-uninitialized