after 'police' configuration parameters were converted to use RCU instead
of spinlock, the state variables used to compute the traffic rate (namely
'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
the traffic path without any protection.
Use a dedicated spinlock to avoid race conditions on these variables, and
ensure proper cache-line alignment. In this way, 'police' is still faster
than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
reverting commit 2d550dbad8 ("net/sched: act_police: don't use spinlock
in the data path"). Moreover, we preserve the throughput improvement that
was obtained after 'police' started using per-cpu counters, when 'avrate'
is used instead of 'rate'.
Changes since v1 (thanks to Eric Dumazet):
- call ktime_get_ns() before acquiring the lock in the traffic path
- use a dedicated spinlock instead of tcf_lock
- improve cache-line usage
Fixes: 2d550dbad8 ("net/sched: act_police: don't use spinlock in the data path")
Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
There are two cases were we can avoid calling ktime_get_ns() :
1) Queue is empty.
2) Internal queue is not empty.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of egress offloads the class/flowid assigned by the filter
may be very important for offloaded Qdisc selection. Provide this
info to drivers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers which offload GRED to report back statistics. Since
A lot of GRED stats is fairly ad hoc in nature pass to drivers the
standard struct gnet_stats_basic/gnet_stats_queue pairs, and
untangle the values in the core.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add basic offload for the GRED Qdisc. Inform the drivers any
time Qdisc or virtual queue configuration changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users to set and dump RED flags (ECN enabled and harddrop)
on per-virtual queue basis. Validation of attributes is split
from changes to make sure we won't have to undo previous operations
when we find out configuration is invalid.
The objective is to allow changing per-Qdisc parameters without
overwriting the per-vq configured flags.
Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and
per-Qdisc flags will always get propagated to the virtual queues.
New user space which wants to make use of per-vq flags should set
per-Qdisc flags to 0 and then configure per-vq flags as it
sees fit. Once per-vq flags are set per-Qdisc flags can't be
changed to non-zero. Vice versa - if the per-Qdisc flags are
non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted
or set to the same value as per-Qdisc flags.
Update per-Qdisc parameters:
per-Qdisc | per-VQ | result
0 | 0 | all vq flags updated
0 | non-0 | error (vq flags in use)
non-0 | 0 | -- impossible --
non-0 | non-0 | all vq flags updated
Update per-VQ state (flags parameter not specified):
no change to flags
Update per-VQ state (flags parameter set):
per-Qdisc | per-VQ | result
0 | any | per-vq flags updated
non-0 | 0 | -- impossible --
non-0 | non-0 | error (per-Qdisc flags in use)
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now ECN marking and HARD drop (the common RED flags) can only
be configured for the entire Qdisc. In preparation for per-vq flags
store the values in the virtual queue structure. Setting per-vq
flags will only be allowed when no flags are set for the entire Qdisc.
For the new flags we will also make sure undefined bits are 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently all GRED's virtual queue data is dumped in a single
array in a single attribute. This makes it pretty much impossible
to add new fields. In order to expose more detailed stats add a
new set of attributes. We can now expose the 64 bit value of bytesin
and all the mark stats which were not part of the original design.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
32 bit counters for bytes are not really going to last long in modern
world. Make sch_gred count bytes on a 64 bit counter. It will still
get truncated during dump but follow up patch will add set of new
stat dump attributes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack messages to -EINVAL errors, to help users identify
their mistakes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case netlink wants to provide parsing error pass extack
to nla_parse_nested().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon want to add more code to the non-error path, separate
it from the error handling flow.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently on dequeue() ETF only drops the first expired packet, which
causes a problem if the next packet is already expired. When this
happens, the watchdog will be configured with a time in the past, fire
straight way and the packet will finally be dropped once the dequeue()
function of the qdisc is called again.
We can save quite a few cycles and improve the overall behavior of the
qdisc if we drop all expired packets if the next packet is expired.
This should allow ETF to recover faster from bad situations. But
packet drops are still a very serious warning that the requirements
imposed on the system aren't reasonable.
This was inspired by how the implementation of hrtimers use the
rb_tree inside the kernel.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is just a refactor that will simplify the implementation of the
next patch in this series which will drop all expired packets on the
dequeue flow.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ETF's peek() operation is heavily used so use an rb_root_cached instead
and leverage rb_first_cached() which will run in O(1) instead of
O(log n).
Even if on 'timesortedlist_clear()' we could be using rb_erase(), we
choose to use rb_erase_cached(), because if in the future we allow
runtime changes to ETF parameters, and need to do a '_clear()', this
might cause some hard to debug issues.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in firing the qdisc watchdog if there are no future
skbs pending in the queue and the watchdog had been set previously.
Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_idr_check_alloc() can return a negative value, on allocation failures
(-ENOMEM) or IDR exhaustion (-ENOSPC): don't leak keys_ex in these cases.
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When EDT conversion happened, fq lost the ability to enfore a maxrate
for all flows. It kept it for non EDT flows.
This commit restores the functionality.
Tested:
tc qd replace dev eth0 root fq maxrate 500Mbit
netperf -P0 -H host -- -O THROUGHPUT
489.75
Fixes: ab408b6dc7 ("tcp: switch tcp and sch_fq to new earliest departure time model")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added support in tc flower for filtering based on port ranges.
Example:
1. Match on a port range:
-------------------------
$ tc filter add dev enp4s0 protocol ip parent ffff:\
prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
action drop
$ tc -s filter show dev enp4s0 parent ffff:
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
eth_type ipv4
ip_proto tcp
dst_port range 20-30
skip_hw
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 1 ref 1 bind 1 installed 85 sec used 3 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
2. Match on IP address and port range:
--------------------------------------
$ tc filter add dev enp4s0 protocol ip parent ffff:\
prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
skip_hw action drop
$ tc -s filter show dev enp4s0 parent ffff:
filter protocol ip pref 1 flower chain 0 handle 0x2
eth_type ipv4
ip_proto tcp
dst_ip 192.168.1.1
dst_port range 100-200
skip_hw
not_in_hw
action order 1: gact action drop
random type none pass val 0
index 2 ref 1 bind 1 installed 58 sec used 2 sec
Action statistics:
Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
v4:
1. Added condition before setting port key.
2. Organized setting and dumping port range keys into functions
and added validation of input range.
v3:
1. Moved new fields in UAPI enum to the end of enum.
2. Removed couple of empty lines.
v2:
Addressed Jiri's comments:
1. Added separate functions for dst and src comparisons.
2. Removed endpoint enum.
3. Added new bit TCA_FLOWER_FLAGS_RANGE to decide normal/range
lookup.
4. Cleaned up fl_lookup function.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RED qdisc's limit parameter changes the behaviour of the qdisc,
for instance if it's set to 0 qdisc will drop all the packets.
When replace operation happens and parameter is set to non-0
a new fifo qdisc will be instantiated and replace the old child
qdisc which will be destroyed.
Drivers need to know the parameter, even if they don't impose
the actual limit to be able to reliably reconstruct the Qdisc
hierarchy.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from MQ.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from RED. The drivers
are expected to simply stop offloading the Qdisc, if a non-standard
child is ever grafted onto it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers are currently not notified when a Qdisc is grafted as root.
This requires special casing Qdiscs added with parent = TC_H_ROOT in
the driver. Also there is no notification sent to the driver when
an existing Qdisc is grafted as root.
Add this very simple notifications, drivers should now be able to
track their Qdisc tree fully.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 80ba92fa1a ("codel: add ce_threshold attribute")
After EDT adoption, it became easier to implement DCTCP-like CE marking.
In many cases, queues are not building in the network fabric but on
the hosts themselves.
If packets leaving fq missed their Earliest Departure Time by XXX usec,
we mark them with ECN CE. This gives a feedback (after one RTT) to
the sender to slow down and find better operating mode.
Example :
tc qd replace dev eth0 root fq ce_threshold 2.5ms
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If sch_fq is used at ingress, skbs that might have been
timestamped by net_timestamp_set() if a packet capture
is requesting timestamps could be delayed by arbitrary
amount of time, since sch_fq time base is MONOTONIC.
Fix this problem by moving code from sch_netem.c to act_mirred.c.
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.
Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The callback is triggered when the
device is bound to a block, allowing the driver to register for rules
applied to that block using already available functions.
Freeing an indirect block callback will trigger an unbind event (if
necessary) to direct the driver to remove any offloaded rules and unreg
any block rule callbacks. It is the responsibility of the implementing
driver to clean any registered indirect block callbacks before exiting,
if the block it still active at such a time.
Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_FLOWER_KEY_ENC_OPTS and TCA_FLOWER_KEY_ENC_OPTS_MASK can only
currently contain further nested attributes, which are parsed by
hand, so the policy is never actually used resulting in a W=1
build warning:
net/sched/cls_flower.c:492:1: warning: ‘enc_opts_policy’ defined but not used [-Wunused-const-variable=]
enc_opts_policy[TCA_FLOWER_KEY_ENC_OPTS_MAX + 1] = {
Add the validation anyway to avoid potential bugs when other
attributes are added and to make the attribute structure slightly
more clear. Validation will also set extact to point to bad
attribute on error.
Fixes: 0a6e77784f ("net/sched: allow flower to match tunnel options")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To mirror software behaviour on offload more precisely inform
the drivers about the state of the harddrop flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes assumptions about VLAN_TAG_PRESENT bit.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move destroying of the old child qdiscs outside of the sch_tree_lock()
section. This should improve the software qdisc replace but is even
more important for offloads. Calling offloads under a spin lock is
best avoided, and child's destroy would be called under sch_tree_lock().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move destroying of the old child qdisc outside of the sch_tree_lock()
section. This should improve the software qdisc replace but is even
more important for offloads. Firstly calling offloads under a spin
lock is best avoided. Secondly the destroy event of existing child
would have been sent to the offload device before the replace, causing
confusion.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for grafting Qdiscs when there is a parent has two needless
indentation levels, and breaks the "keep the success path unindented"
guideline. Refactor.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc graft operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PRIO and RED mark the qdisc with TCQ_F_OFFLOADED upon successful offload,
make MQ do the same. The consistency will help with consistent
graft callback behaviour.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload dump helper does not use opt parameter, remove it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc dump operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gred_change_table_def() takes a pointer to TCA_GRED_DPS attribute,
and expects it will be able to interpret its contents as
struct tc_gred_sopt. Pass the correct gred attribute, instead of
TCA_OPTIONS.
This bug meant the table definition could never be changed after
Qdisc was initialized (unless whatever TCA_OPTIONS contained both
passed netlink validation and was a valid struct tc_gred_sopt...).
Old behaviour:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
RTNETLINK answers: Invalid argument
Now:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
gred setup vqs 4 default 0
Fixes: f62d6b936d ("[PKT_SCHED]: GRED: Use central VQ change procedure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marco reported an error with hfsc:
root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
Error: Attribute failed policy validation.
Apparently a few implementations pass TCA_OPTIONS as a binary instead
of nested attribute, so drop TCA_OPTIONS from the policy.
Fixes: 8b4c3cdd9d ("net: sched: Add policy validation for tc attributes")
Reported-by: Marco Berizzi <pupilla@libero.it>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in the following command:
# tc action add action police rate <r> burst <b> conform-exceed <c1>/<c2>
'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
crash with NULL pointer dereference, since TC core doesn't initialize the
chain handle.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in the following command:
# tc action add action <c1> random <rand_type> <c2> <rand_param>
'goto chain x' is allowed only for c1: setting it for c2 makes the kernel
crash with NULL pointer dereference, since TC core doesn't initialize the
chain handle.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
When dumping classes by parent, kernel would return classes twice:
| # tc qdisc add dev lo root prio
| # tc class show dev lo
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| # tc class show dev lo parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
This comes from qdisc_match_from_root() potentially returning the root
qdisc itself if its handle matched. Though in that case, root's classes
were already dumped a few lines above.
Fixes: cb395b2010 ("net: sched: optimize class dumps")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to what has been done in 8b4c3cdd9d ("net: sched: Add policy
validation for tc attributes"), fix classifier code to add validation of
TCA_CHAIN and TCA_KIND netlink attributes.
tested with:
# ./tdc.py -c filter
v2: Let sch_api and cls_api share nla_policy they have in common, thanks
to David Ahern.
v3: Avoid EXPORT_SYMBOL(), as validation of those attributes is not done
by TC modules, thanks to Cong Wang.
While at it, restore the 'Delete / get qdisc' comment to its orginal
position, just above tc_get_qdisc() function prototype.
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
David writes:
"Networking
1) RXRPC receive path fixes from David Howells.
2) Re-export __skb_recv_udp(), from Jiri Kosina.
3) Fix refcounting in u32 classificer, from Al Viro.
4) Userspace netlink ABI fixes from Eugene Syromiatnikov.
5) Don't double iounmap on rmmod in ena driver, from Arthur
Kiyanovski.
6) Fix devlink string attribute handling, we must pull a copy into a
kernel buffer if the lifetime extends past the netlink request.
From Moshe Shemesh.
7) Fix hangs in RDS, from Ka-Cheong Poon.
8) Fix recursive locking lockdep warnings in tipc, from Ying Xue.
9) Clear RX irq correctly in socionext, from Ilias Apalodimas.
10) bcm_sf2 fixes from Florian Fainelli."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: dsa: bcm_sf2: Call setup during switch resume
net: dsa: bcm_sf2: Fix unbind ordering
net: phy: sfp: remove sfp_mutex's definition
r8169: set RX_MULTI_EN bit in RxConfig for 8168F-family chips
net: socionext: clear rx irq correctly
net/mlx4_core: Fix warnings during boot on driverinit param set failures
tipc: eliminate possible recursive locking detected by LOCKDEP
selftests: udpgso_bench.sh explicitly requires bash
selftests: rtnetlink.sh explicitly requires bash.
qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface
tipc: queue socket protocol error messages into socket receive buffer
tipc: set link tolerance correctly in broadcast link
net: ipv4: don't let PMTU updates increase route MTU
net: ipv4: update fnhe_pmtu when first hop's MTU changes
net/ipv6: stop leaking percpu memory in fib6 info
rds: RDS (tcp) hangs on sendto() to unresponding address
net: make skb_partial_csum_set() more robust against overflows
devlink: Add helper function for safely copy string param
devlink: Fix param cmode driverinit for string type
devlink: Fix param set handling for string type
...
While noop_qdisc.gso_skb and noop_qdisc.skb_bad_txq are not used
in other places, it seems not correct to overwrite their fields
in dev_init_scheduler_queue().
noop_qdisc is essentially a shared and read-only object, even if
it is not marked as const because of some implementation detail.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure extack is passed to nlmsg_parse where easy to do so.
Most of these are dump handlers and leveraging the extack in
the netlink_callback.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have the knode count, we can instantly check if
any hnodes are non-empty. And that kills the check for extra
references to root hnode - those could happen only if there was
a knode to carry such a link.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
allows to simplify u32_delete() considerably
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both hnode ->tp_c and tp_c argument of u32_set_parms()
the latter is redundant, the former - never read...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It must be tc_u_common associated with that tp (i.e. tp->data).
Proof:
* both ->ht_up and ->tp_c are assign-once
* ->tp_c of anything inserted into tp_c->hlist is tp_c
* hnodes never get reinserted into the lists or moved
between those, so anything found by u32_lookup_ht(tp->data, ...)
will have ->tp_c equal to tp->data.
* tp->root->tp_c == tp->data.
* ->ht_up of anything inserted into hnode->ht[...] is
equal to hnode.
* knodes never get reinserted into hash chains or moved
between those, so anything returned by u32_lookup_key(ht, ...)
will have ->ht_up equal to ht.
* any knode returned by u32_get(tp, ...) will have ->ht_up->tp_c
point to tp->data
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the only thing we used ht for was ht->tp_c and callers can get that
without going through ->tp_c at all; start with lifting that into
the callers, next commits will massage those, eventually removing
->tp_c altogether.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* calculate key *once*, not for each hash chain element
* let tc_u_hash() return the pointer to chain head rather than index -
callers are cleaner that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
unused
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
not used anymore
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested by modifying iproute2 to allow sending a divisor > 255
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Operation makes no sense. Nothing will actually break if we do so
(depth limit in u32_classify() will prevent infinite loops), but
according to maintainers it's best prohibited outright.
NOTE: doing so guarantees that u32_destroy() will trigger the call
of u32_destroy_hnode(); we might want to make that unconditional.
Test:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 u32 \
link 800: offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff
should fail with
Error: cls_u32: Not linking to root node
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
... and produce consistent error on attempt to delete such.
Existing check in u32_delete() is inconsistent - after
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 \
divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 \
divisor 1
both
tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32
and
tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 800: u32
will fail (at least with refcounting fixes), but the former will complain
about an attempt to remove a busy table, while the latter will recognize
it as root and yield "Not allowed to delete root node" instead.
The problem with the existing check is that several tcf_proto instances
might share the same tp->data and handle-to-hnode lookup will be the same
for all of them. So comparing an hnode to be deleted with tp->root won't
catch the case when one tp is used to try deleting the root of another.
Solution is trivial - mark the root hnodes explicitly upon allocation and
check for that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
via ->hlist and via ->tp_root together. u32_destroy() drops the former
and, in case when there had been links, leaves the sucker on the list.
As the result, there's nothing to protect it from getting freed once links
are dropped.
That also makes the "is it busy" check incapable of catching the root
hnode - it *is* busy (there's a reference from tp), but we don't see it as
something separate. "Is it our root?" check partially covers that, but
the problem exists for others' roots as well.
AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
include oopsen, that is) would be this:
* count tp->root and tp_c->hlist as separate references. I.e.
have u32_init() set refcount to 2, not 1.
* in u32_destroy() we always drop the former;
in u32_destroy_hnode() - the latter.
That way we have *all* references contributing to refcount. List
removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
an in u32_destroy() in case of tc_u_common going away, along with
everything reachable from it. IOW, that way we know that
u32_destroy_key() won't free something still on the list (or pointed to by
someone's ->root).
Reproducer:
tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
plus 0 eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 200
tc filter change dev eth0 parent ffff: protocol ip prio 100 \
handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
eat match ip protocol 6 ff
tc filter delete dev eth0 parent ffff: protocol ip prio 100
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix 5 warnings and 14 checks issued by checkpatch.pl:
CHECK: Logical continuations should be on the previous line
+ if ((q->vars.qdelay < q->params.target / 2)
+ && (q->vars.prob < MAX_PROB / 5))
WARNING: line over 80 characters
+ q->params.tupdate = usecs_to_jiffies(nla_get_u32(tb[TCA_PIE_TUPDATE]));
CHECK: Blank lines aren't necessary after an open brace '{'
+{
+
CHECK: braces {} should be used on all arms of this statement
+ if (qlen < QUEUE_THRESHOLD)
[...]
+ else {
[...]
CHECK: Unbalanced braces around else statement
+ else {
CHECK: No space is necessary after a cast
+ if (delta > (s32) (MAX_PROB / (100 / 2)) &&
CHECK: Unnecessary parentheses around 'qdelay == 0'
+ if ((qdelay == 0) && (qdelay_old == 0) && update_prob)
CHECK: Unnecessary parentheses around 'qdelay_old == 0'
+ if ((qdelay == 0) && (qdelay_old == 0) && update_prob)
CHECK: Unnecessary parentheses around 'q->vars.prob == 0'
+ if ((q->vars.qdelay < q->params.target / 2) &&
+ (q->vars.qdelay_old < q->params.target / 2) &&
+ (q->vars.prob == 0) &&
+ (q->vars.avg_dq_rate > 0))
CHECK: Unnecessary parentheses around 'q->vars.avg_dq_rate > 0'
+ if ((q->vars.qdelay < q->params.target / 2) &&
+ (q->vars.qdelay_old < q->params.target / 2) &&
+ (q->vars.prob == 0) &&
+ (q->vars.avg_dq_rate > 0))
CHECK: Blank lines aren't necessary before a close brace '}'
+
+}
CHECK: Comparison to NULL could be written "!opts"
+ if (opts == NULL)
CHECK: No space is necessary after a cast
+ ((u32) PSCHED_TICKS2NS(q->params.target)) /
WARNING: line over 80 characters
+ nla_put_u32(skb, TCA_PIE_TUPDATE, jiffies_to_usecs(q->params.tupdate)) ||
CHECK: Blank lines aren't necessary before a close brace '}'
+
+}
CHECK: No space is necessary after a cast
+ .delay = ((u32) PSCHED_TICKS2NS(q->vars.qdelay)) /
WARNING: Missing a blank line after declarations
+ struct sk_buff *skb;
+ skb = qdisc_dequeue_head(sch);
WARNING: Missing a blank line after declarations
+ struct pie_sched_data *q = qdisc_priv(sch);
+ qdisc_reset_queue(sch);
WARNING: Missing a blank line after declarations
+ struct pie_sched_data *q = qdisc_priv(sch);
+ q->params.tupdate = 0;
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As done treewide earlier, this catches several more open-coded
allocation size calculations that were added to the kernel during the
merge window. This performs the following mechanical transformations
using Coccinelle:
kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)
Signed-off-by: Kees Cook <keescook@chromium.org>
A number of TC attributes are processed without proper validation
(e.g., length checks). Add a tca policy for all input attributes and use
when invoking nlmsg_parse.
The 2 Fixes tags below cover the latest additions. The other attributes
are a string (KIND), nested attribute (OPTIONS which does seem to have
validation in most cases), for dumps only or a flag.
Fixes: 5bc1701881 ("net: sched: introduce multichain support for filters")
Fixes: d47a6b0e7c ("net: sched: introduce ingress/egress block index attributes for qdisc")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit ec3ed293e7 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
we move fl_hw_destroy_tmplt() to a workqueue to avoid blocking
with the spinlock held. Unfortunately, this causes a lot of
troubles here:
1. tcf_chain_destroy() could be called right after we queue the work
but before the work runs. This is a use-after-free.
2. The chain refcnt is already 0, we can't even just hold it again.
We can check refcnt==1 but it is ugly.
3. The chain with refcnt 0 is still visible in its block, which means
it could be still found and used!
4. The block has a refcnt too, we can't hold it without introducing a
proper API either.
We can make it working but the end result is ugly. Instead of wasting
time on reviewing it, let's just convert the troubling spinlock to
a mutex, which allows us to use non-atomic allocations too.
Fixes: ec3ed293e7 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
Reported-by: Ido Schimmel <idosch@idosch.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.
Example configuration:
tc qdisc replace dev enp3s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
base-time 1528743495910289987 \
sched-entry S 01 300000 \
sched-entry S 02 300000 \
sched-entry S 04 300000 \
clockid CLOCK_TAI
The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:
sched-entry <CMD> <GATE MASK> <INTERVAL>
The only supported <CMD> is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry. <INTERVAL> is a time duration
in nanoseconds that specifies for how long that state defined by <CMD>
and <GATE MASK> should be held before moving to the next entry.
This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.
The other parameters can be defined as follows:
- base-time: specifies the instant when the schedule starts, if
'base-time' is a time in the past, the schedule will start at
base-time + (N * cycle-time)
where N is the smallest integer so the resulting time is greater
than "now", and "cycle-time" is the sum of all the intervals of the
entries in the schedule;
- clockid: specifies the reference clock to be used;
The parameters should be similar to what the IEEE 802.1Q family of
specification defines.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.
But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.
If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.
In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.
Fixes: 72b0094f91 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e419 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa7 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <leonard.crestez@nxp.com>
Tested-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcf_block_find() fails, it already rollbacks the qdisc refcnt,
so its caller doesn't need to clean up this again. Avoid calling
qdisc_put() again by resetting qdisc to NULL for callers.
Reported-by: syzbot+37b8770e6d5a8220a039@syzkaller.appspotmail.com
Fixes: e368fdb61d ("net: sched: use Qdisc rcu API instead of relying on rtnl lock")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we
return -EINVAL. But we don't check whether it's smaller than
sizeof(struct xt_entry_target) and that could lead to an out of bounds
read.
Fixes: 7ba699c604 ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/sched/sch_generic.c:944:6: warning:
symbol 'qdisc_free_cb' was not declared. Should it be static?
Fixes: 3a7d0d07a3 ("net: sched: extend Qdisc with rcu")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on rtnl lock on rules update path, always
take reference to block while using it on rules update path. Change
tcf_block_get() error handling to properly release block with reference
counting, instead of just destroying it, in order to accommodate potential
concurrent users.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement get/put function for blocks that only take/release the reference
and perform deallocation. These functions are intended to be used by
unlocked rules update path to always hold reference to block while working
with it. They use on new fine-grained locking mechanisms introduced in
previous patches in this set, instead of relying on global protection
provided by rtnl lock.
Extract code that is common with tcf_block_detach_ext() into common
function __tcf_block_put().
Extend tcf_block with rcu to allow safe deallocation when it is accessed
concurrently.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protect block idr access with spinlock, instead of relying on rtnl lock.
Take tn->idr_lock spinlock during block insertion and removal.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract code that flushes and puts all chains on tcf block to two
standalone function to be shared with functions that locklessly get/put
reference to block.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation for removing rtnl lock dependency from rules update path,
change tcf block reference counter type to refcount_t to allow modification
by concurrent users.
In block put function perform decrement and check reference counter once to
accommodate concurrent modification by unlocked users. After this change
tcf_chain_put at the end of block put function is called with
block->refcnt==0 and will deallocate block after the last chain is
released, so there is no need to manually deallocate block in this case.
However, if block reference counter reached 0 and there are no chains to
release, block must still be deallocated manually.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation from removing rtnl lock dependency from rules update path,
use Qdisc rcu and reference counting capabilities instead of relying on
rtnl lock while working with Qdiscs. Create new tcf_block_release()
function, and use it to free resources taken by tcf_block_find().
Currently, this function only releases Qdisc and it is extended in next
patches in this series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, Qdisc API functions assume that users have rtnl lock taken. To
implement rtnl unlocked classifiers update interface, Qdisc API must be
extended with functions that do not require rtnl lock.
Extend Qdisc structure with rcu. Implement special version of put function
qdisc_put_unlocked() that is called without rtnl lock taken. This function
only takes rtnl lock if Qdisc reference counter reached zero and is
intended to be used as optimization.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.
Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.
Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.
This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.
Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.
Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE
Tested:
Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.
The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)
MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02
After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.
We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Action API was changed to work with actions and action_idr in concurrency
safe manner, however tcf_del_walker() still uses actions without taking a
reference or idrinfo->lock first, and deletes them directly, disregarding
possible concurrent delete.
Change tcf_del_walker() to take idrinfo->lock while iterating over actions
and use new tcf_idr_release_unsafe() to release them while holding the
lock.
And the blocking function fl_hw_destroy_tmplt() could be called when we
put a filter chain, so defer it to a work queue.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
[xiyou.wangcong@gmail.com: heavily modify the code and changelog]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore,
We prefer to use the macro rather than calculating its value.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two new tls tests added in parallel in both net and net-next.
Used Stephen Rothwell's linux-next resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
use RCU instead of spinlocks, to protect concurrent read/write on
act_police configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
use per-CPU counters, instead of sharing a single set of stats with all
cores. This removes the need of using spinlock when statistics are read
or updated.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we delete a chain of filters, we need to notify
user-space we are deleting each filters in this chain
too.
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cl->leaf.q is slightly more readable than cl->un.leaf.q.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We no longer take any spinlock on RX path for ingress qdisc,
so this lockdep annotation is no longer needed.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change flower in_hw_count type to fixed-size u32 and dump it as
TCA_FLOWER_IN_HW_COUNT. This change is necessary to properly test shared
blocks and re-offload functionality.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead, adjust __qdisc_enqueue_tail() such that HTB can use it
instead.
The only other caller of __qdisc_enqueue_tail() is
qdisc_enqueue_tail() so we can move the backlog and return value
handling (which HTB doesn't need/want) to the latter.
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the new locking rule, we have to take tcf_lock for both
->init() and ->dump(), as RTNL will be removed.
Use tcf spinlock to protect private nat action data from concurrent
modification during dump. (nat init already uses tcf spinlock when changing
action state)
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the new locking rule, we have to take tcf_lock for both
->init() and ->dump(), as RTNL will be removed.
Use tcf lock to protect skbedit action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)
Remove rtnl lock assertion that is no longer required.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When nla_put*() fails after nla_nest_start(), we need
to call nla_nest_cancel() to cancel the message, otherwise
we end up calling nla_nest_end() like a success.
Fixes: 0ed5269f9e ("net/sched: add tunnel option support to act_tunnel_key")
Cc: Davide Caratti <dcaratti@redhat.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If users try to install act_tunnel_key 'set' rules with duplicate values
of 'index', the tunnel metadata are allocated, but never released. Then,
kmemleak complains as follows:
# tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111
# echo clear > /sys/kernel/debug/kmemleak
# tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111
Error: TC IDR already exists.
We have an error talking to the kernel
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800574e6c80 (size 256):
comm "tc", pid 5617, jiffies 4298118009 (age 57.990s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 1c e8 b0 ff ff ff ff ................
81 24 c2 ad ff ff ff ff 00 00 00 00 00 00 00 00 .$..............
backtrace:
[<00000000b7afbf4e>] tunnel_key_init+0x8a5/0x1800 [act_tunnel_key]
[<000000007d98fccd>] tcf_action_init_1+0x698/0xac0
[<0000000099b8f7cc>] tcf_action_init+0x15c/0x590
[<00000000dc60eebe>] tc_ctl_action+0x336/0x5c2
[<000000002f5a2f7d>] rtnetlink_rcv_msg+0x357/0x8e0
[<000000000bfe7575>] netlink_rcv_skb+0x124/0x350
[<00000000edab656f>] netlink_unicast+0x40f/0x5d0
[<00000000b322cdcb>] netlink_sendmsg+0x6e8/0xba0
[<0000000063d9d490>] sock_sendmsg+0xb3/0xf0
[<00000000f0d3315a>] ___sys_sendmsg+0x654/0x960
[<00000000c06cbd42>] __sys_sendmsg+0xd3/0x170
[<00000000ce72e4b0>] do_syscall_64+0xa5/0x470
[<000000005caa2d97>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<00000000fac1b476>] 0xffffffffffffffff
This problem theoretically happens also in case users attempt to setup a
geneve rule having wrong configuration data, or when the kernel fails to
allocate 'params_new'. Ensure that tunnel_key_init() releases the tunnel
metadata also in the above conditions.
Addresses-Coverity-ID: 1373974 ("Resource leak")
Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Fixes: 0ed5269f9e ("net/sched: add tunnel option support to act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Immediately after module_put(), user could delete this
module, so e->ops could be already freed before we call
e->ops->release().
Fix this by moving module_put() after ops->release().
Fixes: ef6980b6be ("introduce IFE action")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the new locking rule, we have to take tcf_lock
for both ->init() and ->dump(), as RTNL will be removed.
However, it is missing for act_connmark.
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 331a9295de ("net: sched: act: add extack for lookup callback").
This extack is never used after 6 months... In fact, it can be just
set in the caller, right after ->lookup().
Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit 802bfb1915 ("net/sched: user-space can't set
unknown tcfa_action values"), unknown tcfa_action values are
converted to TC_ACT_UNSPEC, but the common agreement is instead
rejecting such configurations.
This change also introduces a helper to simplify the destruction
of a single action, avoiding code duplication.
v1 -> v2:
- helper is now static and renamed according to act_* convention
- updated extack message, according to the new behavior
Fixes: 802bfb1915 ("net/sched: user-space can't set unknown tcfa_action values")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When chain 0 was implicitly created, removal of non-existent filter from
chain 0 gave -ENOENT. Once chain 0 became non-implicit, the same call is
giving -EINVAL. Fix this by returning -ENOENT in that case.
Reported-by: Roman Mashak <mrv@mojatatu.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink
policy, so max length isn't enforced, only minimum. This means nkeys
(from userspace) was being trusted without checking the actual size of
nla_len(), which could lead to a memory over-read, and ultimately an
exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within
a namespace.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TC filter flow mapping override completely skipped the call to
cake_hash(); however that meant that the internal state was not being
updated, which ultimately leads to deadlocks in some configurations. Fix
that by passing the overridden flow ID into cake_hash() instead so it can
react appropriately.
In addition, the major number of the class ID can now be set to override
the host mapping in host isolation mode. If both host and flow are
overridden (or if the respective modes are disabled), flow dissection and
hashing will be skipped entirely; otherwise, the hashing will be kept for
the portions that are not set by the filter.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
use_all_metadata() acquires read_lock(&ife_mod_lock), then calls
add_metainfo() which calls find_ife_oplist() which acquires the same
lock again. Deadlock!
Introduce __add_metainfo() which accepts struct tcf_meta_ops *ops
as an additional parameter and let its callers to decide how
to find it. For use_all_metadata(), it already has ops, no
need to find it again, just call __add_metainfo() directly.
And, as ife_mod_lock is only needed for find_ife_oplist(),
this means we can make non-atomic allocation for populate_metalist()
now.
Fixes: 817e9f2c5c ("act_ife: acquire ife_mod_lock before reading ifeoplist")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only time we need to take tcfa_lock is when adding
a new metainfo to an existing ife->metalist. We don't need
to take tcfa_lock so early and so broadly in tcf_ife_init().
This means we can always take ife_mod_lock first, avoid the
reverse locking ordering warning as reported by Vlad.
Reported-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Vlad Buslov <vladbu@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 42c625a486 ("net: sched: act_ife: disable bh
when taking ife_mod_lock"), because what ife_mod_lock protects
is absolutely not touched in rate est timer BH context, they have
no race.
A better fix is following up.
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 90b73b77d0, list_head is no longer needed.
Now we just need to convert the list iteration to array
iteration for drivers.
Fixes: 90b73b77d0 ("net: sched: change action API to use array of pointers to actions")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_idr_check() is replaced by tcf_idr_check_alloc(),
and __tcf_idr_check() now can be folded into tcf_idr_search().
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All ops->delete() wants is getting the tn->idrinfo, but we already
have tc_action before calling ops->delete(), and tc_action has
a pointer ->idrinfo.
More importantly, each type of action does the same thing, that is,
just calling tcf_idr_delete_index().
So it can be just removed.
Fixes: b409074e66 ("net: sched: add 'delete' function to action ops")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_action_put_many() is mostly called to clean up actions on
failure path, but tcf_action_put_many(&actions[acts_deleted]) is
used in the ugliest way: it passes a slice of the array and
uses an additional NULL at the end to avoid out-of-bound
access.
acts_deleted is completely unnecessary since we can teach
tcf_action_put_many() scan the whole array and checks against
NULL pointer. Which also means tcf_action_delete() should
set deleted action pointers to NULL to avoid double free.
Fixes: 90b73b77d0 ("net: sched: change action API to use array of pointers to actions")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove including <linux/version.h> that don't need it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Action init API was changed to always take reference to action, even when
overwriting existing action. Substitute conditional action release, which
was executed only if action is newly created, with unconditional release in
tcf_ife_init() error handling code to prevent double free or memory leak in
case of overwrite.
Fixes: 4e8ddd7f17 ("net: sched: don't release reference on action overwrite")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix tcf_unbind_filter missing in cls_matchall as this will trigger
WARN_ON() in cbq_destroy_class().
Fixes: fd62d9f5c5 ("net/sched: matchall: Fix configuration race")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>