We never use RCU protection for it, just a lot of cargo-cult
rcu_deference_protects calls.
Note that we do keep the kfree_rcu call for it, as the references through
struct sock are RCU protected and thus might require a grace period before
freeing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header
(SRH), full specification is available here:
https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-05
This patch adds this action now to the seg6local BPF
interface. Since it is not mandatory that the inner IPv6 header also
contains a SRH, seg6_bpf_srh_state has been extended with a pointer to
a possible SRH of the outermost IPv6 header. This helps assessing if the
validation must be triggered or not, and avoids some calls to
ipv6_find_hdr.
v3: s/1/true, s/0/false for boolean values
v2: - changed true/false -> 1/0
- preempt_enable no longer called in first conditional block
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is similar TC_ACT_REDIRECT, but with a slightly different
semantic:
- on ingress the mirred skbs are passed to the target device
network stack without any additional check not scrubbing.
- the rcu-protected stats provided via the tcf_result struct
are updated on error conditions.
This new tcfa_action value is not exposed to the user-space
and can be used only internally by clsact.
v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
a new action type instead
v2 -> v3:
- rename the new action value TC_ACT_REINJECT, update the
helper accordingly
- take care of uncloned reinjected packets in XDP generic
hook
v3 -> v4:
- renamed again the new action value (JiriP)
v4 -> v5:
- fix build error with !NET_CLS_ACT (kbuild bot)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each lockless action currently does its own RCU locking in ->act().
This allows using plain RCU accessor, even if the context
is really RCU BH.
This change drops the per action RCU lock, replace the accessors
with the _bh variant, cleans up a bit the surrounding code and
documents the RCU status in the relevant header.
No functional nor performance change is intended.
The goal of this patch is clarifying that the RCU critical section
used by the tc actions extends up to the classifier's caller.
v1 -> v2:
- preserve rcu lock in act_bpf: it's needed by eBPF helpers,
as pointed out by Daniel
v3 -> v4:
- fixed some typos in the commit message (JiriP)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fold it into the only caller to make the code simpler and easier to read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no point in hiding this logic in a helper. Also remove the
useless events != 0 check and only busy loop once we know we actually
have a poll method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
For any open socket file descriptor sock->sk->sk_wq->wait will always
point to sock->wq->wait. That means we can do the shorter dereference
and removal a NULL check and don't have to not worry about any RCU
protection.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The wait_address argument is always directly derived from the filp
argument, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new quirk HCI_QUIRK_NON_PERSISTENT_SETUP allowing that a quirk that
runs setup() after every open() and not just after the first open().
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds support for advertising in primary and secondary
channel on different PHYs. User can add the phy preference in
the flag based on which phy type will be added in extended
advertising parameter would be set.
@ MGMT Command: Add Advertising (0x003e) plen 11
Instance: 1
Flags: 0x00000200
Advertise in CODED on Secondary channel
Duration: 0
Timeout: 0
Advertising data length: 0
Scan response length: 0
< HCI Command: LE Set Extended Advertising Enable (0x08|0x0039) plen 2
Extended advertising: Disabled (0x00)
Number of sets: Disable all sets (0x00)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Advertising Enable (0x08|0x0039) ncmd 2
Status: Success (0x00)
< HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036) plen 25
Handle: 0x00
Properties: 0x0000
Min advertising interval: 1280.000 msec (0x0800)
Max advertising interval: 1280.000 msec (0x0800)
Channel map: 37, 38, 39 (0x07)
Own address type: Random (0x01)
Peer address type: Public (0x00)
Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
TX power: 127 dbm (0x7f)
Primary PHY: LE Coded (0x03)
Secondary max skip: 0x00
Secondary PHY: LE Coded (0x03)
SID: 0x00
Scan request notifications: Disabled (0x00)
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This event comes after connection complete event for incoming
connections. Since we now have different random address for
each instance, conn resp address is assigned from this event.
As of now only connection part is handled as we are not
enabling duration or max num of events while starting ext adv.
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This basically sets the random address for the adv instance
Random address can be set only if the instance is created which
is done in Set ext adv param.
Random address and rpa expire timer and flags have been added
to adv instance which will be used when the respective
instance is scheduled.
This introduces a hci_get_random_address() which returns the
own address type and random address (rpa or nrpa) based
on the instance flags and hdev flags. New function is required
since own address type should be known before setting adv params
but address can be set only after setting params.
< HCI Command: LE Set Advertising Set Random Address (0x08|0x0035) plen 7
Advertising handle: 0x00
Advertising random address: 3C:8E:56:9B:77:84 (OUI 3C-8E-56)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Advertising Set Random Address (0x08|0x0035) ncmd 1
Status: Success (0x00)
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
If ext adv is enabled then use ext adv to disable as well.
Also remove the adv set during LE disable.
< HCI Command: LE Set Extended Advertising Enable (0x08|0x0039) plen 2
Extended advertising: Disabled (0x00)
Number of sets: Disable all sets (0x00)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Advertising Enable (0x08|0x0039) ncmd 2
Status: Success (0x00)
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch implements Set Ext Adv data and Set Ext Scan rsp data
if controller support extended advertising.
Currently the operation is set as Complete data and fragment
preference is set as no fragment
< HCI Command: LE Set Extended Advertising Data (0x08|0x0037) plen 35
Handle: 0x00
Operation: Complete extended advertising data (0x03)
Fragment preference: Minimize fragmentation (0x01)
Data length: 0x15
16-bit Service UUIDs (complete): 2 entries
Heart Rate (0x180d)
Battery Service (0x180f)
Name (complete): Test LE
Company: Google (224)
Data: 0102
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Advertising Data (0x08|0x0037) ncmd 1
Status: Success (0x00)
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch basically replaces legacy adv with extended adv
based on the controller support. Currently there is no
design change. ie only one adv set will be enabled at a time.
This also adds tx_power in instance and store whatever returns
from Set_ext_parameter, use the same in adv data as well.
For instance 0 tx_power is stored in hdev only.
< HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036) plen 25
Handle: 0x00
Properties: 0x0010
Use legacy advertising PDUs: ADV_NONCONN_IND
Min advertising interval: 1280.000 msec (0x0800)
Max advertising interval: 1280.000 msec (0x0800)
Channel map: 37, 38, 39 (0x07)
Own address type: Random (0x01)
Peer address type: Public (0x00)
Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00)
TX power: 127 dbm (0x7f)
Primary PHY: LE 1M (0x01)
Secondary max skip: 0x00
Secondary PHY: LE 1M (0x01)
SID: 0x00
Scan request notifications: Disabled (0x00)
> HCI Event: Command Complete (0x0e) plen 5
LE Set Extended Advertising Parameters (0x08|0x0036) ncmd 1
Status: Success (0x00)
TX power (selected): 7 dbm (0x07)
< HCI Command: LE Set Extended Advertising Enable (0x08|0x0039) plen 6
Extended advertising: Enabled (0x01)
Number of sets: 1 (0x01)
Entry 0
Handle: 0x00
Duration: 0 ms (0x00)
Max ext adv events: 0
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Advertising Enable (0x08|0x0039) ncmd 2
Status: Success (0x00)
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch reads the number of advertising sets in the controller
during init and save it in hdev.
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This patch defines the extended ADV types and handle it in ADV report.
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This also add macros for checking LMP support for different
pkt_types
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
1M is mandatory to be supported by LE controllers and the same
would be set in power on. This patch defines hdev flags for
LE PHYs and set 1M to default.
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently nft uses inlined variants for common operations
such as 'ip saddr 1.2.3.4' instead of an indirect call.
Also handle meta get operations and lookups without indirect call,
both are builtin.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On ingress, a network device such as a switch assigns to packets
priority based on various criteria. Common options include interpreting
PCP and DSCP fields according to user configuration. When a packet
egresses the switch, a reverse process may rewrite PCP and/or DSCP
values according to packet priority.
The following three functions support a) obtaining a DSCP-to-priority
map or vice versa, and b) finding default-priority entries in APP
database.
The DCB subsystem supports for APP entries a very generous M:N mapping
between priorities and protocol identifiers. Understandably,
several (say) DSCP values can map to the same priority. But this
asymmetry holds the other way around as well--one priority can map to
several DSCP values. For this reason, the following functions operate in
terms of bitmaps, with ones in positions that match some APP entry.
- dcb_ieee_getapp_dscp_prio_mask_map() to compute for a given netdevice
a map of DSCP-to-priority-mask, which gives for each DSCP value a
bitmap of priorities related to that DSCP value by APP, along the
lines of dcb_ieee_getapp_mask().
- dcb_ieee_getapp_prio_dscp_mask_map() similarly to compute for a given
netdevice a map from priorities to a bitmap of DSCPs.
- dcb_ieee_getapp_default_prio_mask() which finds all default-priority
rules for a given port in APP database, and returns a mask of
priorities allowed by these default-priority rules.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case a chain is empty and not explicitly created by a user,
such chain should not exist. The only exception is if there is
an action "goto chain" pointing to it. In that case, don't show the
chain in the dump. Track the chain references held by actions and
use them to find out if a chain should or should not be shown
in chain dump.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-07-27
1) Extend the output_mark to also support the input direction
and masking the mark values before applying to the skb.
2) Add a new lookup key for the upcomming xfrm interfaces.
3) Extend the xfrm lookups to match xfrm interface IDs.
4) Add virtual xfrm interfaces. The purpose of these interfaces
is to overcome the design limitations that the existing
VTI devices have.
The main limitations that we see with the current VTI are the
following:
VTI interfaces are L3 tunnels with configurable endpoints.
For xfrm, the tunnel endpoint are already determined by the SA.
So the VTI tunnel endpoints must be either the same as on the
SA or wildcards. In case VTI tunnel endpoints are same as on
the SA, we get a one to one correlation between the SA and
the tunnel. So each SA needs its own tunnel interface.
On the other hand, we can have only one VTI tunnel with
wildcard src/dst tunnel endpoints in the system because the
lookup is based on the tunnel endpoints. The existing tunnel
lookup won't work with multiple tunnels with wildcard
tunnel endpoints. Some usecases require more than on
VTI tunnel of this type, for example if somebody has multiple
namespaces and every namespace requires such a VTI.
VTI needs separate interfaces for IPv4 and IPv6 tunnels.
So when routing to a VTI, we have to know to which address
family this traffic class is going to be encapsulated.
This is a lmitation because it makes routing more complex
and it is not always possible to know what happens behind the
VTI, e.g. when the VTI is move to some namespace.
VTI works just with tunnel mode SAs. We need generic interfaces
that ensures transfomation, regardless of the xfrm mode and
the encapsulated address family.
VTI is configured with a combination GRE keys and xfrm marks.
With this we have to deal with some extra cases in the generic
tunnel lookup because the GRE keys on the VTI are actually
not GRE keys, the GRE keys were just reused for something else.
All extensions to the VTI interfaces would require to add
even more complexity to the generic tunnel lookup.
So to overcome this, we developed xfrm interfaces with the
following design goal:
It should be possible to tunnel IPv4 and IPv6 through the same
interface.
No limitation on xfrm mode (tunnel, transport and beet).
Should be a generic virtual interface that ensures IPsec
transformation, no need to know what happens behind the
interface.
Interfaces should be configured with a new key that must match a
new policy/SA lookup key.
The lookup logic should stay in the xfrm codebase, no need to
change or extend generic routing and tunnel lookups.
Should be possible to use IPsec hardware offloads of the underlying
interface.
5) Remove xfrm pcpu policy cache. This was added after the flowcache
removal, but it turned out to make things even worse.
From Florian Westphal.
6) Allow to update the set mark on SA updates.
From Nathan Harold.
7) Convert some timestamps to time64_t.
From Arnd Bergmann.
8) Don't check the offload_handle in xfrm code,
it is an opaque data cookie for the driver.
From Shannon Nelson.
9) Remove xfrmi interface ID from flowi. After this pach
no generic code is touched anymore to do xfrm interface
lookups. From Benedict Wong.
10) Allow to update the xfrm interface ID on SA updates.
From Nathan Harold.
11) Don't pass zero to ERR_PTR() in xfrm_resolve_and_create_bundle.
From YueHaibing.
12) Return more detailed errors on xfrm interface creation.
From Benedict Wong.
13) Use PTR_ERR_OR_ZERO instead of IS_ERR + PTR_ERR.
From the kbuild test robot.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Make sure we don't go over the maximum jump stack boundary,
from Taehee Yoo.
2) Missing rcu_barrier() in hash and rbtree sets, also from Taehee.
3) Missing check to nul-node in rbtree timeout routine, from Taehee.
4) Use dev->name from flowtable to fix a memleak, from Florian.
5) Oneliner to free flowtable object on removal, from Florian.
6) Memleak in chain rename transaction, again from Florian.
7) Don't allow two chains to use the same name in the same
transaction, from Florian.
8) handle DCCP SYNC/SYNCACK as invalid, this triggers an
uninitialized timer in conntrack reported by syzbot, from Florian.
9) Fix leak in case netlink_dump_start() fails, from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
* always keep regulatory user hint
* add missing break statement in station flags parsing
* fix non-linear SKBs in port-control-over-nl80211
* reconfigure VLAN stations during HW restart
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAltW0eYACgkQB8qZga/f
l8SXvQ/5AYXK9qZuwhxIdJ5zpgcidDiWMl+aoaWeAdQF+giDTyB+tiZ7GyIYeJca
+4OJoC90DEH9Yq835rNIYXqGkcT12vSwJgaBp79hx0ch3ufwZqBepdX8DcCd0+18
NAP8IlRoSiGZ/GlpNxYCkXFMUEoRSxvbk4BXowlDwiN/h4ziCRnUkD1OaL6mqHsI
XTRjVRTQCekurc1o74s4kGX3Nj2ir3JJEGaD9Ha5Jsc18OQxx2fs5fgaGl3ffJaR
vJTiqvPPAvdymeV/S+J9BoCe8xAgO7Z0EDC3tbB7rlZ629XEJp9L91s2EN1Od8m3
k1pxq+nT4pci5UVnRtlAyHrr06VxhIYCRBRc+qvjXOTDIkEVF1UEQH9RExi4XNp5
OmcdXQ7fc1UhQfzy4vkM5EQamaxgJzAWZ9SGRIf4RmpUcux4ZTRDjoclwEkATXqr
C2lxDuIkHtClqY9BL+0f2TEio1drEawf99iWD4bSmRL+gng5LK8yxm0rAKFrnG81
kNfxb9Qg4OyJhl0764u2pDyK9YErFNPnxHEsNBCD3DT4Q/9yKIZzmgi4GJKZyWYe
x09+1zFsS/xtg2n0SOaYlLtPNPAKC7AkLqZ+FtHreIjVbKTgW+dEgniscQdUAfYT
nY09uPzbabEyNaF/fuZCOofu3YignVLWlr6Z+PG1sIKqp6qaPQY=
=qtIc
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2018-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Only a few fixes:
* always keep regulatory user hint
* add missing break statement in station flags parsing
* fix non-linear SKBs in port-control-over-nl80211
* reconfigure VLAN stations during HW restart
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a couple of flower offload commands in order to propagate
template creation/destruction events down to device drivers.
Drivers may use this information to prepare HW in an optimal way
for future filter insertions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow user to set a template for newly created chains. Template lock
down the chain for particular classifier type/options combinations.
The classifier needs to support templates, otherwise kernel would
reply with error.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow user to create, destroy, get and dump chain objects. Do that by
extending rtnl commands by the chain-specific ones. User will now be
able to explicitly create or destroy chains (so far this was done only
automatically according the filter/act needs and refcounting). Also, the
user will receive notification about any chain creation or destuction.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, chain 0 is implicitly created during block creation. However
that does not align with chain object exposure, creation and destruction
api introduced later on. So make the chain 0 behave the same way as any
other chain and only create it when it is needed. Since chain 0 is
somehow special as the qdiscs need to hold pointer to the first chain
tp, this requires to move the chain head change callback infra to the
block structure.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have offload replay infrastructure added by
commit 326367427c ("net: sched: call reoffload op on block callback reg")
and flows are guaranteed to be removed correctly, we can revert
commit 951a8ee6de ("nfp: reject binding to shared blocks").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Example setup:
host: ip -6 addr add dev eth1 2001:db8:104::4
where eth1 is enslaved to a VRF
switch: ip -6 ro add 2001:db8:104::4/128 dev br1
where br1 only has an LLA
ping6 2001:db8:104::4
ssh 2001:db8:104::4
(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).
For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.
For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.
Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.
__task_pid_nr_ns has been updated to take advantage of this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make net_ns_get_ownership() reusable by networking code outside of core.
This is useful, for example, to allow bridge related sysfs files to be
owned by container root.
Add a function comment since this is a potentially dangerous function to
use given the way that kobject_get_ownership() works by initializing uid
and gid before calling .get_ownership().
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next
tree:
1) No need to set ttl from reject action for the bridge family, from
Taehee Yoo.
2) Use a fixed timeout for flow that are passed up from the flowtable
to conntrack, from Florian Westphal.
3) More preparation patches for tproxy support for nf_tables, from Mate
Eckl.
4) Remove unnecessary indirection in core IPv6 checksum function, from
Florian Westphal.
5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it.
From Florian Westphal.
6) socket match now selects socket infrastructure, instead of depending
on it. From Mate Eckl.
7) Patch series to simplify conntrack tuple building/parsing from packet
path and ctnetlink, from Florian Westphal.
8) Fetch timeout policy from protocol helpers, instead of doing it from
core, from Florian Westphal.
9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from
Florian Westphal.
10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES
respectively, instead of IPV6. Patch from Mate Eckl.
11) Add specific function for garbage collection in conncount,
from Yi-Hung Wei.
12) Catch number of elements in the connlimit list, from Yi-Hung Wei.
13) Move locking to nf_conncount, from Yi-Hung Wei.
14) Series of patches to add lockless tree traversal in nf_conncount,
from Yi-Hung Wei.
15) Resolve clash in matching conntracks when race happens, from
Martynas Pumputis.
16) If connection entry times out, remove template entry from the
ip_vs_conn_tab table to improve behaviour under flood, from
Julian Anastasov.
17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng.
18) Call abort from 2-phase commit protocol before requesting modules,
make sure this is done under the mutex, from Florian Westphal.
19) Grab module reference when starting transaction, also from Florian.
20) Dynamically allocate expression info array for pre-parsing, from
Florian.
21) Add per netns mutex for nf_tables, from Florian Westphal.
22) A couple of patches to simplify and refactor nf_osf code to prepare
for nft_osf support.
23) Break evaluation on missing socket, from Mate Eckl.
24) Allow to match socket mark from nft_socket, from Mate Eckl.
25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is
built-in into nf_conntrack. From Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:
""" When receiving packets, the CE codepoint MUST be processed as follows:
1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
true and send an immediate ACK.
2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
to false and send an immediate ACK.
"""
Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.
Tested with this packetdrill script provided by Larry Brakmo
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
no need to store the name in separate area.
Furthermore, it uses kmalloc but not kfree and most accesses seem to treat
it as char[IFNAMSIZ] not char *.
Remove this and use dev->name instead.
In case event zeroed dev, just omit the name in the dump.
Fixes: d92191aa84 ("netfilter: nf_tables: cache device name in flowtable object")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In order to remove performance impact of having the extra u32 in every
single flowi, this change removes the flowi_xfrm struct, prefering to
take the if_id as a method parameter where needed.
In the inbound direction, if_id is only needed during the
__xfrm_check_policy() function, and the if_id can be determined at that
point based on the skb. As such, xfrmi_decode_session() is only called
with the skb in __xfrm_check_policy().
In the outbound direction, the only place where if_id is needed is the
xfrm_lookup() call in xfrmi_xmit2(). With this change, the if_id is
directly passed into the xfrm_lookup_with_ifid() call. All existing
callers can still call xfrm_lookup(), which uses a default if_id of 0.
This change does not change any behavior of XFRMIs except for improving
overall system performance via flowi size reduction.
This change has been tested against the Android Kernel Networking Tests:
https://android.googlesource.com/kernel/tests/+/master/net/test
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Add dissection of the tos and ttl from the ip tunnel headers
fields in case a match is needed on them.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rol32 call is currently rotating hash but the rol'd value is
being discarded. I believe the current code is incorrect and hash
should be assigned the rotated value returned from rol32.
Thanks to David Lebrun for spotting this.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We avoid 2 VLAs by using a pre-allocated field in dsa_switch. We also
try to avoid dynamic allocation whenever possible (when using fewer than
bits-per-long ports, which is the common case).
Link: http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Link: http://lkml.kernel.org/r/20180505185145.GB32630@lunn.ch
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
[kees: tweak commit subject and message slightly]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV6=m
DEFRAG_IPV6=m
CONNTRACK=y yields:
net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable'
net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6'
Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
IPV6=y too.
This patch gets rid of the 'followup linker error' by removing
the dependency of ipv6.ko symbols from netfilter ipv6 defrag.
Shared code is placed into a header, then used from both.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Continue to use nftnl subsys mutex to protect (un)registration of hook types,
expressions and so on, but force batch operations to do their own
locking.
This allows distinct net namespaces to perform transactions in parallel.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The param helper of nf_ct_helper_ext_add is useless now, then remove
it now.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
cp->state was not used for templates. Add support for state bits
and for the first "assured" bit which indicates that some
connection controlled by this template was established or assured
by the real server. In a followup patch we will use it to drop
templates under SYN attack.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In preparation for followup patches, provide just the cp
ptr to ip_vs_state_name.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch is originally from Florian Westphal.
This patch does the following 3 main tasks.
1) Add list lock to 'struct nf_conncount_list' so that we can
alter the lists containing the individual connections without holding the
main tree lock. It would be useful when we only need to add/remove to/from
a list without allocate/remove a node in the tree. With this change, we
update nft_connlimit accordingly since we longer need to maintain
a list lock in nft_connlimit now.
2) Use RCU for the initial tree search to improve tree look up performance.
3) Add a garbage collection worker. This worker is schedule when there
are excessive tree node that needed to be recycled.
Moreover,the rbnode reclaim logic is moved from search tree to insert tree
to avoid race condition.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch is originally from Florian Westphal.
This patch does the following three tasks.
It applies the same early exit technique for nf_conncount_lookup().
Since now we keep the number of connections in 'struct nf_conncount_list',
we no longer need to return the count in nf_conncount_lookup().
Moreover, we expose the garbage collection function nf_conncount_gc_list()
for nft_connlimit.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Original patch is from Florian Westphal.
This patch switches from hlist to plain list to store the list of
connections with the same filtering key in nf_conncount. With the
plain list, we can insert new connections at the tail, so over time
the beginning of list holds long-running connections and those are
expired, while the newly creates ones are at the end.
Later on, we could probably move checked ones to the end of the list,
so the next run has higher chance to reclaim stale entries in the front.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This unifies ipv4 and ipv6 protocol trackers and removes the l3proto
abstraction.
This gets rid of all l3proto indirect calls and the need to do
a lookup on the function to call for l3 demux.
It increases module size by only a small amount (12kbyte), so this reduces
size because nf_conntrack.ko is useless without either nf_conntrack_ipv4
or nf_conntrack_ipv6 module.
before:
text data bss dec hex filename
7357 1088 0 8445 20fd nf_conntrack_ipv4.ko
7405 1084 4 8493 212d nf_conntrack_ipv6.ko
72614 13689 236 86539 1520b nf_conntrack.ko
19K nf_conntrack_ipv4.ko
19K nf_conntrack_ipv6.ko
179K nf_conntrack.ko
after:
text data bss dec hex filename
79277 13937 236 93450 16d0a nf_conntrack.ko
191K nf_conntrack.ko
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
join source group". From RFC3810, part 6.1:
If no per-interface state existed for that
multicast address before the change (i.e., the change consisted of
creating a new per-interface record), or if no state exists after the
change (i.e., the change consisted of deleting a per-interface
record), then the "non-existent" state is considered to have an
INCLUDE filter mode and an empty source list.
Which means a new multicast group should start with state IN(). Currently,
for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
then ip6_mc_source(), which will trigger a TO_IN() message instead of
ALLOW().
The issue was exposed by commit a052517a8f ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).
Fix it by adding a new parameter to init group mode. Also add some wrapper
functions to avoid changing too much code.
v1 -> v2:
In the first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
a filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.
In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.
There is also a difference between v4 and v6 version. For IPv6, when the
interface goes down and up, we will send correct state change record with
unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
completed, we resend the change record TO_IN() in mld_send_initial_cr().
Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().
Fixes: a052517a8f ("net/multicast: should not send source list records when have filter mode change")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not needed, we can have the l4trackers fetch it themselvs.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Handle the common cases (tcp, udp, etc). in the core and only
do the indirect call for the protocols that need it (GRE for instance).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Handle it in the core instead.
ipv6_skip_exthdr() is built-in even if ipv6 is a module, i.e. this
doesn't create an ipv6 dependency.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its simpler to just handle it directly in nf_ct_invert_tuple().
Also gets rid of need to pass l3proto pointer to resolve_conntrack().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
handle everything from ctnetlink directly.
After all these years we still only support ipv4 and ipv6, so it
seems reasonable to remove l3 protocol tracker support and instead
handle ipv4/ipv6 from a common, always builtin inet tracker.
Step 1: Get rid of all the l3proto->func() calls.
Start with ctnetlink, then move on to packet-path ones.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These versions deal with the l3proto/l4proto details internally.
It removes only caller of nf_ct_get_tuple, so make it static.
After this, l3proto->get_l4proto() can be removed in a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This function is also necessary to implement nft tproxy support
Fixes: 45ca4e0cf2 ("netfilter: Libify xt_TPROXY")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.
This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.
The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.
In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.
Later, in the tls_device Rx flow, we will use decrypt_skb directly.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-15
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend struct tcf_walker with additional 'cookie' field. It is intended to
be used by classifier walk implementations to continue iteration directly
from particular filter, instead of iterating 'skip' number of times.
Change flower walk implementation to save filter handle in 'cookie'. Each
time flower walk is called, it looks up filter with saved handle directly
with idr, instead of iterating over filter linked list 'skip' number of
times. This change improves complexity of dumping flower classifier from
quadratic to linearithmic. (assuming idr lookup has logarithmic complexity)
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reported-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core. Encapsulate program and flags
into a structure and add helpers. Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.
The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time. Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.
Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The Link Quality Indication data exposed by drivers could not be accessed from
userspace. Since this data is per-datagram received, it makes sense to make it
available to userspace application through the ancillary data mechanism in
recvmsg rather than through ioctls. This can be activated using the socket
option WPAN_WANTLQI under SOL_IEEE802154 protocol.
This LQI data is available in the ancillary data buffer under the SOL_IEEE802154
level as the type WPAN_LQI. The value is an unsigned byte indicating the link
quality with values ranging 0-255.
Signed-off-by: Romuald Cari <romuald.cari@devialet.com>
Signed-off-by: Clément Peron <clement.peron@devialet.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
region_snapshot - When set enables capturing region snapshots
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each device address region can store multiple snapshots,
each snapshot is identified using a different numerical ID.
This ID is used when deleting a snapshot or showing an address
region specific snapshot. This patch exposes a callback to add
a new snapshot to an address region.
The snapshot will be deleted using the destructor function
when destroying a region or when a snapshot delete command
from devlink user tool.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To restrict the driver with the snapshot ID selection a new callback
is introduced for the driver to get the snapshot ID before creating
a new snapshot. This will also allow giving the same ID for multiple
snapshots taken of different regions on the same time.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows a device to register its supported address regions.
Each address region can be accessed directly for example reading
the snapshots taken of this address space.
Drivers are not limited in the name selection for different regions.
An example of a region-name can be: pci cr-space, register-space.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
use RCU instead of spin_{,un}lock_bh, to protect concurrent read/write on
act_skbedit configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using get_seconds() for timestamps is deprecated since it can lead
to overflows on 32-bit systems. While the interface generally doesn't
overflow until year 2106, the specific implementation of the TCP PAWS
algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
overflow.
A related problem is that the local timestamps in CLOCK_REALTIME form
lead to unexpected behavior when settimeofday is called to set the system
clock backwards or forwards by more than 24 days.
While the first problem could be solved by using an overflow-safe method
of comparing the timestamps, a nicer solution is to use a monotonic
clocksource with ktime_get_seconds() that simply doesn't overflow (at
least not until 136 years after boot) and that doesn't change during
settimeofday().
To make 32-bit and 64-bit architectures behave the same way here, and
also save a few bytes in the tcp_options_received structure, I'm changing
the type to a 32-bit integer, which is now safe on all architectures.
Finally, the ts_recent_stamp field also (confusingly) gets used to store
a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
This is currently safe, but changing the type to 32-bit requires
some small changes there to keep it working.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LAG devices (team or bond) recognize for each one of their slave devices
whether LAG traffic is going to be sent through that device. Bond calls
such devices "active", team calls them "txable". When this state
changes, a NETDEV_CHANGELOWERSTATE notification is distributed, together
with a netdev_notifier_changelowerstate_info structure that for LAG
devices includes a tx_enabled flag that refers to the new state. The
notification thus makes it possible to react to the changes in txability
in drivers.
However there's no way to query txability from the outside on demand.
That is problematic namely for mlxsw, which when resolving ERSPAN packet
path, may encounter a LAG device, and needs to determine which of the
slaves it should choose.
To that end, introduce a new function, net_lag_port_dev_txable(), which
determines whether a given slave device is "active" or
"txable" (depending on the flavor of the LAG device). That function then
dispatches to per-LAG-flavor helpers, bond_is_active_slave_dev() resp.
team_port_dev_txable().
Because there currently is no good place where net_lag_port_dev_txable()
should be added, introduce a new header file, lag.h, which should from
now on hold any logic common to both team and bond. (But keep
netif_is_lag_master() together with the rest of netif_is_*_master()
functions).
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Congestion control algorithms, which access the rate sample
through the tcp_cong_control function, only have access to the maximum
of the send and receive interval, for cases where the acknowledgment
rate may be inaccurate due to ACK compression or decimation. Algorithms
may want to use send rates and receive rates as separate signals.
Signed-off-by: Deepti Raghavan <deeptir@mit.edu>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_seconds() is deprecated because it can overflow on 32-bit
architectures. For the xfrm_state->lastused member, we treat the data
as a 64-bit number already, so we just need to use the right accessor
that works on both 32-bit and 64-bit machines.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree:
1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
from Florian Westphal.
2) Possible non-linear access to TCP header from tproxy, from
Mate Eckl.
3) Do not allow rbtree to be used for single elements, this patch
moves all set backend into one single module since such thing
can only happen if hashtable module is explicitly blacklisted,
which should not ever be done.
4) Reject error and standard targets from nft_compat for sanity
reasons, they are never used from there.
5) Don't crash on double hashsize module parameter, from Andrey
Ryabinin.
6) Drop dst on skb before placing it in the fragmentation
reassembly queue, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2018-07-07
The following pull-request contains BPF updates for your *net* tree.
Plenty of fixes for different components:
1) A set of critical fixes for sockmap and sockhash, from John Fastabend.
2) fixes for several race conditions in af_xdp, from Magnus Karlsson.
3) hash map refcnt fix, from Mauricio Vasquez.
4) samples/bpf fixes, from Taeung Song.
5) ifup+mtu check for xdp_redirect, from Toshiaki Makita.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Act API used linked list to pass set of actions to functions. It is
intrusive data structure that stores list nodes inside action structure
itself, which means it is not safe to modify such list concurrently.
However, action API doesn't use any linked list specific operations on this
set of actions, so it can be safely refactored into plain pointer array.
Refactor action API to use array of pointers to tc_actions instead of
linked list. Change argument 'actions' type of exported action init,
destroy and dump functions.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement function that atomically checks if action exists and either takes
reference to it, or allocates idr slot for action index to prevent
concurrent allocations of actions with same index. Use EBUSY error pointer
to indicate that idr slot is reserved.
Implement cleanup helper function that removes temporary error pointer from
idr. (in case of error between idr allocation and insertion of newly
created action to specified index)
Refactor all action init functions to insert new action to idr using this
API.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend action ops with 'delete' function. Each action type to implements
its own delete function that doesn't depend on rtnl lock.
Implement delete function that is required to delete actions without
holding rtnl lock. Use action API function that atomically deletes action
only if it is still in action idr. This implementation prevents concurrent
threads from deleting same action twice.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement new action API function that atomically finds and deletes action
from idr by index. Intended to be used by lockless actions that do not rely
on rtnl lock.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add additional 'rtnl_held' argument to act API init functions. It is
required to implement actions that need to release rtnl lock before loading
kernel module and reacquire if afterwards.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change type of action reference counter to refcount_t.
Change type of action bind counter to atomic_t.
This type is used to allow decrementing bind counter without testing
for 0 result.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement functions to atomically update and free action cookie
using rcu mechanism.
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit
'bpf: bpf_compute_data uses incorrect cb structure' (8108a77515)
we added the routine bpf_compute_data_end_sk_skb() to compute the
correct data_end values, but this has since been lost. In kernel
v4.14 this was correct and the above patch was applied in it
entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
we lost the piece that renamed bpf_compute_data_pointers to the
new function bpf_compute_data_end_sk_skb. This was done here,
e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
When it conflicted with the following rename patch,
6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Finally, after a refactor I thought even the function
bpf_compute_data_end_sk_skb() was no longer needed and it was
erroneously removed.
However, we never reverted the sk_skb_convert_ctx_access() usage of
tcp_skb_cb which had been committed and survived the merge conflict.
Here we fix this by adding back the helper and *_data_end_sk_skb()
usage. Using the bpf_skc_data_end mapping is not correct because it
expects a qdisc_skb_cb object but at the sock layer this is not the
case. Even though it happens to work here because we don't overwrite
any data in-use at the socket layer and the cb structure is cleared
later this has potential to create some subtle issues. But, even
more concretely the filter.c access check uses tcp_skb_cb.
And by some act of chance though,
struct bpf_skb_data_end {
struct qdisc_skb_cb qdisc_cb; /* 0 28 */
/* XXX 4 bytes hole, try to pack */
void * data_meta; /* 32 8 */
void * data_end; /* 40 8 */
/* size: 48, cachelines: 1, members: 3 */
/* sum members: 44, holes: 1, sum holes: 4 */
/* last cacheline: 48 bytes */
};
and then tcp_skb_cb,
struct tcp_skb_cb {
[...]
struct {
__u32 flags; /* 24 4 */
struct sock * sk_redir; /* 32 8 */
void * data_end; /* 40 8 */
} bpf; /* 24 */
};
So when we use offset_of() to track down the byte offset we get 40 in
either case and everything continues to work. Fix this mess and use
correct structures its unclear how long this might actually work for
until someone moves the structs around.
Reported-by: Martin KaFai Lau <kafai@fb.com>
Fixes: e1ea2f9856 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Fixes: 6aaae2b6c4 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dissect the QinQ packets to get both outer and inner vlan information,
then store to the extended flow keys.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change vlan dissector key to save vlan tpid to support both 802.1Q
and 802.1AD ethertype.
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A newly added dummy helper function tries to return a failure from a "void"
function:
In file included from include/net/dsa.h:24,
from arch/arm/plat-orion/common.c:21:
include/net/devlink.h: In function 'devlink_param_value_changed':
include/net/devlink.h:771:9: error: 'return' with a value, in function returning void [-Werror]
return -EOPNOTSUPP;
This fixes it by removing the bogus statement.
Fixes: ea601e1709 ("devlink: Add devlink notifications support for params")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>