Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for destination MAC in ipset, from Stefano Brivio.
2) Disallow all-zeroes MAC address in ipset, also from Stefano.
3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
introduce protocol version number 7, from Jozsef Kadlecsik.
A follow up patch to fix ip_set_byindex() is also included
in this batch.
4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.
5) Statify nf_flow_table_iterate(), from Taehee Yoo.
6) Use nf_flow_table_iterate() to simplify garbage collection in
nf_flow_table logic, also from Taehee Yoo.
7) Don't use _bh variants of call_rcu(), rcu_barrier() and
synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.
8) Remove NFC_* cache definition from the old caching
infrastructure.
9) Remove layer 4 port rover in NAT helpers, use random port
instead, from Florian Westphal.
10) Use strscpy() in ipset, from Qian Cai.
11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
random port is allocated by default, from Xiaozhou Liu.
12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.
13) Limit port allocation selection routine in NAT to avoid
softlockup splats when most ports are in use, from Florian.
14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
from Yafang Shao.
15) Direct call to nf_nat_l4proto_unique_tuple() instead of
indirection, from Florian Westphal.
16) Several patches to remove all layer 4 NAT indirections,
remove nf_nat_l4proto struct, from Florian Westphal.
17) Fix RTP/RTCP source port translation when SNAT is in place,
from Alin Nastac.
18) Selective rule dump per chain, from Phil Sutter.
19) Revisit CLUSTERIP target, this includes a deadlock fix from
netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
and disallow mismatching IP address and MAC address.
Patchset from Taehee Yoo.
20) Update UDP timeout to stream after 2 seconds, from Florian.
21) Shrink UDP established timeout to 120 seconds like TCP timewait.
22) Sysctl knobs to set GRE timeouts, from Yafang Shao.
23) Move seq_print_acct() to conntrack core file, from Florian.
24) Add enum for conntrack sysctl knobs, also from Florian.
25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
As a side effect, shrink netns_ct structure by removing obsolete
sysctl anchors, also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-12-21
The following pull-request contains BPF updates for your *net-next* tree.
There is a merge conflict in test_verifier.c. Result looks as follows:
[...]
},
{
"calls: cross frame pruning",
.insns = {
[...]
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
{
"jset: functional",
.insns = {
[...]
{
"jset: unknown const compare not taken",
.insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32),
BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "!read_ok",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
[...]
{
"jset: range",
.insns = {
[...]
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.result_unpriv = ACCEPT,
.result = ACCEPT,
},
The main changes are:
1) Various BTF related improvements in order to get line info
working. Meaning, verifier will now annotate the corresponding
BPF C code to the error log, from Martin and Yonghong.
2) Implement support for raw BPF tracepoints in modules, from Matt.
3) Add several improvements to verifier state logic, namely speeding
up stacksafe check, optimizations for stack state equivalence
test and safety checks for liveness analysis, from Alexei.
4) Teach verifier to make use of BPF_JSET instruction, add several
test cases to kselftests and remove nfp specific JSET optimization
now that verifier has awareness, from Jakub.
5) Improve BPF verifier's slot_type marking logic in order to
allow more stack slot sharing, from Jiong.
6) Add sk_msg->size member for context access and add set of fixes
and improvements to make sock_map with kTLS usable with openssl
based applications, from John.
7) Several cleanups and documentation updates in bpftool as well as
auto-mount of tracefs for "bpftool prog tracelog" command,
from Quentin.
8) Include sub-program tags from now on in bpf_prog_info in order to
have a reliable way for user space to get all tags of the program
e.g. needed for kallsyms correlation, from Song.
9) Add BTF annotations for cgroup_local_storage BPF maps and
implement bpf fs pretty print support, from Roman.
10) Fix bpftool in order to allow for cross-compilation, from Ivan.
11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
to be compatible with libbfd and allow for Debian packaging,
from Jakub.
12) Remove an obsolete prog->aux sanitation in dump and get rid of
version check for prog load, from Daniel.
13) Fix a memory leak in libbpf's line info handling, from Prashant.
14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
does not get unaligned, from Jesper.
15) Fix test_progs kselftest to work with older compilers which are less
smart in optimizing (and thus throwing build error), from Stanislav.
16) Cleanup and simplify AF_XDP socket teardown, from Björn.
17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
to netns_id argument, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract "Protocol" field decompression code from transport protocols to
PPP generic layer, where it actually belongs. As a consequence, this
patch fixes incorrect place of PFC decompression in L2TP driver (when
it's not PPPOX_BOUND) and also enables this decompression for other
protocols, like PPPoE.
Protocol field decompression also happens in PPP Multilink Protocol
code and in PPP compression protocols implementations (bsd, deflate,
mppe). It looks like there is no easy way to get rid of that, so it was
decided to leave it as is, but provide those cases with appropriate
comments instead.
Changes in v2:
- Fix the order of checking skb data room and proto decompression
- Remove "inline" keyword from ppp_decompress_proto()
- Don't split line before function name
- Prefix ppp_decompress_proto() function with "__"
- Add ppp_decompress_proto() function with skb data room checks
- Add description for introduced functions
- Fix comments (as per review on mailing list)
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
When user requests to resolve an output route, the kernel synthesizes
an skb where the relevant parameters (e.g., source address) are set. The
skb is then passed to ip_route_output_key_hash_rcu() which might call
into the flow dissector in case a multipath route was hit and a nexthop
needs to be selected based on the multipath hash.
Since both 'skb->dev' and 'skb->sk' are not set, a warning is triggered
in the flow dissector [1]. The warning is there to prevent codepaths
from silently falling back to the standard flow dissector instead of the
BPF one.
Therefore, instead of removing the warning, set 'skb->dev' to the
loopback device, as its not used for anything but resolving the correct
namespace.
[1]
WARNING: CPU: 1 PID: 24819 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x314/0x16b0
...
RSP: 0018:ffffa0df41fdf650 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8bcded232000 RCX: 0000000000000000
RDX: ffffa0df41fdf7e0 RSI: ffffffff98e415a0 RDI: ffff8bcded232000
RBP: ffffa0df41fdf760 R08: 0000000000000000 R09: 0000000000000000
R10: ffffa0df41fdf7e8 R11: ffff8bcdf27a3000 R12: ffffffff98e415a0
R13: ffffa0df41fdf7e0 R14: ffffffff98dd2980 R15: ffffa0df41fdf7e0
FS: 00007f46f6897680(0000) GS:ffff8bcdf7a80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055933e95f9a0 CR3: 000000021e636000 CR4: 00000000001006e0
Call Trace:
fib_multipath_hash+0x28c/0x2d0
? fib_multipath_hash+0x28c/0x2d0
fib_select_path+0x241/0x32f
? __fib_lookup+0x6a/0xb0
ip_route_output_key_hash_rcu+0x650/0xa30
? __alloc_skb+0x9b/0x1d0
inet_rtm_getroute+0x3f7/0xb80
? __alloc_pages_nodemask+0x11c/0x2c0
rtnetlink_rcv_msg+0x1d9/0x2f0
? rtnl_calcit.isra.24+0x120/0x120
netlink_rcv_skb+0x54/0x130
rtnetlink_rcv+0x15/0x20
netlink_unicast+0x20a/0x2c0
netlink_sendmsg+0x2d1/0x3d0
sock_sendmsg+0x39/0x50
___sys_sendmsg+0x2a0/0x2f0
? filemap_map_pages+0x16b/0x360
? __handle_mm_fault+0x108e/0x13d0
__sys_sendmsg+0x63/0xa0
? __sys_sendmsg+0x63/0xa0
__x64_sys_sendmsg+0x1f/0x30
do_syscall_64+0x5a/0x120
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: d0e13a1488 ("flow_dissector: lookup netns by skb->sk if skb->dev is NULL")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the stray semicolon means that the final term in the addition
is being missed. Fix this by removing it. Cleans up clang warning:
net/core/neighbour.c:2821:9: warning: expression result unused [-Wunused-value]
Fixes: 82cbb5c631 ("neighbour: register rtnl doit handler")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-By: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
after moving sysctl handling into single place, the init functions
can't fail anymore and some of the fini functions are empty.
Remove them and change return type to void.
This also simplifies error unwinding in conntrack module init path.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Similar to previous change, this time for eache and timestamp.
Unlike helper and acct, these can be disabled at build time, so they
need ifdef guards.
Next patch will remove a few (now obsolete) functions.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Needless copy&paste, just handle all in one. Next patch will handle
acct and timestamp, which have similar functions.
Intentionally leaves cruft behind, will be cleaned up in a followup
patch.
The obsolete sysctl pointers in netns_ct struct are left in place and
removed in a single change, as changes to netns trigger rebuild of
almost all files.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its a bit hard to see what table[3] really lines up with, so add
human-readable mnemonics and use them for initialisation.
This makes it easier to see e.g. which sysctls are not exported to
unprivileged userns.
objdiff shows no changes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds two sysctl knobs for GRE:
net.netfilter.nf_conntrack_gre_timeout = 30
net.netfilter.nf_conntrack_gre_timeout_stream = 180
Update the Documentation as well.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We have no explicit signal when a UDP stream has terminated, peers just
stop sending.
For suspected stream connections a timeout of two minutes is sane to keep
NAT mapping alive a while longer.
It matches tcp conntracks 'timewait' default timeout value.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.
Add a two second grace period where we continue to use the ordinary
30-second default timeout. Its enough for DNS request/response traffic,
even if two request/reply packets are involved.
ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The existing code did not expect users would initialize the TLS ULP
without subsequently calling the TLS TX enabling socket option.
If the application tries to send data after the TLS ULP enable op
but before the TLS TX enable op the BPF sk_msg verdict program is
skipped. This patch resolves this by converting the ipv4 sock ops
to be calculated at init time the same way ipv6 ops are done. This
pulls in any changes to the sock ops structure that have been made
after the socket was created including the changes from adding the
socket to a sock{map|hash}.
This was discovered by running OpenSSL master branch which calls
the TLS ULP setsockopt early in TLS handshake but only enables
the TLS TX path once the handshake has completed. As a result the
datapath missed the initial handshake messages.
Fixes: 02c558b2d5 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In addition to releasing any cork'ed data on a psock when the psock
is removed we should also release any skb's in the ingress work queue.
Otherwise the skb's eventually get free'd but late in the tear
down process so we see the WARNING due to non-zero sk_forward_alloc.
void sk_stream_kill_queues(struct sock *sk)
{
...
WARN_ON(sk->sk_forward_alloc);
...
}
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When a skb verdict program is in-use and either another BPF program
redirects to that socket or the new SK_PASS support is used the
data_ready callback does not wake up application. Instead because
the stream parser/verdict is using the sk data_ready callback we wake
up the stream parser/verdict block.
Fix this by adding a helper to check if the stream parser block is
enabled on the sk and if so call the saved pointer which is the
upper layers wake up function.
This fixes application stalls observed when an application is waiting
for data in a blocking read().
Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add SK_PASS verdict support to SK_SKB_VERDICT programs. Now that
support for redirects exists we can implement SK_PASS as a redirect
to the same socket. This simplifies the BPF programs and avoids an
extra map lookup on RX path for simple visibility cases.
Further, reduces user (BPF programmer in this context) confusion
when their program drops skb due to lack of support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Enforce comment on structure layout dependency with a BUILD_BUG_ON
to ensure the condition is maintained.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The check for max offset in sk_msg_is_valid_access uses sizeof()
which is incorrect because it would allow accessing possibly
past the end of the struct in the padded case. Further, it doesn't
preclude accessing any padding that may be added in the middle of
a struct. All told this makes it fragile to rely on.
To fix this explicitly check offsets with fields using the
bpf_ctx_range() and bpf_ctx_range_till() macros.
For reference the current structure layout looks as follows (reported
by pahole)
struct sk_msg_md {
union {
void * data; /* 8 */
}; /* 0 8 */
union {
void * data_end; /* 8 */
}; /* 8 8 */
__u32 family; /* 16 4 */
__u32 remote_ip4; /* 20 4 */
__u32 local_ip4; /* 24 4 */
__u32 remote_ip6[4]; /* 28 16 */
__u32 local_ip6[4]; /* 44 16 */
__u32 remote_port; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
__u32 local_port; /* 64 4 */
__u32 size; /* 68 4 */
/* size: 72, cachelines: 2, members: 10 */
/* last cacheline: 8 bytes */
};
So there should be no padding at the moment but fixing this now
prevents future errors.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-12-20
Two last patches for this release cycle:
1) Remove an unused variable in xfrm_policy_lookup_bytype().
From YueHaibing.
2) Fix possible infinite loop in __xfrm6_tunnel_alloc_spi().
Also from YueHaibing.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add NDA_PROTOCOL to nda_policy and use the policy for attribute parsing and
validation for adding neighbors and in dump requests. Remove the now duplicate
checks on nla_len.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Off by one in netlink parsing of mac802154_hwsim, from Alexander
Aring.
2) nf_tables RCU usage fix from Taehee Yoo.
3) Flow dissector needs nhoff and thoff clamping, from Stanislav
Fomichev.
4) Missing sin6_flowinfo initialization in SCTP, from Xin Long.
5) Spectrev1 in ipmr and ip6mr, from Gustavo A. R. Silva.
6) Fix r8169 crash when DEBUG_SHIRQ is enabled, from Heiner Kallweit.
7) Fix SKB leak in rtlwifi, from Larry Finger.
8) Fix state pruning in bpf verifier, from Jakub Kicinski.
9) Don't handle completely duplicate fragments as overlapping, from
Michal Kubecek.
10) Fix memory corruption with macb and 64-bit DMA, from Anssi Hannula.
11) Fix TCP fallback socket release in smc, from Myungho Jung.
12) gro_cells_destroy needs to napi_disable, from Lorenzo Bianconi.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (130 commits)
rds: Fix warning.
neighbor: NTF_PROXY is a valid ndm_flag for a dump request
net: mvpp2: fix the phylink mode validation
net/sched: cls_flower: Remove old entries from rhashtable
net/tls: allocate tls context using GFP_ATOMIC
iptunnel: make TUNNEL_FLAGS available in uapi
gro_cell: add napi_disable in gro_cells_destroy
lan743x: Remove MAC Reset from initialization
net/mlx5e: Remove the false indication of software timestamping support
net/mlx5: Typo fix in del_sw_hw_rule
net/mlx5e: RX, Fix wrong early return in receive queue poll
ipv6: explicitly initialize udp6_addr in udp_sock_create6()
bnxt_en: Fix ethtool self-test loopback.
net/rds: remove user triggered WARN_ON in rds_sendmsg
net/rds: fix warn in rds_message_alloc_sgs
ath10k: skip sending quiet mode cmd for WCN3990
mac80211: free skb fraglist before freeing the skb
nl80211: fix memory leak if validate_pae_over_nl80211() fails
net/smc: fix TCP fallback socket release
vxge: ensure data0 is initialized in when fetching firmware version information
...
>> net/rds/send.c:1109:42: warning: Using plain integer as NULL pointer
Fixes: ea010070d0 ("net/rds: fix warn in rds_message_alloc_sgs")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dumping proxy entries the dump request has NTF_PROXY set in
ndm_flags. strict mode checking needs to be updated to allow this
flag.
Fixes: 51183d233b ("net/neighbor: Update neigh_dump_info for strict data checking")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pneigh_lookup uses kmalloc versus kzalloc when new entries are allocated.
Given that the newly added protocol field needs to be initialized.
Fixes: df9b0e30d4 ("neighbor: Add protocol attribute")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When replacing a rule we add the new rule to the rhashtable
but only remove the old if not in skip_sw.
This commit fix this and remove the old rule anyway.
Fixes: 35cc3cefc4 ("net/sched: cls_flower: Reject duplicated rules also under skip_sw")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch registers neigh doit handler. The doit handler
returns a neigh entry given dst and dev. This is similar
to route and fdb doit (get) handlers. Also moves nda_policy
declaration from rtnetlink.c to neighbour.c
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior this commit, when the struct socket object was being released,
the UMEM did not have its reference count decreased. Instead, this was
done in the struct sock sk_destruct function.
There is no reason to keep the UMEM reference around when the socket
is being orphaned, so in this patch the xdp_put_mem is called in the
xsk_release function. This results in that the xsk_destruct function
can be removed!
Note that, it still holds that a struct xsk_sock reference might still
linger in the XSKMAP after the UMEM is released, e.g. if a user does
not clear the XSKMAP prior to closing the process. This sock will be
in a "released" zombie like state, until the XSKMAP is removed.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
syzbot reported the use of uninitialized udp6_addr::sin6_scope_id.
We can just set ::sin6_scope_id to zero, as tunnels are unlikely
to use an IPv6 address that needs a scope id and there is no
interface to bind in this context.
For net-next, it looks different as we have cfg->bind_ifindex there
so we can probably call ipv6_iface_scope_id().
Same for ::sin6_flowinfo, tunnels don't use it.
Fixes: 8024e02879 ("udp: Add udp_sock_create for UDP tunnels to open listener socket")
Reported-by: syzbot+c56449ed3652e6720f30@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending broadcast message on high load system, there are a lot of
unnecessary packets restranmission. That issue was caused by missing in
initial criteria for retransmission.
To prevent this happen, just initialize this criteria for retransmission
in next 10 milliseconds.
Fixes: 31c4f4cc32 ("tipc: improve broadcast retransmission algorithm")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_event for TIPC bearer, L2 device event:
trace_tipc_l2_device_event()
Also, it puts the trace at the tipc_l2_device_event() function, then
the device/bearer events and related info can be traced out during
runtime when needed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC node object:
trace_tipc_node_create()
trace_tipc_node_delete()
trace_tipc_node_lost_contact()
trace_tipc_node_timeout()
trace_tipc_node_link_up()
trace_tipc_node_link_down()
trace_tipc_node_reset_links()
trace_tipc_node_fsm_evt()
trace_tipc_node_check_state()
Also, enables the traces for the following cases:
- When a node is created/deleted;
- When a node contact is lost;
- When a node timer is timed out;
- When a node link is up/down;
- When all node links are reset;
- When node state is changed;
- When a skb comes and node state needs to be checked/updated.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC socket object:
trace_tipc_sk_create()
trace_tipc_sk_poll()
trace_tipc_sk_sendmsg()
trace_tipc_sk_sendmcast()
trace_tipc_sk_sendstream()
trace_tipc_sk_filter_rcv()
trace_tipc_sk_advance_rx()
trace_tipc_sk_rej_msg()
trace_tipc_sk_drop_msg()
trace_tipc_sk_release()
trace_tipc_sk_shutdown()
trace_tipc_sk_overlimit1()
trace_tipc_sk_overlimit2()
Also, enables the traces for the following cases:
- When user creates a TIPC socket;
- When user calls poll() on TIPC socket;
- When user sends a dgram/mcast/stream message.
- When a message is put into the socket 'sk_receive_queue';
- When a message is released from the socket 'sk_receive_queue';
- When a message is rejected (e.g. due to no port, invalid, etc.);
- When a message is dropped (e.g. due to wrong message type);
- When socket is released;
- When socket is shutdown;
- When socket rcvq's allocation is overlimit (> 90%);
- When socket rcvq + bklq's allocation is overlimit (> 90%);
- When the 'TIPC_ERR_OVERLOAD/2' issue happens;
Note:
a) All the socket traces are designed to be able to trace on a specific
socket by either using the 'event filtering' feature on a known socket
'portid' value or the sysctl file:
/proc/sys/net/tipc/sk_filter
The file determines a 'tuple' for what socket should be traced:
(portid, sock type, name type, name lower, name upper)
where:
+ 'portid' is the socket portid generated at socket creating, can be
found in the trace outputs or the 'tipc socket list' command printouts;
+ 'sock type' is the socket type (1 = SOCK_TREAM, ...);
+ 'name type', 'name lower' and 'name upper' are the service name being
connected to or published by the socket.
Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e.
the traces happen for every sockets with no filter.
b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event
which happens when the socket receive queue (and backlog queue) is
about to be overloaded, when the queue allocation is > 90%. Then, when
the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2
issue can be traced.
The trace event is designed as an 'upper watermark' notification that
the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or
actions can be triggerred in the meanwhile to see what is going on with
the socket queue.
In addition, the 'trace_tipc_sk_dump()' is also placed at the
'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped
for post-analysis.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit adds the new trace_events for TIPC link object:
trace_tipc_link_timeout()
trace_tipc_link_fsm()
trace_tipc_link_reset()
trace_tipc_link_too_silent()
trace_tipc_link_retrans()
trace_tipc_link_bc_ack()
trace_tipc_link_conges()
And the traces for PROTOCOL messages at building and receiving:
trace_tipc_proto_build()
trace_tipc_proto_rcv()
Note:
a) The 'tipc_link_too_silent' event will only happen when the
'silent_intv_cnt' is about to reach the 'abort_limit' value (and the
event is enabled). The benefit for this kind of event is that we can
get an early indication about TIPC link loss issue due to timeout, then
can do some necessary actions for troubleshooting.
For example: To trigger the 'tipc_proto_rcv' when the 'too_silent'
event occurs:
echo 'enable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_too_silent/trigger
And disable it when TIPC link is reset:
echo 'disable_event:tipc:tipc_proto_rcv' > \
events/tipc/tipc_link_reset/trigger
b) The 'tipc_link_retrans' or 'tipc_link_bc_ack' event is useful to
trace TIPC retransmission issues.
In addition, the commit adds the 'trace_tipc_list/link_dump()' at the
'retransmission failure' case. Then, if the issue occurs, the link
'transmq' along with the link data can be dumped for post-analysis.
These dump events should be enabled by default since it will only take
effect when the failure happens.
The same approach is also applied for the faulty case that the
validation of protocol message is failed.
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove skb->sp and allocate secpath storage via extension
infrastructure. This also reduces sk_buff by 8 bytes on x86_64.
Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).
No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
secpath_set is a wrapper for secpath_dup that will not perform
an allocation if the secpath attached to the skb has a reference count
of one, i.e., it doesn't need to be COW'ed.
Also, secpath_dup doesn't attach the secpath to the skb, it leaves
this to the caller.
Use secpath_set in places that immediately assign the return value to
skb.
This allows to remove skb->sp without touching these spots again.
secpath_dup can eventually be removed in followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will reduce noise when skb->sp is removed later in this series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.
This avoids direct access to skb->sp:
err = secpath_set(skb);
if (!err) ..
skb->sp-> ...
Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..
This reduces noise in followup patch which is going to remove skb->sp.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.
The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.
objdiff shows no changes if kernel is built without xfrm and br_netfilter
support.
The third (planned future) user is Multipath TCP which is still
out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.
Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.
mptcp has same requirements as secpath/bridge netfilter:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.
The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
mapping for tx and rx processing.
Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
are available for this skb.
This has two purposes.
a) avoids the need to initialize the pointer.
b) allows to "delete" an extension by clearing its bit
value in ->active_extensions.
While it would be possible to store the active_extensions byte
in the extension struct instead of sk_buff, there is one problem
with this:
When an extension has to be disabled, we can always clear the
bit in skb->active_extensions. But in case it would be stored in the
extension buffer itself, we might have to COW it first, if
we are dealing with a cloned skb. On kmalloc failure we would
be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
If the active_extensions byte is 0, the pointer is undefined,
it is not initialized on skb allocation.
This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.
It is possible to add support for extensions that are not preseved on
clones/copies.
To do this, it would be needed to define a bitmask of all extensions that
need copy/cow semantics, and change __skb_ext_copy() to check
->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
->active_extensions to 0 on the new clone.
This isn't done here because all extensions that get added here
need the copy/cow semantics.
v2:
Allocate entire extension space using kmem_cache.
Upside is that this allows better tracking of used memory,
downside is that we will allocate more space than strictly needed in
most cases (its unlikely that all extensions are active/needed at same
time for same skb).
The allocated memory (except the small extension header) is not cleared,
so no additonal overhead aside from memory usage.
Avoid atomic_dec_and_test operation on skb_ext_put()
by using similar trick as kfree_skbmem() does with fclone_ref:
If recount is 1, there is no concurrent user and we can free right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
per comment from Leon in rdma mailing list
https://lkml.org/lkml/2018/10/31/312 :
Please don't forget to remove user triggered WARN_ON.
https://lwn.net/Articles/769365/
"Greg Kroah-Hartman raised the problem of core kernel API code that will
use WARN_ON_ONCE() to complain about bad usage; that will not generate
the desired result if WARN_ON_ONCE() is configured to crash the machine.
He was told that the code should just call pr_warn() instead, and that
the called function should return an error in such situations. It was
generally agreed that any WARN_ON() or WARN_ON_ONCE() calls that can be
triggered from user space need to be fixed."
in addition harden rds_sendmsg to detect and overcome issues with
invalid sg count and fail the sendmsg.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>