Commit Graph

827372 Commits

Author SHA1 Message Date
Matt Mullins 9df1c28bb7 bpf: add writable context for raw tracepoints
This is an opt-in interface that allows a tracepoint to provide a safe
buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
The size of the buffer must be a compile-time constant, and is checked
before allowing a BPF program to attach to a tracepoint that uses this
feature.

The pointer to this buffer will be the first argument of tracepoints
that opt in; the pointer is valid and can be bpf_probe_read() by both
BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
programs that attach to such a tracepoint, but the buffer to which it
points may only be written by the latter.

Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 19:04:19 -07:00
Daniel Borkmann 34b8ab091f bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd
Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016,
lets add support for STADD and use that in favor of LDXR / STXR loop for
the XADD mapping if available. STADD is encoded as an alias for LDADD with
XZR as the destination register, therefore add LDADD to the instruction
encoder along with STADD as special case and use it in the JIT for CPUs
that advertise LSE atomics in CPUID register. If immediate offset in the
BPF XADD insn is 0, then use dst register directly instead of temporary
one.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 18:53:40 -07:00
Daniel Borkmann 8968c67a82 bpf, arm64: remove prefetch insn in xadd mapping
Prefetch-with-intent-to-write is currently part of the XADD mapping in
the AArch64 JIT and follows the kernel's implementation of atomic_add.
This may interfere with other threads executing the LDXR/STXR loop,
leading to potential starvation and fairness issues. Drop the optional
prefetch instruction.

Fixes: 85f68fe898 ("bpf, arm64: implement jiting of BPF_XADD")
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 18:53:15 -07:00
Alexei Starovoitov 0c0cad2c28 Merge branch 'btf-dump'
Andrii Nakryiko says:

====================
This patch set adds a new `bpftool btf dump` sub-command, which allows to dump
BTF contents (only types for now). Currently it only outputs low-level
content, almost 1:1 with binary BTF format, but follow up patches will add
ability to dump BTF types as a compilable C header file. JSON output is
supported as well.

Patch #1 adds `btf` sub-command, dumping BTF types in human-readable format.
It also implements reading .BTF data from ELF file.
Patch #2 adds minimal documentation with output format examples and different
ways to specify source of BTF data.
Patch #3 adds support for btf command in bash-completion/bpftool script.
Patch #4 fixes minor indentation issue in bash-completion script.

Output format is mostly following existing format of BPF verifier log, but
deviates from it in few places. More details are in commit message for patch 1.

Example of output for all supported BTF kinds are in patch #2 as part of
documentation. Some field names are quite verbose and I'd rather shorten them,
if we don't feel like being very close to BPF verifier names is a necessity,
but in this patch I left them exactly the same as in verifier log.

v3->v4:
  - reverse Christmas tree (Quentin)
  - better docs (Quentin)

v2->v3:
  - make map's key|value|kv|all suggestion more precise (Quentin)
  - fix default case indentations (Quentin)

v1->v2:
  - fix unnecessary trailing whitespaces in bpftool-btf.rst (Yonghong)
  - add btf in main.c for a list of possible OBJECTs
  - handle unknown keyword under `bpftool btf dump` (Yonghong)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 21:45:15 -07:00
Andrii Nakryiko 8ed1875bf3 bpftool: fix indendation in bash-completion/bpftool
Fix misaligned default case branch for `prog dump` sub-command.

Reported-by: Quentin Monnet <quentin.monnet@netronome.com>
Cc: Yonghong Song <yhs@fb.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 21:45:14 -07:00
Andrii Nakryiko 4a714feefd bpftool: add bash completions for btf command
Add full support for btf command in bash-completion script.

Cc: Quentin Monnet <quentin.monnet@netronome.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 21:45:14 -07:00
Andrii Nakryiko ca253339af bpftool/docs: add btf sub-command documentation
Document usage and sample output format for `btf dump` sub-command.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 21:45:14 -07:00
Andrii Nakryiko c93cc69004 bpftool: add ability to dump BTF types
Add new `btf dump` sub-command to bpftool. It allows to dump
human-readable low-level BTF types representation of BTF types. BTF can
be retrieved from few different sources:
  - from BTF object by ID;
  - from PROG, if it has associated BTF;
  - from MAP, if it has associated BTF data; it's possible to narrow
    down types to either key type, value type, both, or all BTF types;
  - from ELF file (.BTF section).

Output format mostly follows BPF verifier log format with few notable
exceptions:
  - all the type/field/param/etc names are enclosed in single quotes to
    allow easier grepping and to stand out a little bit more;
  - FUNC_PROTO output follows STRUCT/UNION/ENUM format of having one
    line per each argument; this is more uniform and allows easy
    grepping, as opposed to succinct, but inconvenient format that BPF
    verifier log is using.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 21:45:14 -07:00
Benjamin Poirier 77d764263d bpftool: Fix errno variable usage
The test meant to use the saved value of errno. Given the current code, it
makes no practical difference however.

Fixes: bf598a8f0f ("bpftool: Improve handling of ENOENT on map dumps")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 17:30:14 -07:00
Stanislav Fomichev 7f0c57fec8 bpftool: show flow_dissector attachment status
Right now there is no way to query whether BPF flow_dissector program
is attached to a network namespace or not. In previous commit, I added
support for querying that info, show it when doing `bpftool net`:

$ bpftool prog loadall ./bpf_flow.o \
	/sys/fs/bpf/flow type flow_dissector \
	pinmaps /sys/fs/bpf/flow
$ bpftool prog
3: flow_dissector  name _dissect  tag 8c9e917b513dd5cc  gpl
        loaded_at 2019-04-23T16:14:48-0700  uid 0
        xlated 656B  jited 461B  memlock 4096B  map_ids 1,2
        btf_id 1
...

$ bpftool net -j
[{"xdp":[],"tc":[],"flow_dissector":[]}]

$ bpftool prog attach pinned \
	/sys/fs/bpf/flow/flow_dissector flow_dissector
$ bpftool net -j
[{"xdp":[],"tc":[],"flow_dissector":["id":3]}]

Doesn't show up in a different net namespace:
$ ip netns add test
$ ip netns exec test bpftool net -j
[{"xdp":[],"tc":[],"flow_dissector":[]}]

Non-json output:
$ bpftool net
xdp:

tc:

flow_dissector:
id 3

v2:
* initialization order (Jakub Kicinski)
* clear errno for batch mode (Quentin Monnet)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:49:06 +02:00
Stanislav Fomichev 118c8e9ae6 bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type
target_fd is target namespace. If there is a flow dissector BPF program
attached to that namespace, its (single) id is returned.

v5:
* drop net ref right after rcu unlock (Daniel Borkmann)

v4:
* add missing put_net (Jann Horn)

v3:
* add missing inline to skb_flow_dissector_prog_query static def
  (kbuild test robot <lkp@intel.com>)

v2:
* don't sleep in rcu critical section (Jakub Kicinski)
* check input prog_cnt (exit early)

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:49:06 +02:00
Daniel T. Lee ead442a0f9 samples: bpf: add hbm sample to .gitignore
This commit adds hbm to .gitignore which is
currently ommited from the ignore file.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:38:00 +02:00
Daniel T. Lee 32e621e554 libbpf: fix samples/bpf build failure due to undefined UINT32_MAX
Currently, building bpf samples will cause the following error.

    ./tools/lib/bpf/bpf.h:132:27: error: 'UINT32_MAX' undeclared here (not in a function) ..
     #define BPF_LOG_BUF_SIZE (UINT32_MAX >> 8) /* verifier maximum in kernels <= 5.1 */
                               ^
    ./samples/bpf/bpf_load.h:31:25: note: in expansion of macro 'BPF_LOG_BUF_SIZE'
     extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
                             ^~~~~~~~~~~~~~~~

Due to commit 4519efa6f8 ("libbpf: fix BPF_LOG_BUF_SIZE off-by-one error")
hard-coded size of BPF_LOG_BUF_SIZE has been replaced with UINT32_MAX which is
defined in <stdint.h> header.

Even with this change, bpf selftests are running fine since these are built
with clang and it includes header(-idirafter) from clang/6.0.0/include.
(it has <stdint.h>)

    clang -I. -I./include/uapi -I../../../include/uapi -idirafter /usr/local/include -idirafter /usr/include \
    -idirafter /usr/lib/llvm-6.0/lib/clang/6.0.0/include -idirafter /usr/include/x86_64-linux-gnu \
    -Wno-compare-distinct-pointer-types -O2 -target bpf -emit-llvm -c progs/test_sysctl_prog.c -o - | \
    llc -march=bpf -mcpu=generic  -filetype=obj -o /linux/tools/testing/selftests/bpf/test_sysctl_prog.o

But bpf samples are compiled with GCC, and it only searches and includes
headers declared at the target file. As '#include <stdint.h>' hasn't been
declared in tools/lib/bpf/bpf.h, it causes build failure of bpf samples.

    gcc -Wp,-MD,./samples/bpf/.sockex3_user.o.d -Wall -Wmissing-prototypes -Wstrict-prototypes \
    -O2 -fomit-frame-pointer -std=gnu89 -I./usr/include -I./tools/lib/ -I./tools/testing/selftests/bpf/ \
    -I./tools/  lib/ -I./tools/include -I./tools/perf -c -o ./samples/bpf/sockex3_user.o ./samples/bpf/sockex3_user.c;

This commit add declaration of '#include <stdint.h>' to tools/lib/bpf/bpf.h
to fix this problem.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:34:10 +02:00
Alexei Starovoitov 0e33d334df Merge branch 'libbpf-fixes'
Daniel Borkmann says:

====================
Two small fixes in relation to global data handling. Thanks!
====================

Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 13:47:30 -07:00
Daniel Borkmann 4f8827d2b6 bpf, libbpf: fix segfault in bpf_object__init_maps' pr_debug statement
Ran into it while testing; in bpf_object__init_maps() data can be NULL
in the case where no map section is present. Therefore we simply cannot
access data->d_size before NULL test. Move the pr_debug() where it's
safe to access.

Fixes: d859900c4c ("bpf, libbpf: support global data/bss/rodata sections")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 13:47:29 -07:00
Daniel Borkmann 8837fe5dd0 bpf, libbpf: handle old kernels more graceful wrt global data sections
Andrii reported a corner case where e.g. global static data is present
in the BPF ELF file in form of .data/.bss/.rodata section, but without
any relocations to it. Such programs could be loaded before commit
d859900c4c ("bpf, libbpf: support global data/bss/rodata sections"),
whereas afterwards if kernel lacks support then loading would fail.

Add a probing mechanism which skips setting up libbpf internal maps
in case of missing kernel support. In presence of relocation entries,
we abort the load attempt.

Fixes: d859900c4c ("bpf, libbpf: support global data/bss/rodata sections")
Reported-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 13:47:29 -07:00
Daniel Borkmann a21b48a2f2 Merge branch 'bpf-proto-fixes'
Willem de Bruijn says:

====================
Expand the tc tunnel encap support with protocols that convert the
network layer protocol, such as 6in4. This is analogous to existing
support in bpf_skb_proto_6_to_4.

Patch 1 implements the straightforward logic
Patch 2 tests it with a 6in4 tunnel

Changes v1->v2
  - improve documentation in test
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-24 01:32:28 +02:00
Willem de Bruijn f6ad6accaa selftests/bpf: expand test_tc_tunnel with SIT encap
So far, all BPF tc tunnel testcases encapsulate in the same network
protocol. Add an encap testcase that requires updating skb->protocol.

The 6in4 tunnel encapsulates an IPv6 packet inside an IPv4 tunnel.
Verify that bpf_skb_net_grow correctly updates skb->protocol to
select the right protocol handler in __netif_receive_skb_core.

The BPF program should also manually update the link layer header to
encode the right network protocol.

Changes v1->v2
  - improve documentation of non-obvious logic

Signed-off-by: Willem de Bruijn <willemb@google.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-24 01:32:26 +02:00
Willem de Bruijn 1b00e0dfe7 bpf: update skb->protocol in bpf_skb_net_grow
Some tunnels, like sit, change the network protocol of packet.
If so, update skb->protocol to match the new type.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-24 01:32:26 +02:00
Daniel Borkmann 2aad32613c Merge branch 'bpf-eth-get-headlen'
Stanislav Fomichev says:

====================
Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
skb. Because we use passed skb to lookup associated networking namespace
to find whether we have a BPF program attached or not, we always use
C-based flow dissector in this case.

The goal of this patch series is to add new networking namespace argument
to the eth_get_headlen and make BPF flow dissector programs be able to
work in the skb-less case.

The series goes like this:
* use new kernel context (struct bpf_flow_dissector) for flow dissector
  programs; this makes it easy to distinguish between skb and no-skb
  case and supports calling BPF flow dissector on a chunk of raw data
* convert BPF_PROG_TEST_RUN to use raw data
* plumb network namespace into __skb_flow_dissect from all callers
* handle no-skb case in __skb_flow_dissect
* update eth_get_headlen to include net namespace argument and
  convert all existing users
* add selftest to make sure bpf_skb_load_bytes is not allowed in
  the no-skb mode
* extend test_progs to exercise skb-less flow dissection as well
* stop adjusting nhoff/thoff by ETH_HLEN in BPF_PROG_TEST_RUN

v6:
* more suggestions by Alexei:
  * eth_get_headlen now takes net dev, not net namespace
  * test skb-less case via tun eth_get_headlen
* fix return errors in bpf_flow_load
* don't adjust nhoff/thoff by ETH_HLEN

v5:
* API changes have been submitted via bpf/stable tree

v4:
* prohibit access to vlan fields as well (otherwise, inconsistent
  between skb/skb-less cases)
* drop extra unneeded check for skb->vlan_present in bpf_flow.c

v3:
* new kernel xdp_buff-like context per Alexei suggestion
* drop skb_net helper
* properly clamp flow_keys->nhoff

v2:
* moved temporary skb from stack into percpu (avoids memset of ~200 bytes
  per packet)
* tightened down access to __sk_buff fields from flow dissector programs to
  avoid touching shinfo (whitelist only relevant fields)
* addressed suggestions from Willem
====================

Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:36 +02:00
Stanislav Fomichev 02ee065836 bpf/flow_dissector: don't adjust nhoff by ETH_HLEN in BPF_PROG_TEST_RUN
Now that we use skb-less flow dissector let's return true nhoff and
thoff. We used to adjust them by ETH_HLEN because that's how it was
done in the skb case. For VLAN tests that looks confusing: nhoff is
pointing to vlan parts :-\

Warning, this is an API change for BPF_PROG_TEST_RUN! Feel free to drop
if you think that it's too late at this point to fix it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:35 +02:00
Stanislav Fomichev fe993c6468 selftests/bpf: properly return error from bpf_flow_load
Right now we incorrectly return 'ret' which is always zero at that
point.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev 0905beec9f selftests/bpf: run flow dissector tests in skb-less mode
Export last_dissection map from flow dissector and use a known place in
tun driver to trigger BPF flow dissection.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev c9cb2c1e11 selftests/bpf: add flow dissector bpf_skb_load_bytes helper test
When flow dissector is called without skb, we want to make sure
bpf_skb_load_bytes invocations return error. Add small test which tries
to read single byte from a packet.

bpf_skb_load_bytes should always fail under BPF_PROG_TEST_RUN because
it was converted to the skb-less mode.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev c43f1255b8 net: pass net_device argument to the eth_get_headlen
Update all users of eth_get_headlen to pass network device, fetch
network namespace from it and pass it down to the flow dissector.
This commit is a noop until administrator inserts BPF flow dissector
program.

Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev 9b52e3f267 flow_dissector: handle no-skb use case
When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev 3cbf4ffba5 net: plumb network namespace into __skb_flow_dissect
This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev 7b8a130432 bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
Now that we have bpf_flow_dissect which can work on raw data,
use it when doing BPF_PROG_TEST_RUN for flow dissector.

Simplifies bpf_prog_test_run_flow_dissector and allows us to
test no-skb mode.

Note, that previously, with bpf_flow_dissect_skb we used to call
eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
skb_reset_network_header. That means flow_keys->nhoff would be
initialized to 0 (skb_network_offset) in init_flow_keys.
Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
to undo it once the dissection is done to preserve the existing behavior.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:34 +02:00
Stanislav Fomichev 089b19a920 flow_dissector: switch kernel context to struct bpf_flow_dissector
struct bpf_flow_dissector has a small subset of sk_buff fields that
flow dissector BPF program is allowed to access and an optional
pointer to real skb. Real skb is used only in bpf_skb_load_bytes
helper to read non-linear data.

The real motivation for this is to be able to call flow dissector
from eth_get_headlen context where we don't have an skb and need
to dissect raw bytes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:33 +02:00
Florian Fainelli 7e6e185c74 net: systemport: Remove need for DMA descriptor
All we do is write the length/status and address bits to a DMA
descriptor only to write its contents into on-chip registers right
after, eliminate this unnecessary step.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:20:15 -07:00
Ido Schimmel 697cd36cda bridge: Fix possible use-after-free when deleting bridge port
When a bridge port is being deleted, do not dereference it later in
br_vlan_port_event() as it can result in a use-after-free [1] if the RCU
callback was executed before invoking the function.

[1]
[  129.638551] ==================================================================
[  129.646904] BUG: KASAN: use-after-free in br_vlan_port_event+0x53c/0x5fd
[  129.654406] Read of size 8 at addr ffff8881e4aa1ae8 by task ip/483
[  129.663008] CPU: 0 PID: 483 Comm: ip Not tainted 5.1.0-rc5-custom-02265-ga946bd73daac #1383
[  129.672359] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  129.682484] Call Trace:
[  129.685242]  dump_stack+0xa9/0x10e
[  129.689068]  print_address_description.cold.2+0x9/0x25e
[  129.694930]  kasan_report.cold.3+0x78/0x9d
[  129.704420]  br_vlan_port_event+0x53c/0x5fd
[  129.728300]  br_device_event+0x2c7/0x7a0
[  129.741505]  notifier_call_chain+0xb5/0x1c0
[  129.746202]  rollback_registered_many+0x895/0xe90
[  129.793119]  unregister_netdevice_many+0x48/0x210
[  129.803384]  rtnl_delete_link+0xe1/0x140
[  129.815906]  rtnl_dellink+0x2a3/0x820
[  129.844166]  rtnetlink_rcv_msg+0x397/0x910
[  129.868517]  netlink_rcv_skb+0x137/0x3a0
[  129.882013]  netlink_unicast+0x49b/0x660
[  129.900019]  netlink_sendmsg+0x755/0xc90
[  129.915758]  ___sys_sendmsg+0x761/0x8e0
[  129.966315]  __sys_sendmsg+0xf0/0x1c0
[  129.988918]  do_syscall_64+0xa4/0x470
[  129.993032]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  129.998696] RIP: 0033:0x7ff578104b58
...
[  130.073811] Allocated by task 479:
[  130.077633]  __kasan_kmalloc.constprop.5+0xc1/0xd0
[  130.083008]  kmem_cache_alloc_trace+0x152/0x320
[  130.088090]  br_add_if+0x39c/0x1580
[  130.092005]  do_set_master+0x1aa/0x210
[  130.096211]  do_setlink+0x985/0x3100
[  130.100224]  __rtnl_newlink+0xc52/0x1380
[  130.104625]  rtnl_newlink+0x6b/0xa0
[  130.108541]  rtnetlink_rcv_msg+0x397/0x910
[  130.113136]  netlink_rcv_skb+0x137/0x3a0
[  130.117538]  netlink_unicast+0x49b/0x660
[  130.121939]  netlink_sendmsg+0x755/0xc90
[  130.126340]  ___sys_sendmsg+0x761/0x8e0
[  130.130645]  __sys_sendmsg+0xf0/0x1c0
[  130.134753]  do_syscall_64+0xa4/0x470
[  130.138864]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  130.146195] Freed by task 0:
[  130.149421]  __kasan_slab_free+0x125/0x170
[  130.154016]  kfree+0xf3/0x310
[  130.157349]  kobject_put+0x1a8/0x4c0
[  130.161363]  rcu_core+0x859/0x19b0
[  130.165175]  __do_softirq+0x250/0xa26
[  130.170956] The buggy address belongs to the object at ffff8881e4aa1ae8
                which belongs to the cache kmalloc-1k of size 1024
[  130.184972] The buggy address is located 0 bytes inside of
                1024-byte region [ffff8881e4aa1ae8, ffff8881e4aa1ee8)

Fixes: 9c0ec2e718 ("bridge: support binding vlan dev link state to vlan member bridge ports")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:17:47 -07:00
Crag.Wang a6cbcb7793 r8152: sync sa_family with the media type of network device
Without this patch the socket address family sporadically gets wrong
value ends up the dev_set_mac_address() fails to set the desired MAC
address.

Fixes: 25766271e4 ("r8152: Refresh MAC address during USBDEVFS_RESET")
Signed-off-by: Crag.Wang <crag.wang@dell.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-By: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:14:43 -07:00
David S. Miller 6f97955fd2 Merge branch 'mlxsw-Shared-buffer-improvements'
Ido Schimmel says:

====================
mlxsw: Shared buffer improvements

This patchset includes two improvements with regards to shared buffer
configuration in mlxsw.

The first part of this patchset forbids the user from performing illegal
shared buffer configuration that can result in unnecessary packet loss.
In order to better communicate these configuration failures to the user,
extack is propagated from devlink towards drivers. This is done in
patches #1-#8.

The second part of the patchset deals with the shared buffer
configuration of the CPU port. When a packet is trapped by the device,
it is sent across the PCI bus to the attached host CPU. From the
device's perspective, it is as if the packet is transmitted through the
CPU port.

While testing traffic directed at the CPU it became apparent that for
certain packet sizes and certain burst sizes, the current shared buffer
configuration of the CPU port is inadequate and results in packet drops.
The configuration is adjusted by patches #9-#14 that create two new pools
- ingress & egress - which are dedicated for CPU traffic.
====================

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:33 -07:00
Ido Schimmel 7a1ff9f45b mlxsw: spectrum_buffers: Adjust CPU port shared buffer egress quotas
Switch the CPU port to use the new dedicated egress pool instead the
previously used egress pool which was shared with normal front panel
ports.

Add per-port quotas for the amount of traffic that can be buffered for
the CPU port and also adjust the per-{port, TC} quotas.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:33 -07:00
Ido Schimmel 6d28725c4d mlxsw: spectrum_buffers: Allow skipping ingress port quota configuration
The CPU port is used to transmit traffic that is trapped to the host
CPU. It is therefore irrelevant to define ingress quota for it.

Add a 'skip_ingress' argument to the function tasked with configuring
per-port quotas, so that ingress quotas could be skipped in case the
passed local port is the CPU port.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:33 -07:00
Ido Schimmel 24a7cc1ef6 mlxsw: spectrum_buffers: Split business logic from mlxsw_sp_port_sb_pms_init()
The function is used to set the per-port shared buffer quotas.
Currently, these quotas are only set for front panel ports, but a
subsequent patch will configure these quotas for the CPU port as well.

The configuration required for the CPU port is a bit different than that
of the front panel ports, so split the business logic into a separate
function which will be called with different parameters for the CPU
port.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:33 -07:00
Ido Schimmel 50b5b90514 mlxsw: spectrum_buffers: Use new CPU ingress pool for control packets
Use the new ingress pool that was added in the previous patch for
control packets (e.g., STP, LACP) that are trapped to the CPU.

The previous management pool is no longer necessary and therefore its
size is set to 0.

The maximum quota for traffic towards the CPU is increased to 50% of the
free space in the new ingress pool and therefore the reserved space is
reduced by half, to 10KB - in both the shared and headroom buffer. This
allows for more efficient utilization of the shared buffer as reserved
space cannot be used for other purposes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:33 -07:00
Ido Schimmel 265c49b4b9 mlxsw: spectrum_buffers: Add pools for CPU traffic
Packets that are trapped to the CPU are transmitted through the CPU port
to the attached host. The CPU port is therefore like any other port and
needs to have shared buffer configuration.

The maximum quotas configured for the CPU are provided using dynamic
threshold and cannot be changed by the user. In order to make sure that
these thresholds are always valid, the configuration of the threshold
type of these pools is forbidden.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel 857f138f04 mlxsw: spectrum_buffers: Remove assumption about pool order
The code currently assumes that ingress pools have lower indices than
egress pools. This makes it impossible to add more ingress pools
without breaking user configuration that relies on a certain pool index
to correspond to an egress pool.

Remove such assumptions from the code, so that more ingress pools could
be added by subsequent patches.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel f1aaeacdae mlxsw: spectrum_buffers: Forbid changing multicast TCs' attributes
Commit e83c045e53 ("mlxsw: spectrum_buffers: Configure MC pool")
configured the threshold of the multicast TCs as infinite so that the
admission of multicast packets is only depended on per-switch priority
threshold.

Forbid the user from changing the thresholds of these multicast TCs and
their binding to a different pool.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel 51e15a4978 mlxsw: spectrum_buffers: Forbid changing threshold type of first egress pool
Multicast packets have three egress quotas:
* Per egress port
* Per egress port and traffic class
* Per switch priority

The limits on the switch priority are not exposed to the user and
specified as dynamic threshold on the first egress pool.

Forbid changing the threshold type of the first egress pool so that
these limits are always valid.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel cce7acca8a mlxsw: spectrum_buffers: Forbid configuration of multicast pool
Commit e83c045e53 ("mlxsw: spectrum_buffers: Configure MC pool") added
a dedicated pool for multicast traffic. The pool is visible to the user
so that it would be possible to monitor its occupancy, but its
configuration should be forbidden in order to maintain its intended
operation.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel f7936d0bcf mlxsw: spectrum_buffers: Add ability to veto TC's configuration
Subsequent patches are going to need to veto changes in certain TCs'
binding and threshold configurations.

Add fields to the TC's struct that indicate if the TC can be bound to a
different pool and whether its threshold can change and enforce that.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel 0636f4de79 mlxsw: spectrum_buffers: Add ability to veto pool's configuration
Subsequent patches are going to need to veto changes in certain pools'
size and / or threshold type (mode).

Add two fields to the pool's struct that indicate if either of these
attributes is allowed to change and enforce that.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel 93d3668c02 mlxsw: spectrum_buffers: Use defines for pool indices
The pool indices are currently hard coded throughout the code, which
makes the code hard to follow and extend.

Overcome this by using defines for the pool indices.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel 8f6862065d mlxsw: spectrum_buffers: Add extack messages for invalid configurations
Add extack messages to better communicate invalid configuration to the
user.

Example:

# devlink sb pool set pci/0000:01:00.0 pool 0 size 104857600 thtype dynamic
Error: mlxsw_spectrum: Exceeded shared buffer size.
devlink answers: Invalid argument

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Ido Schimmel f2ad1a522e net: devlink: Add extack to shared buffer operations
Add extack to shared buffer set operations, so that meaningful error
messages could be propagated to the user.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
David S. Miller 7e5ebd0b78 Merge branch 'net-clean-up-needless-use-of-module-infrastructure'
Paul Gortmaker says:

====================
clean up needless use of module infrastructure

People can embed modular includes and modular exit functions into code
that never use any of it, and they won't get any errors or warnings.

Using modular infrastructure in non-modules might seem harmless, but some
of the downfalls this leads to are:

 (1) it is easy to accidentally write unused module_exit removal code
 (2) it can be misleading when reading the source, thinking a driver can
     be modular when the Makefile and/or Kconfig prohibit it
 (3) an unused include of the module.h header file will in turn
     include nearly everything else; adding a lot to CPP overhead.
 (4) it gets copied/replicated into other drivers and spreads quickly.

As a data point for #3 above, an empty C file that just includes the
module.h header generates over 750kB of CPP output.  Repeating the same
experiment with init.h and the result is less than 12kB; with export.h
it is only about 1/2kB; with both it still is less than 12kB.  One driver
in this series gets the module.h ---> init.h+export.h conversion.

Worse, are headers in include/linux that in turn include <linux/module.h>
as they can impact a whole fleet of drivers, or a whole subsystem, so
special care should be used in order to avoid that.  Such headers should
only include what they need to be stand-alone; they should not be trying
to anticipate the various header needs of their possible end users.

In this series, four include/linux headers have module.h removed from
them because they don't strictly need it.  Then three chunks of net
related code have modular infrastructure that isn't used, removed.

There are no runtime changes, so the biggest risk is a genuine consumer
of module.h content relying on implicitly getting it from one of the
include/linux instances removed here - thus resulting in a build fail.

With that in mind, allmodconfig build testing was done on x86-64, arm64,
x86-32, arm. powerpc, and mips on linux-next (and hence net-next).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:51:00 -07:00
Paul Gortmaker 15253b4a71 net: strparser: make it explicitly non-modular
The Kconfig currently controlling compilation of this code is:

net/strparser/Kconfig:config STREAM_PARSER
net/strparser/Kconfig:  def_bool n

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.  For
clarity, we change the fcn name mod_init to dev_init at the same time.

We replace module.h with init.h and export.h ; the latter since this
file exports some syms.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:54 -07:00
Paul Gortmaker 3557b3fdee net: bpfilter: dont use module_init in non-modular code
The Kconfig controlling this code is:

bpfilter/Kconfig:menuconfig BPFILTER
bpfilter/Kconfig:   bool "BPF based packet filtering framework (BPFILTER)"

Since it isn't a module, we shouldn't use module_init().  Instead we
use device_initcall() - which is exactly what module_init() defaults
to for non-modular code/builds.

We don't remove <linux/module.h> from the includes since this file does
a request_module() and hence is a valid user of that header file, even
though it is not modular itself.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:54 -07:00