This work implements direct packet access for helpers and direct packet
write in a similar fashion as already available for XDP types via commits
4acf6c0b84 ("bpf: enable direct packet data write for xdp progs") and
6841de8b0d ("bpf: allow helpers access the packet directly"), and as a
complementary feature to the already available direct packet read for tc
(cls/act) programs.
For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
and bpf_csum_update(). The first is generally needed for both, read and
write, because they would otherwise only be limited to the current linear
skb head. Usually, when the data_end test fails, programs just bail out,
or, in the direct read case, use bpf_skb_load_bytes() as an alternative
to overcome this limitation. If such data sits in non-linear parts, we
can just pull them in once with the new helper, retest and eventually
access them.
At the same time, this also makes sure the skb is uncloned, which is, of
course, a necessary condition for direct write. As this needs to be an
invariant for the write part only, the verifier detects writes and adds
a prologue that is calling bpf_skb_pull_data() to effectively unclone the
skb from the very beginning in case it is indeed cloned. The heuristic
makes use of a similar trick that was done in 233577a220 ("net: filter:
constify detection of pkt_type_offset"). This comes at zero cost for other
programs that do not use the direct write feature. Should a program use
this feature only sparsely and has read access for the most parts with,
for example, drop return codes, then such write action can be delegated
to a tail called program for mitigating this cost of potential uncloning
to a late point in time where it would have been paid similarly with the
bpf_skb_store_bytes() as well. Advantage of direct write is that the
writes are inlined whereas the helper cannot make any length assumptions
and thus needs to generate a call to memcpy() also for small sizes, as well
as cost of helper call itself with sanity checks are avoided. Plus, when
direct read is already used, we don't need to cache or perform rechecks
on the data boundaries (due to verifier invalidating previous checks for
helpers that change skb->data), so more complex programs using rewrites
can benefit from switching to direct read plus write.
For direct packet access to helpers, we save the otherwise needed copy into
a temp struct sitting on stack memory when use-case allows. Both facilities
are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
this to map helpers and csum_diff, and can successively enable other helpers
where we find it makes sense. Helpers that definitely cannot be allowed for
this are those part of bpf_helper_changes_skb_data() since they can change
underlying data, and those that write into memory as this could happen for
packet typed args when still cloned. bpf_csum_update() helper accommodates
for the fact that we need to fixup checksum_complete when using direct write
instead of bpf_skb_store_bytes(), meaning the programs can use available
helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
csum_block_add(), csum_block_sub() equivalents in eBPF together with the
new helper. A usage example will be provided for iproute2's examples/bpf/
directory.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current contract for the following two helper argument types is:
* ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
* ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
(NULL, 0) or (ptr, >0).
With 6841de8b0d ("bpf: allow helpers access the packet directly"), we can
pass also raw packet data to helpers, so depending on the argument type
being PTR_TO_PACKET, we now either assert memory via check_packet_access()
or check_stack_boundary(). As a result, the tests in check_packet_access()
currently allow more than intended with regards to reg->imm.
Back in 969bf05eb3 ("bpf: direct packet access"), check_packet_access()
was fine to ignore size argument since in check_mem_access() size was
bpf_size_to_bytes() derived and prior to the call to check_packet_access()
guaranteed to be larger than zero.
However, for the above two argument types, it currently means, we can have
a <= 0 size and thus breaking current guarantees for helpers. Enforce a
check for size <= 0 and bail out if so.
check_stack_boundary() doesn't have such an issue since it already tests
for access_size <= 0 and bails out, resp. access_size == 0 in case of NULL
pointer passed when allowed.
Fixes: 6841de8b0d ("bpf: allow helpers access the packet directly")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.
This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.
BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.
I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
anyway, so we can drop the unneeded test for dev; also few more other
misc bits addressed here.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
LLVM can generate code that tests for direct packet access via
skb->data/data_end in a way that currently gets rejected by the
verifier, example:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)
The reason why this gets rejected despite a proper test is that we
currently call find_good_pkt_pointers() only in case where we detect
tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
derived, for example, from a register of type pkt(id=Y,off=0,r=0)
pointing to skb->data. find_good_pkt_pointers() then fills the range
in the current branch to pkt(id=Y,off=0,r=Z) on success.
For above case, we need to extend that to recognize pkt_end >= rX
pattern and mark the other branch that is taken on success with the
appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
only two practical options to test for from what LLVM could have
generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
that we would need to take into account as well.
After the fix:
[...]
7: (61) r3 = *(u32 *)(r6 +80)
8: (61) r9 = *(u32 *)(r6 +76)
9: (bf) r2 = r9
10: (07) r2 += 54
11: (3d) if r3 >= r2 goto pc+12
R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
R9=pkt(id=0,off=0,r=0) R10=fp
12: (18) r4 = 0xffffff7a
14: (05) goto pc+430
[...]
from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
24: (7b) *(u64 *)(r10 -40) = r1
25: (b7) r1 = 0
26: (63) *(u32 *)(r6 +56) = r1
27: (b7) r2 = 40
28: (71) r8 = *(u8 *)(r9 +20)
29: (bf) r1 = r8
30: (25) if r8 > 0x3c goto pc+47
R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
R9=pkt(id=0,off=0,r=54) R10=fp
31: (b7) r1 = 1
[...]
Verifier test cases are also added in this work, one that demonstrates
the mentioned example here and one that tries a bad packet access for
the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
with both test variants (>, >=).
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes for both merge conflicts.
Resolution work done by Stephen Rothwell was used
as a reference.
Signed-off-by: David S. Miller <davem@davemloft.net>
The helper functions like bpf_map_lookup_elem(map, key) were only
allowing 'key' to point to the initialized stack area.
That is causing performance degradation when programs need to process
millions of packets per second and need to copy contents of the packet
into the stack just to pass the stack pointer into the lookup() function.
Allow such helpers read from the packet directly.
All helpers that expect ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE,
ARG_PTR_TO_STACK assume byte aligned pointer, so no alignment concerns,
only need to check that helper will not be accessing beyond
the packet range verified by the prior 'if (ptr < data_end)' condition.
For now allow this feature for XDP programs only. Later it can be
relaxed for the clsact programs as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While hashing out BPF's current_task_under_cgroup helper bits, it came
to discussion that the skb_in_cgroup helper name was suboptimally chosen.
Tejun says:
So, I think in_cgroup should mean that the object is in that
particular cgroup while under_cgroup in the subhierarchy of that
cgroup. Let's rename the other subhierarchy test to under too. I
think that'd be a lot less confusing going forward.
[...]
It's more intuitive and gives us the room to implement the real
"in" test if ever necessary in the future.
Since this touches uapi bits, we need to change this as long as v4.8
is not yet officially released. Thus, change the helper enum and rename
related bits.
Fixes: 4a482f34af ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Reference: http://patchwork.ozlabs.org/patch/658500/
Suggested-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
This adds a bpf helper that's similar to the skb_in_cgroup helper to check
whether the probe is currently executing in the context of a specific
subset of the cgroupsv2 hierarchy. It does this based on membership test
for a cgroup arraymap. It is invalid to call this in an interrupt, and
it'll return an error. The helper is primarily to be used in debugging
activities for containers, where you may have multiple programs running in
a given top-level "container".
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The introduction of pre-allocated hash elements inadvertently broke
the behavior of bpf hash maps where users expected to call
bpf_map_update_elem() without considering that the map can be full.
Some programs do:
old_value = bpf_map_lookup_elem(map, key);
if (old_value) {
... prepare new_value on stack ...
bpf_map_update_elem(map, key, new_value);
}
Before pre-alloc the update() for existing element would work even
in 'map full' condition. Restore this behavior.
The above program could have updated old_value in place instead of
update() which would be faster and most programs use that approach,
but sometimes the values are large and the programs use update()
helper to do atomic replacement of the element.
Note we cannot simply update element's value in-place like percpu
hash map does and have to allocate extra num_possible_cpu elements
and use this extra reserve when the map is full.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using per-register incrementing ID can lead to
find_good_pkt_pointers() confusing registers which
have completely different values. Consider example:
0: (bf) r6 = r1
1: (61) r8 = *(u32 *)(r6 +76)
2: (61) r0 = *(u32 *)(r6 +80)
3: (bf) r7 = r8
4: (07) r8 += 32
5: (2d) if r8 > r0 goto pc+9
R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
6: (bf) r8 = r7
7: (bf) r9 = r7
8: (71) r1 = *(u8 *)(r7 +0)
9: (0f) r8 += r1
10: (71) r1 = *(u8 *)(r7 +1)
11: (0f) r9 += r1
12: (07) r8 += 32
13: (2d) if r8 > r0 goto pc+1
R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
14: (71) r1 = *(u8 *)(r9 +16)
15: (b7) r7 = 0
16: (bf) r0 = r7
17: (95) exit
We need to get a UNKNOWN_VALUE with imm to force id
generation so lines 0-5 make r7 a valid packet pointer.
We then read two different bytes from the packet and
add them to copies of the constructed packet pointer.
r8 (line 9) and r9 (line 11) will get the same id of 1,
independently. When either of them is validated (line
13) - find_good_pkt_pointers() will also mark the other
as safe. This leads to access on line 14 being mistakenly
considered safe.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.
This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.
An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsystem may need to store many copies of a bpf program, each
deserving its own reference. Rather than requiring the caller to loop
one by one (with possible mid-loop failure), add a bulk bpf_prog_add
api.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Should have been obvious, only called from bpf() syscall via map_update_elem()
that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this
must also be in GFP_ATOMIC, of course.
Fixes: 3b1efb196e ("bpf, maps: flush own entries on perf map release")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.
We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Kconfig currently controlling compilation of this code is:
init/Kconfig:config BPF_SYSCALL
init/Kconfig: bool "Enable bpf() system call"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.
Note that MODULE_ALIAS is a no-op for non-modular code.
We replace module.h with init.h since the file does use __init.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
belongs to a descendant of a cgroup2. It is similar to the
feature added in netfilter:
commit c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
which will be used by the bpf_skb_in_cgroup.
Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
and bpf_skb_in_cgroup() are always used together.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
To update an element, the caller is expected to obtain a cgroup2 backed
fd by open(cgroup2_dir) and then update the array with that fd.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:
- Set up a process with threads T1, T2 and T3
- Let T1 set up a socket filter F1 that invokes another filter F2
through a BPF map [tail call]
- Let T1 trigger the socket filter via a unix domain socket write,
don't wait for completion
- Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
- Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
- Let T3 close the file descriptor for F2, dropping the reference
count of F2 to 2
- At this point, T1 should have looked up F2 from the map, but not
finished executing it
- Let T3 remove F2 from the BPF map, dropping the reference count of
F2 to 1
- Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
via schedule_work()
- At this point, the BPF program could be freed
- BPF execution is still running in a freed BPF program
While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().
That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb5607035 ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.
Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
instead of the raw variant. This allows for preemption checks when we
have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
need to keep the raw variant for socket filters, but we can reuse the
helper that is already there from cBPF side.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor cleanups: i) Remove the unlikely() from fd array map lookups
and let the CPU branch predictor do its job, scenarios where there is not
always a map entry are very well valid. ii) Move the attribute type check
in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt
checks with bpf_perf_event_output() helper as well. iii) remove some
comments that are self-documenting in kprobe_prog_is_valid_access() and
therefore make it consistent to tp_prog_is_valid_access() as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34f ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c65 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch extends map_fd_get_ptr() callback that is used by fd array
maps, so that struct file pointer from the related map can be passed
in. It's safe to remove map_update_elem() callback for the two maps since
this is only allowed from syscall side, but not from eBPF programs for these
two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem()
needs to be called directly here due to the extra argument.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a release callback for maps that is invoked when the last
reference to its struct file is gone and the struct file about
to be released by vfs. The handler will be used by fd array maps.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix negative error code usage in ATM layer, from Stefan Hajnoczi.
2) If CONFIG_SYSCTL is disabled, the default TTL is not initialized
properly. From Ezequiel Garcia.
3) Missing spinlock init in mvneta driver, from Gregory CLEMENT.
4) Missing unlocks in hwmb error paths, also from Gregory CLEMENT.
5) Fix deadlock on team->lock when propagating features, from Ivan
Vecera.
6) Work around buffer offset hw bug in alx chips, from Feng Tang.
7) Fix double listing of SCTP entries in sctp_diag dumps, from Xin
Long.
8) Various statistics bug fixes in mlx4 from Eric Dumazet.
9) Fix some randconfig build errors wrt fou ipv6 from Arnd Bergmann.
10) All of l2tp was namespace aware, but the ipv6 support code was not
doing so. From Shmulik Ladkani.
11) Handle on-stack hrtimers properly in pktgen, from Guenter Roeck.
12) Propagate MAC changes properly through VLAN devices, from Mike
Manning.
13) Fix memory leak in bnx2x_init_one(), from Vitaly Kuznetsov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
sfc: Track RPS flow IDs per channel instead of per function
usbnet: smsc95xx: fix link detection for disabled autonegotiation
virtio_net: fix virtnet_open and virtnet_probe competing for try_fill_recv
bnx2x: avoid leaking memory on bnx2x_init_one() failures
fou: fix IPv6 Kconfig options
openvswitch: update checksum in {push,pop}_mpls
sctp: sctp_diag should dump sctp socket type
net: fec: update dirty_tx even if no skb
vlan: Propagate MAC address to VLANs
atm: iphase: off by one in rx_pkt()
atm: firestream: add more reserved strings
vxlan: Accept user specified MTU value when create new vxlan link
net: pktgen: Call destroy_hrtimer_on_stack()
timer: Export destroy_hrtimer_on_stack()
net: l2tp: Make l2tp_ip6 namespace aware
Documentation: ip-sysctl.txt: clarify secure_redirects
sfc: use flow dissector helpers for aRFS
ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr
net: nps_enet: Disable interrupts before napi reschedule
net/lapb: tuse %*ph to dump buffers
...
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.
This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...
Follow-up to commit e27f4a942a ("bpf: Use mount_nodev not mount_ns
to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.
The original idea was to have a per mountns instance instead of a
single global fs instance, but that didn't work out and we had to
switch to mount_nodev() model. The intent of that middle ground was
that we avoid users who don't play nice to create endless instances
of bpf fs which are difficult to control and discover from an admin
point of view, but at the same time it would have allowed us to be
more flexible with regard to namespaces.
Therefore, since we now did the switch to mount_nodev() as a fix
where individual instances are created, we also need to remove userns
mount flag along with it to avoid running into mentioned situation.
I don't expect any breakage at this early point in time with removing
the flag and we can revisit this later should the requirement for
this come up with future users. This and commit e27f4a942a have
been split to facilitate tracking should any of them run into the
unlikely case of causing a regression.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Humans don't write C code like:
u8 *ptr = skb->data;
int imm = 4;
imm += ptr;
but from llvm backend point of view 'imm' and 'ptr' are registers and
imm += ptr may be preferred vs ptr += imm depending which register value
will be used further in the code, while verifier can only recognize ptr += imm.
That caused small unrelated changes in the C code of the bpf program to
trigger rejection by the verifier. Therefore teach the verifier to recognize
both ptr += imm and imm += ptr.
For example:
when R6=pkt(id=0,off=0,r=62) R7=imm22
after r7 += r6 instruction
will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62)
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
when packet headers are accessed in 'decreasing' order (like TCP port
may be fetched before the program reads IP src) the llvm may generate
the following code:
[...] // R7=pkt(id=0,off=22,r=70)
r2 = *(u32 *)(r7 +0) // good access
[...]
r7 += 40 // R7=pkt(id=0,off=62,r=70)
r8 = *(u32 *)(r7 +0) // good access
[...]
r1 = *(u32 *)(r7 -20) // this one will fail though it's within a safe range
// it's doing *(u32*)(skb->data + 42)
Fix verifier to recognize such code pattern
Alos turned out that 'off > range' condition is not a verifier bug.
It's a buggy program that may do something like:
if (ptr + 50 > data_end)
return 0;
ptr += 60;
*(u32*)ptr;
in such case emit
"invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message,
so all information is available for the program author to fix the program.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
bpf filesystem. Looking at the code I saw a broken usage of mount_ns
with current->nsproxy->mnt_ns. As the code does not acquire a
reference to the mount namespace it can not possibly be correct to
store the mount namespace on the superblock as it does.
Replace mount_ns with mount_nodev so that each mount of the bpf
filesystem returns a distinct instance, and the code is not buggy.
In discussion with Hannes Frederic Sowa it was reported that the use
of mount_ns was an attempt to have one bpf instance per mount
namespace, in an attempt to keep resources that pin resources from
hiding. That intent simply does not work, the vfs is not built to
allow that kind of behavior. Which means that the bpf filesystem
really is buggy both semantically and in it's implemenation as it does
not nor can it implement the original intent.
This change is userspace visible, but my experience with similar
filesystems leads me to believe nothing will break with a model of each
mount of the bpf filesystem is distinct from all others.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start address randomization and blinding in BPF currently use
prandom_u32(). prandom_u32() values are not exposed to unpriviledged
user space to my knowledge, but given other kernel facilities such as
ASLR, stack canaries, etc make use of stronger get_random_int(), we
better make use of it here as well given blinding requests successively
new random values. get_random_int() has minimal entropy pool depletion,
is not cryptographically secure, but doesn't need to be for our use
cases here.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull misc vfs cleanups from Al Viro:
"Assorted cleanups and fixes all over the place"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coredump: only charge written data against RLIMIT_CORE
coredump: get rid of coredump_params->written
ecryptfs_lookup(): try either only encrypted or plaintext name
ecryptfs: avoid multiple aliases for directories
bpf: reject invalid names right in ->lookup()
__d_alloc(): treat NULL name as QSTR("/", 1)
mtd: switch ubi_open_volume_path() to vfs_stat()
mtd: switch open_mtd_by_chdev() to use of vfs_stat()
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.
We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.
For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.
This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.
The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.
Small bpf_jit_disasm extract from [2] when applied to x86 JIT:
echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.
Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).
BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.
[1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
[2] https://github.com/01org/jit-spray-poc-for-ksp/
[3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the blinding is strictly only called from inside eBPF JITs,
we need to change signatures for bpf_int_jit_compile() and
bpf_prog_select_runtime() first in order to prepare that the
eBPF program we're dealing with can change underneath. Hence,
for call sites, we need to return the latest prog. No functional
change in this patch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the functionality to patch instructions out of the verifier
code and into the core as the new bpf_patch_insn_single() helper
will be needed later on for blinding as well. No changes in
functionality.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Besides others, remove redundant comments where the code is self
documenting enough, and properly indent various bpf_verifier_ops
and bpf_prog_type_list declarations. Moreover, remove two exports
that actually have no module user.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
verifier its recognition of constants in conditional branches
without affecting safety.
Ex:
if (reg == 123) {
.. here verifier was marking reg->type as CONST_IMM
instead keep reg as UNKNOWN_VALUE
}
Two verifier states with UNKNOWN_VALUE are equivalent, whereas
CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
verification and other cases.
So help search pruning by marking registers as UNKNOWN_VALUE
where possible instead of CONST_IMM.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extended BPF carried over two instructions from classic to access
packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
but due to their design they have to do length check for every access.
When BPF is processing 20M packets per second single LD_ABS after JIT
is consuming 3% cpu. Hence the need to optimize it further by amortizing
the cost of 'off < skb_headlen' over multiple packet accesses.
One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
with similar usage as skb_header_pointer().
The kernel part for interpreter and x64 JIT was implemented in [1], but such
new insns behave like old ld_abs and abort the program with 'return 0' if
access is beyond linear data. Such hidden control flow is hard to workaround
plus changing JITs and rolling out new llvm is incovenient.
Therefore allow cls_bpf/act_bpf program access skb->data directly:
int bpf_prog(struct __sk_buff *skb)
{
struct iphdr *ip;
if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
/* packet too small */
return 0;
ip = skb->data + ETH_HLEN;
/* access IP header fields with direct loads */
if (ip->version != 4 || ip->saddr == 0x7f000001)
return 1;
[...]
}
This solution avoids introduction of new instructions. llvm stays
the same and all JITs stay the same, but verifier has to work extra hard
to prove safety of the above program.
For XDP the direct store instructions can be allowed as well.
The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
the alignment. The complex packet parsers where packet pointer is adjusted
incrementally cannot be tracked for alignment, so allow byte access in such cases
and misaligned access on architectures that define efficient_unaligned_access
[1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
cleanup verifier code and prepare it for addition of "pointer to packet" logic
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
introduced clever way to check bpf_helper<->map_type compatibility.
Later on commit a43eec3042 ("bpf: introduce bpf_perf_event_output() helper") adjusted
the logic and inadvertently broke it.
Get rid of the clever bool compare and go back to two-way check
from map and from helper perspective.
Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.
Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.
And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.
The new file is:
# cat /proc/sys/kernel/perf_event_max_stack
127
Chaging it:
# echo 256 > /proc/sys/kernel/perf_event_max_stack
# cat /proc/sys/kernel/perf_event_max_stack
256
But as soon as there is some event using callchains we get:
# echo 512 > /proc/sys/kernel/perf_event_max_stack
-bash: echo: write error: Device or resource busy
#
Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.
Reported-and-Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.
This bug was introduced in
commit 0246e64d9a ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d16 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
(posted publicly according to request by maintainer)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec3042 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When passing buffers from eBPF stack space into a helper function, we have
ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
that such buffers are initialized, within boundaries, etc.
However, the downside with this is that we have a couple of helper functions
such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
success case anyway, so zero initializing them prior to the helper call is
unneeded/wasted instructions in the eBPF program that can be avoided.
Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
The idea is to skip the STACK_MISC check in check_stack_boundary() and color
the related stack slots as STACK_MISC after we checked all call arguments.
Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
the helper function will fill the provided buffer area, so that we cannot leak
any uninitialized stack memory. This f.e. means that error paths need to
memset() the buffers, but the expected fast-path doesn't have to do this
anymore.
Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
argument, we can keep it simple and don't need to check for multiple areas.
Should in future such a use-case really appear, we have check_raw_mode() that
will make sure we implement support for it first.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when the verifier checks calls in check_call() function, we
call check_func_arg() for all 5 arguments e.g. to make sure expected types
are correct. In some cases, we collect meta data (here: map pointer) to
perform additional checks such as checking stack boundary on key/value
sizes for subsequent arguments. As we're going to extend the meta data,
add a generic struct bpf_call_arg_meta that we can use for passing into
check_func_arg().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.
Fixes: ddd872bc30 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
verifier is using the following structure to track the state of registers:
struct reg_state {
enum bpf_reg_type type;
union {
int imm;
struct bpf_map *map_ptr;
};
};
and later on in states_equal() does memcmp(&old->regs[i], &cur->regs[i],..)
to find equivalent states.
Throughout the code of verifier there are assignements to 'imm' and 'map_ptr'
fields and it's not obvious that most of the assignments into 'imm' don't
need to clear extra 4 bytes (like mark_reg_unknown_value() does) to make sure
that memcmp doesn't go over junk left from 'map_ptr' assignment.
Simplify the code by converting 'int' into 'long'
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The verifier needs to go through every path of the program in
order to check that it terminates safely, which can be quite a
lot of instructions that need to be processed f.e. in cases with
more branchy programs. With search pruning from f1bca824da ("bpf:
add search pruning optimization to verifier") the search space can
already be reduced significantly when the verifier detects that
a previously walked path with same register and stack contents
terminated already (see verifier's states_equal()), so the search
can skip walking those states.
When working with larger programs of > ~2000 (out of max 4096)
insns, we found that the current limit of 32k instructions is easily
hit. For example, a case we ran into is that the search space cannot
be pruned due to branches at the beginning of the program that make
use of certain stack space slots (STACK_MISC), which are never used
in the remaining program (STACK_INVALID). Therefore, the verifier
needs to walk paths for the slots in STACK_INVALID state, but also
all remaining paths with a stack structure, where the slots are in
STACK_MISC, which can nearly double the search space needed. After
various experiments, we find that a limit of 64k processed insns is
a more reasonable choice when dealing with larger programs in practice.
This still allows to reject extreme crafted cases that can have a
much higher complexity (f.e. > ~300k) within the 4096 insns limit
due to search pruning not being able to take effect.
Furthermore, we found that a lot of states can be pruned after a
call instruction, f.e. we were able to reduce the search state by
~35% in some cases with this heuristic, trade-off is to keep a bit
more states in env->explored_states. Usually, call instructions
have a number of preceding register assignments and/or stack stores,
where search pruning has a better chance to suceed in states_equal()
test. The current code marks the branch targets with STATE_LIST_MARK
in case of conditional jumps, and the next (t + 1) instruction in
case of unconditional jump so that f.e. a backjump will walk it. We
also did experiments with using t + insns[t].off + 1 as a marker in
the unconditionally jump case instead of t + 1 with the rationale
that these two branches of execution that converge after the label
might have more potential of pruning. We found that it was a bit
better, but not necessarily significantly better than the current
state, perhaps also due to clang not generating back jumps often.
Hence, we left that as is for now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields
This also disallows access to __dynamic_array fields, but can be
relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
tc can check for them when loading objects from a pinned entry, e.g.
if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
pinned object, it can bail out. Follow-up to 6c90598174 ("bpf:
pre-allocate hash map elements"), so that tc can still support this
with v4.6.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull 'objtool' stack frame validation from Ingo Molnar:
"This tree adds a new kernel build-time object file validation feature
(ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
It was written by and is maintained by Josh Poimboeuf.
The motivation: there's a category of hard to find kernel bugs, most
of them in assembly code (but also occasionally in C code), that
degrades the quality of kernel stack dumps/backtraces. These bugs are
hard to detect at the source code level. Such bugs result in
incorrect/incomplete backtraces most of time - but can also in some
rare cases result in crashes or other undefined behavior.
The build time correctness checking is done via the new 'objtool'
user-space utility that was written for this purpose and which is
hosted in the kernel repository in tools/objtool/. The tool's (very
simple) UI and source code design is shaped after Git and perf and
shares quite a bit of infrastructure with tools/perf (which tooling
infrastructure sharing effort got merged via perf and is already
upstream). Objtool follows the well-known kernel coding style.
Objtool does not try to check .c or .S files, it instead analyzes the
resulting .o generated machine code from first principles: it decodes
the instruction stream and interprets it. (Right now objtool supports
the x86-64 architecture.)
From tools/objtool/Documentation/stack-validation.txt:
"The kernel CONFIG_STACK_VALIDATION option enables a host tool named
objtool which runs at compile time. It has a "check" subcommand
which analyzes every .o file and ensures the validity of its stack
metadata. It enforces a set of rules on asm code and C inline
assembly code so that stack traces can be reliable.
Currently it only checks frame pointer usage, but there are plans to
add CFI validation for C files and CFI generation for asm files.
For each function, it recursively follows all possible code paths
and validates the correct frame pointer state at each instruction.
It also follows code paths involving special sections, like
.altinstructions, __jump_table, and __ex_table, which can add
alternative execution paths to a given instruction (or set of
instructions). Similarly, it knows how to follow switch statements,
for which gcc sometimes uses jump tables."
When this new kernel option is enabled (it's disabled by default), the
tool, if it finds any suspicious assembly code pattern, outputs
warnings in compiler warning format:
warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer
... so that scripts that pick up compiler warnings will notice them.
All known warnings triggered by the tool are fixed by the tree, most
of the commits in fact prepare the kernel to be warning-free. Most of
them are bugfixes or cleanups that stand on their own, but there are
also some annotations of 'special' stack frames for justified cases
such entries to JIT-ed code (BPF) or really special boot time code.
There are two other long-term motivations behind this tool as well:
- To improve the quality and reliability of kernel stack frames, so
that they can be used for optimized live patching.
- To create independent infrastructure to check the correctness of
CFI stack frames at build time. CFI debuginfo is notoriously
unreliable and we cannot use it in the kernel as-is without extra
checking done both on the kernel side and on the build side.
The quality of kernel stack frames matters to debuggability as well,
so IMO we can merge this without having to consider the live patching
or CFI debuginfo angle"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
objtool: Only print one warning per function
objtool: Add several performance improvements
tools: Copy hashtable.h into tools directory
objtool: Fix false positive warnings for functions with multiple switch statements
objtool: Rename some variables and functions
objtool: Remove superflous INIT_LIST_HEAD
objtool: Add helper macros for traversing instructions
objtool: Fix false positive warnings related to sibling calls
objtool: Compile with debugging symbols
objtool: Detect infinite recursion
objtool: Prevent infinite recursion in noreturn detection
objtool: Detect and warn if libelf is missing and don't break the build
tools: Support relative directory path for 'O='
objtool: Support CROSS_COMPILE
x86/asm/decoder: Use explicitly signed chars
objtool: Enable stack metadata validation on 64-bit x86
objtool: Add CONFIG_STACK_VALIDATION option
objtool: Add tool to perform compile-time stack metadata validation
x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
sched: Always inline context_switch()
...
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.
Fixes: ffeedafbf0 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.
Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.
The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
cpu0 cpu1
---- ----
user does lookup(stackidX)
starts copying ips into buffer
delete(stackidX)
calls bpf_get_stackid()
which recyles the element and
overwrites with new stack trace
To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.
Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.
Fixes: d5a3b1f691 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.
/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);
/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);
The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.
For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.
Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
objtool reports the following false positive warnings:
kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x5c: sibling call from callable instruction with changed frame pointer
kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x60: function has unreachable instruction
kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x64: function has unreachable instruction
[...]
It's confused by the following dynamic jump instruction in
__bpf_prog_run()::
jmp *(%r12,%rax,8)
which corresponds to the following line in the C code:
goto *jumptable[insn->code];
There's no way for objtool to deterministically find all possible
branch targets for a dynamic jump, so it can't verify this code.
In this case the jumps all stay within the function, and there's nothing
unusual going on related to the stack, so we can whitelist the function.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/b90e6bf3fdbfb5c4cc1b164b965502e53cf48935.1456719558.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when we pass a buffer from the eBPF stack into a helper
function, the function proto indicates argument types as ARG_PTR_TO_STACK
and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
must be of the latter type. Then, verifier checks whether the buffer
points into eBPF stack, is initialized, etc. The verifier also guarantees
that the constant value passed in R<X+1> is greater than 0, so helper
functions don't need to test for it and can always assume a non-NULL
initialized buffer as well as non-0 buffer size.
This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
Such helper functions, of course, need to be able to handle these cases
internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
other combinations are not possible to load.
I went through various options of extending the verifier, and introducing
the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
needed to the verifier.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
bit 8 - collect user stack instead of kernel
bit 9 - compare stacks by hash only
bit 10 - if two different stacks hash into the same stackid
discard old
other bits - reserved
Return: >= 0 stackid on success or negative error
stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.
Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
which pointed out a missing rcu read lock.
Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ctx access is used, the kernel often needs to expand/rewrite
instructions, so after that patching, branch offsets have to be
adjusted for both forward and backward jumps in the new eBPF program,
but for backward jumps it fails to account the delta. Meaning, for
example, if the expansion happens exactly on the insn that sits at
the jump target, it doesn't fix up the back jump offset.
Analysis on what the check in adjust_branches() is currently doing:
/* adjust offset of jmps if necessary */
if (i < pos && i + insn->off + 1 > pos)
insn->off += delta;
else if (i > pos && i + insn->off + 1 < pos)
insn->off -= delta;
First condition (forward jumps):
Before: After:
insns[0] insns[0]
insns[1] <--- i/insn insns[1] <--- i/insn
insns[2] <--- pos insns[P] <--- pos
insns[3] insns[P] `------| delta
insns[4] <--- target_X insns[P] `-----|
insns[5] insns[3]
insns[4] <--- target_X
insns[5]
First case is if we cross pos-boundary and the jump instruction was
before pos. This is handeled correctly. I.e. if i == pos, then this
would mean our jump that we currently check was the patchlet itself
that we just injected. Since such patchlets are self-contained and
have no awareness of any insns before or after the patched one, the
delta is correctly not adjusted. Also, for the second condition in
case of i + insn->off + 1 == pos, means we jump to that newly patched
instruction, so no offset adjustment are needed. That part is correct.
Second condition (backward jumps):
Before: After:
insns[0] insns[0]
insns[1] <--- target_X insns[1] <--- target_X
insns[2] <--- pos <-- target_Y insns[P] <--- pos <-- target_Y
insns[3] insns[P] `------| delta
insns[4] <--- i/insn insns[P] `-----|
insns[5] insns[3]
insns[4] <--- i/insn
insns[5]
Second interesting case is where we cross pos-boundary and the jump
instruction was after pos. Backward jump with i == pos would be
impossible and pose a bug somewhere in the patchlet, so the first
condition checking i > pos is okay only by itself. However, i +
insn->off + 1 < pos does not always work as intended to trigger the
adjustment. It works when jump targets would be far off where the
delta wouldn't matter. But, for example, where the fixed insn->off
before pointed to pos (target_Y), it now points to pos + delta, so
that additional room needs to be taken into account for the check.
This means that i) both tests here need to be adjusted into pos + delta,
and ii) for the second condition, the test needs to be <= as pos
itself can be a target in the backjump, too.
Fixes: 9bac3d6d54 ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Primary use case is a histogram array of latency
where bpf program computes the latency of block requests or other
events and stores histogram of latency into array of 64 elements.
All cpus are constantly running, so normal increment is not accurate,
bpf_xadd causes cache ping-pong and this per-cpu approach allows
fastest collision-free counters.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
accurate counters without need to use BPF_XADD instruction which turned
out to be too costly for high-performance network monitoring.
In the typical use case the 'key' is the flow tuple or other long
living object that sees a lot of events per second.
bpf_map_lookup_elem() returns per-cpu area.
Example:
struct {
u32 packets;
u32 bytes;
} * ptr = bpf_map_lookup_elem(&map, &key);
/* ptr points to this_cpu area of the value, so the following
* increments will not collide with other cpus
*/
ptr->packets ++;
ptr->bytes += skb->len;
bpf_update_elem() atomically creates a new element where all per-cpu
values are zero initialized and this_cpu value is populated with
given 'value'.
Note that non-per-cpu hash map always allocates new element
and then deletes old after rcu grace period to maintain atomicity
of update. Per-cpu hash map updates element values in-place.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT. Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.
The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is just used for protecting the per-bucket
hlist, so it isn't needed for selecting bucket.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparing for removing global per-hashtable lock, so
the counter need to be defined as aotmic_t first.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via 83d5b7ef99
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.
This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3 ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e. cde66c2d88
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.
Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Acked-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for renaming and hard links to the fs. Most of this can be
implemented by using simple library operations under the same constraints
that we don't use a reserved name like elsewhere. Linking can be useful
to share/manage things like maps across subsystem users. It works within
the file system boundary, but is not allowed for directories.
Symbolic links are explicitly not implemented here, as it can be better
done already by doing bind mounts inside bpf fs to set up shared directories
f.e. useful when using volumes in docker containers that map a private
working directory into /sys/fs/bpf/ which contains itself a bind mounted
path from the host's /sys/fs/bpf/ mount that is shared among multiple
containers. For single maps instead of whole directory, hard links can
be easily used to do the same.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/renesas/ravb_main.c
kernel/bpf/syscall.c
net/ipv4/ipmr.c
All three conflicts were cases of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
For large map->value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
[<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
[<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
[< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989
[<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
[<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
[< inline >] alloc_pages include/linux/gfp.h:451
[<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
[<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
[<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
[< inline >] kmalloc_large include/linux/slab.h:390
[<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
[< inline >] kmalloc include/linux/slab.h:463
[< inline >] map_update_elem kernel/bpf/syscall.c:288
[< inline >] SYSC_bpf kernel/bpf/syscall.c:744
To avoid never succeeding kmalloc with order >= MAX_ORDER check that
elem->value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
so keep those kmalloc-s as-is.
Large value_size can cause integer overflows in elem_size and map.pages
formulas, so check for that as well.
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During own review but also reported by Dmitry's syzkaller [1] it has been
noticed that we trigger a heap out-of-bounds access on eBPF array maps
when updating elements. This happens with each map whose map->value_size
(specified during map creation time) is not multiple of 8 bytes.
In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
used to align array map slots for faster access. However, in function
array_map_update_elem(), we update the element as ...
memcpy(array->value + array->elem_size * index, value, array->elem_size);
... where we access 'value' out-of-bounds, since it was allocated from
map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
and later on copied through copy_from_user(value, uvalue, map->value_size).
Thus, up to 7 bytes, we can access out-of-bounds.
Same could happen from within an eBPF program, where in worst case we
access beyond an eBPF program's designated stack.
Since 1be7f75d16 ("bpf: enable non-root eBPF programs") didn't hit an
official release yet, it only affects priviledged users.
In case of array_map_lookup_elem(), the verifier prevents eBPF programs
from accessing beyond map->value_size through check_map_access(). Also
from syscall side map_lookup_elem() only copies map->value_size back to
user, so nothing could leak.
[1] http://github.com/google/syzkaller
Fixes: 28fbcfa08d ("bpf: add array type of eBPF maps")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.
Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.
Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.
Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.
Joint work with Alexei.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a handler for show_fdinfo() to be used by the anon-inodes
backend for eBPF maps, and dump the map specification there. Not
only useful for admins, but also it provides a minimal way to
compare specs from ELF vs pinned object.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>