Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing
to deliver SIGHUP but always trying.
When the stack overflows delivery of SIGHUP fails and force_sigsegv is
called. Unfortunately because SIGSEGV is numerically higher than
SIGHUP next_signal tries again to deliver a SIGHUP.
From a quality of implementation standpoint attempting to deliver the
timer SIGHUP signal is wrong. We should attempt to deliver the
synchronous SIGSEGV signal we just forced.
We can make that happening in a fairly straight forward manner by
instead of just looking at the signal number we also look at the
si_code. In particular for exceptions (aka synchronous signals) the
si_code is always greater than 0.
That still has the potential to pick up a number of asynchronous
signals as in a few cases the same si_codes that are used
for synchronous signals are also used for asynchronous signals,
and SI_KERNEL is also included in the list of possible si_codes.
Still the heuristic is much better and timer signals are definitely
excluded. Which is enough to prevent all known ways for someone
sending a process signals fast enough to cause unexpected and
arguably incorrect behavior.
Cc: stable@vger.kernel.org
Fixes: a27341cd5f ("Prioritize synchronous signals over 'normal' signals")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop
failing to deliver SIGHUP but always trying.
Upon examination it turns out part of the problem is actually most of
the solution. Since 2.5 signal delivery has found all fatal signals,
marked the signal group for death, and queued SIGKILL in every threads
thread queue relying on signal->group_exit_code to preserve the
information of which was the actual fatal signal.
The conversion of all fatal signals to SIGKILL results in the
synchronous signal heuristic in next_signal kicking in and preferring
SIGHUP to SIGKILL. Which is especially problematic as all
fatal signals have already been transformed into SIGKILL.
Instead of dequeueing signals and depending upon SIGKILL to
be the first signal dequeued, first test if the signal group
has already been marked for death. This guarantees that
nothing in the signal queue can prevent a process that needs
to exit from exiting.
Cc: stable@vger.kernel.org
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- Cut and paste fix to have uprobe printks say "uprobe" and not "kprobe"
- Add terminating '\0' byte when copying of function arguments
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXFsNwBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlSLAP9iup0t8BrDYcJQCNIJK+7hwBdB642c
11qbWE7aXfsyUwEAu78tJfQfmBeZz7mHxKeMkTHQHE2IqV5qU311twOFiAE=
=zkXr
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This has two fixes for uprobe code.
- Cut and paste fix to have uprobe printks say "uprobe" and not
"kprobe"
- Add terminating '\0' byte when copying function arguments"
* tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/uprobes: Fix output for multiple string arguments
tracing: uprobes: Fix typo in pr_fmt string
Since 2.5.34 the code has had the potential to not allocate siginfo
for SIGSTOP signals. Except for ptrace this is perfectly fine as only
ptrace can use PTRACE_PEEK_SIGINFO and see what the contents of
the delivered siginfo are.
Users of PTRACE_PEEK_SIGINFO that care about the contents siginfo
for SIGSTOP are rare, but they do exist. A seccomp self test
has cared and lldb cares.
Jack Andersen <jackoalan@gmail.com> writes:
> The patch titled
> `signal: Never allocate siginfo for SIGKILL or SIGSTOP`
> created a regression for users of PTRACE_GETSIGINFO needing to
> discern signals that were raised via the tgkill syscall.
>
> A notable user of this tgkill+ptrace combination is lldb while
> debugging a multithreaded program. Without the ability to detect a
> SIGSTOP originating from tgkill, lldb does not have a way to
> synchronize on a per-thread basis and falls back to SIGSTOP-ing the
> entire process.
Everyone affected by this please note. The kernel can still fail to
allocate a siginfo structure. The allocation is with GFP_KERNEL and
is best effort only. If memory is tight when the signal allocation
comes in this will fail to allocate a siginfo.
So I strongly recommend looking at more robust solutions for
synchronizing with a single thread such as PTRACE_INTERRUPT. Or if
that does not work persuading your friendly local kernel developer to
build the interface you need.
Reported-by: Tycho Andersen <tycho@tycho.ws>
Reported-by: Kees Cook <keescook@chromium.org>
Reported-by: Jack Andersen <jackoalan@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Christian Brauner <christian@brauner.io>
Cc: stable@vger.kernel.org
Fixes: f149b31557 ("signal: Never allocate siginfo for SIGKILL or SIGSTOP")
Fixes: 6dfc88977e42 ("[PATCH] shared thread signals")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The perf tool uses /proc/sys/kernel/perf_event_mlock_kb to determine how
large its ringbuffer mmap should be. This can be configured to arbitrary
values, which can be larger than the maximum possible allocation from
kmalloc.
When this is configured to a suitably large value (e.g. thanks to the
perf fuzzer), attempting to use perf record triggers a WARN_ON_ONCE() in
__alloc_pages_nodemask():
WARNING: CPU: 2 PID: 5666 at mm/page_alloc.c:4511 __alloc_pages_nodemask+0x3f8/0xbc8
Let's avoid this by checking that the requested allocation is possible
before calling kzalloc.
Reported-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190110142745.25495-1-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cpu hotplug fixes from Thomas Gleixner:
"Two fixes for the cpu hotplug machinery:
- Replace the overly clever 'SMT disabled by BIOS' detection logic as
it breaks KVM scenarios and prevents speculation control updates
when the Hyperthreads are brought online late after boot.
- Remove a redundant invocation of the speculation control update
function"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
x86/speculation: Remove redundant arch_smt_update() invocation
Pull perf fixes from Thomas Gleixner:
"A pile of perf updates:
- Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
write handler
- Cure a perf script crash which caused by an unitinialized data
structure
- Highlight the hottest instruction in perf top and not a random one
- Cure yet another clang issue when building perf python
- Handle topology entries with no CPU correctly in the tools
- Handle perf data which contains both tracepoints and performance
counter entries correctly.
- Add a missing NULL pointer check in perf ordered_events_free()"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf script: Fix crash when processing recorded stat data
perf top: Fix wrong hottest instruction highlighted
perf tools: Handle TOPOLOGY headers with no CPU
perf python: Remove -fstack-clash-protection when building with some clang versions
perf core: Fix perf_proc_update_handler() bug
perf script: Fix crash with printing mixed trace point and other events
perf ordered_events: Fix crash in ordered_events__free
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.
zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list. In this case, we will have one unkillable process with one or
more dead children.
Thanks to Oleg for the advice to release tasks in find_child_reaper().
Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd2322c ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexei Starovoitov says:
====================
pull-request: bpf 2019-01-31
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) disable preemption in sender side of socket filters, from Alexei.
2) fix two potential deadlocks in syscall bpf lookup and prog_register,
from Martin and Alexei.
3) fix BTF to allow typedef on func_proto, from Yonghong.
4) two bpftool fixes, from Jiri and Paolo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.
Implementation details that are part of uapi:
Array:
The array map takes the element lock for lookup/update.
Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.
Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.
Example:
struct hash_elem {
int cnt;
struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
bpf_spin_lock(&val->lock);
val->cnt++;
bpf_spin_unlock(&val->lock);
}
Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
It drastically simplifies implementation yet allows bpf program to use
any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
insufficient preemption checks
Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
Other solutions to avoid nested bpf_spin_lock are possible.
Like making sure that all networking progs run with softirq disabled.
spin_lock_irqsave is the simplest and doesn't add overhead to the
programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
extra flag during map_create, but specifying it via BTF is cleaner.
It provides introspection for map key/value and reduces user mistakes.
Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
to request kernel to grab bpf_spin_lock before rewriting the value.
That will serialize access to map elements.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The map_lookup_elem used to not acquiring spinlock
in order to optimize the reader.
It was true until commit 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
The syscall's map_lookup_elem(stackmap) calls bpf_stackmap_copy().
bpf_stackmap_copy() may find the elem no longer needed after the copy is done.
If that is the case, pcpu_freelist_push() saves this elem for reuse later.
This push requires a spinlock.
If a tracing bpf_prog got run in the middle of the syscall's
map_lookup_elem(stackmap) and this tracing bpf_prog is calling
bpf_get_stackid(stackmap) which also requires the same pcpu_freelist's
spinlock, it may end up with a dead lock situation as reported by
Eric Dumazet in https://patchwork.ozlabs.org/patch/1030266/
The situation is the same as the syscall's map_update_elem() which
needs to acquire the pcpu_freelist's spinlock and could race
with tracing bpf_prog. Hence, this patch fixes it by protecting
bpf_stackmap_copy() with this_cpu_inc(bpf_prog_active)
to prevent tracing bpf_prog from running.
A later syscall's map_lookup_elem commit f1a2e44a3a ("bpf: add queue and stack maps")
also acquires a spinlock and races with tracing bpf_prog similarly.
Hence, this patch is forward looking and protects the majority
of the map lookups. bpf_map_offload_lookup_elem() is the exception
since it is for network bpf_prog only (i.e. never called by tracing
bpf_prog).
Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Disabled preemption is necessary for proper access to per-cpu maps
from BPF programs.
But the sender side of socket filters didn't have preemption disabled:
unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN
and a combination of af_packet with tun device didn't disable either:
tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue->
tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN
Disable preemption before executing BPF programs (both classic and extended).
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If create_buf_file() returns an error, don't try to reference it later
as a valid dentry pointer.
This problem was exposed when debugfs started to return errors instead
of just NULL for some calls when they do not succeed properly.
Also, the check for WARN_ON(dentry) was just wrong :)
Reported-by: Kees Cook <keescook@chromium.org>
Reported-and-tested-by: syzbot+16c3a70e1e9b29346c43@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Fixes: ff9fb72bc0 ("debugfs: return error values, not NULL")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Building with W=1 reveals some bitrot:
CC kernel/bpf/cgroup.o
kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach'
kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach'
Add a kerneldoc line for 'flags'.
Fixing the warning for 'unused_flags' is best approached by
removing the unused parameter on the function call.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Over the years, the function signature has changed, but the
kerneldoc block hasn't.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
With the following commit:
73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled. However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.
The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case. So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.
Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:
1) /sys/devices/system/cpu/smt/control showed "on" instead of
"notsupported"; and
2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
I'd propose that we instead consider #1 above to not actually be a
problem. Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later. So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).
The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.
So fix it by:
a) reverting the original "fix" and its followup fix:
73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
bc2d8d262c ("cpu/hotplug: Fix SMT supported evaluation")
and
b) changing vmx_vm_init() to query the actual current SMT state --
instead of the sysfs control value -- to determine whether the L1TF
warning is needed. This also requires the 'sched_smt_present'
variable to exported, instead of 'cpu_smt_control'.
Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
Current implementation does not allow typedef func_proto.
But it is actually allowed.
-bash-4.4$ cat t.c
typedef int (f) (int);
f *g;
-bash-4.4$ clang -O2 -g -c -target bpf t.c -Xclang -target-feature -Xclang +dwarfris
-bash-4.4$ pahole -JV t.o
File t.o:
[1] PTR (anon) type_id=2
[2] TYPEDEF f type_id=3
[3] FUNC_PROTO (anon) return=4 args=(4 (anon))
[4] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED
-bash-4.4$
This patch related btf verifier to allow such (typedef func_proto)
patterns.
Fixes: 2667a2626f ("bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With commit a74cfffb03 ("x86/speculation: Rework SMT state change"),
arch_smt_update() is invoked from each individual CPU hotplug function.
Therefore the extra arch_smt_update() call in the sysfs SMT control is
redundant.
Fixes: a74cfffb03 ("x86/speculation: Rework SMT state change")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <konrad.wilk@oracle.com>
Cc: <dwmw@amazon.co.uk>
Cc: <bp@suse.de>
Cc: <srinivas.eeda@oracle.com>
Cc: <peterz@infradead.org>
Cc: <hpa@zytor.com>
Link: https://lkml.kernel.org/r/e2e064f2-e8ef-42ca-bf4f-76b612964752@default
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-01-29
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Teach verifier dead code removal, this also allows for optimizing /
removing conditional branches around dead code and to shrink the
resulting image. Code store constrained architectures like nfp would
have hard time doing this at JIT level, from Jakub.
2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
code generation for 32-bit sub-registers. Evaluation shows that this
can result in code reduction of ~5-20% compared to 64 bit-only code
generation. Also add implementation for most JITs, from Jiong.
3) Add support for __int128 types in BTF which is also needed for
vmlinux's BTF conversion to work, from Yonghong.
4) Add a new command to bpftool in order to dump a list of BPF-related
parameters from the system or for a specific network device e.g. in
terms of available prog/map types or helper functions, from Quentin.
5) Add AF_XDP sock_diag interface for querying sockets from user
space which provides information about the RX/TX/fill/completion
rings, umem, memory usage etc, from Björn.
6) Add skb context access for skb_shared_info->gso_segs field, from Eric.
7) Add support for testing flow dissector BPF programs by extending
existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.
8) Split BPF kselftest's test_verifier into various subgroups of tests
in order better deal with merge conflicts in this area, from Jakub.
9) Add support for queue/stack manipulations in bpftool, from Stanislav.
10) Document BTF, from Yonghong.
11) Dump supported ELF section names in libbpf on program load
failure, from Taeung.
12) Silence a false positive compiler warning in verifier's BTF
handling, from Peter.
13) Fix help string in bpftool's feature probing, from Prashant.
14) Remove duplicate includes in BPF kselftests, from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer fix from Thomas Glexiner:
"A single regression fix to address the unintended breakage of posix
cpu timers.
This is caused by a new sanity check in the common code, which fails
for posix cpu timers under certain conditions because the posix cpu
timer code never updates the variable which is checked"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Unbreak timer rearming
Pull locking fixes from Thomas Gleixner:
"A small series of fixes which all address possible missed wakeups:
- Document and fix the wakeup ordering of wake_q
- Add the missing barrier in rcuwait_wake_up(), which was documented
in the comment but missing in the code
- Fix the possible missed wakeups in the rwsem and futex code"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix (possible) missed wakeup
futex: Fix (possible) missed wakeup
sched/wake_q: Fix wakeup ordering for wake_q
sched/wake_q: Document wake_q_add()
sched/wait: Fix rcuwait_wake_up() ordering
Pull irq fixes from Thomas Gleixner:
"A small set of fixes for the interrupt subsystem:
- Fix a double increment in the irq descriptor allocator which
resulted in a sanity check only being done for every second
affinity mask
- Add a missing device tree translation in the stm32-exti driver.
Without that the interrupt association is completely wrong.
- Initialize the mutex in the GIC-V3 MBI driver
- Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
multi MSI allocations work correctly
- Ensure that the initial affinity of a interrupt is not empty at
startup time.
- Drop bogus include in the madera irq chip driver
- Fix KernelDoc regression"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
genirq/irqdesc: Fix double increment in alloc_descs()
genirq: Fix the kerneldoc comment for struct irq_affinity_desc
irqchip/madera: Drop GPIO includes
irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
irqchip/stm32-exti: Add domain translate function
genirq: Make sure the initial affinity is not empty
This patch adds JIT blinds support for JMP32.
Like BPF_JMP_REG/IMM, JMP32 version are needed for building raw bpf insn.
They are added to both include/linux/filter.h and
tools/include/linux/filter.h.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch implements interpreting new JMP32 instructions.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch teaches disassembler about JMP32. There are two places to
update:
- Class 0x6 now used by BPF_JMP32, not "unused".
- BPF_JMP32 need to show comparison operands properly.
The disassemble format is to add an extra "(32)" before the operands if
it is a sub-register. A better disassemble format for both JMP32 and
ALU32 just show the register prefix as "w" instead of "r", this is the
format using by LLVM assembler.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch teach verifier about the new BPF_JMP32 instruction class.
Verifier need to treat it similar as the existing BPF_JMP class.
A BPF_JMP32 insn needs to go through all checks that have been done on
BPF_JMP.
Also, verifier is doing runtime optimizations based on the extra info
conditional jump instruction could offer, especially when the comparison is
between constant and register that the value range of the register could be
improved based on the comparison results. These code are updated
accordingly.
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current min/max code does both signed and unsigned comparisons against
the input argument "val" which is "u64" and there is explicit type casting
when the comparison is signed.
As we will need slightly more complexer type casting when JMP32 introduced,
it is better to host the signed type casting. This makes the code more
clean with ignorable runtime overhead.
Also, code for J*GE/GT/LT/LE and JEQ/JNE are very similar, this patch
combine them.
The main purpose for this refactor is to make sure the min/max code will
still be readable and with minimum code duplication after JMP32 introduced.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Let offload JITs know when instructions are replaced and optimized
out, so they can update their state appropriately. The optimizations
are best effort, if JIT returns an error from any callback verifier
will stop notifying it as state may now be out of sync, but the
verifier continues making progress.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The communication between the verifier and advanced JITs is based
on instruction indexes. We have to keep them stable throughout
the optimizations otherwise referring to a particular instruction
gets messy quickly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Unconditional branches by 0 instructions are basically noops
but they can result from earlier optimizations, e.g. a conditional
jumps which would never be taken or a conditional jump around
dead code.
Remove those branches.
v0.2:
- s/opt_remove_dead_branches/opt_remove_nops/ (Jiong).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of overwriting dead code with jmp -1 instructions
remove it completely for root. Adjust verifier state and
line info appropriately.
v2:
- adjust func_info (Alexei);
- make sure first instruction retains line info (Alexei).
v4: (Yonghong)
- remove unnecessary if (!insn to remove) checks;
- always keep last line info if first live instruction lacks one.
v5: (Martin Lau)
- improve and clarify comments.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Loading programs with dead code becomes more and more
common, as people begin to patch constants at load time.
Turn conditional jumps to unconditional ones, to avoid
potential branch misprediction penalty.
This optimization is enabled for privileged users only.
For branches which just fall through we could just mark
them as not seen and have dead code removal take care of
them, but that seems less clean.
v0.2:
- don't call capable(CAP_SYS_ADMIN) twice (Jiong).
v3:
- fix GCC warning;
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In preparation for code removal change parameters to branch
and call adjustment functions to be more universal. The
current parameters assume we are patching a single instruction
with a longer set.
A diagram may help reading the change, this is for the patch
single case, patching instruction 1 with a replacement of 4:
____
0 |____|
1 |____| <-- pos ^
2 | | <-- end old ^ |
3 | | | delta | len
4 |____| | | (patch region)
5 | | <-- end new v v
6 |____|
end_old = pos + 1
end_new = pos + delta + 1
If we are before the patch region - curr variable and the target
are fully in old coordinates (hence comparing against end_old).
If we are after the region curr is in new coordinates (hence
the comparison to end_new) but target is in mixed coordinates,
so we just check if it falls before end_new, and if so it needs
the adjustment.
Note that we will not fix up branches which land in removed region
in case of removal, which should be okay, as we are only going to
remove dead code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Kernel:
Stephane Eranian:
- Fix perf_proc_update_handler() bug.
perf script:
Andi Kleen:
- Fix crash with printing mixed trace point and other events.
Tony Jones:
- Fix crash when processing recorded stat data.
perf top:
He Kuang:
- Fix wrong hottest instruction highlighted.
perf python:
Arnaldo Carvalho de Melo:
- Remove -fstack-clash-protection when building with some clang versions.
perf ordered_events:
Jiri Olsa:
- Fix out of buffers crash in ordered_events__free().
perf cpu_map:
Stephane Eranian:
- Handle TOPOLOGY headers with no CPU.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXEYQMgAKCRCyPKLppCJ+
J/pYAP0c+6frwxCAll72bigi+/+5t+1kc/zpM5jNgt97moGh2AD+KKrN5h4E0Z/J
g5T2FOpiwB4cxpVjYTVRchDlx9JohgA=
=X2zj
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-for-mingo-5.0-20190121' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
Kernel:
Stephane Eranian:
- Fix perf_proc_update_handler() bug.
perf script:
Andi Kleen:
- Fix crash with printing mixed trace point and other events.
Tony Jones:
- Fix crash when processing recorded stat data.
perf top:
He Kuang:
- Fix wrong hottest instruction highlighted.
perf python:
Arnaldo Carvalho de Melo:
- Remove -fstack-clash-protection when building with some clang versions.
perf ordered_events:
Jiri Olsa:
- Fix out of buffers crash in ordered_events__free().
perf cpu_map:
Stephane Eranian:
- Handle TOPOLOGY headers with no CPU.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because wake_q_add() can imply an immediate wakeup (cmpxchg failure
case), we must not rely on the wakeup being delayed. However, commit:
e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
relies on exactly that behaviour in that the wakeup must not happen
until after we clear waiter->task.
[ peterz: Added changelog. ]
Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e38513905e ("locking/rwsem: Rework zeroing reader waiter->task")
Link: https://lkml.kernel.org/r/1543495830-2644-1-git-send-email-xieyongji@baidu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We must not rely on wake_q_add() to delay the wakeup; in particular
commit:
1d0dcb3ad9 ("futex: Implement lockless wakeups")
moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
could result in futex_wait() waking before observing ->lock_ptr ==
NULL and going back to sleep again.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1d0dcb3ad9 ("futex: Implement lockless wakeups")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Notable cmpxchg() does not provide ordering when it fails, however
wake_q_add() requires ordering in this specific case too. Without this
it would be possible for the concurrent wakeup to not observe our
prior state.
Andrea Parri provided:
C wake_up_q-wake_q_add
{
int next = 0;
int y = 0;
}
P0(int *next, int *y)
{
int r0;
/* in wake_up_q() */
WRITE_ONCE(*next, 1); /* node->next = NULL */
smp_mb(); /* implied by wake_up_process() */
r0 = READ_ONCE(*y);
}
P1(int *next, int *y)
{
int r1;
/* in wake_q_add() */
WRITE_ONCE(*y, 1); /* wake_cond = true */
smp_mb__before_atomic();
r1 = cmpxchg_relaxed(next, 1, 2);
}
exists (0:r0=0 /\ 1:r1=0)
This "exists" clause cannot be satisfied according to the LKMM:
Test wake_up_q-wake_q_add Allowed
States 3
0:r0=0; 1:r1=1;
0:r0=1; 1:r1=0;
0:r0=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r0=0 /\ 1:r1=0)
Observation wake_up_q-wake_q_add Never 0 3
Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only guarantee provided by wake_q_add() is that a wakeup will
happen after it, it does _NOT_ guarantee the wakeup will be delayed
until the matching wake_up_q().
If wake_q_add() fails the cmpxchg() a concurrent wakeup is pending and
that can happen at any time after the cmpxchg(). This means we should
not rely on the wakeup happening at wake_q_up(), but should be ready
for wake_q_add() to issue the wakeup.
The delay; if provided (most likely); should only result in more efficient
behaviour.
Reported-by: Yongji Xie <elohimes@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.
This mistake has been observed to cause a deadlock in the following
situation:
P1 P2
percpu_up_read() percpu_down_write()
rcu_sync_is_idle() // false
rcu_sync_enter()
...
__percpu_up_read()
[S] ,- __this_cpu_dec(*sem->read_count)
| smp_rmb();
[L] | task = rcu_dereference(w->task) // NULL
|
| [S] w->task = current
| smp_mb();
| [L] readers_active_check() // fail
`-> <store happens here>
Where the smp_rmb() (obviously) fails to constrain the store.
[ peterz: Added changelog. ]
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix endless loop in nf_tables, from Phil Sutter.
2) Fix cross namespace ip6_gre tunnel hash list corruption, from
Olivier Matz.
3) Don't be too strict in phy_start_aneg() otherwise we might not allow
restarting auto negotiation. From Heiner Kallweit.
4) Fix various KMSAN uninitialized value cases in tipc, from Ying Xue.
5) Memory leak in act_tunnel_key, from Davide Caratti.
6) Handle chip errata of mv88e6390 PHY, from Andrew Lunn.
7) Remove linear SKB assumption in fou/fou6, from Eric Dumazet.
8) Missing udplite rehash callbacks, from Alexey Kodanev.
9) Log dirty pages properly in vhost, from Jason Wang.
10) Use consume_skb() in neigh_probe() as this is a normal free not a
drop, from Yang Wei. Likewise in macvlan_process_broadcast().
11) Missing device_del() in mdiobus_register() error paths, from Thomas
Petazzoni.
12) Fix checksum handling of short packets in mlx5, from Cong Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (96 commits)
bpf: in __bpf_redirect_no_mac pull mac only if present
virtio_net: bulk free tx skbs
net: phy: phy driver features are mandatory
isdn: avm: Fix string plus integer warning from Clang
net/mlx5e: Fix cb_ident duplicate in indirect block register
net/mlx5e: Fix wrong (zero) TX drop counter indication for representor
net/mlx5e: Fix wrong error code return on FEC query failure
net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
tools: bpftool: Cleanup license mess
bpf: fix inner map masking to prevent oob under speculation
bpf: pull in pkt_sched.h header for tooling to fix bpftool build
selftests: forwarding: Add a test case for externally learned FDB entries
selftests: mlxsw: Test FDB offload indication
mlxsw: spectrum_switchdev: Do not treat static FDB entries as sticky
net: bridge: Mark FDB entries that were added by user as such
mlxsw: spectrum_fid: Update dummy FID index
mlxsw: pci: Return error on PCI reset timeout
mlxsw: pci: Increase PCI SW reset timeout
mlxsw: pci: Ring CQ's doorbell before RDQ's
MAINTAINERS: update email addresses of liquidio driver maintainers
...
The perf_proc_update_handler() handles /proc/sys/kernel/perf_event_max_sample_rate
syctl variable. When the PMU IRQ handler timing monitoring is disabled, i.e,
when /proc/sys/kernel/perf_cpu_time_max_percent is equal to 0 or 100,
then no modification to sysctl_perf_event_sample_rate is allowed to prevent
possible hang from wrong values.
The problem is that the test to prevent modification is made after the
sysctl variable is modified in perf_proc_update_handler().
You get an error:
$ echo 10001 >/proc/sys/kernel/perf_event_max_sample_rate
echo: write error: invalid argument
But the value is still modified causing all sorts of inconsistencies:
$ cat /proc/sys/kernel/perf_event_max_sample_rate
10001
This patch fixes the problem by moving the parsing of the value after
the test.
Committer testing:
# echo 100 > /proc/sys/kernel/perf_cpu_time_max_percent
# echo 10001 > /proc/sys/kernel/perf_event_max_sample_rate
-bash: echo: write error: Invalid argument
# cat /proc/sys/kernel/perf_event_max_sample_rate
10001
#
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1547169436-6266-1-git-send-email-eranian@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The sys_ipc() and compat_ksys_ipc() functions are meant to only
be used from the system call table, not called by another function.
Introduce ksys_*() interfaces for this purpose, as we have done
for many other system calls.
Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The recent rework of alloc_descs() introduced a double increment of the
loop counter. As a consequence only every second affinity mask is
validated.
Remove it.
[ tglx: Massaged changelog ]
Fixes: c410abbbac ("genirq/affinity: Add is_managed to struct irq_affinity_desc")
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fuxin Zhang <zhangfx@lemote.com>
Cc: Zhangjin Wu <wuzhangjin@gmail.com>
Cc: Huacai Chen <chenhuacai@gmail.com>
Cc: Dou Liyang <douliyangs@gmail.com>
Link: https://lkml.kernel.org/r/1547694009-16261-1-git-send-email-chenhc@lemote.com
Pull swiotlb fix from Konrad Rzeszutek Wilk:
"A tiny fix for v5.0-rc2:
This fixes an issue with GPU cards not working anymore with the DMA
mapping work Christopher did - as the SWIOTLB is initialized first and
then free'd (as IOMMU is available) but we forgot to clear our start
and end entries which are used and BOOM"
* 'stable/for-linus-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit
* make the reference from superblock to cgroup_root counting -
do cgroup_put() in cgroup_kill_sb() whether we'd done
percpu_ref_kill() or not; matching grab is done when we allocate
a new root. That gives the same refcounting rules for all callers
of cgroup_do_mount() - a reference to cgroup_root has been grabbed
by caller and it either is transferred to new superblock or dropped.
* have cgroup_kill_sb() treat an already killed refcount as "just
don't bother killing it, then".
* after successful cgroup_do_mount() have cgroup1_mount() recheck
if we'd raced with mount/umount from somebody else and cgroup_root
got killed. In that case we drop the superblock and bugger off
with -ERESTARTSYS, same as if we'd found it in the list already
dying.
* don't bother with delayed initialization of refcount - it's
unreliable and not needed. No need to prevent attempts to bump
the refcount if we find cgroup_root of another mount in progress -
sget will reuse an existing superblock just fine and if the
other sb manages to die before we get there, we'll catch
that immediately after cgroup_do_mount().
* don't bother with kernfs_pin_sb() - no need for doing that
either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
same story as with last May fixes in sysfs (7b745a4e40
"unfuck sysfs_mount()"); new_sb is left uninitialized
in case of early errors in kernfs_mount_ns() and papering
over it by treating any error from kernfs_mount_ns() as
equivalent to !new_ns ends up conflating the cases when
objects had never been transferred to a superblock with
ones when that has happened and resulting new superblock
had been dropped. Easily fixed (same way as in sysfs
case). Additionally, there's a superblock leak on
kernfs_node_dentry() failure *and* a dentry leak inside
kernfs_node_dentry() itself - the latter on probably
impossible errors, but the former not impossible to trigger
(as the matter of fact, injecting allocation failures
at that point *does* trigger it).
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When printing multiple uprobe arguments as strings the output for the
earlier arguments would also include all later string arguments.
This is best explained in an example:
Consider adding a uprobe to a function receiving two strings as
parameters which is at offset 0xa0 in strlib.so and we want to print
both parameters when the uprobe is hit (on x86_64):
$ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
/sys/kernel/debug/tracing/uprobe_events
When the function is called as func("foo", "bar") and we hit the probe,
the trace file shows a line like the following:
[...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
Note the extra "bar" printed as part of arg1. This behaviour stacks up
for additional string arguments.
The strings are stored in a dynamically growing part of the uprobe
buffer by fetch_store_string() after copying them from userspace via
strncpy_from_user(). The return value of strncpy_from_user() is then
directly used as the required size for the string. However, this does
not take the terminating null byte into account as the documentation
for strncpy_from_user() cleary states that it "[...] returns the
length of the string (not including the trailing NUL)" even though the
null byte will be copied to the destination.
Therefore, subsequent calls to fetch_store_string() will overwrite
the terminating null byte of the most recently fetched string with
the first character of the current string, leading to the
"accumulation" of strings in earlier arguments in the output.
Fix this by incrementing the return value of strncpy_from_user() by
one if we did not hit the maximum buffer size.
Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5baaa59ef0 ("tracing/probes: Implement 'memory' fetch method for uprobes")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a plan to build the kernel with -Wimplicit-fallthrough
and this place in the code produced a warnings (W=1).
This commit removes the following warning:
kernel/bpf/cgroup.c:719:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Initially in commit 69b693f0ae ("bpf: btf: Introduce BPF Type Format
(BTF)") the function 'btf_name_offset_valid' was introduced as static
function it was later on changed to a non-static one, and then finally
in commit 23127b33ec ("bpf: Create a new btf_name_by_offset() for
non type name use case") the function prototype was removed.
Revert back to original implementation and make the function static.
Remove warning triggered with W=1:
kernel/bpf/btf.c:470:6: warning: no previous prototype for 'btf_name_offset_valid' [-Wmissing-prototypes]
Fixes: 23127b33ec ("bpf: Create a new btf_name_by_offset() for non type name use case")
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
make sure that build_id field is empty. Since we are using percpu
free list, there is a possibility that we might reuse some previous
bpf_stack_build_id with non-zero build_id.
Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
* 128-bit (uuid)
* 160-bit (sha1)
* any length specified in ld --build-id=0xhexstring
To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
assume that build-id is somewhere in the range of 1 .. 20.
Set the remaining bytes to zero.
v2:
* don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
this 'if' condition
Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The subsystem-specific message prefix for uprobes was also
"trace_kprobe: " instead of "trace_uprobe: " as described in
the original commit message.
Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 7257634135 ("tracing/probe: Show subsystem name in messages")
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
An older GCC compiler complains:
kernel/bpf/verifier.c: In function 'bpf_check':
kernel/bpf/verifier.c:4***:13: error: 'prev_offset' may be used uninitialized
in this function [-Werror=maybe-uninitialized]
} else if (krecord[i].insn_offset <= prev_offset) {
^
kernel/bpf/verifier.c:4***:38: note: 'prev_offset' was declared here
u32 i, nfuncs, urec_size, min_size, prev_offset;
Although the compiler is wrong here, the patch makes sure
that prev_offset is always initialized, just to silence the warning.
v2: fix a spelling error in the commit message.
Signed-off-by: Peter Oskolkov <posk@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, btf only supports up to 64-bit integer.
On the other hand, 128bit support for gcc and clang
has existed for a long time. For example, both gcc 4.8
and llvm 3.7 supports types "__int128" and
"unsigned __int128" for virtually all 64bit architectures
including bpf.
The requirement for __int128 support comes from two areas:
. bpf program may use __int128. For example, some bcc tools
(https://github.com/iovisor/bcc/tree/master/tools),
mostly tcp v6 related, tcpstates.py, tcpaccept.py, etc.,
are using __int128 to represent the ipv6 addresses.
. linux itself is using __int128 types. Hence supporting
__int128 type in BTF is required for vmlinux BTF,
which will be used by "compile once and run everywhere"
and other projects.
For 128bit integer, instead of base-10, hex numbers are pretty
printed out as large decimal number is hard to decipher, e.g.,
for ipv6 addresses.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Otherwise is_swiotlb_buffer will return false positives when
we first initialize a swiotlb buffer, but then free it because
we have an IOMMU available.
Fixes: 55897af630 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code")
Reported-by: Sibren Vasse <sibren@sibrenvasse.nl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sibren Vasse <sibren@sibrenvasse.nl>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
On the failure path, we do an fput() of the listener fd if the filter fails
to install (e.g. because of a TSYNC race that's lost, or if the thread is
killed, etc.). fput() doesn't actually release the fd, it just ads it to a
work queue. Then the thread proceeds to free the filter, even though the
listener struct file has a reference to it.
To fix this, on the failure path let's set the private data to null, so we
know in ->release() to ignore the filter.
Reported-by: syzbot+981c26489b2d1c6316ba@syzkaller.appspotmail.com
Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
It is possible to trigger a NULL pointer dereference by writing an
incorrectly formatted string to the krpobe_events file.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXD4L1RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsv7AP9nzekt94AQC2t5OQ38ph/nYGBjLc3T
yLqFMshqUSgyVAEAgFB88fvniwLOMFyAqbfRb0+4mq1SDeThBY7TtJBzSQI=
=+Pyh
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Andrea Righi fixed a NULL pointer dereference in trace_kprobe_create()
It is possible to trigger a NULL pointer dereference by writing an
incorrectly formatted string to the krpobe_events file"
* tag 'trace-v5.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create()
Pull networking fixes from David Miller:
1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
Gautier.
2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.
3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
Brivio.
4) Use after free in hns driver, from Yonglong Liu.
5) icmp6_send() needs to handle the case of NULL skb, from Eric
Dumazet.
6) Missing rcu read lock in __inet6_bind() when operating on mapped
addresses, from David Ahern.
7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.
8) Fix PHY vs r8169 module loading ordering issues, from Heiner
Kallweit.
9) Fix bridge vlan memory leak, from Ido Schimmel.
10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.
11) Infoleak in ipv6_local_error(), flow label isn't completely
initialized. From Eric Dumazet.
12) Handle mv88e6390 errata, from Andrew Lunn.
13) Making vhost/vsock CID hashing consistent, from Zha Bin.
14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.
15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
bnxt_en: Fix context memory allocation.
bnxt_en: Fix ring checking logic on 57500 chips.
mISDN: hfcsusb: Use struct_size() in kzalloc()
net: clear skb->tstamp in bridge forwarding path
net: bpfilter: disallow to remove bpfilter module while being used
net: bpfilter: restart bpfilter_umh when error occurred
net: bpfilter: use cleanup callback to release umh_info
umh: add exit routine for UMH process
isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
vhost/vsock: fix vhost vsock cid hashing inconsistent
net: stmmac: Prevent RX starvation in stmmac_napi_poll()
net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
net: stmmac: Check if CBS is supported before configuring
net: stmmac: dwxgmac2: Only clear interrupts that are active
net: stmmac: Fix PCI module removal leak
tools/bpf: fix bpftool map dump with bitfields
tools/bpf: test btf bitfield with >=256 struct member offset
bpf: fix bpffs bitfield pretty print
net: ethernet: mediatek: fix warning in phy_start_aneg
tcp: change txhash on SYN-data timeout
...
The recent commit which prevented a division by 0 issue in the alarm timer
code broke posix CPU timers as an unwanted side effect.
The reason is that the common rearm code checks for timer->it_interval
being 0 now. What went unnoticed is that the posix cpu timer setup does not
initialize timer->it_interval as it stores the interval in CPU timer
specific storage. The reason for the separate storage is historical as the
posix CPU timers always had a 64bit nanoseconds representation internally
while timer->it_interval is type ktime_t which used to be a modified
timespec representation on 32bit machines.
Instead of reverting the offending commit and fixing the alarmtimer issue
in the alarmtimer code, store the interval in timer->it_interval at CPU
timer setup time so the common code check works. This also repairs the
existing inconistency of the posix CPU timer code which kept a single shot
timer armed despite of the interval being 0.
The separate storage can be removed in mainline, but that needs to be a
separate commit as the current one has to be backported to stable kernels.
Fixes: 0e334db6bb ("posix-timers: Fix division by zero bug")
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de
If all CPUs in the irq_default_affinity mask are offline when an interrupt
is initialized then irq_setup_affinity() can set an empty affinity mask for
a newly allocated interrupt.
Fix this by falling back to cpu_online_mask in case the resulting affinity
mask is zero.
Signed-off-by: Srinivas Ramana <sramana@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1545312957-8504-1-git-send-email-sramana@codeaurora.org
UNAME26 is a mechanism to report Linux's version as 2.6.x, for
compatibility with old/broken software. Due to the way it is
implemented, it would have to be updated after 5.0, to keep the
resulting versions unique. Linus Torvalds argued:
"Do we actually need this?
I'd rather let it bitrot, and just let it return random versions. It
will just start again at 2.4.60, won't it?
Anybody who uses UNAME26 for a 5.x kernel might as well think it's
still 4.x. The user space is so old that it can't possibly care about
differences between 4.x and 5.x, can it?
The only thing that matters is that it shows "2.4.<largeenough>",
which it will do regardless"
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A UMH process which is created by the fork_usermode_blob() such as
bpfilter needs to release members of the umh_info when process is
terminated.
But the do_exit() does not release members of the umh_info. hence module
which uses UMH needs own code to detect whether UMH process is
terminated or not.
But this implementation needs extra code for checking the status of
UMH process. it eventually makes the code more complex.
The new PF_UMH flag is added and it is used to identify UMH processes.
The exit_umh() does not release members of the umh_info.
Hence umh_info->cleanup callback should release both members of the
umh_info and the private data.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9d5f9f701b ("bpf: btf: fix struct/union/fwd types
with kind_flag") introduced kind_flag and used bitfield_size
in the btf_member to directly pretty print member values.
The commit contained a bug where the incorrect parameters could be
passed to function btf_bitfield_seq_show(). The bits_offset
parameter in the function expects a value less than 8.
Instead, the member offset in the structure is passed.
The below is btf_bitfield_seq_show() func signature:
void btf_bitfield_seq_show(void *data, u8 bits_offset,
u8 nr_bits, struct seq_file *m)
both bits_offset and nr_bits are u8 type. If the bitfield
member offset is greater than 256, incorrect value will
be printed.
This patch fixed the issue by calculating correct proper
data offset and bits_offset similar to non kind_flag case.
Fixes: 9d5f9f701b ("bpf: btf: fix struct/union/fwd types with kind_flag")
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.
This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.
Fixes: 615755a77b (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge misc fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, page_alloc: do not wake kswapd with zone lock held
hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"
mm: page_mapped: don't assume compound page is huge or THP
mm/memory.c: initialise mmu_notifier_range correctly
tools/vm/page_owner: use page_owner_sort in the use example
kasan: fix krealloc handling for tag-based mode
kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY
kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning
mm, memcg: fix reclaim deadlock with writeback
mm/usercopy.c: no check page span for stack objects
slab: alien caches must not be initialized if the allocation of the alien cache failed
fork, memcg: fix cached_stacks case
zram: idle writeback fixes and cleanup
Commit 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") fixes a crash caused due to failed memcg charge of
the kernel stack. However the fix misses the cached_stacks case which
this patch fixes. So, the same crash can happen if the memcg charge of
a cached stack is failed.
Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
Fixes: 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.
Technically, we could record the start_time anytime during fork(2). But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space. For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).
By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.
Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned. Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded. This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.
Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- improve boolinit.cocci and use_after_iter.cocci semantic patches
- fix alignment for kallsyms
- move 'asm goto' compiler test to Kconfig and clean up jump_label
CONFIG option
- generate asm-generic wrappers automatically if arch does not implement
mandatory UAPI headers
- remove redundant generic-y defines
- misc cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJcMV5GAAoJED2LAQed4NsGs9gQAI/oGg8wJgk9a7+dJCX245W5
F4ReftnQd4AFptFCi9geJkr+sfViXNgwPLqlJxiXz8Qe8XP7z3LcArDw3FUzwvGn
bMSBiN9ggwWkOFgF523XesYgUVtcLpkNch/Migzf1Ac0FHk0G9o7gjcdsvAWHkUu
qFwtNcUB6PElRbhsHsh5qCY1/6HaAXgf/7O7wztnaKRe9myN6f2HzT4wANS9HHde
1e1r0LcIQeGWfG+3va3fZl6SDxSI/ybl244OcDmDyYl6RA1skSDlHbIBIFgUPoS0
cLyzoVj+GkfI1fRFEIfou+dj7lpukoAXHsggHo0M+ofqtbMF+VB2T3jvg4txanCP
TXzDc+04QUguK5yVnBfcnyC64Htrhnbq0eGy43kd1VZWAEGApl+680P8CRsWU3ZV
kOiFvZQ6RP/Ssw+a42yU3SHr31WD7feuQqHU65osQt4rdyL5wnrfU1vaUvJSkltF
cyPr9Kz/Ism0kPodhpFkuKxwtlKOw6/uwdCQoQHtxAPkvkcydhYx93x3iE0nxObS
CRMximiRyE12DOcv/3uv69n0JOPn6AsITcMNp8XryASYrR2/52txhGKGhvo3+Zoq
5pwc063JsuxJ/5/dcOw/erQar5d1eBRaBJyEWnXroxUjbsLPAznE+UIN8tmvyVly
SunlxNOXBdYeWN6t6S3H
=I+r6
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- improve boolinit.cocci and use_after_iter.cocci semantic patches
- fix alignment for kallsyms
- move 'asm goto' compiler test to Kconfig and clean up jump_label
CONFIG option
- generate asm-generic wrappers automatically if arch does not
implement mandatory UAPI headers
- remove redundant generic-y defines
- misc cleanups
* tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: rename generated .*conf-cfg to *conf-cfg
kbuild: remove unnecessary stubs for archheader and archscripts
kbuild: use assignment instead of define ... endef for filechk_* rules
arch: remove redundant UAPI generic-y defines
kbuild: generate asm-generic wrappers if mandatory headers are missing
arch: remove stale comments "UAPI Header export list"
riscv: remove redundant kernel-space generic-y
kbuild: change filechk to surround the given command with { }
kbuild: remove redundant target cleaning on failure
kbuild: clean up rule_dtc_dt_yaml
kbuild: remove UIMAGE_IN and UIMAGE_OUT
jump_label: move 'asm goto' support test to Kconfig
kallsyms: lower alignment on ARM
scripts: coccinelle: boolinit: drop warnings on named constants
scripts: coccinelle: check for redeclaration
kconfig: remove unused "file" field of yylval union
nds32: remove redundant kernel-space generic-y
nios2: remove unneeded HAS_DMA define
Fix various regressions introduced in this cycles:
- fix dma-debug tracking for the map_page / map_single consolidatation
- properly stub out DMA mapping symbols for !HAS_DMA builds to avoid
link failures
- fix AMD Gart direct mappings
- setup the dma address for no kernel mappings using the remap
allocator
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlwyR9ULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPvOA/+L+32p2pm8o6NTgvtRvqsKNrbOm02fORLrhBqAiok
AcirFDxTfMuUWU2isr7E7WNqwEmUQ1nVUa+I0IJ/IJFfKdTggXcaTX1M19+62KWa
1LHpZLg1t2rl2yFQHgTrFKr5sz1PwUKZO8UbrYaYYgLgQkWDRzJs4E/tFNju8pMm
0Usexo/bkI5mreJBImMsFwAnuk0k3NT058XIeD+eNttKjcuz5kEH+bE/999vySW3
sOj9Peic/EFelOGb4ODxUIPjhiGFMv5dVusSAsFBH26iwQfX/tFSmXhrI5cnDewg
NlREennfyM+6uTH/DO+BlX7eGCRYbFc1GU5H9q4rRMXhEam6oc2AzVKuElJOVstZ
XVjP6zTwmuOh/5ff0NG6EPjA/OFcmlBEsmeWu4xSS8KsNILOkpUaPed/uWnA7O+2
mvU104NA5cHgVMgiGNM/4ilirkEZEFEHYhafH42bQxjMigm7ZHN14NtwM7StLTu6
QgyfPUcW/LmHj2scgvB1AZ+iQX0z7yJJMGifUxtz+eMCWCC7neOJ7JLvNnS9WI5w
9RwYaCOcDAZyAmCpbSADWxeG9cfsCDp8wmaGs3YVyhkDU8tCSqbxWJutvyDQnC17
GtZ0vYLTaJXBCq1L/FC0y8NCCGgvySPXYU7/ZYuOCzS4q2jvjwTWD3dKodvnS+mb
B0s=
=H9J6
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"Fix various regressions introduced in this cycles:
- fix dma-debug tracking for the map_page / map_single
consolidatation
- properly stub out DMA mapping symbols for !HAS_DMA builds to avoid
link failures
- fix AMD Gart direct mappings
- setup the dma address for no kernel mappings using the remap
allocator"
* tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocations
x86/amd_gart: fix unmapping of non-GART mappings
dma-mapping: remove a few unused exports
dma-mapping: properly stub out the DMA API for !CONFIG_HAS_DMA
dma-mapping: remove dmam_{declare,release}_coherent_memory
dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrs
dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs
While 979d63d50c ("bpf: prevent out of bounds speculation on pointer
arithmetic") took care of rejecting alu op on pointer when e.g. pointer
came from two different map values with different map properties such as
value size, Jann reported that a case was not covered yet when a given
alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
different branches where we would incorrectly try to sanitize based
on the pointer's limit. Catch this corner case and reject the program
instead.
Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
filechk_* rules often consist of multiple 'echo' lines. They must be
surrounded with { } or ( ) to work correctly. Otherwise, only the
string from the last 'echo' would be written into the target.
Let's take care of that in the 'filechk' in scripts/Kbuild.include
to clean up filechk_* rules.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:
#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
# define HAVE_JUMP_LABEL
#endif
We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Merge more updates from Andrew Morton:
- procfs updates
- various misc bits
- lib/ updates
- epoll updates
- autofs
- fatfs
- a few more MM bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...
We need to return a dma_addr_t even if we don't have a kernel mapping.
Do so by consolidating the phys_to_dma call in a single place and jump
to it from all the branches that return successfully.
Fixes: bfd56cd605 ("dma-mapping: support highmem in the generic remap allocator")
Reported-by: Liviu Dudau <liviu@dudau.co.uk
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
This is already done for us internally by the signal machinery.
Link: http://lkml.kernel.org/r/20181116002713.8474-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is already done for us internally by the signal machinery.
Link: http://lkml.kernel.org/r/20181116002713.8474-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the
function called from __sanitizer_cov_trace_const_cmp4 shouldn't be
traceable either. ftrace_graph_caller() gets called every time func
write_comp_data() gets called if it isn't marked 'notrace'. This is the
backtrace from gdb:
#0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
#1 0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
#2 0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116
#3 0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188
#4 0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27
#5 0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
Rework so that write_comp_data() that are called from
__sanitizer_cov_trace_*_cmp*() are marked as 'notrace'.
Commit 903e8ff867 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace")
missed to mark write_comp_data() as 'notrace'. When that patch was
created gcc-7 was used. In lib/Kconfig.debug
config KCOV_ENABLE_COMPARISONS
depends on $(cc-option,-fsanitize-coverage=trace-cmp)
That code path isn't hit with gcc-7. However, it were that with gcc-8.
Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
So that we can also runtime chose to print out the needed system info
for panic, other than setting the kernel cmdline.
Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel panic issues are always painful to debug, partially because it's
not easy to get enough information of the context when panic happens.
And we have ramoops and kdump for that, while this commit tries to
provide a easier way to show the system info by adding a cmdline
parameter, referring some idea from sysrq handler.
Link: http://lkml.kernel.org/r/1543398842-19295-2-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We get a warning when building kernel with W=1:
kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes]
kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes]
Add the missing declaration in head file to fix this.
Also, remove arch_release_thread_stack() completely because no arch
seems to implement it since bb9d81264 (arch: remove tile port).
Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
check_hung_uninterruptible_tasks() is currently calling rcu_lock_break()
for every 1024 threads. But check_hung_task() is very slow if printk()
was called, and is very fast otherwise.
If many threads within some 1024 threads called printk(), the RCU grace
period might be extended enough to trigger RCU stall warnings.
Therefore, calling rcu_lock_break() for every some fixed jiffies will be
safer.
Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on commit 401c636a0e ("kernel/hung_task.c: show all hung tasks
before panic"), we could get the call stack of hung task.
However, if the console loglevel is not high, we still can not see the
useful panic information in practice, and in most cases users don't set
console loglevel to high level.
This patch is to force console verbose before system panic, so that the
real useful information can be seen in the console, instead of being
like the following, which doesn't have hung task information.
INFO: task init:1 blocked for more than 120 seconds.
Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Kernel panic - not syncing: hung_task: blocked tasks
CPU: 2 PID: 479 Comm: khungtaskd Tainted: G U W 4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
Call Trace:
dump_stack+0x4f/0x65
panic+0xde/0x231
watchdog+0x290/0x410
kthread+0x12c/0x150
ret_from_fork+0x35/0x40
reboot: panic mode set: p,w
Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.
For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:
{
.procname = "monitor_signals",
.data = &monitor_sigs,
.maxlen = 2*sizeof(unsigned long),
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
Reproduce:
When passing two parameters, it's work normal. But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
-bash: echo: write error: Invalid argument
[root@cl150 ~]# echo $?
1
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
The following is the result after apply this patch. No error is
returned when the number of input parameters is less than the total
parameters.
[root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
1 2
[root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
[root@cl150 ~]# echo $?
0
[root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
3 2
[root@cl150 ~]#
There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.
This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'. In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.
static int __do_proc_douintvec(...){
...
/*
* Arrays are not supported, keep this simple. *Do not* add
* support for them.
*/
if (vleft != 1) {
*lenp = 0;
return -EINVAL;
}
...
}
So, just __do_proc_doulongvec_minmax has the problem. And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.
Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Originally, the rule used to be that you'd have to do access_ok()
separately, and then user_access_begin() before actually doing the
direct (optimized) user access.
But experience has shown that people then decide not to do access_ok()
at all, and instead rely on it being implied by other operations or
similar. Which makes it very hard to verify that the access has
actually been range-checked.
If you use the unsafe direct user accesses, hardware features (either
SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
Access Never - on ARM) do force you to use user_access_begin(). But
nothing really forces the range check.
By putting the range check into user_access_begin(), we actually force
people to do the right thing (tm), and the range check vill be visible
near the actual accesses. We have way too long a history of people
trying to avoid them.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the slow path DMA API calls are implemented out of line a few
helpers only used by them don't need to be exported anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
dmam_alloc_coherent is just the default no-flags case of
dmam_alloc_attrs, so take advantage of this similar to the non-managed
version.
Signed-off-by: Christoph Hellwig <hch@lst.de>
And also switch the way we implement the unmap side around to stay
consistent. This ensures dma-debug works again because it records which
function we used for mapping to ensure it is also used for unmapping,
and also reduces further code duplication. Last but not least this
also officially allows calling dma_sync_single_* for mappings created
using dma_map_page, which is perfectly fine given that the sync calls
only take a dma_addr_t, but not a virtual address or struct page.
Fixes: 7f0fee242e ("dma-mapping: merge dma_unmap_page_attrs and dma_unmap_single_attrs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: LABBE Corentin <clabbe.montjoie@gmail.com>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"Several fixes here. Basically split down the line between newly
introduced regressions and long existing problems:
1) Double free in tipc_enable_bearer(), from Cong Wang.
2) Many fixes to nf_conncount, from Florian Westphal.
3) op->get_regs_len() can throw an error, check it, from Yunsheng
Lin.
4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
driver, from Scott Wood.
5) Inifnite loop in fib_empty_table(), from Yue Haibing.
6) Use after free in ax25_fillin_cb(), from Cong Wang.
7) Fix socket locking in nr_find_socket(), also from Cong Wang.
8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.
9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.
10) Fix ptr_ring wrap during queue swap, from Cong Wang.
11) Missing shutdown callback in hinic driver, from Xue Chaojing.
12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
Brivio.
13) BPF out of bounds speculation fixes from Daniel Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
ipv6: Consider sk_bound_dev_if when binding a socket to an address
ipv6: Fix dump of specific table with strict checking
bpf: add various test cases to selftests
bpf: prevent out of bounds speculation on pointer arithmetic
bpf: fix check_map_access smin_value test when pointer contains offset
bpf: restrict unknown scalars of mixed signed bounds for unprivileged
bpf: restrict stack pointer arithmetic for unprivileged
bpf: restrict map value pointer arithmetic for unprivileged
bpf: enable access to ax register also from verifier rewrite
bpf: move tmp variable into ax register in interpreter
bpf: move {prev_,}insn_idx into verifier env
isdn: fix kernel-infoleak in capi_unlocked_ioctl
ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
net/hamradio/6pack: use mod_timer() to rearm timers
net-next/hinic:add shutdown callback
net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
ip: validate header length on virtual device xmit
tap: call skb_probe_transport_header after setting skb->dev
ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
net: rds: remove unnecessary NULL check
...
Note that there is a conflict with the rdma tree in this pull request, since
we delete a file that has been changed in the rdma tree. Hopefully that's
easy enough to resolve!
We also were unable to track down a maintainer for Neil Brown's changes to
the generic cred code that are prerequisites to his RPC cred cleanup patches.
We've been asking around for several months without any response, so
hopefully it's okay to include those patches in this pull request.
Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlwtO50ACgkQ18tUv7Cl
QOtZWQ//e5Hhp2TnQZ6U+99YKedjwBHP6psH3GKSEdeHSNdlSpZ5ckgHxvMb9TBa
6t4ecgv5P/uYLIePQ0u2ubUFc9+TlyGi7Iacx13/YhK7kihGHDPnZhfl0QbYixV7
rwa9bFcKmOrXs8ld+Hw3P2UL22G1gMf/LHDhPNshbW7LFZmcshKz+mKTk70kwkq9
v7tFC59p6GwV8Sr2YI2NXn2fOWsUS00sQfgj2jceJYJ8PsNa+wHYF4wPj2IY5NsE
D5Oq2kLPbytBhCllOHgopNZaf4qb5BfqhVETyc1O+kDF3BZKUhQ1PoDi2FPinaHM
5/d8hS+5fr3eMBsQrPWQLXYjWQFUXnkQQJvU3Bo52AIgomsk/8uBq3FvH7XmFcBd
C8sgnuUAkAS8feMes8GCS50BTxclnGuYGdyFJyCRXoG9Kn9rMrw9EKitky6EVq0v
NmXhW79jK84a3yDXVlAIpZ8Y9BU/HQ3GviGX8lQEdZU9YiYRzDIHvpMFwzMgqaBi
XvLbr8PlLOm8GZokThS8QYT/G2Wu6IwfUq/AufVjVD4+HiL3duKKfWSGAvcm6aAa
GoRF6UG+OmjWlzKojtRc1dI+sy22Fzh+DW+Mx6tuf/b/66wkmYnW7eKcV4rt6Tm5
/JEhvTMo9q7elL/4FgCoMCcdoc5eXqQyXRXrQiOU7YHLzn2aWU0=
=DvVW
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery"
* tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
sunrpc: convert to DEFINE_SHOW_ATTRIBUTE
sunrpc: Add xprt after nfs4_test_session_trunk()
sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS
sunrpc: handle ENOMEM in rpcb_getport_async
NFS: remove unnecessary test for IS_ERR(cred)
xprtrdma: Prevent leak of rpcrdma_rep objects
NFSv4.2 fix async copy reboot recovery
xprtrdma: Don't leak freed MRs
xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
xprtrdma: Replace outdated comment for rpcrdma_ep_post
xprtrdma: Update comments in frwr_op_send
SUNRPC: Fix some kernel doc complaints
SUNRPC: Simplify defining common RPC trace events
NFS: Fix NFSv4 symbolic trace point output
xprtrdma: Trace mapping, alloc, and dereg failures
xprtrdma: Add trace points for calls to transport switch methods
xprtrdma: Relocate the xprtrdma_mr_map trace points
xprtrdma: Clean up of xprtrdma chunk trace points
xprtrdma: Remove unused fields from rpcrdma_ia
xprtrdma: Cull dprintk() call sites
...
Jann reported that the original commit back in b2157399cc
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:
- Load a map value pointer into R6
- Load an index into R7
- Do a slow computation (e.g. with a memory dependency) that
loads a limit into R8 (e.g. load the limit from a map for
high latency, then mask it to make the verifier happy)
- Exit if R7 >= R8 (mispredicted branch)
- Load R0 = R6[R7]
- Load R0 = R6[R0]
For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.
In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.
Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.
There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.
The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...
PTR += 0x1000 (e.g. K-based imm)
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
PTR += 0x1000
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
[...]
... which under speculation could end up as ...
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
[...]
... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.
Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.
With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:
# bpftool prog dump xlated id 282
[...]
28: (79) r1 = *(u64 *)(r7 +0)
29: (79) r2 = *(u64 *)(r7 +8)
30: (57) r1 &= 15
31: (79) r3 = *(u64 *)(r0 +4608)
32: (57) r3 &= 1
33: (47) r3 |= 1
34: (2d) if r2 > r3 goto pc+19
35: (b4) (u32) r11 = (u32) 20479 |
36: (1f) r11 -= r2 | Dynamic sanitation for pointer
37: (4f) r11 |= r2 | arithmetic with registers
38: (87) r11 = -r11 | containing bounded or known
39: (c7) r11 s>>= 63 | scalars in order to prevent
40: (5f) r11 &= r2 | out of bounds speculation.
41: (0f) r4 += r11 |
42: (71) r4 = *(u8 *)(r4 +0)
43: (6f) r4 <<= r1
[...]
For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:
[...]
16: (b4) (u32) r11 = (u32) 20479
17: (1f) r11 -= r2
18: (4f) r11 |= r2
19: (87) r11 = -r11
20: (c7) r11 s>>= 63
21: (5f) r2 &= r11
22: (0f) r2 += r0
23: (61) r0 = *(u32 *)(r2 +0)
[...]
JIT blinding example with non-conflicting use of r10:
[...]
d5: je 0x0000000000000106 _
d7: mov 0x0(%rax),%edi |
da: mov $0xf153246,%r10d | Index load from map value and
e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f.
e7: and %r10,%rdi |_
ea: mov $0x2f,%r10d |
f0: sub %rdi,%r10 | Sanitized addition. Both use r10
f3: or %rdi,%r10 | but do not interfere with each
f6: neg %r10 | other. (Neither do these instructions
f9: sar $0x3f,%r10 | interfere with the use of ax as temp
fd: and %r10,%rdi | in interpreter.)
100: add %rax,%rdi |_
103: mov 0x0(%rdi),%eax
[...]
Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.
[0] Speculose: Analyzing the Security Implications of Speculative
Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
https://arxiv.org/pdf/1801.04084.pdf
[1] A Systematic Evaluation of Transient Execution Attacks and
Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
Dmitry Evtyushkin, Daniel Gruss,
https://arxiv.org/pdf/1811.05441.pdf
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>