In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
without checking the return value. bpf_prog_add() can fail since 92117d8443
("bpf: fix refcnt overflow"), so we really must check it. Take the reference
right when we assign it to the rq from priv->xdp_prog, and just drop the
reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.
Fixes: 86994156c7 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The token table passed into match_token() must be null-terminated, which
it currently is not in the perf's address filter string parser, as caught
by Vince's perf_fuzzer and KASAN.
It doesn't blow up otherwise because of the alignment padding of the table
to the next element in the .rodata, which is luck.
Fixing by adding a null-terminator to the token table.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dvyukov@google.com
Cc: stable@vger.kernel.org # v4.7+
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the wake_q data structure is defined by the WAKE_Q() macro.
This macro, however, looks like a function doing something as "wake" is
a verb. Even checkpatch.pl was confused as it reported warnings like
WARNING: Missing a blank line after declarations
#548: FILE: kernel/futex.c:3665:
+ int ret;
+ WAKE_Q(wake_q);
This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
what the macro is doing and eliminates the checkpatch.pl warnings.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
[ Resolved conflict and added missing rename. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reduce the size of data structure for lockdep entries by half if
PROVE_LOCKING_SMALL if defined. This is used only for sparc.
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent conversion of the console hotplug notifier to the state machine
missed the fact, that the notifier only operated on the non frozen
transitions. As a consequence the console_lock/unlock() pair is also
invoked during suspend, which results in a lockdep warning.
Restore the previous state by making the lock/unlock conditional on
!tasks_frozen.
Fixes: 90b14889d2 ("kernel/printk: Convert to hotplug state machine")
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611171729320.3645@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If slow path frequency changes are conducted in a SCHED_OTHER context
then they may be delayed for some amount of time, including
indefinitely, when real time or deadline activity is taking place.
Move the slow path to a real time kernel thread. In the future the
thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set
to 50 for now.
Hackbench results on ARM Exynos, dual core A15 platform for 10
iterations:
$ hackbench -s 100 -l 100 -g 10 -f 20
Before After
---------------------------------
1.808 1.603
1.847 1.251
2.229 1.590
1.952 1.600
1.947 1.257
1.925 1.627
2.694 1.620
1.258 1.621
1.919 1.632
1.250 1.240
Average:
1.8829 1.5041
Based on initial work by Steve Muckle.
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.
Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Switch to the more common practice of writing labels.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries. Fix this up by doing a few things
1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
life and just adds extra complexity.
2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.
3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.
This fixes the testcase provided by Jann as well as a few other theoretical
problems.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reserved vectors at the beginning and the end of the vector space get
cpu_possible_mask assigned as their affinity mask.
All other non-auto affine interrupts get the default irq affinity mask
assigned. Using cpu_possible_mask breaks that rule.
Treat them like any other interrupt and use irq_default_affinity as target
mask.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
The recent addition of reserved vectors at the beginning or the end of the
vector space did not take the reserved vectors at the beginning into
account for the various loop exit conditions. As a consequence the last
vectors of the spread area are not included into the spread algorithm and
are treated like the reserved vectors at the end of the vector space and
get the default affinity mask assigned.
Sum up the affinity vectors and the reserved vectors at the beginning and
use the sum as exit condition.
[ tglx: Fixed all conditions instead of only one and massaged changelog ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/1479201178-29604-2-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
gcc-6.2.1 gives the following warning:
kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’:
kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized]
The "next" is currently initialized in the while() loop which must have >=1
iterations.
This patch initializes next to get rid of the compiler warning.
Fixes: 3a08c2fd76 ("bpf: LRU List")
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.
During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.
The propagation relies on patch:
"sched: Fix hierarchical order in rq->leaf_cfs_rq_list"
... which orders children and parents, to ensure that it's done in one pass.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.
For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.
For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Every time we modify load/utilization of sched_entity, we start to
sync it with its cfs_rq. This update is done in different ways:
- when attaching/detaching a sched_entity, we update cfs_rq and then
we sync the entity with the cfs_rq.
- when enqueueing/dequeuing the sched_entity, we update both
sched_entity and cfs_rq metrics to now.
Use update_load_avg() everytime we have to update and sync cfs_rq and
sched_entity before changing the state of a sched_enity.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8 ("sched: Introduce hierarchal order on shares update list")
With the current implementation a child can be still put after its
parent.
Lets take the example of:
root
\
b
/\
c d*
|
e*
with root -> b -> c already enqueued but not d -> e so the
leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail
The branch d -> e will be added the first time that they are enqueued,
starting with e then d.
When e is added, its parents is not already on the list so e is put at
the tail : head -> c -> b -> root -> e -> tail
Then, d is added at the head because its parent is already on the
list: head -> d -> c -> b -> root -> e -> tail
e is not placed at the right position and will be called the last
whereas it should be called at the beginning.
Because it follows the bottom-up enqueue sequence, we are sure that we
will finished to add either a cfs_rq without parent or a cfs_rq with a
parent that is already on the list. We can use this event to detect
when we have finished to add a new branch. For the others, whose
parents are not already added, we have to ensure that they will be
added after their children that have just been inserted the steps
before, and after any potential parents that are already in the list.
The easiest way is to put the cfs_rq just after the last inserted one
and to keep track of it untl the branch is fully added.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.
This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.
Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).
But even the average may not be sufficient if the group covers CPUs of
different capacities.
Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.
In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).
To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For s390 kernel builds I keep getting this warning:
kernel/sched/cpuacct.c: In function 'cpuacct_stats_show':
kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=]
seq_printf(sf, "%s %lld\n",
Silence the warning by adding an explicit cast.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161111142749.6545-1-schwidefsky@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the s390 special case of a yielding cpu_relax() implementation gone,
we can now remove all users of cpu_relax_lowlatency() and replace them
with cpu_relax().
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-5-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some time ago the following commit:
57f2ffe14f ("s390: remove diag 44 calls from cpu_relax()")
... stopped cpu_relax() on s390 yielding to the hypervisor.
As it turns out this made stop_machine() run really slow on virtualized
overcommited systems. For example the kprobes test during bootup took
several seconds instead of just running unnoticed with large guests.
Therefore, yielding was reintroduced with commit:
4d92f50249 ("s390: reintroduce diag 44 calls for cpu_relax()")
... but in fact the stop machine code seems to be the only place where
this yielding was really necessary. This place is probably the most
important one as it makes all but one guest CPUs wait for one guest CPU.
As we now have cpu_relax_yield(), we can use this in multi_cpu_stop().
For now lets only add it here. We can add it later in other places
when necessary.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1477386195-32736-3-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some embedded systems have no use for them. This removes about
25KB from the kernel binary size when configured out.
Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.
The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no logical relation between add_device_randomness() and
posix_cpu_timers_exit(). Let's move the former to where the later
is called. This way, when posix-cpu-timers.c is compiled out, there
is no need to worry about not losing a call to add_device_randomness().
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-6-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the only user of alarm_setitimer to itimer.c where it is defined.
This allows for making alarm_setitimer static, and dropping it from the
build when __ARCH_WANT_SYS_ALARM is not defined.
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-5-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently the global_ops filtering hash is not available to outside users
registering for function tracing. Provide an API for those users to be
able to choose global filtering.
This is in preparation for pstore's ftrace feature to be able to
use the global filters.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having a common LRU list, this patch allows a
percpu LRU list which can be selected by specifying a map
attribute. The map attribute will be added in the later
patch.
While the common use case for LRU is #reads >> #updates,
percpu LRU list allows bpf prog to absorb unusual #updates
under pathological case (e.g. external traffic facing machine which
could be under attack).
Each percpu LRU is isolated from each other. The LRU nodes (including
free nodes) cannot be moved across different LRU Lists.
Here are the update performance comparison between
common LRU list and percpu LRU list (the test code is
at the last patch):
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
1 cpus: 2934082 updates
4 cpus: 7391434 updates
8 cpus: 6500576 updates
[root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
1 cpus: 2896553 updates
4 cpus: 9766395 updates
8 cpus: 17460553 updates
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce bpf_lru_list which will provide LRU capability to
the bpf_htab in the later patch.
* General Thoughts:
1. Target use case. Read is more often than update.
(i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
If bpf_prog does a bpf_lookup_elem() first and then an in-place
update, it still counts as a read operation to the LRU list concern.
2. It may be useful to think of it as a LRU cache
3. Optimize the read case
3.1 No lock in read case
3.2 The LRU maintenance is only done during bpf_update_elem()
4. If there is a percpu LRU list, it will lose the system-wise LRU
property. A completely isolated percpu LRU list has the best
performance but the memory utilization is not ideal considering
the work load may be imbalance.
5. Hence, this patch starts the LRU implementation with a global LRU
list with batched operations before accessing the global LRU list.
As a LRU cache, #read >> #update/#insert operations, it will work well.
6. There is a local list (for each cpu) which is named
'struct bpf_lru_locallist'. This local list is not used to sort
the LRU property. Instead, the local list is to batch enough
operations before acquiring the lock of the global LRU list. More
details on this later.
7. In the later patch, it allows a percpu LRU list by specifying a
map-attribute for scalability reason and for use cases that need to
prepare for the worst (and pathological) case like DoS attack.
The percpu LRU list is completely isolated from each other and the
LRU nodes (including free nodes) cannot be moved across the list. The
following description is for the global LRU list but mostly applicable
to the percpu LRU list also.
* Global LRU List:
1. It has three sub-lists: active-list, inactive-list and free-list.
2. The two list idea, active and inactive, is borrowed from the
page cache.
3. All nodes are pre-allocated and all sit at the free-list (of the
global LRU list) at the beginning. The pre-allocation reasoning
is similar to the existing BPF_MAP_TYPE_HASH. However,
opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
the LRU map.
* Active/Inactive List (of the global LRU list):
1. The active list, as its name says it, maintains the active set of
the nodes. We can think of it as the working set or more frequently
accessed nodes. The access frequency is approximated by a ref-bit.
The ref-bit is set during the bpf_lookup_elem().
2. The inactive list, as its name also says it, maintains a less
active set of nodes. They are the candidates to be removed
from the bpf_htab when we are running out of free nodes.
3. The ordering of these two lists is acting as a rough clock.
The tail of the inactive list is the older nodes and
should be released first if the bpf_htab needs free element.
* Rotating the Active/Inactive List (of the global LRU list):
1. It is the basic operation to maintain the LRU property of
the global list.
2. The active list is only rotated when the inactive list is running
low. This idea is similar to the current page cache.
Inactive running low is currently defined as
"# of inactive < # of active".
3. The active list rotation always starts from the tail. It moves
node without ref-bit set to the head of the inactive list.
It moves node with ref-bit set back to the head of the active
list and then clears its ref-bit.
4. The inactive rotation is pretty simply.
It walks the inactive list and moves the nodes back to the head of
active list if its ref-bit is set. The ref-bit is cleared after moving
to the active list.
If the node does not have ref-bit set, it just leave it as it is
because it is already in the inactive list.
* Shrinking the Inactive List (of the global LRU list):
1. Shrinking is the operation to get free nodes when the bpf_htab is
full.
2. It usually only shrinks the inactive list to get free nodes.
3. During shrinking, it will walk the inactive list from the tail,
delete the nodes without ref-bit set from bpf_htab.
4. If no free node found after step (3), it will forcefully get
one node from the tail of inactive or active list. Forcefully is
in the sense that it ignores the ref-bit.
* Local List:
1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
batch enough operations before acquiring the lock of the
global LRU.
2. A local list has two sub-lists, free-list and pending-list.
3. During bpf_update_elem(), it will try to get from the free-list
of (the current CPU local list).
4. If the local free-list is empty, it will acquire from the
global LRU list. The global LRU list can either satisfy it
by its global free-list or by shrinking the global inactive
list. Since we have acquired the global LRU list lock,
it will try to get at most LOCAL_FREE_TARGET elements
to the local free list.
5. When a new element is added to the bpf_htab, it will
first sit at the pending-list (of the local list) first.
The pending-list will be flushed to the global LRU list
when it needs to acquire free nodes from the global list
next time.
* Lock Consideration:
The LRU list has a lock (lru_lock). Each bucket of htab has a
lock (buck_lock). If both locks need to be acquired together,
the lock order is always lru_lock -> buck_lock and this only
happens in the bpf_lru_list.c logic.
In hashtab.c, both locks are not acquired together (i.e. one
lock is always released first before acquiring another lock).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
can cause a ftrace check to trigger and disable ftrace. This is because
of the way modules are registered to ftrace. Their functions are
loaded in the ftrace function tables but set to "disabled" since
they are still in the process of being loaded by the module. After
the module is finished, it calls back into the ftrace infrastructure
to enable it. Looking deeper into the locations that access all the
functions in the table, I found more locations that should ignore
the disabled ones.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJYKxq1FBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
k24IAIvZoDRiFvwdF8ZNrngD8ZT1IfEg771yzouKgTXT+mHhty8IpO6YRIYxJ5vS
bLmmg1wGhSjBUGdAsm1e16r8bbxuZo7SbYx0CUz/XEQimbKJZU56M6oBrp22qtEv
f4TBN0RPHhghG5IpHTm1Cvh4FZ9NgDkoWVuSKf7/vhxJ1GNpu/cpaTS60x+X6Fdj
mFVIdwDRcupqXJLtqoB4tDC+iekqD0Zwj9yxpBL12Em/PvcgtXsBO4oaP0vEMOES
ylGwEY0jCSpRJ2EsX1QnZDMP3DM0m9JLGmkzGuTwekGsHxe9+ODyQeYZyZ105rbD
hYPfo3xS0diXnOyosxetgjefUwQ=
=VoP8
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Alexei discovered a race condition in modules failing to load that can
cause a ftrace check to trigger and disable ftrace.
This is because of the way modules are registered to ftrace. Their
functions are loaded in the ftrace function tables but set to
"disabled" since they are still in the process of being loaded by the
module. After the module is finished, it calls back into the ftrace
infrastructure to enable it.
Looking deeper into the locations that access all the functions in the
table, I found more locations that should ignore the disabled ones"
* tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
Commit:
db4a835601 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
failed to verify that event->cgrp is actually the scheduled cgroup
in a CPU before setting cpuctx->cgrp. This patch fixes that.
Now that there is a different path for scheduled and unscheduled
cgroup, add a warning to catch when cpuctx->cgrp is still set after
the last cgroup event has been unsheduled.
To verify the bug:
# Create 2 cgroups.
mkdir /dev/cgroups/devices/g1
mkdir /dev/cgroups/devices/g2
# launch a task, bind it to a cpu and move it to g1
CPU=2
while :; do : ; done &
P=$!
taskset -pc $CPU $P
echo $P > /dev/cgroups/devices/g1/tasks
# monitor g2 (it runs no tasks) and observe output
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000091408 7,579,527 cycles g2
2.000350111 <not counted> cycles g2
3.000589181 <not counted> cycles g2
4.000771428 <not counted> cycles g2
# note first line that displays that a task run in g2, despite
# g2 having no tasks. This is because cpuctx->cgrp was wrongly
# set when context of new event was installed.
# After applying the fix we obtain the right output:
perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000119615 <not counted> cycles g2
2.000389430 <not counted> cycles g2
3.000590962 <not counted> cycles g2
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now since fetch_task_cputime() has no other users than task_cputime(),
its code could be used directly in task_cputime().
Moreover since only 2 task_cputime() calls of 17 use a NULL argument,
we can add dummy variables to those calls and remove NULL checks from
task_cputimes().
Also remove NULL checks from task_cputimes_scaled().
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Only s390 and powerpc have hardware facilities allowing to measure
cputimes scaled by frequency. On all other architectures
utimescaled/stimescaled are equal to utime/stime (however they are
accounted separately).
Remove {u,s}timescaled accounting on all architectures except
powerpc and s390, where those values are explicitly accounted
in the proper places.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161031162143.GB12646@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently cputime_to_scaled() just return it's argument on
all implementations, we don't need to call this function.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1479175612-14718-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
1) Fix off by one wrt. indexing when dumping /proc/net/route entries,
from Alexander Duyck.
2) Fix lockdep splats in iwlwifi, from Johannes Berg.
3) Cure panic when inserting certain netfilter rules when NFT_SET_HASH
is disabled, from Liping Zhang.
4) Memory leak when nft_expr_clone() fails, also from Liping Zhang.
5) Disable UFO when path will apply IPSEC tranformations, from Jakub
Sitnicki.
6) Don't bogusly double cwnd in dctcp module, from Florian Westphal.
7) skb_checksum_help() should never actually use the value "0" for the
resulting checksum, that has a special meaning, use CSUM_MANGLED_0
instead. From Eric Dumazet.
8) Per-tx/rx queue statistic strings are wrong in qed driver, fix from
Yuval MIntz.
9) Fix SCTP reference counting of associations and transports in
sctp_diag. From Xin Long.
10) When we hit ip6tunnel_xmit() we could have come from an ipv4 path in
a previous layer or similar, so explicitly clear the ipv6 control
block in the skb. From Eli Cooper.
11) Fix bogus sleeping inside of inet_wait_for_connect(), from WANG
Cong.
12) Correct deivce ID of T6 adapter in cxgb4 driver, from Hariprasad
Shenai.
13) Fix potential access past the end of the skb page frag array in
tcp_sendmsg(). From Eric Dumazet.
14) 'skb' can legitimately be NULL in inet{,6}_exact_dif_match(). Fix
from David Ahern.
15) Don't return an error in tcp_sendmsg() if we wronte any bytes
successfully, from Eric Dumazet.
16) Extraneous unlocks in netlink_diag_dump(), we removed the locking
but forgot to purge these unlock calls. From Eric Dumazet.
17) Fix memory leak in error path of __genl_register_family(). We leak
the attrbuf, from WANG Cong.
18) cgroupstats netlink policy table is mis-sized, from WANG Cong.
19) Several XDP bug fixes in mlx5, from Saeed Mahameed.
20) Fix several device refcount leaks in network drivers, from Johan
Hovold.
21) icmp6_send() should use skb dst device not skb->dev to determine L3
routing domain. From David Ahern.
22) ip_vs_genl_family sets maxattr incorrectly, from WANG Cong.
23) We leak new macvlan port in some cases of maclan_common_netlink()
errors. Fix from Gao Feng.
24) Similar to the icmp6_send() fix, icmp_route_lookup() should
determine L3 routing domain using skb_dst(skb)->dev not skb->dev.
Also from David Ahern.
25) Several fixes for route offloading and FIB notification handling in
mlxsw driver, from Jiri Pirko.
26) Properly cap __skb_flow_dissect()'s return value, from Eric Dumazet.
27) Fix long standing regression in ipv4 redirect handling, wrt.
validating the new neighbour's reachability. From Stephen Suryaputra
Lin.
28) If sk_filter() trims the packet excessively, handle it reasonably in
tcp input instead of exploding. From Eric Dumazet.
29) Fix handling of napi hash state when copying channels in sfc driver,
from Bert Kenward.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (121 commits)
mlxsw: spectrum_router: Flush FIB tables during fini
net: stmmac: Fix lack of link transition for fixed PHYs
sctp: change sk state only when it has assocs in sctp_shutdown
bnx2: Wait for in-flight DMA to complete at probe stage
Revert "bnx2: Reset device during driver initialization"
ps3_gelic: fix spelling mistake in debug message
net: ethernet: ixp4xx_eth: fix spelling mistake in debug message
ibmvnic: Fix size of debugfs name buffer
ibmvnic: Unmap ibmvnic_statistics structure
sfc: clear napi_hash state when copying channels
mlxsw: spectrum_router: Correctly dump neighbour activity
mlxsw: spectrum: Fix refcount bug on span entries
bnxt_en: Fix VF virtual link state.
bnxt_en: Fix ring arithmetic in bnxt_setup_tc().
Revert "include/uapi/linux/atm_zatm.h: include linux/time.h"
tcp: take care of truncations done by sk_filter()
ipv4: use new_gw for redirect neigh lookup
r8152: Fix error path in open function
net: bpqether.h: remove if_ether.h guard
net: __skb_flow_dissect() must cap its return value
...
When a module is first loaded and its function ip records are added to the
ftrace list of functions to modify, they are set to DISABLED, as their text
is still in a read only state. When the module is fully loaded, and can be
updated, the flag is cleared, and if their's any functions that should be
tracing them, it is updated at that moment.
But there's several locations that do record accounting and should ignore
records that are marked as disabled, or they can cause issues.
Alexei already fixed one location, but others need to be addressed.
Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_shutdown() checks for sanity of ftrace records
and if dyn_ftrace->flags is not zero, it will warn.
It can happen that 'flags' are set to FTRACE_FL_DISABLED at this point,
since some module was loaded, but before ftrace_module_enable()
cleared the flags for this module.
In other words the module.c is doing:
ftrace_module_init(mod); // calls ftrace_update_code() that sets flags=FTRACE_FL_DISABLED
... // here ftrace_shutdown() is called that warns, since
err = prepare_coming_module(mod); // didn't have a chance to clear FTRACE_FL_DISABLED
Fix it by ignoring disabled records.
It's similar to what __ftrace_hash_rec_update() is already doing.
Link: http://lkml.kernel.org/r/1478560460-3818619-1-git-send-email-ast@fb.com
Cc: stable@vger.kernel.org
Fixes: b7ffffbb46 "ftrace: Add infrastructure for delayed enabling of module functions"
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace the custom u64_to_ptr() function with the u64_to_user_ptr()
macro.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code can result in spurious kicks when there are no grace
periods in progress and no grace-period-related requests. This is
sort of OK for a diagnostic aid, but the resulting ftrace-dump messages
in dmesg are annoying. This commit therefore avoids spurious kicks
in the common case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Expedited grace periods check dyntick-idle state, and avoid sending
IPIs to idle CPUs, including those running guest OSes, and, on NOHZ_FULL
kernels, nohz_full CPUs. However, the kernel has been observed checking
a CPU while it was non-idle, but sending the IPI after it has gone
idle. This commit therefore rechecks idle state immediately before
sending the IPI, refraining from IPIing CPUs that have since gone idle.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although rcutorture will occasionally do a 50-millisecond grace-period
delay, these delays are quite rare. And rightly so, because otherwise
the read rate would be quite low. Thie means that it can be important
to identify whether or not a given run contained a long-delay read.
This commit therefore inserts a trace_rcu_torture_read() event to flag
runs containing long delays.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The __call_rcu() comment about opportunistically noting grace period
beginnings and endings is obsolete. RCU still does such opportunistic
noting, but in __call_rcu_core() rather than __call_rcu(), and there
already is an appropriate comment in __call_rcu_core(). This commit
therefore removes the obsolete comment.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In the deep past, rcu_check_callbacks() was only invoked if rcu_pending()
returned true. Which was fine, but these days rcu_check_callbacks()
is invoked unconditionally. This commit therefore removes the obsolete
sentence from the header comment.
Reported-by: Michalis Kokologiannakis <mixaskok@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Commit 720abae3d6 ("rcu: force alignment on struct
callback_head/rcu_head") forced the rcu_head (AKA callback_head)
structure's alignment to pointer size, that is, to 4-byte boundaries on
32-bit systems and to 8-byte boundaries on 64-bit systems. This
commit therefore checks for this same alignment in __call_rcu(),
which used to contain a looser check for two-byte alignment.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This reverts commit bfd8d3f23b.
It turns out that this flushes things much too aggressiverly, and causes
lines to break up when the system logger races with new continuation
lines being printed.
There's a pending patch to make printk() flushing much more
straightforward, but it's too invasive for 4.9, so in the meantime let's
just not make the system message logging flush continuation lines.
They'll be flushed by the final newline anyway.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull irq fix from Ingo Molnar:
"This fixes a genirq regression that resulted in the Intel/Broxton
pinctrl/GPIO driver (and possibly others) spewing warnings"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Use irq type from irqdata instead of irqdesc
Commit 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings
scheme") added a bug in that the prog's reference count is not dropped
in the error path when mlx4_en_try_alloc_resources() is failing from
mlx4_xdp_set().
We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
need to release again. Earlier in the call path, dev_change_xdp_fd()
itself holds a reference to the prog as well (hence the '- 1' in the
bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
an error is propagated, then bpf_prog_put() is called eventually from
dev_change_xdp_fd()
Fixes: 67f8b1dcb9 ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Prevent the PM core from attempting to suspend parent devices
if any of their children, whose suspend callbacks were invoked
asynchronously, have failed to suspend during the "late" and
"noirq" phases of system-wide suspend of devices (Brian Norris).
- Prevent the boot-time system suspend test code from leaking a
reference to the RTC device used by it (Johan Hovold).
- Fix cpupower to use the return value of one of its library
functions correctly and restore the correct behavior of it
when used for setting cpufreq tunables broken during the 4.7
development cycle (Laura Abbott).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJYJloUAAoJEILEb/54YlRxkYgP/RvTRLO+6Hlk0Re/Lm5SS7O4
YjOrcemWkGkA+YJ5Qe4vki28nMzGITIAWuHJK7Nex0A1zFMX3GrNPxxMPHBejfog
7VjeEp77x0h9F8EzWX2YLZaTCSpvQjpHH5/vUQaTIE8a/WtZVbSzfCT4jqAYYHzV
rHn52aNwiAKd4kjMTSBJdjTOdhqebRFYRHpBHovoTARcGH4DgEJciQzaEh1O6qAq
barOGrDDkB32mULjsBNRX4Wyx1wuJiOpVjiD7sIM75ZomlIDC/RyYc9O37t6GDBC
koTTlilV2/i2VoqcU5nQgAmgUOoDz1skGJW6Rsp/d/S2GnZiLBJwma31e187otBL
NFl1egL1BJaz5J1sow7YPYBQILzlYGwNhXLiKHoPEshWp5WSn7mIhxxzAou3Llsp
oPCXDwuRrMhqYrOdclxiL+Pj6dUapvut3SerZB5C9rGg2JL9Y2gDxrhf/jFqXxNb
mBw3kRx94bDb6DNHJPuplRHWrL0SlM2ZXnUa7XwUSt+jIiwMK2la0cAL1P/7Qva9
zJ6SG01BnGEd6OI+y9ZSjITdpTVK10JIuCrTmR9cvWNA+A1vOGCkg6Nj+qunmzFh
j6W2O0KQ/moR0ICCPCg96Hjf888mohX7vhGQqDQwpXYlsvWcWvjNpURsfY47yBH9
p70+ziBkqyyIAprJKjLB
=ykEg
-----END PGP SIGNATURE-----
Merge tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two bugs in error code paths in the PM core (system-wide
suspend of devices), a device reference leak in the boot-time suspend
test code and a cpupower utility regression from the 4.7 cycle.
Specifics:
- Prevent the PM core from attempting to suspend parent devices if
any of their children, whose suspend callbacks were invoked
asynchronously, have failed to suspend during the "late" and
"noirq" phases of system-wide suspend of devices (Brian Norris).
- Prevent the boot-time system suspend test code from leaking a
reference to the RTC device used by it (Johan Hovold).
- Fix cpupower to use the return value of one of its library
functions correctly and restore the correct behavior of it when
used for setting cpufreq tunables broken during the 4.7 development
cycle (Laura Abbott)"
* tag 'pm-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails
PM / sleep: fix device reference leak in test_suspend
cpupower: Correct return type of cpu_power_is_cpu_online() in cpufreq-set
This reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path").
The reverted commit changes existing behavior on which many ARM boards
rely. Many ARM small-board-computers, like e.g. the Raspberry Pi have
both a video output and a serial console. Depending on whether the user
is using the device as a more regular computer; or as a headless device
we need to have the console on either one or the other.
Many users rely on the kernel behavior of the console being present on
both outputs, before the reverted commit the console setup with no
console= kernel arguments on an ARM board which sets stdout-path in dt
would look like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
tty0 -WU (E p ) 4:1
Where as after the reverted commit, it looks like this:
[root@localhost ~]# cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
This commit reverts commit 05fd007e46 ("console: don't prefer first
registered if DT specifies stdout-path") restoring the original
behavior.
Fixes: 05fd007e46 ("console: don't prefer first registered if DT specifies stdout-path")
Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the comment:
/*
* The task might have changed its scheduling policy to something
* different than SCHED_DEADLINE (through switched_fromd_dl()).
*/
s/fromd/from/
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5408b3b3f9ee197a7b7f10fb834341100a4f2c88.1478599881.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'class' parameter is not used, remove it.
n
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1478592127-4376-1-git-send-email-tahsin@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the unused but set variables min_set and max_set in
adjust_reg_min_max_vals to fix the following warning when building with
'W=1':
kernel/bpf/verifier.c:1483:7: warning: variable ‘min_set’ set but not used [-Wunused-but-set-variable]
There is no warning about max_set being unused, but since it is only
used in the assignment of min_set it can be removed as well.
They were introduced in commit 484611357c ("bpf: allow access into map
value arrays") but seem to have never been used.
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Install the callbacks via the state machine.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161103145021.28528-3-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Only calculate the affinity for the main I/O vectors, and skip the
pre or post vectors specified by struct irq_affinity.
Also remove the irq_affinity cpumask argument that has never been used.
If we ever need it in the future we can pass it through struct
irq_affinity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-4-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Only calculate the affinity for the main I/O vectors, and skip the pre or
post vectors specified by struct irq_affinity.
Also remove the irq_affinity cpumask argument that has never been used. If
we ever need it in the future we can pass it through struct irq_affinity.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Link: http://lkml.kernel.org/r/1478654107-7384-3-git-send-email-hch@lst.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The type flags in the irq descriptor are there for historical reasons and
only updated via irq_modify_status() or irq_set_type(). Both functions also
update the type flags in irqdata. __setup_irq() is the only left over user
of the type flags in the irq descriptor.
If __setup_irq() is called with empty irq type flags, then the type flags
are retrieved from irqdata. If an interrupt is shared, then the type flags
are compared with the type flags stored in the irq descriptor.
On x86 the ioapic does not have a irq_set_type() callback because the type
is defined in the BIOS tables and cannot be changed. The type is stored in
irqdata at setup time without updating the type data in the irq
descriptor. As a result the comparison described above fails.
There is no point in updating the irq descriptor flags because the only
relevant storage is irqdata. Use the type flags from irqdata for both
retrieval and comparison in __setup_irq() instead.
Aside of that the print out in case of non matching type flags has the old
and new type flags arguments flipped. Fix that as well.
For correctness sake the flags stored in the irq descriptor should be
removed, but this is beyond the scope of this bugfix and will be done in a
later patch.
Fixes: 4b357daed6 ("genirq: Look-up trigger type if not specified by caller")
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -> delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.
Fixes: a6ed3ea65d ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull stack vmap fixups from Thomas Gleixner:
"Two small patches related to sched_show_task():
- make sure to hold a reference on the task stack while accessing it
- remove the thread_saved_pc printout
.. and add a sanity check into release_task_stack() to catch problems
with task stack references"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Remove pointless printout in sched_show_task()
sched/core: Fix oops in sched_show_task()
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fork: Add task stack refcounting sanity check and prevent premature task stack freeing
cgroupstats_cmd_get_policy is [CGROUPSTATS_CMD_ATTR_MAX+1],
taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1],
but their family.maxattr is TASKSTATS_CMD_ATTR_MAX.
CGROUPSTATS_CMD_ATTR_MAX is less than TASKSTATS_CMD_ATTR_MAX,
so we could end up accessing out-of-bound.
Change cgroupstats_cmd_get_policy to TASKSTATS_CMD_ATTR_MAX+1,
this is safe because the rest are initialized to 0's.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_THREAD_INFO_IN_TASK=y, it is possible that an exited thread
remains in the task list after its stack pointer was already set to NULL.
Therefore, thread_saved_pc() and stack_not_used() in sched_show_task()
will trigger NULL pointer dereference if an attempt to dump such thread's
traces (e.g. SysRq-t, khungtaskd) is made.
Since show_stack() in sched_show_task() calls try_get_task_stack() and
sched_show_task() is called from interrupt context, calling
try_get_task_stack() from sched_show_task() will be safe as well.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: brgerst@gmail.com
Cc: jann@thejh.net
Cc: keescook@chromium.org
Cc: linux-api@vger.kernel.org
Cc: tycho.andersen@canonical.com
Link: http://lkml.kernel.org/r/201611021950.FEJ34368.HFFJOOMLtQOVSF@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make sure to drop the reference taken by class_find_device() after
opening the RTC device.
Fixes: 77437fd4e6 (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
If something goes wrong with task stack refcounting and a stack
refcount hits zero too early, warn and leak it rather than
potentially freeing it early (and silently).
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While commit bb35a6ef7d ("bpf, inode: allow for rename and link ops")
added support for hard links that can be used for prog and map nodes,
this work adds simple symlink support, which can be used f.e. for
directories also when unpriviledged and works with cmdline tooling that
understands S_IFLNK anyway. Since the switch in e27f4a942a ("bpf: Use
mount_nodev not mount_ns to mount the bpf filesystem"), there can be
various mount instances with mount_nodev() and thus hierarchy can be
flattened to facilitate object sharing. Thus, we can keep bpf tooling
also working by repointing paths.
Most of the functionality can be used from vfs library operations. The
symlink is stored in the inode itself, that is in i_link, which is
sufficient in our case as opposed to storing it in the page cache.
While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
the directories mtime and ctime, so add a common helper for it called
bpf_dentry_finalize() that takes care of it for all cases now.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple overlapping changes.
For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
The verifier currently prints raw function ids when printing CALL
instructions or when complaining:
5: (85) call 23
unknown func 23
print a meaningful function name instead:
5: (85) call bpf_redirect#23
unknown func bpf_redirect#23
Moves the function documentation to a single comment and renames all
helpers names in the list to conform to the bpf_ prefix notation so
they can be greped in the kernel source.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Specifics:
- Fix a missing KERN_CONT in a system suspend message by converting
the affected code to using pr_info() and pr_cont() instead of the
"raw" printk() (Jon Hunter).
- Make intel_pstate set the CPU P-state from its .set_policy()
callback when the scaling_governor sysfs attribute is set to
"performance" so that it interacts with NOHZ_FULL more
predictably which was the case before 4.7 (Rafael Wysocki).
- Make intel_pstate always request the maximum allowed P-state when
the scaling_governor sysfs attribute is set to "performance" to
prevent it from effectively ingoring that setting is some
situations (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJYE+vbAAoJEILEb/54YlRxsqQQAIZLfm5Isp4L0JBt2svtLYuO
5mLgGOfcpsHIpqq5+saKtK+JOcCpm/MJ0PMQdv3WhKkFGXG9++Dz6F2gNNFlE+Sy
era/B9NFry1BsQ14xsX8XonYRkQ2tS+CgOw+akSZl9BTVeX57VASqPIzlKI4Auz/
Ru5M8n4QWAD7dPWnqHTFHYbmTnMtr18wn0mNjiC/cLz7TPxmGM9Pe730/bpGKMDC
DQpflnpT/ktqV1imDuaiMeuP2re2DSZWMrirZo326UsDtcKN1A4sCO+PPUUt/TTb
k6pKsZ9OB9FgMmaRqx5G6wCD3NvuhKa1JOqg3RvAVZ0SmMcl2AFhCaoEJtX+TKvz
3365o2tSQeGc3bmXTb7YqP2SaYm107rOG/M0I2JuA7qNUKxk8MFfbNdaYxTY2ezT
D5NF0SCycOMeEXIDTTuZ2oN1NHR6osO7bNFOIacKlcIL8PmEl8fDqX2KdYQze5xh
AGsa1Oee7FPrDd2kQIL6HNGgVjd6gtHU9cVjyzQeUv9SU6zwOlu4/BbZI+wIterN
L2OfW7w6/KSSyScDZ3TbifeJwpP918bUv578dEP6U/2VKthPBUPY/MYMMQzSuEIn
RAQoDShqacIjdyLZtStii28KctMOk0L/edEDbdFmyu2ttNwpB+hk2ieCYWfI8wAa
2SAaYyRGpC2nafsSTv+E
=RtkB
-----END PGP SIGNATURE-----
Merge tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two intel_pstate issues related to the way it works when the
scaling_governor sysfs attribute is set to "performance" and fix up
messages in the system suspend core code.
Specifics:
- Fix a missing KERN_CONT in a system suspend message by converting
the affected code to using pr_info() and pr_cont() instead of the
"raw" printk() (Jon Hunter).
- Make intel_pstate set the CPU P-state from its .set_policy()
callback when the scaling_governor sysfs attribute is set to
"performance" so that it interacts with NOHZ_FULL more predictably
which was the case before 4.7 (Rafael Wysocki).
- Make intel_pstate always request the maximum allowed P-state when
the scaling_governor sysfs attribute is set to "performance" to
prevent it from effectively ingoring that setting is some
situations (Rafael Wysocki)"
* tag 'pm-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Always set max P-state in performance mode
PM / suspend: Fix missing KERN_CONT for suspend message
cpufreq: intel_pstate: Set P-state upfront in performance mode
Pull perf fixes from Ingo Molnar:
"Misc kernel fixes: a virtualization environment related fix, an uncore
PMU driver removal handling fix, a PowerPC fix and new events for
Knights Landing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf/powerpc: Don't call perf_event_disable() from atomic context
perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
perf/x86/intel/cstate: Add C-state residency events for Knights Landing
Pull timer fixes from Ingo Molnar:
"Fix four timer locking races: two were noticed by Linus while
reviewing the code while chasing for a corruption bug, and two
from fixing spurious USB timeouts"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Prevent base clock corruption when forwarding
timers: Prevent base clock rewind when forwarding clock
timers: Lock base for same bucket optimization
timers: Plug locking race vs. timer migration
Pull objtool, irq and scheduler fixes from Ingo Molnar:
"One more objtool fixlet for GCC6 code generation patterns, an irq
DocBook fix and an unused variable warning fix in the scheduler"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix rare switch jump table pattern detection
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
doc: Add missing parameter for msi_setup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Remove unused but set variable 'rq'
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The trinity syscall fuzzer triggered following WARN() on powerpc:
WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
...
NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
Call Trace:
[c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
Followed by a lockdep warning:
===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #7 Tainted: G W
-------------------------------
./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ls/2998:
#0: (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
#1: (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
stack backtrace:
CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
Call Trace:
[c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
[c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
[c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
[c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
[c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
[c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
[c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.
The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.
But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CAI Qian reported a crash in the PMU uncore device removal code,
enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:
https://marc.info/?l=linux-kernel&m=147688837328451
The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
in_interrupt() returns a nonzero value when we are either in an
interrupt or have bh disabled via local_bh_disable(). Since we are
interested in only ignoring coverage from actual interrupts, do a proper
check instead of just calling in_interrupt().
As a result of this change, kcov will start to collect coverage from
within local_bh_disable()/local_bh_enable() sections.
Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Nicolai Stange <nicstange@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morse <james.morse@arm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.
In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.
This protects the data structure from accidental corruption.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.
This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.
Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)
Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"A set of fixes for this series, most notably the fix for the blk-mq
software queue regression in from this merge window.
Apart from that, a fix for an unlikely hang if a queue is flooded with
FUA requests from Ming, and a few small fixes for nbd and badblocks.
Lastly, a rename update for the proc softirq output, since the block
polling code was made generic"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: update hardware and software queues for sleeping alloc
block: flush: fix IO hang in case of flood fua req
nbd: fix incorrect unlock of nbd->sock_lock in sock_shutdown
badblocks: badblocks_set/clear update unacked_exist
softirq: Display IRQ_POLL for irq-poll statistics
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:
wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().
The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).
It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.
As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.
Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit:
8663e24d56 ("sched/fair: Reorder cgroup creation code")
... the variable 'rq' in alloc_fair_sched_group() is set but no longer used.
Remove it to fix the following GCC warning when building with 'W=1':
kernel/sched/fair.c:8842:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161026113704.8981-1-tklauser@distanz.ch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The documentation for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claim that the routines couldn't possibly
return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
not true since wake_up_process() will cause those routines to exit early.
We cannot make schedule_[hr]timeout() loop until the timeout expires if the
task state is uninterruptible because we have users which rely on the
existing and designed behaviour.
Make the documentation match the (correct) implementation.
schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
woken up. This might look strange, but making the return code depend on the
state is too much of an effort as it would affect all the call sites. There
is no value in doing so, but we spell it out clearly in the documentation.
Suggested-by: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: huangtao@rock-chips.com
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in the code ensures
this, when the sleeping task is woken by wake_up_process() or any other
mechanism which can wake a task from uninterruptible state.
Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
against wakeups. schedule_hrtimeout_range*() is designed this way despite
the fact that the API documentation does not mention it.
msleep() already has code to handle this case since it will loop as long
as there was still time left. usleep_range() has no such loop, add it.
Presumably this problem was not detected before because usleep_range() is
only used in a few places and the function is mostly used in contexts which
are not exposed to wakeups of any form.
An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems were found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.
An effort was made to ask several upstream maintainers if they were aware
of people relying on wake_up_process() to wake up usleep_range(). No
maintainers were aware of that but they were aware of many people relying
on usleep_range() never returning before the minimum.
Reported-by: Tao Huang <huangtao@rock-chips.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: djkurtz@chromium.org
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently after bringing up secondary CPUs all arches print "Brought up
%d CPUs". On x86 they also print the number of nodes that were brought
online.
It would be nice to also print the number of nodes on other arches.
Although we could override smp_announce() on the other ~10 NUMA aware
arches, it seems simpler to just always print the number of nodes. On
non-NUMA arches there is just always 1 node.
Having done that, smp_announce() is no longer weak, and seems small
enough to just pull directly into smp_init().
Also update the printing of "%d CPUs" to be smart when an SMP kernel is
booted on a single CPU system, or when only one CPU is available, eg:
smp: Brought up 2 nodes, 1 CPU
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: akpm@osdl.org
Cc: jgross@suse.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: richard@nod.at
Cc: jolsa@redhat.com
Cc: boris.ostrovsky@oracle.com
Cc: mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1477460275-8266-2-git-send-email-mpe@ellerman.id.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For mostly historical reasons, the x86 oops dump shows the raw stack
values:
...
[registers]
Stack:
ffff880079af7350 ffff880079905400 0000000000000000 ffffc900008f3ae0
ffffffffa0196610 0000000000000001 00010000ffffffff 0000000087654321
0000000000000002 0000000000000000 0000000000000000 0000000000000000
Call Trace:
...
This seems to be an artifact from long ago, and probably isn't needed
anymore. It generally just adds noise to the dump, and it can be
actively harmful because it leaks kernel addresses.
Linus says:
"The stack dump actually goes back to forever, and it used to be
useful back in 1992 or so. But it used to be useful mainly because
stacks were simpler and we didn't have very good call traces anyway. I
definitely remember having used them - I just do not remember having
used them in the last ten+ years.
Of course, it's still true that if you can trigger an oops, you've
likely already lost the security game, but since the stack dump is so
useless, let's aim to just remove it and make games like the above
harder."
This also removes the related 'kstack=' cmdline option and the
'kstack_depth_to_print' sysctl.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e83bd50df52d8fe88e94d2566426ae40d813bf8f.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a timer is enqueued we try to forward the timer base clock. This
mechanism has two issues:
1) Forwarding a remote base unlocked
The forwarding function is called from get_target_base() with the current
timer base lock held. But if the new target base is a different base than
the current base (can happen with NOHZ, sigh!) then the forwarding is done
on an unlocked base. This can lead to corruption of base->clk.
Solution is simple: Invoke the forwarding after the target base is locked.
2) Possible corruption due to jiffies advancing
This is similar to the issue in get_net_timer_interrupt() which was fixed
in the previous patch. jiffies can advance between check and assignement
and therefore advancing base->clk beyond the next expiry value.
So we need to read jiffies into a local variable once and do the checks and
assignment with the local copy.
Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Ashton and Michael reported, that kernel versions 4.8 and later suffer from
USB timeouts which are caused by the timer wheel rework.
This is caused by a bug in the base clock forwarding mechanism, which leads
to timers expiring early. The scenario which leads to this is:
run_timers()
while (jiffies >= base->clk) {
collect_expired_timers();
base->clk++;
expire_timers();
}
So base->clk = jiffies + 1. Now the cpu goes idle:
idle()
get_next_timer_interrupt()
nextevt = __next_time_interrupt();
if (time_after(nextevt, base->clk))
base->clk = jiffies;
jiffies has not advanced since run_timers(), so this assignment effectively
decrements base->clk by one.
base->clk is the index into the timer wheel arrays. So let's assume the
following state after the base->clk increment in run_timers():
jiffies = 0
base->clk = 1
A timer gets enqueued with an expiry delta of 63 ticks (which is the case
with the USB timeout and HZ=250) so the resulting bucket index is:
base->clk + delta = 1 + 63 = 64
The timer goes into the first wheel level. The array size is 64 so it ends
up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
to index into bucket 0 again.
If the cpu goes idle before jiffies advance, then the bug in the forwarding
mechanism sets base->clk back to 0, so the next invocation of run_timers()
at the next tick will index into bucket 0 and therefore expire the timer 62
ticks too early.
Instead of blindly setting base->clk to jiffies we must make the forwarding
conditional on jiffies > base->clk, but we cannot use jiffies for this as
we might run into the following issue:
if (time_after(jiffies, base->clk) {
if (time_after(nextevt, base->clk))
base->clk = jiffies;
jiffies can increment between the check and the assigment far enough to
advance beyond nextevt. So we need to use a stable value for checking.
get_next_timer_interrupt() has the basej argument which is the jiffies
value snapshot taken in the calling code. So we can just that.
Thanks to Ashton for bisecting and providing trace data!
Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Linus stumbled over the unlocked modification of the timer expiry value in
mod_timer() which is an optimization for timers which stay in the same
bucket - due to the bucket granularity - despite their expiry time getting
updated.
The optimization itself still makes sense even if we take the lock, because
in case that the bucket stays the same, we avoid the pointless
queue/enqueue dance.
Make the check and the modification of timer->expires protected by the base
lock and shuffle the remaining code around so we can keep the lock held
when we actually have to requeue the timer to a different bucket.
Fixes: f00c0afdfa ("timers: Implement optimization for same expiry time in mod_timer()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
timer flags. As a consequence the compiler is allowed to reload the flags
between the initial check for TIMER_MIGRATION and the following timer base
computation and the spin lock of the base.
While this has not been observed (yet), we need to make sure that it never
happens.
Fixes: 0eeda71bc3 ("timer: Replace timer base by a cpu index")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
This patch makes the waiter that sets the HANDOFF flag start spinning
instead of sleeping until the handoff is complete or the owner
sleeps. Otherwise, the handoff will cause the optimistic spinners to
abort spinning as the handed-off owner may not be running.
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/1472254509-27508-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch removes some of the redundant ww_mutex code in
__mutex_lock_common().
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Imre Deak <imre.deak@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <Will.Deacon@arm.com>
Link: http://lkml.kernel.org/r/1472254509-27508-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Doesn't really matter yet, but pull the HANDOFF and trylock out from
under the wait_lock.
The intention is to add an optimistic spin loop here, which requires
we do not hold the wait_lock, so shuffle code around in preparation.
Also clarify the purpose of taking the wait_lock in the wait loop, its
tempting to want to avoid it altogether, but the cancellation cases
need to to avoid losing wakeups.
Suggested-by: Waiman Long <waiman.long@hpe.com>
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implement lock handoff to avoid lock starvation.
Lock starvation is possible because mutex_lock() allows lock stealing,
where a running (or optimistic spinning) task beats the woken waiter
to the acquire.
Lock stealing is an important performance optimization because waiting
for a waiter to wake up and get runtime can take a significant time,
during which everyboy would stall on the lock.
The down-side is of course that it allows for starvation.
This patch has the waiter requesting a handoff if it fails to acquire
the lock upon waking. This re-introduces some of the wait time,
because once we do a handoff we have to wait for the waiter to wake up
again.
A future patch will add a round of optimistic spinning to attempt to
alleviate this penalty, but if that turns out to not be enough, we can
add a counter and only request handoff after multiple failed wakeups.
There are a few tricky implementation details:
- accepting a handoff must only be done in the wait-loop. Since the
handoff condition is owner == current, it can easily cause
recursive locking trouble.
- accepting the handoff must be careful to provide the ACQUIRE
semantics.
- having the HANDOFF bit set on unlock requires care, we must not
clear the owner.
- we must be careful to not leave HANDOFF set after we've acquired
the lock. The tricky scenario is setting the HANDOFF bit on an
unlocked mutex.
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that mutex::count and mutex::owner are the same field, we can
allow SPIN_ON_OWNER while DEBUG_MUTEX.
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current mutex implementation has an atomic lock word and a
non-atomic owner field.
This disparity leads to a number of issues with the current mutex code
as it means that we can have a locked mutex without an explicit owner
(because the owner field has not been set, or already cleared).
This leads to a number of weird corner cases, esp. between the
optimistic spinning and debug code. Where the optimistic spinning
code needs the owner field updated inside the lock region, the debug
code is more relaxed because the whole lock is serialized by the
wait_lock.
Also, the spinning code itself has a few corner cases where we need to
deal with a held lock without an owner field.
Furthermore, it becomes even more of a problem when trying to fix
starvation cases in the current code. We end up stacking special case
on special case.
To solve this rework the basic mutex implementation to be a single
atomic word that contains the owner and uses the low bits for extra
state.
This matches how PI futexes and rt_mutex already work. By having the
owner an integral part of the lock state a lot of the problems
dissapear and we get a better option to deal with starvation cases,
direct owner handoff.
Changing the basic mutex does however invalidate all the arch specific
mutex code; this patch leaves that unused in-place, a later patch will
remove that.
Tested-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There were a few questions wrt. how sleep-wakeup works. Try and explain
it more.
Requested-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4bcc595ccd (printk: reinstate KERN_CONT for printing
continuation lines) exposed a missing KERN_CONT from one of the
messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
after syncing the filesystems no longer appears as a continuation but
a new message with its own timestamp.
[ 9.259566] PM: Syncing filesystems ... [ 9.264119] done.
Fix this by adding the KERN_CONT log level for the 'done.' part of the
message seen after syncing filesystems. While we are at it, convert
these suspend printks to pr_info and pr_cont, respectively.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull irq fixes from Ingo Molnar:
"Mostly irqchip driver fixes, plus a symbol export"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/irq: Export irq_set_parent()
irqchip/gic: Add missing \n to CPU IF adjustment message
irqchip/jcore: Don't show Kconfig menu item for driver
irqchip/eznps: Drop pointless static qualifier in nps400_of_init()
irqchip/gic-v3-its: Fix entry size mask for GITS_BASER
irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses
This library was moved to the generic area and was
renamed to irq-poll. Hence, update proc/softirqs output accordingly.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The TPS65217 driver grew interrupt support which uses
irq_set_parent(). While it's not yet clear why this is used in the first
place, building the driver as a module fails with:
ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined!
The correctness of the driver change is still investigated, but for now
it's less trouble to export irq_set_parent() than dealing with the build
wreckage.
[ tglx: Rewrote changelog and made the export GPL ]
Fixes: 6556bdacf6 ("mfd: tps65217: Add support for IRQs")
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Marcin Niestroj <m.niestroj@grinn-global.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Lee Jones <lee.jones@linaro.org>
Link: http://lkml.kernel.org/r/1475775403-27207-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The last user of this tunable was removed in 2012 in commit:
82958366cf ("sched: Replace update_shares weight distribution with per-entity computation")
Delete it since its very existence confuses people.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161019141059.26408-1-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fix from Ingo Molnar:
"This fixes a group scheduling related performance/interactivity
regression introduced in v4.8, which affects certain hardware
environments where cpu_possible_mask != cpu_present_mask"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix incorrect task group ->load_avg
While splitting up workqueue initialization into two parts,
ac8f73400782 ("workqueue: make workqueue available early during boot")
put wq_numa_init() into workqueue_init_early(). Unfortunately, on
some archs including power and arm64, cpu to node mapping isn't yet
established by the time the early init is called leading to incorrect
NUMA initialization and subsequently the following oops due to zero
cpumask on node-specific unbound pools.
Unable to handle kernel paging request for data at address 0x00000038
Faulting instruction address: 0xc0000000000fc0cc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
task: c0000007f5400000 task.stack: c000001ffc084000
NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
LR [c0000000000ed928] activate_task+0x78/0xe0
Call Trace:
[c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
[c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
[c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
[c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
[c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
[c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
[c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
[c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
[c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
---[ end trace 0000000000000000 ]---
Fix it by moving wq_numa_init() to workqueue_init(). As this means
that the early intialization may not have full NUMA info for per-cpu
pools and ignores NUMA affinity for unbound pools, fix them up from
workqueue_init() after wq_numa_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/87twck5wqo.fsf@concordia.ellerman.id.au
Fixes: ac8f73400782 ("workqueue: make workqueue available early during boot")
Signed-off-by: Tejun Heo <tj@kernel.org>
We have a fairly common pattern where you print several things as
continuations on one single line in a loop, and then at the end you do
printk(KERN_CONT "\n");
to flush the buffered output.
But if the output was flushed by something else (concurrent printk
activity, or just system logging), we don't want that final flushing to
just print an empty line.
So just suppress empty continuation lines when they couldn't be merged
into the line they are a continuation of.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge the gup_flags cleanups from Lorenzo Stoakes:
"This patch series adjusts functions in the get_user_pages* family such
that desired FOLL_* flags are passed as an argument rather than
implied by flags.
The purpose of this change is to make the use of FOLL_FORCE explicit
so it is easier to grep for and clearer to callers that this flag is
being used. The use of FOLL_FORCE is an issue as it overrides missing
VM_READ/VM_WRITE flags for the VMA whose pages we are reading
from/writing to, which can result in surprising behaviour.
The patch series came out of the discussion around commit 38e0885465
("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
which addressed a BUG_ON() being triggered when a page was faulted in
with PROT_NONE set but having been overridden by FOLL_FORCE.
do_numa_page() was run on the assumption the page _must_ be one marked
for NUMA node migration as an actual PROT_NONE page would have been
dealt with prior to this code path, however FOLL_FORCE introduced a
situation where this assumption did not hold.
See
https://marc.info/?l=linux-mm&m=147585445805166
for the patch proposal"
Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.
[ This branch was rebased recently to add a few more acked-by's and
reviewed-by's ]
* gup_flag-cleanups:
mm: replace access_process_vm() write parameter with gup_flags
mm: replace access_remote_vm() write parameter with gup_flags
mm: replace __access_remote_vm() write parameter with gup_flags
mm: replace get_user_pages_remote() write/force parameters with gup_flags
mm: replace get_user_pages() write/force parameters with gup_flags
mm: replace get_vaddr_frames() write/force parameters with gup_flags
mm: replace get_user_pages_locked() write/force parameters with gup_flags
mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
mm: remove write/force parameters from __get_user_pages_unlocked()
mm: remove write/force parameters from __get_user_pages_locked()
mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A BPF program is required to check the return register of a
map_elem_lookup() call before accessing memory. The verifier keeps
track of this by converting the type of the result register from
PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
jump ensures safety. This check is currently exclusively performed
for the result register 0.
In the event the compiler reorders instructions, BPF_MOV64_REG
instructions may be moved before the conditional jump which causes
them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
verifier objects when the register is accessed:
0: (b7) r1 = 10
1: (7b) *(u64 *)(r10 -8) = r1
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x59c00000
6: (85) call 1
7: (bf) r4 = r0
8: (15) if r0 == 0x0 goto pc+1
R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
9: (7a) *(u64 *)(r4 +0) = 0
R4 invalid mem access 'map_value_or_null'
This commit extends the verifier to keep track of all identical
PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
assigning them an ID and then marking them all when the conditional
jump is observed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A scheduler performance regression has been reported by Joseph Salisbury,
which he bisected back to:
3d30544f02 ("sched/fair: Apply more PELT fixes)
The regression triggers when several levels of task groups are involved
(read: SystemD) and cpu_possible_mask != cpu_present_mask.
The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
is initialized to scale_load_down(se->load.weight). During the creation of
a child task group, its group entities on possible CPUs are attached to
parent's cfs_rq (tg_parent) and their loads are added to the parent's load
(tg_parent->load_avg) with update_tg_load_avg().
But only the load on online CPUs will then be updated to reflect real load,
whereas load on other CPUs will stay at the initial value.
The result is a tg_parent->load_avg that is higher than the real load, the
weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
smaller than it should be, and the task group gets a less running time than
what it could expect.
( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
of the task group will be much higher than sum of ".tg_load_avg_contrib"
of online cfs_rqs of the task group. )
The load of group entities don't have to be intialized to something else
than 0 because their load will increase when an entity is attached.
Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org> # 4.8.x
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: joonwoop@codeaurora.org
Fixes: 3d30544f02 ("sched/fair: Apply more PELT fixes)
Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull timer fixlet from Ingo Molnar:
"Remove an unused variable"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Remove unused but set variable
Pull scheduler fix from Ingo Molnar:
"Fix a crash that can trigger when racing with CPU hotplug: we didn't
use sched-domains data structures carefully enough in select_idle_cpu()"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix sched domains NULL dereference in select_idle_sibling()
Pull misc fixes from Ingo Molnar:
"A CPU hotplug debuggability fix and three objtool false positive
warnings fixes for new GCC6 code generation patterns"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Use distinct name for cpu_hotplug.dep_map
objtool: Skip all "unreachable instruction" warnings for gcov kernels
objtool: Improve rare switch jump table pattern detection
objtool: Support '-mtune=atom' stack frame setup instruction
Remove the set but unused variable base in alarm_clock_get to fix the
following warning when building with 'W=1':
kernel/time/alarmtimer.c: In function ‘alarm_timer_create’:
kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20161017094702.10873-1-tklauser@distanz.ch
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use distinctive name for cpu_hotplug.dep_map to avoid the actual
cpu_hotplug.lock appearing as cpu_hotplug.lock#2 in lockdep splats.
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Gautham R . Shenoy <ego@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
pa/A7CNQwibIV6YD8+/p
=1dUK
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
Pull cgroup updates from Tejun Heo:
- tracepoints for basic cgroup management operations added
- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()
- non-critical bug fixes
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()
This affectively reverts commit 377ccbb483 ("Makefile: Mute warning
for __builtin_return_address(>0) for tracing only") because it turns out
that it really isn't tracing only - it's all over the tree.
We already also had the warning disabled separately for mm/usercopy.c
(which this commit also removes), and it turns out that we will also
want to disable it for get_lock_parent_ip(), that is used for at least
TRACE_IRQFLAGS. Which (when enabled) ends up being all over the tree.
Steven Rostedt had a patch that tried to limit it to just the config
options that actually triggered this, but quite frankly, the extra
complexity and abstraction just isn't worth it. We have never actually
had a case where the warning is actually useful, so let's just disable
it globally and not worry about it.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously hung_task_panic would not be respected if enabled after
hung_task_warnings had already been decremented to 0.
Permit the kernel to panic if hung_task_panic is enabled after
hung_task_warnings has already been decremented to 0 and another task
hangs for hung_task_timeout_secs seconds.
Check if hung_task_panic is enabled so we don't return prematurely, and
check if hung_task_warnings is non-zero so we don't print the warning
unnecessarily.
[akpm@linux-foundation.org: fix off-by-one]
Link: http://lkml.kernel.org/r/1473450214-4049-1-git-send-email-jsiddle@redhat.com
Signed-off-by: John Siddle <jsiddle@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows to make kthread worker freezable via a new @flags
parameter. It will allow to avoid an init work in some kthreads.
It currently does not affect the function of kthread_worker_fn()
but it might help to do some optimization or fixes eventually.
I currently do not know about any other use for the @flags
parameter but I believe that we will want more flags
in the future.
Finally, I hope that it will not cause confusion with @flags member
in struct kthread. Well, I guess that we will want to rework the
basic kthreads implementation once all kthreads are converted into
kthread workers or workqueues. It is possible that we will merge
the two structures.
Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are situations when we need to modify the delay of a delayed kthread
work. For example, when the work depends on an event and the initial delay
means a timeout. Then we want to queue the work immediately when the event
happens.
This patch implements kthread_mod_delayed_work() as inspired workqueues.
It cancels the timer, removes the work from any worker list and queues it
again with the given timeout.
A very special case is when the work is being canceled at the same time.
It might happen because of the regular kthread_cancel_delayed_work_sync()
or by another kthread_mod_delayed_work(). In this case, we do nothing and
let the other operation win. This should not normally happen as the caller
is supposed to synchronize these operations a reasonable way.
Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to use kthread workers more widely and sometimes we will need
to make sure that the work is neither pending nor running.
This patch implements cancel_*_sync() operations as inspired by
workqueues. Well, we are synchronized against the other operations via
the worker lock, we use del_timer_sync() and a counter to count parallel
cancel operations. Therefore the implementation might be easier.
First, we check if a worker is assigned. If not, the work has newer been
queued after it was initialized.
Second, we take the worker lock. It must be the right one. The work must
not be assigned to another worker unless it is initialized in between.
Third, we try to cancel the timer when it exists. The timer is deleted
synchronously to make sure that the timer call back is not running. We
need to temporary release the worker->lock to avoid a possible deadlock
with the callback. In the meantime, we set work->canceling counter to
avoid any queuing.
Fourth, we try to remove the work from a worker list. It might be
the list of either normal or delayed works.
Fifth, if the work is running, we call kthread_flush_work(). It might
take an arbitrary time. We need to release the worker-lock again. In the
meantime, we again block any queuing by the canceling counter.
As already mentioned, the check for a pending kthread work is done under a
lock. In compare with workqueues, we do not need to fight for a single
PENDING bit to block other operations. Therefore we do not suffer from
the thundering storm problem and all parallel canceling jobs might use
kthread_flush_work(). Any queuing is blocked until the counter gets zero.
Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to use kthread_worker more widely and delayed works
will be pretty useful.
The implementation is inspired by workqueues. It uses a timer to queue
the work after the requested delay. If the delay is zero, the work is
queued immediately.
In compare with workqueues, each work is associated with a single worker
(kthread). Therefore the implementation could be much easier. In
particular, we use the worker->lock to synchronize all the operations with
the work. We do not need any atomic operation with a flags variable.
In fact, we do not need any state variable at all. Instead, we add a list
of delayed works into the worker. Then the pending work is listed either
in the list of queued or delayed works. And the existing check of pending
works is the same even for the delayed ones.
A work must not be assigned to another worker unless reinitialized.
Therefore the timer handler might expect that dwork->work->worker is valid
and it could simply take the lock. We just add some sanity checks to help
with debugging a potential misuse.
Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nothing currently prevents a work from queuing for a kthread worker when
it is already running on another one. This means that the work might run
in parallel on more than one worker. Also some operations are not
reliable, e.g. flush.
This problem will be even more visible after we add kthread_cancel_work()
function. It will only have "work" as the parameter and will use
worker->lock to synchronize with others.
Well, normally this is not a problem because the API users are sane.
But bugs might happen and users also might be crazy.
This patch adds a warning when we try to insert the work for another
worker. It does not fully prevent the misuse because it would make the
code much more complicated without a big benefit.
It adds the same warning also into kthread_flush_work() instead of the
repeated attempts to get the right lock.
A side effect is that one needs to explicitly reinitialize the work if it
must be queued into another worker. This is needed, for example, when the
worker is stopped and started again. It is a bit inconvenient. But it
looks like a good compromise between the stability and complexity.
I have double checked all existing users of the kthread worker API and
they all seems to initialize the work after the worker gets started.
Just for completeness, the patch adds a check that the work is not already
in a queue.
The patch also puts all the checks into a separate function. It will be
reused when implementing delayed works.
Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current kthread worker users call flush() and stop() explicitly.
This function does the same plus it frees the kthread_worker struct
in one call.
It is supposed to be used together with kthread_create_worker*() that
allocates struct kthread_worker.
Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kthread workers are currently created using the classic kthread API,
namely kthread_run(). kthread_worker_fn() is passed as the @threadfn
parameter.
This patch defines kthread_create_worker() and
kthread_create_worker_on_cpu() functions that hide implementation details.
They enforce using kthread_worker_fn() for the main thread. But I doubt
that there are any plans to create any alternative. In fact, I think that
we do not want any alternative main thread because it would be hard to
support consistency with the rest of the kthread worker API.
The naming and function of kthread_create_worker() is inspired by the
workqueues API like the rest of the kthread worker API.
The kthread_create_worker_on_cpu() variant is motivated by the original
kthread_create_on_cpu(). Note that we need to bind per-CPU kthread
workers already when they are created. It makes the life easier.
kthread_bind() could not be used later for an already running worker.
This patch does _not_ convert existing kthread workers. The kthread
worker API need more improvements first, e.g. a function to destroy the
worker.
IMPORTANT:
kthread_create_worker_on_cpu() allows to use any format of the worker
name, in compare with kthread_create_on_cpu(). The good thing is that it
is more generic. The bad thing is that most users will need to pass the
cpu number in two parameters, e.g. kthread_create_worker_on_cpu(cpu,
"helper/%d", cpu).
To be honest, the main motivation was to avoid the need for an empty
va_list. The only legal way was to create a helper function that would be
called with an empty list. Other attempts caused compilation warnings or
even errors on different architectures.
There were also other alternatives, for example, using #define or
splitting __kthread_create_worker(). The used solution looked like the
least ugly.
Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kthread_create_on_node() implements a bunch of logic to create the
kthread. It is already called by kthread_create_on_cpu().
We are going to extend the kthread worker API and will need to call
kthread_create_on_node() with va_list args there.
This patch does only a refactoring and does not modify the existing
behavior.
Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kthread_create_on_cpu() was added by the commit 2a1d446019
("kthread: Implement park/unpark facility"). It is currently used only
when enabling new CPU. For this purpose, the newly created kthread has to
be parked.
The CPU binding is a bit tricky. The kthread is parked when the CPU has
not been allowed yet. And the CPU is bound when the kthread is unparked.
The function would be useful for more per-CPU kthreads, e.g.
bnx2fc_thread, fcoethread. For this purpose, the newly created kthread
should stay in the uninterruptible state.
This patch moves the parking into smpboot. It binds the thread already
when created. Then the function might be used universally. Also the
behavior is consistent with kthread_create() and kthread_create_on_node().
Link: http://lkml.kernel.org/r/1470754545-17632-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kthread: Kthread worker API improvements"
The intention of this patchset is to make it easier to manipulate and
maintain kthreads. Especially, I want to replace all the custom main
cycles with a generic one. Also I want to make the kthreads sleep in a
consistent state in a common place when there is no work.
This patch (of 11):
A good practice is to prefix the names of functions by the name of the
subsystem.
This patch fixes the name of probe_kthread_data(). The other wrong
functions names are part of the kthread worker API and will be fixed
separately.
Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As of Android N, SECCOMP is required. Without it, we will get
mediaextractor error:
E /system/bin/mediaextractor: libminijail: prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER): Invalid argument
Link: http://lkml.kernel.org/r/20160908185934.18098-3-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>