mirror of https://gitee.com/openkylin/linux.git
333 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Xunlei Pang | 512ac999d2 |
sched/fair: Fix bandwidth timer clock drift condition
I noticed that cgroup task groups constantly get throttled even if they have low CPU usage, this causes some jitters on the response time to some of our business containers when enabling CPU quotas. It's very simple to reproduce: mkdir /sys/fs/cgroup/cpu/test cd /sys/fs/cgroup/cpu/test echo 100000 > cpu.cfs_quota_us echo $$ > tasks then repeat: cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily After some analysis, we found that cfs_rq::runtime_remaining will be cleared by expire_cfs_rq_runtime() due to two equal but stale "cfs_{b|q}->runtime_expires" after period timer is re-armed. The current condition to judge clock drift in expire_cfs_rq_runtime() is wrong, the two runtime_expires are actually the same when clock drift happens, so this condtion can never hit. The orginal design was correctly done by this commit: |
|
Vincent Guittot | 296b2ffe7f |
sched/rt: Fix call to cpufreq_update_util()
With commit: |
|
Linus Torvalds | f7f4e7fc6c |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: - power-aware scheduling improvements (Patrick Bellasi) - NUMA balancing improvements (Mel Gorman) - vCPU scheduling fixes (Rohit Jain) * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Update util_est before updating schedutil sched/cpufreq: Modify aggregate utilization to always include blocked FAIR utilization sched/deadline/Documentation: Add overrun signal and GRUB-PA documentation sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu() sched/wait: Include <linux/wait.h> in <linux/swait.h> sched/numa: Stagger NUMA balancing scan periods for new threads sched/core: Don't schedule threads on pre-empted vCPUs sched/fair: Avoid calling sync_entity_load_avg() unnecessarily sched/fair: Rearrange select_task_rq_fair() to optimize it |
|
Davidlohr Bueso | 595058b667 |
sched/headers: Fix typo
I cannot spell 'throttling'. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Ingo Molnar | 0548dc5cde |
Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Mathieu Malaterre | f6a3463063 |
sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
In the following commit:
|
|
Mel Gorman | 1378447598 |
sched/numa: Stagger NUMA balancing scan periods for new threads
Threads share an address space and each can change the protections of the same address space to trap NUMA faults. This is redundant and potentially counter-productive as any thread doing the update will suffice. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created. The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity. The following is the time reported by NAS Parallel Benchmark using unbound openmp threads and a D size class: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 Time bt.D 442.77 ( 0.00%) 419.70 ( 5.21%) Time cg.D 171.90 ( 0.00%) 180.85 ( -5.21%) Time ep.D 33.10 ( 0.00%) 32.90 ( 0.60%) Time is.D 9.59 ( 0.00%) 9.42 ( 1.77%) Time lu.D 306.75 ( 0.00%) 304.65 ( 0.68%) Time mg.D 54.56 ( 0.00%) 52.38 ( 4.00%) Time sp.D 1020.03 ( 0.00%) 903.77 ( 11.40%) Time ua.D 400.58 ( 0.00%) 386.49 ( 3.52%) Note it's not a universal win but we have no prior knowledge of which thread matters but the number of threads created often exceeds the size of the node when the threads are not bound. However, there is a reducation of overall system CPU usage: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%) sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%) sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%) sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%) sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%) sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%) sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%) sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%) NUMA scan activity is also reduced: NUMA alloc local 1042828 1342670 NUMA base PTE updates 140481138 93577468 NUMA huge PMD updates 272171 180766 NUMA page range updates 279832690 186129660 NUMA hint faults 1395972 1193897 NUMA hint local faults 877925 855053 NUMA hint local percent 62 71 NUMA pages migrated 12057909 9158023 Similar observations are made for other thread-intensive workloads. System CPU usage is lower even though the headline gains in performance tend to be small. For example, specjbb 2005 shows almost no difference in performance but scan activity is reduced by a third on a 4-socket box. I didn't find a workload (thread intensive or otherwise) that suffered badly. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Davidlohr Bueso | adcc8da885 |
sched/core: Simplify helpers for rq clock update skip requests
By renaming the functions we can get rid of the skip parameter and have better code redability. It makes zero sense to have things such as: rq_clock_skip_update(rq, false) When the skip request is in fact not going to happen. Ever. Rename things such that we end up with: rq_clock_skip_update(rq) rq_clock_cancel_skipupdate(rq) Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805 Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Patrick Bellasi | a07630b8b2 |
sched/cpufreq/schedutil: Use util_est for OPP selection
When schedutil looks at the CPU utilization, the current PELT value for that CPU is returned straight away. In certain scenarios this can have undesired side effects and delays on frequency selection. For example, since the task utilization is decayed at wakeup time, a long sleeping big task newly enqueued does not add immediately a significant contribution to the target CPU. This introduces some latency before schedutil will be able to detect the best frequency required by that task. Moreover, the PELT signal build-up time is a function of the current frequency, because of the scale invariant load tracking support. Thus, starting from a lower frequency, the utilization build-up time will increase even more and further delays the selection of the actual frequency which better serves the task requirements. In order to reduce these kind of latencies, we integrate the usage of the CPU's estimated utilization in the sugov_get_util function. This allows to properly consider the expected utilization of a CPU which, for example, has just got a big task running after a long sleep period. Ultimately this allows to select the best frequency to run a task right after its wake-up. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steve Muckle <smuckle@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Vincent Guittot | f643ea2207 |
sched/nohz: Stop NOHZ stats when decayed
Stopped the periodic update of blocked load when all idle CPUs have fully decayed. We introduce a new nohz.has_blocked that reflect if some idle CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked is set everytime that a Idle CPU can have blocked load and it is then clear when no more blocked load has been detected during an update. We don't need atomic operation but only to make cure of the right ordering when updating nohz.idle_cpus_mask and nohz.has_blocked. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | 00357f5ec5 |
sched/nohz: Clean up nohz enter/exit
The primary observation is that nohz enter/exit is always from the current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be an atomic. Secondary is that we appear to have 2 nearly identical hooks in the nohz enter code, set_cpu_sd_state_idle() and nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into nohz_balance_{enter,exit}_idle. Removes an atomic op from both enter and exit paths. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | e022e0d38a |
sched/fair: Update blocked load from NEWIDLE
Since we already iterate CPUs looking for work on NEWIDLE, use this iteration to age the blocked load. If the domain for which this is done completely spand the idle set, we can push the ILB based aging forward. Suggested-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | b7031a02ec |
sched/fair: Add NOHZ_STATS_KICK
Split the NOHZ idle balancer into doing two separate actions: - update blocked load statistic - actually load-balance Since the latter requires the former, ensure this happens. For now always tag both bits at the same time. Prepares for a future where we can toggle only the STATS bit. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | a22e47a4e3 |
sched/core: Convert nohz_flags to atomic_t
Using atomic_t allows us to use the more flexible bitops provided there. Also its smaller. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Ingo Molnar | 325ea10c08 |
sched/headers: Simplify and clean up header usage in the scheduler
Do the following cleanups and simplifications: - sched/sched.h already includes <asm/paravirt.h>, so no need to include it in sched/core.c again. - order the <linux/sched/*.h> headers alphabetically - add all <linux/sched/*.h> headers to kernel/sched/sched.h - remove all unnecessary includes from the .c files that are already included in kernel/sched/sched.h. Finally, make all scheduler .c files use a single common header: #include "sched.h" ... which now contains a union of the relied upon headers. This makes the various .c files easier to read and easier to handle. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Ingo Molnar | 97fb7a0a89 |
sched: Clean up and harmonize the coding style of the scheduler code base
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all these details: - fix speling in comments, - use curly braces for multi-line statements, - remove unnecessary parentheses from integer literals, - capitalize consistently, - remove stray newlines, - add comments where necessary, - remove invalid/unnecessary comments, - align structure definitions and other data types vertically, - add missing newlines for increased readability, - fix vertical tabulation where it's misaligned, - harmonize preprocessor conditional block labeling and vertical alignment, - remove line-breaks where they uglify the code, - add newline after local variable definitions, No change in functionality: md5: 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Frederic Weisbecker | dcdedb2415 |
sched/nohz: Remove the 1 Hz tick code
Now that the 1Hz tick is offloaded to workqueues, we can safely remove the residual code that used to handle it locally. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Frederic Weisbecker | d84b31313e |
sched/isolation: Offload residual 1Hz scheduler tick
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to keep the scheduler stats alive. However this residual tick is a burden for bare metal tasks that can't stand any interruption at all, or want to minimize them. The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now outsource these scheduler ticks to the global workqueue so that a housekeeping CPU handles those remotely. The sched_class::task_tick() implementations have been audited and look safe to be called remotely as the target runqueue and its current task are passed in parameter and don't seem to be accessed locally. Note that in the case of using isolcpus, it's still up to the user to affine the global workqueues to the housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or domains isolation "isolcpus=nohz,domain". Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Steven Rostedt (VMware) | 364f566537 |
sched/rt: Up the root domain ref count when passing it around via IPIs
When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.
Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
|
Juri Lelli | 07881166a8 |
sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth enforcement (similar to what we already do to fair utilization tracking). Each delta_exec gets scaled considering current frequency and maximum CPU capacity; which means that the reservation runtime parameter (that need to be specified profiling the task execution at max frequency on biggest capacity core) gets thus scaled accordingly. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Juri Lelli | 7e1a9208f6 |
sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
Currently, frequency and cpu capacity scaling is only performed on CONFIG_SMP systems (as CFS PELT signals are only present for such systems). However, other scheduling classes want to do freq/cpu scaling, and for !CONFIG_SMP configurations as well. arch_scale_freq_capacity() is useful to implement frequency scaling even on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP ifdeffery. Even if arch_scale_cpu_capacity() is not useful on !CONFIG_SMP platforms, we make a default implementation available for such configurations anyway to simplify scheduler code doing CPU scale invariance. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-8-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Juri Lelli | 7673c8a4c7 |
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to see where information coming from scheduling domains might help doing frequency invariance scaling). Remove it; also in anticipation of moving arch_scale_freq_capacity() outside CONFIG_SMP. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Juri Lelli | 794a56ebd9 |
sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
Worker kthread needs to be able to change frequency for all other threads. Make it special, just under STOP class. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Juri Lelli | e0367b1267 |
sched/deadline: Move CPU frequency selection triggering points
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a fraction of CPU bandwidth to tasks admitted to the system), there is no point in evaluating frequency changes during each tick event. Move frequency selection triggering points to where running_bw changes. Co-authored-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-3-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Juri Lelli | d4edd662ac |
sched/cpufreq: Use the DEADLINE utilization signal
SCHED_DEADLINE tracks active utilization signal with a per dl_rq variable named running_bw. Make use of that to drive CPU frequency selection: add up FAIR and DEADLINE contribution to get the required CPU capacity to handle both requirements (while RT still selects max frequency). Co-authored-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-2-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
rodrigosiqueira | 31cb1bc0dc |
sched/core: Rework and clarify prepare_lock_switch()
The prepare_lock_switch() function has an unused parameter, and also the function name was not descriptive. To improve readability and remove the extra parameter, do the following changes: * Move prepare_lock_switch() from kernel/sched/sched.h to kernel/sched/core.c, rename it to prepare_task(), and remove the unused parameter. * Split the smp_store_release() out from finish_lock_switch() to a function named finish_task. * Comments ajdustments. Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Linus Torvalds | 22714a2ba4 |
Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Cgroup2 cpu controller support is finally merged. - Basic cpu statistics support to allow monitoring by default without the CPU controller enabled. - cgroup2 cpu controller support. - /sys/kernel/cgroup files to help dealing with new / optional features" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: export list of cgroups v2 features using sysfs cgroup: export list of delegatable control files using sysfs cgroup: mark @cgrp __maybe_unused in cpu_stat_show() MAINTAINERS: relocate cpuset.c cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat sched: Implement interface for cgroup unified hierarchy sched: Misc preps for cgroup unified hierarchy interface sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE cgroup: statically initialize init_css_set->dfl_cgrp cgroup: Implement cgroup2 basic CPU usage accounting cpuacct: Introduce cgroup_account_cputime[_field]() sched/cputime: Expose cputime_adjust() |
|
Patrick Bellasi | 765cc3a4b2 |
sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds
When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
all SCHED_FEAT are turned into compile time constants being propagated
to support compiler optimizations.
Specifically, we expect that code blocks like this:
if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
/* FEATURE CODE */
}
are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
being removed by the compiler from the finale image.
For this mechanism to properly work it's required for the compiler to
have full access, from each translation unit, to whatever is the value
defined by the sched_feat macro. This macro is defined as:
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
and thus, the compiler can optimize that code only if the value of
sysctl_sched_features is visible within each translation unit.
Since:
|
|
Ingo Molnar | 8a103df440 |
Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Greg Kroah-Hartman | b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
|
Steven Rostedt (Red Hat) | 4bdced5c9a |
sched/rt: Simplify the IPI based RT balancing logic
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
|
|
Peter Zijlstra | 8c0944cee7 |
sched/deadline: Rename __dl_clear() to __dl_sub()
__dl_sub() is more meaningful as a name, and is more consistent with the naming of the dual function (__dl_add()). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
luca abeni | e964d3501b |
sched/headers: Remove duplicate prototype of __dl_clear_params()
Signed-off-by: luca abeni <luca.abeni@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1504778971-13573-2-git-send-email-luca.abeni@santannapisa.it Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | 1ea6c46a23 |
sched/fair: Propagate an effective runnable_load_avg
The load balancer uses runnable_load_avg as load indicator. For !cgroup this is: runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq That is, a direct sum of all runnable tasks on that runqueue. As opposed to load_avg, which is a sum of all tasks on the runqueue, which includes a blocked component. However, in the cgroup case, this comes apart since the group entities are always runnable, even if most of their constituent entities are blocked. Therefore introduce a runnable_weight which for task entities is the same as the regular weight, but for group entities is a fraction of the entity weight and represents the runnable part of the group runqueue. Then propagate this load through the PELT hierarchy to arrive at an effective runnable load avgerage -- which we should not confuse with the canonical runnable load average. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | 0e2d2aaaae |
sched/fair: Rewrite PELT migration propagation
When an entity migrates in (or out) of a runqueue, we need to add (or remove) its contribution from the entire PELT hierarchy, because even non-runnable entities are included in the load average sums. In order to do this we have some propagation logic that updates the PELT tree, however the way it 'propagates' the runnable (or load) change is (more or less): tg->weight * grq->avg.load_avg ge->avg.load_avg = ------------------------------ tg->load_avg But that is the expression for ge->weight, and per the definition of load_avg: ge->avg.load_avg := ge->weight * ge->avg.runnable_avg That destroys the runnable_avg (by setting it to 1) we wanted to propagate. Instead directly propagate runnable_sum. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | 2a2f5d4e44 |
sched/fair: Rewrite cfs_rq->removed_*avg
Since on wakeup migration we don't hold the rq->lock for the old CPU we cannot update its state. Instead we add the removed 'load' to an atomic variable and have the next update on that CPU collect and process it. Currently we have 2 atomic variables; which already have the issue that they can be read out-of-sync. Also, two atomic ops on a single cacheline is already more expensive than an uncontended lock. Since we want to add more, convert the thing over to an explicit cacheline with a lock in. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Vincent Guittot | 9059393e4e |
sched/fair: Use reweight_entity() for set_user_nice()
Now that we directly change load_avg and propagate that change into the sums, sys_nice() and co should do the same, otherwise its possible to confuse load accounting when we migrate near the weight change. Fixes-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added changelog, fixed the call condition. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Tejun Heo | d2cc5ed694 |
cpuacct: Introduce cgroup_account_cputime[_field]()
Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge() and cgroup_account_field(). This doesn't introduce any functional changes and will be used to add cgroup basic resource accounting. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> |
|
Linus Torvalds | ec846ecd63 |
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar: "Three CPU hotplug related fixes and a debugging improvement" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Add debugfs knob for "sched_debug" sched/core: WARN() when migrating to an offline CPU sched/fair: Plug hole between hotplug and active_load_balance() sched/fair: Avoid newidle balance for !active CPUs |
|
Peter Zijlstra | 9469eb01db |
sched/debug: Add debugfs knob for "sched_debug"
I'm forever late for editing my kernel cmdline, add a runtime knob to disable the "sched_debug" thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170907150614.142924283@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Davidlohr Bueso | 2161573ecd |
sched/deadline: replace earliest dl and rq leftmost caching
... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/20170719014603.19029-9-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Davidlohr Bueso | bfb068892d |
sched/fair: replace cfs_rq->rb_leftmost
... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/20170719014603.19029-8-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 439644096c |
Power management updates for v4.14-rc1
- Drop the P-state selection algorithm based on a PID controller from intel_pstate and make it use the same P-state selection method (based on the CPU load) for all types of systems in the active mode (Rafael Wysocki, Srinivas Pandruvada). - Rework the cpufreq core and governors to make it possible to take cross-CPU utilization updates into account and modify the schedutil governor to actually do so (Viresh Kumar). - Clean up the handling of transition latency information in the cpufreq core and untangle it from the information on which drivers cannot do dynamic frequency switching (Viresh Kumar). - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek cpufreq driver and update its DT bindings (Sean Wang). - Modify the cpufreq dt-platdev driver to autimatically create cpufreq devices for the new (v2) Operating Performance Points (OPP) DT bindings and update its whitelist of supported systems (Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley Xiao). - Add support for Ux500 to the cpufreq-dt driver and drop the obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann). - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem Nguyen). - Fix and clean up assorted issues in the cpufreq drivers and core (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva, Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla). - Update the IO-wait boost handling in the schedutil governor to make it less aggressive (Joel Fernandes). - Rework system suspend diagnostics to make it print fewer messages to the kernel log by default, add a sysfs knob to allow more suspend-related messages to be printed and add Low Power S0 Idle constraints checks to the ACPI suspend-to-idle code (Rafael Wysocki, Srinivas Pandruvada). - Prefer suspend-to-idle over S3 on ACPI-based systems with the ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM interface present in the ACPI tables (Rafael Wysocki). - Update documentation related to system sleep and rename a number of items in the code to make it cleare that they are related to suspend-to-idle (Rafael Wysocki). - Export a variable allowing device drivers to check the target system sleep state from the core system suspend code (Florian Fainelli). - Clean up the cpuidle subsystem to handle the polling state on x86 in a more straightforward way and to use %pOF instead of full_name (Rafael Wysocki, Rob Herring). - Update the devfreq framework to fix and clean up a few minor issues (Chanwoo Choi, Rob Herring). - Extend diagnostics in the generic power domains (genpd) framework and clean it up slightly (Thara Gopinath, Rob Herring). - Fix and clean up a couple of issues in the operating performance points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz). - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Fix the usage of notifiers in CPU power management on some platforms (Alex Shi). - Update the pm-graph system suspend/hibernation and boot profiling utility (Todd Brandt). - Make it possible to run the cpupower utility without CPU0 (Prarit Bhargava). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJZrcDJAAoJEILEb/54YlRx9FUQAIUKvWBAARc61ZIZXjbqZF1v aEMOBuksFns0CMekdptSic6n4wc81E/XYMS8yDhOOMpyDzfAZsTWjmu+gKwN7w3l E/yf/NVlhob9JZ7MqGgqD4EUFfFIaKBXPlWFdDi2rdCUXE2L8xJ7rla8i7zyZlc5 pYHfAppBbF4qUcEY4OoOVOOGRZCfMdiLXj0iZOhMX8Y6yLBRk/AjnVADYsF33hoj gBEfomU+H0K5V8nQEp0ZFKDArPwL+oElHQj6i+nxBpGfPM5evvLXhHOyR6AsldJ5 J4YI1kMuQNSCmvHMqOTxTYyJf8Jcf3Fj4wcjwaVMVGceY1lz6McAKknnFnCqCvz+ mskn84gFCBCM8EoJDqRf0b9MQHcuRyQKM+yw4tjnR9r8yd32erb85ZWFHcPWYhCT fZatNOwFFv2MU+2vo5J3yeUNSWIKT+uBjy+tKPbrDkUwpKZVRj3Oj+hP3Mq9NE8U YBqltsj7tmrdA634zI8C7jfS6wF221S0fId/iPszwmPJaVn/lq8Ror7pWL5YI8U7 SCJFjiqDiGmAcQEkuWwFAQnscZkyHpO+Y3A+jfXl/izoaZETaI5+ceIHBaocm3+5 XrOOpHS3ik8EHf9ji0KFCKZ/pYDwllday3cBQPWo3sMIzpQ2lrjbqdnE1cVnBrld OtHZAeD/jLUXuY6XW2jN =mAiV -----END PGP SIGNATURE----- Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "This time (again) cpufreq gets the majority of changes which mostly are driver updates (including a major consolidation of intel_pstate), some schedutil governor modifications and core cleanups. There also are some changes in the system suspend area, mostly related to diagnostics and debug messages plus some renames of things related to suspend-to-idle. One major change here is that suspend-to-idle is now going to be preferred over S3 on systems where the ACPI tables indicate to do so and provide requsite support (the Low Power Idle S0 _DSM in particular). The system sleep documentation and the tools related to it are updated too. The rest is a few cpuidle changes (nothing major), devfreq updates, generic power domains (genpd) framework updates and a few assorted modifications elsewhere. Specifics: - Drop the P-state selection algorithm based on a PID controller from intel_pstate and make it use the same P-state selection method (based on the CPU load) for all types of systems in the active mode (Rafael Wysocki, Srinivas Pandruvada). - Rework the cpufreq core and governors to make it possible to take cross-CPU utilization updates into account and modify the schedutil governor to actually do so (Viresh Kumar). - Clean up the handling of transition latency information in the cpufreq core and untangle it from the information on which drivers cannot do dynamic frequency switching (Viresh Kumar). - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek cpufreq driver and update its DT bindings (Sean Wang). - Modify the cpufreq dt-platdev driver to autimatically create cpufreq devices for the new (v2) Operating Performance Points (OPP) DT bindings and update its whitelist of supported systems (Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley Xiao). - Add support for Ux500 to the cpufreq-dt driver and drop the obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann). - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem Nguyen). - Fix and clean up assorted issues in the cpufreq drivers and core (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva, Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla). - Update the IO-wait boost handling in the schedutil governor to make it less aggressive (Joel Fernandes). - Rework system suspend diagnostics to make it print fewer messages to the kernel log by default, add a sysfs knob to allow more suspend-related messages to be printed and add Low Power S0 Idle constraints checks to the ACPI suspend-to-idle code (Rafael Wysocki, Srinivas Pandruvada). - Prefer suspend-to-idle over S3 on ACPI-based systems with the ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM interface present in the ACPI tables (Rafael Wysocki). - Update documentation related to system sleep and rename a number of items in the code to make it cleare that they are related to suspend-to-idle (Rafael Wysocki). - Export a variable allowing device drivers to check the target system sleep state from the core system suspend code (Florian Fainelli). - Clean up the cpuidle subsystem to handle the polling state on x86 in a more straightforward way and to use %pOF instead of full_name (Rafael Wysocki, Rob Herring). - Update the devfreq framework to fix and clean up a few minor issues (Chanwoo Choi, Rob Herring). - Extend diagnostics in the generic power domains (genpd) framework and clean it up slightly (Thara Gopinath, Rob Herring). - Fix and clean up a couple of issues in the operating performance points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz). - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Fix the usage of notifiers in CPU power management on some platforms (Alex Shi). - Update the pm-graph system suspend/hibernation and boot profiling utility (Todd Brandt). - Make it possible to run the cpupower utility without CPU0 (Prarit Bhargava)" * tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits) cpuidle: Make drivers initialize polling state cpuidle: Move polling state initialization code to separate file cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol cpufreq: imx6q: Fix imx6sx low frequency support cpufreq: speedstep-lib: make several arrays static, makes code smaller PM: docs: Delete the obsolete states.txt document PM: docs: Describe high-level PM strategies and sleep states PM / devfreq: Fix memory leak when fail to register device PM / devfreq: Add dependency on PM_OPP PM / devfreq: Move private devfreq_update_stats() into devfreq PM / devfreq: Convert to using %pOF instead of full_name PM / AVS: rockchip-io: add io selectors and supplies for RV1108 cpufreq: ti: Fix 'of_node_put' being called twice in error handling path cpufreq: dt-platdev: Drop few entries from whitelist cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2 ARM: ux500: don't select CPUFREQ_DT cpuidle: Convert to using %pOF instead of full_name cpufreq: Convert to using %pOF instead of full_name PM / Domains: Convert to using %pOF instead of full_name cpufreq: Cap the default transition delay value to 10 ms ... |
|
Linus Torvalds | 5f82e71a00 |
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar: - Add 'cross-release' support to lockdep, which allows APIs like completions, where it's not the 'owner' who releases the lock, to be tracked. It's all activated automatically under CONFIG_PROVE_LOCKING=y. - Clean up (restructure) the x86 atomics op implementation to be more readable, in preparation of KASAN annotations. (Dmitry Vyukov) - Fix static keys (Paolo Bonzini) - Add killable versions of down_read() et al (Kirill Tkhai) - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini) - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra) - Remove smp_mb__before_spinlock() and convert its usages, introduce smp_mb__after_spinlock() (Peter Zijlstra) * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits) locking/lockdep/selftests: Fix mixed read-write ABBA tests sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK() acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures smp: Avoid using two cache lines for struct call_single_data locking/lockdep: Untangle xhlock history save/restore from task independence locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being futex: Remove duplicated code and fix undefined behaviour Documentation/locking/atomic: Finish the document... locking/lockdep: Fix workqueue crossrelease annotation workqueue/lockdep: 'Fix' flush_work() annotation locking/lockdep/selftests: Add mixed read-write ABBA tests mm, locking/barriers: Clarify tlb_flush_pending() barriers locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive locking/lockdep: Explicitly initialize wq_barrier::done::map locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING locking/refcounts, x86/asm: Implement fast refcount overflow protection locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease ... |
|
Ying Huang | 966a967116 |
smp: Avoid using two cache lines for struct call_single_data
struct call_single_data is used in IPIs to transfer information between CPUs. Its size is bigger than sizeof(unsigned long) and less than cache line size. Currently it is not allocated with any explicit alignment requirements. This makes it possible for allocated call_single_data to cross two cache lines, which results in double the number of the cache lines that need to be transferred among CPUs. This can be fixed by requiring call_single_data to be aligned with the size of call_single_data. Currently the size of call_single_data is the power of 2. If we add new fields to call_single_data, we may need to add padding to make sure the size of new definition is the power of 2 as well. Fortunately, this is enforced by GCC, which will report bad sizes. To set alignment requirements of call_single_data to the size of call_single_data, a struct definition and a typedef is used. To test the effect of the patch, I used the vm-scalability multiple thread swap test case (swap-w-seq-mt). The test will create multiple threads and each thread will eat memory until all RAM and part of swap is used, so that huge number of IPIs are triggered when unmapping memory. In the test, the throughput of memory writing improves ~5% compared with misaligned call_single_data, because of faster IPIs. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Huang, Ying <ying.huang@intel.com> [ Add call_single_data_t and align with size of call_single_data. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Peter Zijlstra | bbdacdfed2 |
sched/debug: Optimize sched_domain sysctl generation
Currently we unconditionally destroy all sysctl bits and regenerate them after we've rebuild the domains (even if that rebuild is a no-op). And since we unconditionally (re)build the sysctl for all possible CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to only rebuild the bits for CPUs we've actually installed new domains on. Reported-by: Ofer Levi(SW) <oferle@mellanox.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Viresh Kumar | 674e75411f |
sched: cpufreq: Allow remote cpufreq callbacks
With Android UI and benchmarks the latency of cpufreq response to certain scheduling events can become very critical. Currently, callbacks into cpufreq governors are only made from the scheduler if the target CPU of the event is the same as the current CPU. This means there are certain situations where a target CPU may not run the cpufreq governor for some time. One testcase to show this behavior is where a task starts running on CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the system is configured such that the new tasks should receive maximum demand initially, this should result in CPU0 increasing frequency immediately. But because of the above mentioned limitation though, this does not occur. This patch updates the scheduler core to call the cpufreq callbacks for remote CPUs as well. The schedutil, ondemand and conservative governors are updated to process cpufreq utilization update hooks called for remote CPUs where the remote CPU is managed by the cpufreq policy of the local CPU. The intel_pstate driver is updated to always reject remote callbacks. This is tested with couple of usecases (Android: hackbench, recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit octa-core, single policy). Only galleryfling showed minor improvements, while others didn't had much deviation. The reason being that this patch only targets a corner case, where following are required to be true to improve performance and that doesn't happen too often with these tests: - Task is migrated to another CPU. - The task has high demand, and should take the target CPU to higher OPPs. - And the target CPU doesn't call into the cpufreq governor until the next tick. Based on initial work from Steve Muckle. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Saravana Kannan <skannan@codeaurora.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
|
Nicolas Pitre | 8887cd9903 |
sched/rt: Move RT related code from sched/core.c to sched/rt.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain. Signed-off-by: Nicolas Pitre <nico@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170621182203.30626-3-nicolas.pitre@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Nicolas Pitre | 06a76fe08d |
sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
This helps making sched/core.c smaller and hopefully easier to understand and maintain. Signed-off-by: Nicolas Pitre <nico@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170621182203.30626-2-nicolas.pitre@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Ingo Molnar | 6d3aed3d8a |
sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
This definition of SCHED_WARN_ON(): #define SCHED_WARN_ON(x) ((void)(x)) is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it has no return value, so it cannot be used in conditionals. Fix it. Cc: Daniel Axtens <dja@axtens.net> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |