Commit Graph

123 Commits

Author SHA1 Message Date
Rafael J. Wysocki 56026645e2 cpufreq: governor: Ensure sufficiently large sampling intervals
After commit aa7519af45 (cpufreq: Use transition_delay_us for legacy
governors as well) the sampling_rate field of struct dbs_data may be
less than the tick period which causes dbs_update() to produce
incorrect results, so make the code ensure that the value of that
field will always be sufficiently large.

Fixes: aa7519af45 (cpufreq: Use transition_delay_us for legacy governors as well)
Reported-by: Andy Tang <andy.tang@nxp.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Andy Tang <andy.tang@nxp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-12-18 12:09:39 +01:00
Rafael J. Wysocki 08a10002be Merge branch 'pm-cpufreq-sched'
* pm-cpufreq-sched:
  cpufreq: schedutil: Always process remote callback with slow switching
  cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
  cpufreq: Return 0 from ->fast_switch() on errors
  cpufreq: Simplify cpufreq_can_do_remote_dvfs()
  cpufreq: Process remote callbacks from any CPU if the platform permits
  sched: cpufreq: Allow remote cpufreq callbacks
  cpufreq: schedutil: Use unsigned int for iowait boost
  cpufreq: schedutil: Make iowait boost more energy efficient
2017-09-04 00:05:22 +02:00
Viresh Kumar 674e75411f sched: cpufreq: Allow remote cpufreq callbacks
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.

One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.

This patch updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well.

The schedutil, ondemand and conservative governors are updated to
process cpufreq utilization update hooks called for remote CPUs where
the remote CPU is managed by the cpufreq policy of the local CPU.

The intel_pstate driver is updated to always reject remote callbacks.

This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.

The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:

- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
  OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
  next tick.

Based on initial work from Steve Muckle.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-01 14:24:53 +02:00
Viresh Kumar aa7519af45 cpufreq: Use transition_delay_us for legacy governors as well
The policy->transition_delay_us field is used only by the schedutil
governor currently, and this field describes how fast the driver wants
the cpufreq governor to change CPUs frequency. It should rather be a
common thing across all governors, as it doesn't have any schedutil
dependency here.

Create a new helper cpufreq_policy_transition_delay_us() to get the
transition delay across all governors.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22 02:25:20 +02:00
Viresh Kumar 2d04503632 cpufreq: governor: Drop min_sampling_rate
The cpufreq core and governors aren't supposed to set a limit on how
fast we want to try changing the frequency. This is currently done for
the legacy governors with help of min_sampling_rate.

At worst, we may end up setting the sampling rate to a value lower than
the rate at which frequency can be changed and then one of the CPUs in
the policy will be only changing frequency for ever.

But that is something for the user to decide and there is no need to
have special handling for such cases in the core. Leave it for the user
to figure out.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-07-22 02:25:20 +02:00
Ingo Molnar 55687da166 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/cpufreq.h>
We are going to split <linux/sched/cpufreq.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/cpufreq.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Frederic Weisbecker 7fb1327ee9 sched/cputime: Convert kcpustat to nsecs
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:47 +01:00
Stratos Karafotis 00bfe05889 cpufreq: conservative: Decrease frequency faster for deferred updates
Conservative governor changes the CPU frequency in steps.
That means that if a CPU runs at max frequency, it will need several
sampling periods to return to min frequency when the workload
is finished.

If the update function that calculates the load and target frequency
is deferred, the governor might need even more time to decrease the
frequency.

This may have impact to power consumption and after all conservative
should decrease the frequency if there is no workload at every sampling
rate.

To resolve the above issue calculate the number of sampling periods
that the update is deferred. Considering that for each sampling period
conservative should drop the frequency by a freq_step because the
CPU was idle apply the proper subtraction to requested frequency.

Below, the kernel trace with and without this patch. First an
intensive workload is applied on a specific CPU. Then the workload
is removed and the CPU goes to idle.

WITHOUT

     <idle>-0     [007] dN..   620.329153: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   620.350857: cpu_frequency: state=1700000 cpu_id=7
kworker/7:2-556   [007] ....   620.370856: cpu_frequency: state=1900000 cpu_id=7
kworker/7:2-556   [007] ....   620.390854: cpu_frequency: state=2100000 cpu_id=7
kworker/7:2-556   [007] ....   620.411853: cpu_frequency: state=2200000 cpu_id=7
kworker/7:2-556   [007] ....   620.432854: cpu_frequency: state=2400000 cpu_id=7
kworker/7:2-556   [007] ....   620.453854: cpu_frequency: state=2600000 cpu_id=7
kworker/7:2-556   [007] ....   620.494856: cpu_frequency: state=2900000 cpu_id=7
kworker/7:2-556   [007] ....   620.515856: cpu_frequency: state=3100000 cpu_id=7
kworker/7:2-556   [007] ....   620.536858: cpu_frequency: state=3300000 cpu_id=7
kworker/7:2-556   [007] ....   620.557857: cpu_frequency: state=3401000 cpu_id=7
     <idle>-0     [007] d...   669.591363: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   669.591939: cpu_idle: state=4294967295 cpu_id=7
     <idle>-0     [007] d...   669.591980: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] dN..   669.591989: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   670.201224: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   670.221975: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   670.222016: cpu_frequency: state=3300000 cpu_id=7
     <idle>-0     [007] d...   670.222026: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   670.234964: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   670.801251: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.236046: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   671.236073: cpu_frequency: state=3100000 cpu_id=7
     <idle>-0     [007] d...   671.236112: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.393437: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   671.401277: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.404083: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   671.404111: cpu_frequency: state=2900000 cpu_id=7
     <idle>-0     [007] d...   671.404125: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.404974: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   671.501180: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.995414: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   671.995459: cpu_frequency: state=2800000 cpu_id=7
     <idle>-0     [007] d...   671.995469: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   671.996287: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   672.001305: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.078374: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   672.078410: cpu_frequency: state=2600000 cpu_id=7
     <idle>-0     [007] d...   672.078419: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.158020: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   672.158040: cpu_frequency: state=2400000 cpu_id=7
     <idle>-0     [007] d...   672.158044: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.160038: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   672.234557: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.237121: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   672.237174: cpu_frequency: state=2100000 cpu_id=7
     <idle>-0     [007] d...   672.237186: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.237778: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   672.267902: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.269860: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   672.269906: cpu_frequency: state=1900000 cpu_id=7
     <idle>-0     [007] d...   672.269914: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.271902: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...   672.751342: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...   672.823056: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-556   [007] ....   672.823095: cpu_frequency: state=1600000 cpu_id=7

WITH

     <idle>-0     [007] dN..  4380.928009: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399   [007] ....  4380.949767: cpu_frequency: state=2000000 cpu_id=7
kworker/7:2-399   [007] ....  4380.969765: cpu_frequency: state=2200000 cpu_id=7
kworker/7:2-399   [007] ....  4381.009766: cpu_frequency: state=2500000 cpu_id=7
kworker/7:2-399   [007] ....  4381.029767: cpu_frequency: state=2600000 cpu_id=7
kworker/7:2-399   [007] ....  4381.049769: cpu_frequency: state=2800000 cpu_id=7
kworker/7:2-399   [007] ....  4381.069769: cpu_frequency: state=3000000 cpu_id=7
kworker/7:2-399   [007] ....  4381.089771: cpu_frequency: state=3100000 cpu_id=7
kworker/7:2-399   [007] ....  4381.109772: cpu_frequency: state=3400000 cpu_id=7
kworker/7:2-399   [007] ....  4381.129773: cpu_frequency: state=3401000 cpu_id=7
     <idle>-0     [007] d...  4428.226159: cpu_idle: state=1 cpu_id=7
     <idle>-0     [007] d...  4428.226176: cpu_idle: state=4294967295 cpu_id=7
     <idle>-0     [007] d...  4428.226181: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4428.227177: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...  4428.551640: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4428.649239: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399   [007] ....  4428.649268: cpu_frequency: state=2800000 cpu_id=7
     <idle>-0     [007] d...  4428.649278: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4428.689856: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...  4428.799542: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4428.801683: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399   [007] ....  4428.801748: cpu_frequency: state=1700000 cpu_id=7
     <idle>-0     [007] d...  4428.801761: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4428.806545: cpu_idle: state=4294967295 cpu_id=7
...
     <idle>-0     [007] d...  4429.051880: cpu_idle: state=4 cpu_id=7
     <idle>-0     [007] d...  4429.086240: cpu_idle: state=4294967295 cpu_id=7
kworker/7:2-399   [007] ....  4429.086293: cpu_frequency: state=1600000 cpu_id=7

Without the patch the CPU dropped to min frequency after 3.2s
With the patch applied the CPU dropped to min frequency after 0.86s

Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-16 23:15:56 +01:00
Viresh Kumar 26f0dbc9ab cpufreq: governor: Don't use 'timer' keyword
The earlier implementation of governors used background timers and so
functions, mutex, etc had 'timer' keyword in their names.

But that's not true anymore. Replace 'timer' with 'update', as those
functions, variables are based around updates to frequency.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-11 01:48:33 +01:00
Rafael J. Wysocki 58919e83c8 cpufreq / sched: Pass flags to cpufreq_update_util()
It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data.  However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.

Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.

Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes.  Make it non-modular
too to avoid having to export scheduler variables to modules at
large.

Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:14:55 +02:00
Rafael J. Wysocki f6709b8aa7 cpufreq: governor: Drop gov_cancel_work()
There's no reason for gov_cancel_work() to exist at all, as it only
has one caller and the only thing done by that caller is to invoke
gov_cancel_work().

Accordingly, drop gov_cancel_work() and move its contents to the
caller.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-06-09 23:39:29 +02:00
Rafael J. Wysocki 9a15fb2c79 cpufreq: Drop the 'initialized' field from struct cpufreq_governor
The 'initialized' field in struct cpufreq_governor is only used by
the conservative governor (as a usage counter) and the way that
happens is far from straightforward and arguably incorrect.

Namely, the value of 'initialized' is checked by
cpufreq_dbs_governor_init() and cpufreq_dbs_governor_exit() and
the results of those checks are passed (as the second argument) to
the ->init() and ->exit() callbacks in struct dbs_governor.  Those
callbacks are only implemented by the ondemand and conservative
governors and ondemand doesn't use their second argument at all.
In turn, the conservative governor uses it to decide whether or not
to either register or unregister a transition notifier.

That whole mechanism is not only unnecessarily convoluted, but also
racy, because the 'initialized' field of struct cpufreq_governor is
updated in cpufreq_init_governor() and cpufreq_exit_governor() under
policy->rwsem which doesn't help if one of these functions is run
twice in parallel for different policies (which isn't impossible in
principle), for example.

Instead of it, add a proper usage counter to the conservative
governor and update it from cs_init() and cs_exit() which is
guaranteed to be non-racy, as those functions are only called
under gov_dbs_data_mutex which is global.

With that in place, drop the 'initialized' field from struct
cpufreq_governor as it is not used any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-06-02 23:24:39 +02:00
Viresh Kumar bf2be2de84 cpufreq: governor: Create cpufreq_policy_apply_limits()
Create a new helper to avoid code duplication across governors.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-02 23:24:39 +02:00
Viresh Kumar 666f4ccc5d cpufreq: governor: Remove unnecessary bits from print message
pr_*() helpers already prefix the print messages with
"cpufreq_governor:" and similar details aren't required in the actual
message.

For example, the print message getting fixed looks like this before this
patch:

cpufreq_governor: cpufreq: Governor initialization failed (dbs_data kobject init error 0)

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-02 23:24:38 +02:00
Rafael J. Wysocki e788892ba3 cpufreq: governor: Get rid of governor events
The design of the cpufreq governor API is not very straightforward,
as struct cpufreq_governor provides only one callback to be invoked
from different code paths for different purposes.  The purpose it is
invoked for is determined by its second "event" argument, causing it
to act as a "callback multiplexer" of sorts.

Unfortunately, that leads to extra complexity in governors, some of
which implement the ->governor() callback as a switch statement
that simply checks the event argument and invokes a separate function
to handle that specific event.

That extra complexity can be eliminated by replacing the all-purpose
->governor() callback with a family of callbacks to carry out specific
governor operations: initialization and exit, start and stop and policy
limits updates.  That also turns out to reduce the code size too, so
do it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-06-02 23:24:15 +02:00
Rafael J. Wysocki 9485e4ca0b cpufreq: governor: Fix handling of special cases in dbs_update()
As reported in KBZ 69821:

"With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz
 even if usage goes to 100%, frequency does not scale up, the governor
 in use is ondemand. Neither works conservative. Performance and
 userspace governors work as expected.

 With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand
 as expected."

Analysis carried out by Chen Yu leads to the conclusion that the
observed issue is due to idle_time in dbs_update() representing a
negative number in which case the function will return 0 as the load
(unless load is greater than 0 for another CPU sharing the policy),
although that need not be the right choice.

Indeed, idle_time representing a negative number means that during
the last sampling interval the CPU was almost 100% busy on the rough
average, so 100 should be returned as the load in that case.

Modify the code accordingly and rearrange it to clarify the handling
of all of the special cases in it.  While at it, also avoid returning
zero as the load if time_elapsed is 0 (it doesn't really make sense
to return 0 then).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-05-06 14:24:23 +02:00
Rafael J. Wysocki b4f4b4b371 cpufreq: governor: Change confusing struct field and variable names
The name of the prev_cpu_wall field in struct cpu_dbs_info is
confusing, because it doesn't represent wall time, but the previous
update time as returned by get_cpu_idle_time() (that may be the
current value of jiffies_64 in some cases, for example).

Moreover, the names of some related variables in dbs_update() take
that confusion further.

Rename all of those things to make their names reflect the purpose
more accurately.  While at it, drop unnecessary parens from one of
the updated expressions.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Chen Yu <yu.c.chen@intel.com>
2016-04-28 15:10:08 +02:00
Rafael J. Wysocki ba1ca654f3 cpufreq: governor: Fix prev_load initialization in cpufreq_governor_start()
The way cpufreq_governor_start() initializes j_cdbs->prev_load is
questionable.

First off, j_cdbs->prev_cpu_wall used as a denominator in the
computation may be zero.  The case this happens is when
get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
used to return that number is called exactly at the jiffies_64
wrap time.  It is rather hard to trigger that error, but it is not
impossible and it will just crash the kernel then.

Second, j_cdbs->prev_load is computed as the average load during
the entire time since the system started and it may not reflect the
load in the previous sampling period (as it is expected to).
That doesn't play well with the way dbs_update() uses that value.
Namely, if the update time delta (wall_time) happens do be greater
than twice the sampling rate on the first invocation of it, the
initial value of j_cdbs->prev_load (which may be completely off) will
be returned to the caller as the current load (unless it is equal to
zero and unless another CPU sharing the same policy object has a
greater load value).

For this reason, notice that the prev_load field of struct cpu_dbs_info
is only used by dbs_update() and only in that one place, so if
cpufreq_governor_start() is modified to always initialize it to 0,
it will make dbs_update() always compute the actual load first time
it checks the update time delta against the doubled sampling rate
(after initialization) and there won't be any side effects of it.

Consequently, modify cpufreq_governor_start() as described.

Fixes: 18b46abd00 (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-04-25 16:21:34 +02:00
Rafael J. Wysocki 1cbc99dfe5 Merge back cpufreq changes for v4.7. 2016-04-25 15:44:01 +02:00
Rafael J. Wysocki 94862a62df Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"
Revert commit 0df35026c6 (cpufreq: governor: Fix negative idle_time
when configured with CONFIG_HZ_PERIODIC) that introduced a regression
by causing the ondemand cpufreq governor to misbehave for
CONFIG_TICK_CPU_ACCOUNTING unset (the frequency goes up to the max at
one point and stays there indefinitely).

The revert takes subsequent modifications of the code in question into
account.

Fixes: 0df35026c6 (cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=115261
Reported-and-tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-04-25 02:39:20 +02:00
Rafael J. Wysocki 2d0c58ad60 cpufreq: governor: Move abstract gov_attr_set code to seperate file
Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".

No intentional functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-04-02 01:09:01 +02:00
Rafael J. Wysocki 0dd3c1d678 cpufreq: governor: New data type for management part of dbs_data
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type.  As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-04-02 01:09:00 +02:00
Rafael J. Wysocki 0bed612be6 cpufreq: sched: Helpers to add and remove update_util hooks
Replace the single helper for adding and removing cpufreq utilization
update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
and modify the users of cpufreq_set_update_util_data() accordingly.

With the new helpers, the code using them doesn't need to worry
about the internals of struct update_util_data and in particular
it doesn't need to worry about populating the func field in it
properly upfront.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-04-02 01:08:43 +02:00
Rafael J. Wysocki 539a4c4247 cpufreq: governor: Always schedule work on the CPU running update
Modify dbs_irq_work() to always schedule the process-context work
on the current CPU which also ran the dbs_update_util_handler()
that the irq_work being handled came from.

This causes the entire frequency update handling (involving the
"ondemand" or "conservative" governors) to be carried out by the
CPU whose frequency is to be updated and reduces the overall amount
of inter-CPU noise related to cpufreq.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-22 23:13:36 +01:00
Rafael J. Wysocki adaf9fcd13 cpufreq: Move scheduler-related code to the sched directory
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file and to sched.h.

Redefine cpufreq_update_util() as a static inline function to avoid
function calls at its call sites in the scheduler code (as suggested
by Peter Zijlstra).

Also move the definition of struct update_util_data and declaration
of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
include/linux/sched.h.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-03-10 20:44:47 +01:00
Rafael J. Wysocki 08f511fd41 cpufreq: Reduce cpufreq_update_util() overhead a bit
Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete.  Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.

In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.

Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-03-09 15:07:58 +01:00
Viresh Kumar f737236b12 cpufreq: governor: Drop unnecessary checks from show() and store()
The show() and store() routines in the cpufreq-governor core don't need
to check if the struct governor_attr they want to use really provides
the callbacks they need as expected (if that's not the case, it means a
bug in the code anyway), so change them to avoid doing that.

Also change the error value to -EBUSY, if the governor is getting
removed and we aren't allowed to store any more changes.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:41:11 +01:00
Rafael J. Wysocki 27de348239 cpufreq: governor: Fix race in dbs_update_util_handler()
There is a scenario that may lead to undesired results in
dbs_update_util_handler().  Namely, if two CPUs sharing a policy
enter the funtion at the same time, pass the sample delay check
and then one of them is stalled until dbs_work_handler() (queued
up by the other CPU) clears the work counter, it may update the
work counter and queue up another work item prematurely.

To prevent that from happening, use the observation that the CPU
queuing up a work item in dbs_update_util_handler() updates the
last sample time.  This means that if another CPU was stalling after
passing the sample delay check and now successfully updated the work
counter as a result of the race described above, it will see the new
value of the last sample time which is different from what it used in
the sample delay check before.  If that happens, the sample delay
check passed previously is not valid any more, so the CPU should not
continue.

Fixes: f17cbb53783c (cpufreq: governor: Avoid atomic operations in hot paths)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:10 +01:00
Rafael J. Wysocki 94ab5e030f cpufreq: governor: Make gov_set_update_util() static
The gov_set_update_util() routine is only used internally by the
common governor code and it doesn't need to be exported, so make
it static.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:10 +01:00
Rafael J. Wysocki 1112e9d83e cpufreq: governor: Narrow down the dbs_data_mutex coverage
Since cpufreq_governor_dbs() is now always called with policy->rwsem
held, it cannot be executed twice in parallel for the same policy.
Thus it is not necessary to hold dbs_data_mutex around the invocations
of cpufreq_governor_start/stop/limits() from it as those functions
never modify any data that can be shared between different policies.

However, cpufreq_governor_dbs() may be executed twice in parallal
for different policies using the same gov->gdbs_data object and
dbs_data_mutex is still necessary to protect that object against
concurrent updates.

For this reason, narrow down the dbs_data_mutex locking to
cpufreq_governor_init/exit() where it is needed and rename the
mutex to gov_dbs_data_mutex to reflect its purpose.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:10 +01:00
Rafael J. Wysocki e3f5ed9393 cpufreq: governor: Make dbs_data_mutex static
That mutex is only used by cpufreq_governor_dbs() and it doesn't
need to be exported to modules, so make it static and drop the
export incantation.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:09 +01:00
Rafael J. Wysocki 8c8f77fd07 cpufreq: governor: Move per-CPU data to the common code
After previous changes there is only one piece of code in the
ondemand governor making references to per-CPU data structures,
but it can be easily modified to avoid doing that, so modify it
accordingly and move the definition of per-CPU data used by the
ondemand and conservative governors to the common code.  Next,
change that code to access the per-CPU data structures directly
rather than via a governor callback.

This causes the ->get_cpu_cdbs governor callback to become
unnecessary, so drop it along with the macro and function
definitions related to it.

Finally, drop the definitions of struct od_cpu_dbs_info_s and
struct cs_cpu_dbs_info_s that aren't necessary any more.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:09 +01:00
Rafael J. Wysocki 7d5a9956af cpufreq: governor: Make governor private data per-policy
Some fields in struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s
are only used for a limited set of CPUs.  Namely, if a policy is
shared between multiple CPUs, those fields will only be used for one
of them (policy->cpu).  This means that they really are per-policy
rather than per-CPU and holding room for them in per-CPU data
structures is generally wasteful.  Also moving those fields into
per-policy data structures will allow some significant simplifications
to be made going forward.

For this reason, introduce struct cs_policy_dbs_info and
struct od_policy_dbs_info to hold those fields.  Define each of the
new structures as an extension of struct policy_dbs_info (such that
struct policy_dbs_info is embedded in each of them) and introduce
new ->alloc and ->free governor callbacks to allocate and free
those structures, respectively, such that ->alloc() will return
a pointer to the struct policy_dbs_info embedded in the allocated
data structure and ->free() will take that pointer as its argument.

With that, modify the code accessing the data fields in question
in per-CPU data objects to look for them in the new structures
via the struct policy_dbs_info pointer available to it and drop
them from struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:08 +01:00
Rafael J. Wysocki a33cce1c6c cpufreq: governor: Fix CPU load information updates via ->store
The ->store() callbacks of some tunable sysfs attributes of the
ondemand and conservative governors trigger immediate updates of
the CPU load information for all CPUs "governed" by the given
dbs_data by walking the cpu_dbs_info structures for all online
CPUs in the system and updating them.

This is questionable for two reasons.  First, it may lead to a lot of
extra overhead on a system with many CPUs if the given dbs_data is
only associated with a few of them.  Second, if governor tunables are
per-policy, the CPUs associated with the other sets of governor
tunables should not be updated.

To address this issue, use the observation that in all of the places
in question the update operation may be carried out in the same way
(because all of the tunables involved are now located in struct
dbs_data and readily available to the common code) and make the
code in those places invoke the same (new) helper function that
will carry out the update correctly.

That new function always checks the ignore_nice_load tunable value
and updates the CPUs' prev_cpu_nice data fields if that's set, which
wasn't done by the original code in store_io_is_busy(), but it
should have been done in there too.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:08 +01:00
Rafael J. Wysocki 702c9e542a cpufreq: governor: Add a ->start callback for governors
To avoid having to check the governor type explicitly in the common
code in order to initialize data structures specific to the governor
type properly, add a ->start callback to struct dbs_governor and
use it to initialize those data structures for the ondemand and
conservative governors.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:07 +01:00
Rafael J. Wysocki 8847e038c1 cpufreq: governor: Move io_is_busy to struct dbs_data
The io_is_busy governor tunable is only used by the ondemand governor
and is located in the ondemand-specific data structure, but it is
looked at by the common governor code that has to do ugly things to
get to that value, so move it to struct dbs_data and modify ondemand
accordingly.

Since the conservative governor never touches that field, it will
be always 0 for that governor and it won't have any effect on the
results of computations in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:06 +01:00
Rafael J. Wysocki 574ef14d5d cpufreq: governor: Close dbs_data update race condition
It is possible for a dbs_data object to be updated after its
usage counter has become 0.  That may happen if governor_store()
runs (via a govenor tunable sysfs attribute write) in parallel
with cpufreq_governor_exit() called for the last cpufreq policy
associated with the dbs_data in question.  In that case, if
governor_store() acquires dbs_data->mutex right after
cpufreq_governor_exit() has released it, the ->store() callback
invoked by it may operate on dbs_data with no users.  Although
sysfs will cause the kobject_put() in cpufreq_governor_exit() to
block until governor_store() has returned, that situation may
lead to some unexpected results, depending on the implementation
of the ->store callback, and therefore it should be avoided.

To that end, modify governor_store() to check the dbs_data's
usage count before invoking the ->store() callback and return
an error if it is 0 at that point.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:06 +01:00
Rafael J. Wysocki 07aa4402a0 cpufreq: governor: Use microseconds in sample delay computations
Do not convert microseconds to jiffies and the other way around
in governor computations related to the sampling rate and sample
delay and drop delay_for_sampling_rate() which isn't of any use
then.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:05 +01:00
Rafael J. Wysocki 57dc3bcd45 cpufreq: governor: Move rate_mult to struct policy_dbs
The rate_mult field in struct od_cpu_dbs_info_s is used by the code
shared with the conservative governor and to access it that code
has to do an ugly governor type check.  However, first of all it
is ever only used for policy->cpu, so it is per-policy rather than
per-CPU and second, it is initialized to 1 by cpufreq_governor_start(),
so if the conservative governor never modifies it, it will have no
effect on the results of any computations.

For these reasons, move rate_mult to struct policy_dbs_info (as a
common field).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:04 +01:00
Rafael J. Wysocki 78347cdb89 cpufreq: governor: Reset sample delay in store_sampling_rate()
If store_sampling_rate() updates the sample delay when the ondemand
governor is in the middle of its high/low dance (OD_SUB_SAMPLE sample
type is set), the governor will still do the bottom half of the
previous sample which may take too much time.

To prevent that from happening, change store_sampling_rate() to always
reset the sample delay to 0 which also is consistent with the new
behavior of cpufreq_governor_limits().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:04 +01:00
Rafael J. Wysocki 4cccf75557 cpufreq: governor: Get rid of the ->gov_check_cpu callback
The way the ->gov_check_cpu governor callback is used by the ondemand
and conservative governors is not really straightforward.  Namely, the
governor calls dbs_check_cpu() that updates the load information for
the policy and the invokes ->gov_check_cpu() for the governor.

To get rid of that entanglement, notice that cpufreq_governor_limits()
doesn't need to call dbs_check_cpu() directly.  Instead, it can simply
reset the sample delay to 0 which will cause a sample to be taken
immediately.  The result of that is practically equivalent to calling
dbs_check_cpu() except that it will trigger a full update of governor
internal state and not just the ->gov_check_cpu() part.

Following that observation, make cpufreq_governor_limits() reset
the sample delay and turn dbs_check_cpu() into a function that will
simply evaluate the load and return the result called dbs_update().

That function can now be called by governors from the routines that
previously were pointed to by ->gov_check_cpu and those routines
can be called directly by each governor instead of dbs_check_cpu().
This way ->gov_check_cpu becomes unnecessary, so drop it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:04 +01:00
Rafael J. Wysocki 57eb832f90 cpufreq: governor: Clean up load-related computations
Clean up some load-related computations in dbs_check_cpu() and
cpufreq_governor_start() to get rid of unnecessary operations and
type casts and make the code easier to read.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:03 +01:00
Rafael J. Wysocki 679b8fe43a cpufreq: governor: Fix nice contribution computation in dbs_check_cpu()
The contribution of the CPU nice time to the idle time in dbs_check_cpu()
is computed in a bogus way, as the code may subtract current and previous
nice values for different CPUs.

That doesn't matter for cases when cpufreq policies are not shared,
but may lead to problems otherwise.

Fix the computation and simplify it to avoid taking unnecessary steps.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:03 +01:00
Rafael J. Wysocki e4db2813d2 cpufreq: governor: Avoid atomic operations in hot paths
Rework the handling of work items by dbs_update_util_handler() and
dbs_work_handler() so the former (which is executed in scheduler
paths) only uses atomic operations when absolutely necessary.  That
is, when the policy is shared and dbs_update_util_handler() has
already decided that this is the time to queue up a work item.

In particular, this avoids the atomic ops entirely on platforms where
policy objects are never shared.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:03 +01:00
Rafael J. Wysocki f62b93740c cpufreq: governor: Simplify gov_cancel_work() slightly
The atomic work counter incrementation in gov_cancel_work() is not
necessary any more, because work items won't be queued up after
gov_clear_update_util() anyway, so drop it along with the comment
about how it may be missed by the gov_clear_update_util().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-03-09 14:41:02 +01:00
Rafael J. Wysocki b9db42730a cpufreq: governor: Avoid irq_work_queue_on() crash on non-SMP ARM
As it turns out, irq_work_queue_on() will crash if invoked on
non-SMP ARM platforms, but in fact it is not necessary to use that
function in the cpufreq governor code (as it doesn't matter to that
code which CPU will handle the irq_work), so change it to always use
irq_work_queue().

Fixes: 8fb47ff100af (cpufreq: governor: Replace timers with utilization update callbacks)
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-and-tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:41:02 +01:00
Viresh Kumar aded387b94 cpufreq: conservative: Update sample_delay_ns immediately
The ondemand governor already updates sample_delay_ns immediately on
updates to the sampling rate, but conservative doesn't do that.

It was left out earlier as the code was really too complex to get
that done easily.  Things are sorted out very well now, however, and
the conservative governor can be modified to follow ondemand in that
respect.

Moreover, since the code needed to implement that in the
conservative governor would be identical to the corresponding
ondemand governor's code, make that code common and change both
governors to use it.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:41:01 +01:00
Viresh Kumar 581c214b21 cpufreq: governor: No need to manage state machine now
The cpufreq core now guarantees that policy->rwsem won't be dropped
while running the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT
event and will be held acquired until the complete sequence of governor
state changes has finished.

This allows governor state machine checks to be dropped from multiple
functions in cpufreq_governor.c.

This also means that policy_dbs->policy can be initialized upfront, so
the entire initialization of struct policy_dbs can be carried out in
one place.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:41:01 +01:00
Viresh Kumar c54df07184 cpufreq: governor: Create and traverse list of policy_dbs to avoid deadlock
The dbs_data_mutex lock is currently used in two places.  First,
cpufreq_governor_dbs() uses it to guarantee mutual exclusion between
invocations of governor operations from the core.  Second, it is used by
ondemand governor's update_sampling_rate() to ensure the stability of
data structures walked by it.

The second usage is quite problematic, because update_sampling_rate() is
called from a governor sysfs attribute's ->store callback and that leads
to a deadlock scenario involving cpufreq_governor_exit() which runs
under dbs_data_mutex.  Thus it is better to rework the code so
update_sampling_rate() doesn't need to acquire dbs_data_mutex.

To that end, rework update_sampling_rate() to walk a list of policy_dbs
objects supported by the dbs_data one it has been called for (instead of
walking cpu_dbs_info object for all CPUs).  The list manipulation is
protected with dbs_data->mutex which also is held around the execution
of update_sampling_rate(), it is not necessary to hold dbs_data_mutex in
that function any more.

Reported-by: Juri Lelli <juri.lelli@arm.com>
Reported-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:40:59 +01:00
Viresh Kumar c443563036 cpufreq: governor: New sysfs show/store callbacks for governor tunables
The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).

Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation.  That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).

We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.  Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.

To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them.  Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog + rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 14:40:58 +01:00