After commit 8fa520af50 "intel_pstate: Remove freq calculation from
intel_pstate_calc_busy()" intel_pstate_get() calls get_avg_frequency()
to compute the average frequency, which is problematic for two reasons.
First, intel_pstate_get() may be invoked before the driver reads the
CPU feedback registers for the first time and if that happens,
get_avg_frequency() will attempt to divide by zero.
Second, the get_avg_frequency() call in intel_pstate_get() is racy
with respect to intel_pstate_sample() and it may end up returning
completely meaningless values for this reason.
Moreover, after commit 7349ec0470 "intel_pstate: Move
intel_pstate_calc_busy() into get_target_pstate_use_performance()"
sample.core_pct_busy is never computed on Atom, but it is used in
intel_pstate_adjust_busy_pstate() in that case too.
To address those problems notice that if sample.core_pct_busy
was used in the average frequency computation carried out by
get_avg_frequency(), both the divide by zero problem and the
race with respect to intel_pstate_sample() would be avoided.
Accordingly, move the invocation of intel_pstate_calc_busy() from
get_target_pstate_use_performance() to intel_pstate_update_util(),
which also will take care of the uninitialized sample.core_pct_busy
on Atom, and modify get_avg_frequency() to use sample.core_pct_busy
as per the above.
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Link: http://marc.info/?l=linux-kernel&m=146226437623173&w=4
Fixes: 8fa520af50 "intel_pstate: Remove freq calculation from intel_pstate_calc_busy()"
Fixes: 7349ec0470 "intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()"
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 41cfd64cf4 "Update frequencies of policy->cpus only from
->set_policy()" changed the way the intel_pstate driver's ->set_policy
callback updates the HWP (hardware-managed P-states) settings.
A side effect of it is that if those settings are modified on the
boot CPU during system suspend and wakeup, they will never be
restored during subsequent system resume.
To address this problem, allow cpufreq drivers that don't provide
->target or ->target_index callbacks to use ->suspend and ->resume
callbacks and add a ->resume callback to intel_pstate to restore
the HWP settings on the CPUs that belong to the given policy.
Fixes: 41cfd64cf4 "Update frequencies of policy->cpus only from ->set_policy()"
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
How we switch MMU context differs between hash and radix. For hash we
need to switch the SLB details and for radix we need to switch the PID
SPR.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* pm-cpufreq-fixes:
cpufreq: intel_pstate: Fix processing for turbo activation ratio
Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC"
The sti-cpufreq does unconditional registration of the cpufreq-dt driver
which causes issue on an multi-platform build. For example, on Vexpress
TC2 platform, we get the following error on boot:
cpu cpu0: OPP-v2 not supported
cpu cpu0: Not doing voltage scaling
cpu: dev_pm_opp_of_cpumask_add_table: couldn't find opp table
for cpu:0, -19
cpu cpu0: dev_pm_opp_get_max_volt_latency: Invalid regulator (-6)
...
arm_big_little: bL_cpufreq_register: Failed registering platform driver:
vexpress-spc, err: -17
The actual driver fails to initialise as cpufreq-dt is probed
successfully, which is incorrect. This issue can happen to any platform
not using cpufreq-dt in a multi-platform build.
This patch adds a check to do selective initialization of the driver.
Fixes: ab0ea257fc (cpufreq: st: Provide runtime initialised driver for ST's platforms)
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Lee Jones <lee.jones@linaro.org>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move cpufreq bits for mvebu into drivers/cpufreq/ directory, that's
where they really belong to.
Compiled tested only.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are no more users of platform-data for cpufreq-dt driver, get rid
of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Existing platforms, which do not support operating-points-v2, can
explicitly tell the opp core that some of the CPUs share opp tables,
with help of dev_pm_opp_set_sharing_cpus().
For such platforms, explicitly ask the opp core to provide list of CPUs
sharing the opp table with current cpu device, before falling back to
platform data.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The name of the prev_cpu_wall field in struct cpu_dbs_info is
confusing, because it doesn't represent wall time, but the previous
update time as returned by get_cpu_idle_time() (that may be the
current value of jiffies_64 in some cases, for example).
Moreover, the names of some related variables in dbs_update() take
that confusion further.
Rename all of those things to make their names reflect the purpose
more accurately. While at it, drop unnecessary parens from one of
the updated expressions.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Chen Yu <yu.c.chen@intel.com>
For platforms which are controlled via remove node manager, enable _PPC by
default. These platforms are mostly categorized as enterprise server or
performance servers. These platforms needs to go through some
certifications tests, which tests control via _PPC.
The relative risk of enabling by default is low as this is is less likely
that these systems have broken _PSS table.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When policy->max is changed via _PPC or sysfs and is more than the max non
turbo frequency, it does not really change resulting performance in some
processors. When policy->max results in a P-State ratio more than the
turbo activation ratio, then processor can choose any P-State up to max
turbo. So the user or _PPC setting has no value, but this can cause
undesirable side effects like:
- Showing reduced max percentage in Intel P-State sysfs
- It can cause reduced max performance under certain boundary conditions:
The requested max scaling frequency either via _PPC or via cpufreq-sysfs,
will be converted into a fixed floating point max percent scale. In
majority of the cases this will result in correct max. But not 100% of the
time. If the _PPC is requested at a point where the calculation lead to a
lower max, this can result in a lower P-State then expected and it will
impact performance.
Example of this condition using a Broadwell laptop with config TDP.
ACPI _PSS table from a Broadwell laptop
2301000 2300000 2200000 2000000 1900000 1800000 1700000 1500000 1400000
1300000 1100000 1000000 900000 800000 600000 500000
The actual results by disabling config TDP so that we can get what is
requested on or below 2300000Khz.
scaling_max_freq Max Requested P-State Resultant scaling
max
---------------------------------------- ----------------------
2400000 18 2900000 (max
turbo)
2300000 17 2300000 (max
physical non turbo)
2200000 15 2100000
2100000 15 2100000
2000000 13 1900000
1900000 13 1900000
1800000 12 1800000
1700000 11 1700000
1600000 10 1600000
1500000 f 1500000
1400000 e 1400000
1300000 d 1300000
1200000 c 1200000
1100000 a 1000000
1000000 a 1000000
900000 9 900000
800000 8 800000
700000 7 700000
600000 6 600000
500000 5 500000
------------------------------------------------------------------
Now set the config TDP level 1 ratio as 0x0b (equivalent to 1100000KHz)
in BIOS (not every system will let you adjust this).
The turbo activation ratio will be set to one less than that, which will
be 0x0a (So any request above 1000000KHz should result in turbo region
assuming no thermal limits).
Here _PPC will request max to 1100000KHz (which basically should still
result in turbo as this is more than the turbo activation ratio up to
max allowable turbo frequency), but actual calculation resulted in a max
ceiling P-State which is 0x0a. So under any load condition, this driver
will not request turbo P-States. This will be a huge performance hit.
When config TDP feature is ON, if the _PPC points to a frequency above
turbo activation ratio, the performance can still reach max turbo. In this
case we don't need to treat this as the reduced frequency in set_policy
callback.
In this change when config TDP is active (by checking if the physical max
non turbo ratio is more than the current max non turbo ratio), any request
above current max non turbo is treated as full performance.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use ACPI _PPC notification to limit max P state driver will request.
ACPI _PPC change notification is sent by BIOS to limit max P state
in several cases:
- Reduce impact of platform thermal condition
- When Config TDP feature is used, a changed _PPC is sent to
follow TDP change
- Remote node managers in server want to control platform power
via baseboard management controller (BMC)
This change registers with ACPI processor performance lib so that
_PPC changes are notified to cpufreq core, which in turns will
result in call to .setpolicy() callback. Also the way _PSS
table identifies a turbo frequency is not compatible to max turbo
frequency in intel_pstate, so the very first entry in _PSS needs
to be adjusted.
This feature can be turned on by using kernel parameters:
intel_pstate=support_acpi_ppc
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Minor cleanups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The frequency transition latency from pmin to pmax is observed to be in
few millisecond granurality. And it usually happens to take a performance
penalty during sudden frequency rampup requests.
This patch set solves this problem by using an entity called "global
pstates". The global pstate is a Chip-level entity, so the global entitiy
(Voltage) is managed across the cores. The local pstate is a Core-level
entity, so the local entity (frequency) is managed across threads.
This patch brings down global pstate at a slower rate than the local
pstate. Hence by holding global pstates higher than local pstate makes
the subsequent rampups faster.
A per policy structure is maintained to keep track of the global and
local pstate changes. The global pstate is brought down using a parabolic
equation. The ramp down time to pmin is set to ~5 seconds. To make sure
that the global pstates are dropped at regular interval , a timer is
queued for every 2 seconds during ramp-down phase, which eventually brings
the pstate down to local pstate.
Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.
Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb
with different record sizes . The following table shows IOoperations/sec
with and without patch.
Iozone Results ( in op/sec) ( mean over 3 iterations )
---------------------------------------------------------------------
file size- with without %
recordsize-IOtype patch patch change
----------------------------------------------------------------------
200704-1-SeqWrite 1616532 1615425 0.06
200704-1-Rewrite 2423195 2303130 5.21
200704-2-SeqWrite 1628577 1602620 1.61
200704-2-Rewrite 2428264 2312154 5.02
200704-4-SeqWrite 1617605 1617182 0.02
200704-4-Rewrite 2430524 2351238 3.37
200704-8-SeqWrite 1629478 1600436 1.81
200704-8-Rewrite 2415308 2298136 5.09
200704-16-SeqWrite 1619632 1618250 0.08
200704-16-Rewrite 2396650 2352591 1.87
200704-32-SeqWrite 1632544 1598083 2.15
200704-32-Rewrite 2425119 2329743 4.09
200704-64-SeqWrite 1617812 1617235 0.03
200704-64-Rewrite 2402021 2321080 3.48
200704-128-SeqWrite 1631998 1600256 1.98
200704-128-Rewrite 2422389 2304954 5.09
200704-256 SeqWrite 1617065 1616962 0.00
200704-256-Rewrite 2432539 2301980 5.67
200704-512-SeqWrite 1632599 1598656 2.12
200704-512-Rewrite 2429270 2323676 4.54
200704-1024-SeqWrite 1618758 1616156 0.16
200704-1024-Rewrite 2431631 2315889 4.99
401408-1-SeqWrite 1631479 1608132 1.45
401408-1-Rewrite 2501550 2459409 1.71
401408-2-SeqWrite 1617095 1626069 -0.55
401408-2-Rewrite 2507557 2443621 2.61
401408-4-SeqWrite 1629601 1611869 1.10
401408-4-Rewrite 2505909 2462098 1.77
401408-8-SeqWrite 1617110 1626968 -0.60
401408-8-Rewrite 2512244 2456827 2.25
401408-16-SeqWrite 1632609 1609603 1.42
401408-16-Rewrite 2500792 2451405 2.01
401408-32-SeqWrite 1619294 1628167 -0.54
401408-32-Rewrite 2510115 2451292 2.39
401408-64-SeqWrite 1632709 1603746 1.80
401408-64-Rewrite 2506692 2433186 3.02
401408-128-SeqWrite 1619284 1627461 -0.50
401408-128-Rewrite 2518698 2453361 2.66
401408-256-SeqWrite 1634022 1610681 1.44
401408-256-Rewrite 2509987 2446328 2.60
401408-512-SeqWrite 1617524 1628016 -0.64
401408-512-Rewrite 2504409 2442899 2.51
401408-1024-SeqWrite 1629812 1611566 1.13
401408-1024-Rewrite 2507620 2442968 2.64
Tested with YCSB workload (50% update + 50% read) over redis for 1 million
records and 1 million operation. Each test was carried out with target
operations per second and persistence disabled.
Max-latency (in us)( mean over 5 iterations )
---------------------------------------------------------------
op/s Operation with patch without patch %change
---------------------------------------------------------------
15000 Read 61480.6 50261.4 22.32
15000 cleanup 215.2 293.6 -26.70
15000 update 25666.2 25163.8 2.00
25000 Read 32626.2 89525.4 -63.56
25000 cleanup 292.2 263.0 11.10
25000 update 32293.4 90255.0 -64.22
35000 Read 34783.0 33119.0 5.02
35000 cleanup 321.2 395.8 -18.8
35000 update 36047.0 38747.8 -6.97
40000 Read 38562.2 42357.4 -8.96
40000 cleanup 371.8 384.6 -3.33
40000 update 27861.4 41547.8 -32.94
45000 Read 42271.0 88120.6 -52.03
45000 cleanup 263.6 383.0 -31.17
45000 update 29755.8 81359.0 -63.43
(test without target op/s)
47659 Read 83061.4 136440.6 -39.12
47659 cleanup 195.8 193.8 1.03
47659 update 73429.4 124971.8 -41.24
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
commit 1b0289848d ("cpufreq: powernv: Add sysfs attributes to show
throttle stats") used policy->driver_data as a flag for one-time creation
of throttle sysfs files. Instead of this use 'kernfs_find_and_get()' to
check if the attribute already exists. This is required as
policy->driver_data is used for other purposes in the later patch.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The IS_ENABLED() macro checks if a Kconfig symbol has been enabled either
built-in or as a module, use that macro instead of open coding the same.
Signed-off-by: Javier Martinez Canillas <javier@osg.samsung.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the config TDP level is not nominal (level = 0), the MSR values for
reading level 1 and level 2 ratios contain power in low 14 bits and actual
ratio bits are at bits [23:16]. The current processing for level 1 and
level 2 is wrong as there is no shift done to get actual ratio.
Fixes: 6a35fc2d6c (cpufreq: intel_pstate: get P1 from TAR when available)
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 4.4+ <stable@vger.kernel.org> # 4.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The way cpufreq_governor_start() initializes j_cdbs->prev_load is
questionable.
First off, j_cdbs->prev_cpu_wall used as a denominator in the
computation may be zero. The case this happens is when
get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy()
used to return that number is called exactly at the jiffies_64
wrap time. It is rather hard to trigger that error, but it is not
impossible and it will just crash the kernel then.
Second, j_cdbs->prev_load is computed as the average load during
the entire time since the system started and it may not reflect the
load in the previous sampling period (as it is expected to).
That doesn't play well with the way dbs_update() uses that value.
Namely, if the update time delta (wall_time) happens do be greater
than twice the sampling rate on the first invocation of it, the
initial value of j_cdbs->prev_load (which may be completely off) will
be returned to the caller as the current load (unless it is equal to
zero and unless another CPU sharing the same policy object has a
greater load value).
For this reason, notice that the prev_load field of struct cpu_dbs_info
is only used by dbs_update() and only in that one place, so if
cpufreq_governor_start() is modified to always initialize it to 0,
it will make dbs_update() always compute the actual load first time
it checks the update time delta against the doubled sampling rate
(after initialization) and there won't be any side effects of it.
Consequently, modify cpufreq_governor_start() as described.
Fixes: 18b46abd00 (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch add rockchip's compatible string to the compat list and
remove similar code from platform code for supporting generic platdev
driver.
Signed-off-by: Finley Xiao <finley.xiao@rock-chips.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Note that the complete routine imx27_dt_init() is removed as
of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL);
has same effect as a NULL .init_machine machine callback pointer.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Antoine Tenart <antoine.tenart@free-electrons.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The machines array in cpufreq-dt-platdev is used only once at boot time
and so should be marked with __initconst, so that kernel can free up
memory used for it, if required.
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cooling device is registered by ready callback. It's also invoked while
system resuming from sleep (Enabling non-boot cpus). Thus cooling device
may be multiple registered. Matchable unregistration is added to exit
callback to fix this issue.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
.exit callback (qoriq_cpufreq_cpu_exit()) is also used during suspend.
So __exit macro should be removed or the function will be discarded.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When THERMAL_OF is undefined the cooling device messages should not be
shown. -ENOSYS is returned from of_cpufreq_cooling_register() when
THERMAL_OF is undefined.
Signed-off-by: Jia Hongtao <hongtao.jia@nxp.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a function to cleanup at module exit and export
appropriate GPL string to enable moduler support
for the cppc_cpufreq driver.
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com>
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The result returned by pid_calc() is subtracted from current_pstate
(which is the P-State requested during the last period) in order to
obtain the target P-State for the current iteration.
However, current_pstate may not reflect the real current P-State of
the CPU. In particular, that P-State may be higher because of the
frequency sharing per module.
The theory is:
- The load is the percentage of time spent in C0 and is related to
the average P-State during the same period.
- The last requested P-State can be completely different than the
average P-State (because of frequency sharing or throttling).
- The P-State shift computed by the pid_calc is based on the load
computed at average P-State, so the shift must be relative to
this average P-State.
Using the average P-State instead of current P-State improves power
without significant performance penalty in cases when a task migrates
from one core to other core sharing frequency and voltage.
Performance and power comparison with this patch on Cherry Trail
platform using Android:
Benchmark ?Perf ?Power
FishTank 10.45% 3.1%
SmartBench-Gaming -0.1% -10.4%
SmartBench-Productivity -0.8% -10.4%
CandyCrush n/a -17.4%
AngryBirds n/a -5.9%
videoPlayback n/a -13.9%
audioPlayback n/a -4.9%
IcyRocks-20-50 0.0% -38.4%
iozone RR -0.16% -1.3%
iozone RW 0.74% -1.3%
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 0df35026c6 (cpufreq: governor: Fix negative idle_time
when configured with CONFIG_HZ_PERIODIC) that introduced a regression
by causing the ondemand cpufreq governor to misbehave for
CONFIG_TICK_CPU_ACCOUNTING unset (the frequency goes up to the max at
one point and stays there indefinitely).
The revert takes subsequent modifications of the code in question into
account.
Fixes: 0df35026c6 (cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=115261
Reported-and-tested-by: Timo Valtoaho <timo.valtoaho@gmail.com>
Cc: 4.5+ <stable@vger.kernel.org> # 4.5+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-cpufreq-fixes:
cpufreq: Abort cpufreq_update_current_freq() for cpufreq_suspended set
intel_pstate: Avoid getting stuck in high P-states when idle
Since governor operations are generally skipped if cpufreq_suspended
is set, cpufreq_start_governor() should do nothing in that case.
That function is called in the cpufreq_online() path, and may also
be called from cpufreq_offline() in some cases, which are invoked
by the nonboot CPUs disabing/enabling code during system suspend
to RAM and resume. That happens when all devices have been
suspended, so if the cpufreq driver relies on things like I2C to
get the current frequency, it may not be ready to do that then.
To prevent problems from happening for this reason, make
cpufreq_update_current_freq(), which is the only function invoked
by cpufreq_start_governor() that doesn't check cpufreq_suspended
already, return 0 upfront if cpufreq_suspended is set.
Fixes: 3bbf8fe3ae (cpufreq: Always update current frequency before startig governor)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The Kconfig for this driver is currently:
config CPU_FREQ_CBE_PMI
bool "CBE frequency scaling using PMI interface"
...meaning that it currently is not being built as a module by
anyone. Lets remove the modular and unused code here, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Christian Krafft <krafft@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-pm@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Jörg Otte reports that commit a4675fbc4a (cpufreq: intel_pstate:
Replace timers with utilization update callbacks) caused the CPUs in
his Haswell-based system to stay in the very high frequency region
even if the system is completely idle.
That turns out to be an existing problem in the intel_pstate driver's
P-state selection algorithm for Core processors. Namely, all
decisions made by that algorithm are based on the average frequency
of the CPU between sampling events and on the P-state requested on
the last invocation, so it may get stuck at a very hight frequency
even if the utilization of the CPU is very low (in fact, it may get
stuck in a inadequate P-state regardless of the CPU utilization).
The only way to kick it out of that limbo is a sufficiently long idle
period (3 times longer than the prescribed sampling interval), but if
that doesn't happen often enough (eg. due to a timing change like
after the above commit), the P-state of the CPU may be inadequate
pretty much all the time.
To address the most egregious manifestations of that issue, reset the
core_busy value used to determine the next P-state to request if the
utilization of the CPU, determined with the help of the MPERF
feedback register and the TSC, is below 1%.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=115771
Reported-and-tested-by: Jörg Otte <jrg.otte@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The freq-table is stored in struct cpufreq_policy also and there is
absolutely no need of keeping a copy of its reference in struct
acpi_cpufreq_data. Drop it.
Also policy->freq_table can't be NULL in the target() callback, remove
the useless check as well.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Its always set by ->init() and so it will always be there in ->exit().
There is no need to have a special check for just that.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reorganize the code in cpufreq_add_dev() to avoid using the ret
variable and reduce the indentation level in it.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Merge two switch entries that do the same thing in
cpufreq_cpu_callback().
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Use the more common kernel style adding a define for pr_fmt.
Miscellanea:
o Remove now unused PFX defines
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use the more common logging style.
Miscellanea:
o Coalesce formats
o Realign arguments
o Add a missing space between a coalesced format
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Prefix the output using the more common kernel style.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are multiple places in intel_pstate where int_tofp() is applied
to both arguments of div_fp(), but this is pointless, because int_tofp()
simply shifts its argument to the left by FRAC_BITS which mathematically
is equivalent to multuplication by 2^FRAC_BITS, so if this is done
to both arguments of a division, the extra factors will cancel each
other during that operation anyway.
Drop the pointless int_tofp() applied to div_fp() arguments throughout
the driver.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq-dt-platdev driver supports creation of cpufreq-dt platform
device now, reuse that and remove similar code from platform code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Multiple platforms are using the generic cpufreq-dt driver now, and all
of them are required to create a platform device with name "cpufreq-dt",
in order to get the cpufreq-dt probed.
Many of them do it from platform code, others have special drivers just
to do that.
It would be more sensible to do this at a generic place, where all such
platform can mark their entries.
This patch adds a separate file to get this device created. Currently
the compat list of platforms that we support is empty, and will be
filled in as and when we move platforms to use it.
It always compiles as part of the kernel and so doesn't need a
module-exit operation.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All CPUs on Tegra platform share clock/voltage lines and there is
absolutely no need of setting platform data for 'cpufreq-dt' platform
device, as that's the default case.
Stop setting platform data for cpufreq-dt device.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Thierry Reding <treding@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Kconfig for this driver is currently:
config CPU_FREQ_CBE_PMI
bool "CBE frequency scaling using PMI interface"
...meaning that it currently is not being built as a module by
anyone. Lets remove the modular and unused code here, so that
when reading the driver there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Due to differences in the cpufreq core's handling of runtime CPU
offline and nonboot CPUs disabling during system suspend-to-RAM,
fast frequency switching gets disabled after a suspend-to-RAM and
resume cycle on all of the nonboot CPUs.
To prevent that from happening, move the invocation of
cpufreq_disable_fast_switch() from cpufreq_exit_governor() to
sugov_exit(), as the schedutil governor is the only user of fast
frequency switching today anyway.
That simply prevents cpufreq_disable_fast_switch() from being called
without invoking the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT
event (which happens during system suspend now).
Fixes: b7898fda5b (cpufreq: Support for fast frequency switching)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
* pm-cpufreq:
cpufreq: dt: Drop stale comment
cpufreq: intel_pstate: Documenation for structures
cpufreq: intel_pstate: fix inconsistency in setting policy limits
intel_pstate: Avoid extra invocation of intel_pstate_sample()
intel_pstate: Do not set utilization update hook too early
* pm-cpuidle:
intel_idle: Add KBL support
intel_idle: Add SKX support
intel_idle: Clean up all registered devices on exit.
intel_idle: Propagate hot plug errors.
intel_idle: Don't overreact to a cpuidle registration failure.
intel_idle: Setup the timer broadcast only on successful driver load.
intel_idle: Avoid a double free of the per-CPU data.
intel_idle: Fix dangling registration on error path.
intel_idle: Fix deallocation order on the driver exit path.
intel_idle: Remove redundant initialization calls.
intel_idle: Fix a helper function's return value.
intel_idle: remove useless return from void function.
* acpi-cppc:
mailbox: pcc: Don't access an unmapped memory address space
The comment in file header doesn't hold true anymore, drop it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
No code change. Only added kernel doc style comments for structures.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When user sets performance policy using cpufreq interface, it is possible
that because of policy->max limits, the actual performance is still
limited. But the current implementation will silently switch the
policy to powersave and start using powersave limits. If user modifies
any limits using intel_pstate sysfs, this is actually changing powersave
limits.
The current implementation tracks limits under powersave and performance
policy using two different variables. When policy->max is less than
policy->cpuinfo.max_freq, only powersave limit variable is used.
This fix causes the performance limits variable to be used always when
the policy is performance.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit 34e2c555f3 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by
next_freq = 1.25 * max_freq * util / max
where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:
next_freq = 1.25 * curr_freq * util / max
The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Modify the ACPI cpufreq driver to provide a method for switching
CPU frequencies from interrupt context and update the cpufreq core
to support that method if available.
Introduce a new cpufreq driver callback, ->fast_switch, to be
invoked for frequency switching from interrupt context by (future)
governors supporting that feature via (new) helper function
cpufreq_driver_fast_switch().
Add two new policy flags, fast_switch_possible, to be set by the
cpufreq driver if fast frequency switching can be used for the
given policy and fast_switch_enabled, to be set by the governor
if it is going to use fast frequency switching for the given
policy. Also add a helper for setting the latter.
Since fast frequency switching is inherently incompatible with
cpufreq transition notifiers, make it possible to set the
fast_switch_enabled only if there are no transition notifiers
already registered and make the registration of new transition
notifiers fail if fast_switch_enabled is set for at least one
policy.
Implement the ->fast_switch callback in the ACPI cpufreq driver
and make it set fast_switch_possible during policy initialization
as appropriate.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move definitions of symbols related to transition latency and
sampling rate to include/linux/cpufreq.h so they can be used by
(future) goverernors located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move definitions and function headers related to struct gov_attr_set
to include/linux/cpufreq.h so they can be used by (future) goverernors
located outside of drivers/cpufreq/.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move abstract code related to struct gov_attr_set to a separate (new)
file so it can be shared with (future) goverernors that won't share
more code with "ondemand" and "conservative".
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
In addition to fields representing governor tunables, struct dbs_data
contains some fields needed for the management of objects of that
type. As it turns out, that part of struct dbs_data may be shared
with (future) governors that won't use the common code used by
"ondemand" and "conservative", so move it to a separate struct type
and modify the code using struct dbs_data to follow.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Replace the single helper for adding and removing cpufreq utilization
update hooks, cpufreq_set_update_util_data(), with a pair of helpers,
cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(),
and modify the users of cpufreq_set_update_util_data() accordingly.
With the new helpers, the code using them doesn't need to worry
about the internals of struct update_util_data and in particular
it doesn't need to worry about populating the func field in it
properly upfront.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The initialization of intel_pstate for a given CPU involves populating
the fields of its struct cpudata that represent the previous sample,
but currently that is done in a problematic way.
Namely, intel_pstate_init_cpu() makes an extra call to
intel_pstate_sample() so it reads the current register values that
will be used to populate the "previous sample" record during the
next invocation of intel_pstate_sample(). However, after commit
a4675fbc4a (cpufreq: intel_pstate: Replace timers with utilization
update callbacks) that doesn't work for last_sample_time, because
the time value is passed to intel_pstate_sample() as an argument now.
Passing 0 to it from intel_pstate_init_cpu() is problematic, because
that causes cpu->last_sample_time == 0 to be visible in
get_target_pstate_use_performance() (and hence the extra
cpu->last_sample_time > 0 check in there) and effectively allows
the first invocation of intel_pstate_sample() from
intel_pstate_update_util() to happen immediately after the
initialization which may lead to a significant "turn on"
effect in the governor algorithm.
To mitigate that issue, rework the initialization to avoid the
extra intel_pstate_sample() call from intel_pstate_init_cpu().
Instead, make intel_pstate_sample() return false if it has been
called with cpu->sample.time equal to zero, which will make
intel_pstate_update_util() skip the sample in that case, and
reset cpu->sample.time from intel_pstate_set_update_util_hook()
to make the algorithm start properly every time the hook is set.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The utilization update hook in the intel_pstate driver is set too
early, as it only should be set after the policy has been fully
initialized by the core. That may cause intel_pstate_update_util()
to use incorrect data and put the CPUs into incorrect P-states as
a result.
To prevent that from happening, make intel_pstate_set_policy() set
the utilization update hook instead of intel_pstate_init_cpu() so
intel_pstate_update_util() only runs when all things have been
initialized as appropriate.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix for an intel_pstate driver issue related to the handling of
MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking
fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems
from hanging during initialization by disabling deep C-states
mishandled by the platform in the problematic configurations (Len
Brown).
- Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman
Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the
fallback C-state on x86 when they are set below its exit latency)
and to restore the previous behavior to fall back to C1 if the next
timer event is set far enough in the future that was changed in 4.4
which led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API
to make it parse non-device (data node only) children of an
ACPI device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJW9JaRAAoJEILEb/54YlRx/GAQAJujANWilWHZYm24a9JDcIE9
rsNZIC/FdeBVilPtRTZQnig/Pj32Z4Jm7IZ/DLOq0Deu1YK/9uv3y59M3BcX6WyL
H5VR80L8geUJZ7RRk0WfM5D4X82ovzwpE/kWt2Z7HDuvJSCBmFBZOvNrXbaRncKD
jIvat/p6uCuxt5c08+ebnBLQ6tOs8wLTWiCx3fO128GIrGRGN2xFV6hzRWVGnJ4g
WXGAR+AdLxRMZz4PPmqdTfRj4TNSR071GjKyaeKfZUjQGAsf5O9A77JFjeNVomDx
g1K37Byid2bTByzVavlEXPJZ7eKb5dAhlo7IJ9HAcOAXChLqH2Czjrpd+1XjR9MF
SV/78rCnF8eet83QYLbGV/Mzf7gbJP2Xp6wiaM22VAPpGe+sYfphJoQka9XRTfId
OgAjyYMYdWAKo5DhxVNI8WyN0W5dsoBFPxnaUFhHSGDCIJH7Ksy20m6y3plG2Bxf
ahoiQhmd9ohjtB5JbRnf4MY0hjekp8Srdf+DoNKsk/+JscIyROpYY3msQ3smUKo+
f628MC/wAosMpSV+l+KOYkbjCbtB49IabWtZ//NVD9hYB3E1f6aTN59yFbWB+1rp
L7Y8iaxzSkyJy/yYVuBal3rSk356+BvvoXBlLXmBsyu1TMlcDjALIYztSiTVT5MB
RZBhgNwdkxNCYJfU3ex+
=hUVj
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"The second batch of power management and ACPI updates for v4.6.
Included are fixups on top of the previous PM/ACPI pull request and
other material that didn't make into it but still should go into 4.6.
Among other things, there's a fix for an intel_pstate driver issue
uncovered by recent cpufreq changes, a workaround for a boot hang on
Skylake-H related to the handling of deep C-states by the platform and
a PCI/ACPI fix for the handling of IO port resources on non-x86
architectures plus some new device IDs and similar.
Specifics:
- Fix for an intel_pstate driver issue related to the handling of MSR
updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking fix
for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems from
hanging during initialization by disabling deep C-states mishandled
by the platform in the problematic configurations (Len Brown).
- Intel Xeon Phi Processor x200 support for intel_idle
(Dasaratharaman Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the fallback
C-state on x86 when they are set below its exit latency) and to
restore the previous behavior to fall back to C1 if the next timer
event is set far enough in the future that was changed in 4.4 which
led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API to
make it parse non-device (data node only) children of an ACPI
device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven)"
* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM / AVS: rockchip-io: add io selectors and supplies for rk3399
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
PCI: ACPI: IA64: fix IO port generic range check
ACPI / util: cast data to u64 before shifting to fix sign extension
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpuidle: menu: Fall back to polling if next timer event is near
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
ACPI / property: fix data node parsing in acpi_get_next_subnode()
ACPI / APD: Add device HID for future AMD UART controller
...
* pm-cpufreq:
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
* pm-cpuidle:
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
cpuidle: menu: Fall back to polling if next timer event is near
cpuidle: menu: use high confidence factors only when considering polling
Modify dbs_irq_work() to always schedule the process-context work
on the current CPU which also ran the dbs_update_util_handler()
that the irq_work being handled came from.
This causes the entire frequency update handling (involving the
"ondemand" or "conservative" governors) to be carried out by the
CPU whose frequency is to be updated and reduces the overall amount
of inter-CPU noise related to cpufreq.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make policy->cur match the current frequency returned by the driver's
->get() callback before starting the governor in case they went out of
sync in the meantime and drop the piece of code attempting to
resync policy->cur with the real frequency of the boot CPU from
cpufreq_resume() as it serves no purpose any more (and it's racy and
super-ugly anyway).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move the part of cpufreq_update_policy() that obtains the current
frequency from the driver and updates policy->cur if necessary to
a separate function, cpufreq_get_current_freq().
That should not introduce functional changes and subsequent change
set will need it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Starting a governor in cpufreq always follows the same pattern
involving two calls to cpufreq_governor(), one with the event
argument set to CPUFREQ_GOV_START and one with that argument set to
CPUFREQ_GOV_LIMITS.
Introduce cpufreq_start_governor() that will carry out those two
operations and make all places where governors are started use it.
That slightly modifies the behavior of cpufreq_set_policy() which
now also will go back to the old governor if the second call to
cpufreq_governor() (the one with event equal to CPUFREQ_GOV_LIMITS)
fails, but that really is how it should work in the first place.
Also cpufreq_resume() will now pring an error message if the
CPUFREQ_GOV_LIMITS call to cpufreq_governor() fails, but that
makes it follow cpufreq_add_policy_cpu() and cpufreq_offline()
in that respect.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Create sysfs attributes to export throttle information in
/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats directory. The
newly added sysfs files are as follows:
1)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/turbo_stat
2)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/sub-turbo_stat
3)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/unthrottle
4)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/powercap
5)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overtemp
6)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/supply_fault
7)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/overcurrent
8)/sys/devices/system/cpu/cpuX/cpufreq/throttle_stats/occ_reset
Detailed explanation of each attribute is added to
Documentation/ABI/testing/sysfs-devices-system-cpu
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
These frequency register read/write operations' implementations for the
given processor (Intel/AMD MSR access or I/O port access) are only used
internally in acpi-cpufreq, so make them static.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 96c4726f01 "cpufreq: powernv: Remove cpu_to_chip_id() from
hot-path" introduced a 'core_to_chip_map' array to cache the chip-ids
of all cores.
Replace this with a per-CPU variable that stores the pointer to the
chip-array. This removes the linear lookup and provides a neater and
simpler solution.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Driver updates for ARM SoCs, these contain various things that touch
the drivers/ directory but got merged through arm-soc for practical
reasons:
- Rockchip rk3368 gains power domain support
- Small updates for the ARM spmi driver
- The Atmel PMC driver saw a larger rework, touching both
arch/arm/mach-at91 and drivers/clk/at91
- All reset controller driver changes alway get merged through
arm-soc, though this time the largest change is the addition
of a MIPS pistachio reset driver
- One bugfix for the NXP (formerly Freescale) i.MX weim bus driver
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVu67OmCrR//JCVInAQJ64hAAqNemdAMloJhh8mk4O74egd/XNE8GLK3v
gGefpZNi0TC8u/GWMhU1aFCElaCmbNlL0IlqaRrU/vydOmQcZYht7Fg3bAm4r3ck
TlKijGTJap4sdHhxSeui+7bhaBToxcklQTdcrKFgOwsype7CAWJCl5otIC/GHO5L
fn4QSjQbqr5kqH1XfuVIphj/fJjDKRRze5D7zn0nExq46OyoYyjc2lm/QkLgeeS2
vDpzOULYXcjf5GfsPknCJGGjenISD7cIAwZukGvJXFh8WrXkEPZZ7B7bBI/8ZeBU
MkdWvOm9fHEWpIPnuTcLeQNlfdzQ0Z0zijgJqnXjwSYXK2Es1UKEoIFvZUyGA9zG
uyLtddFcKbP4QBDUKVMbyYM6x4Cj7LO96dB2pe8iH5rvnoLS32EjJ/4glnbPQFB7
75JKb7eU1pijoy9c3x/G10vINHzbPjyUN3sYTFKMomPFzEF4OVQ3GDclSuD7jjDr
GnqmAqlj29+qGU6iQBBHp9TfLTxwrs/4MKPEZ+tTGvtINnzOpLGA3TUnji7nVFQc
BYy3qaEvg9MfHI3uXhAl2L4CGCVvHfqFs5B7giZfAkbbcTNAHs9PkZ6gMYH+GG3p
tEbTf/dMHmkkqttSz4f7LZS7D56cSfm3cD8kFCRJPLKifmGAk3w1HZ7JoCXdjr1K
22HSKRMxlhU=
=HS4G
-----END PGP SIGNATURE-----
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC driver updates from Arnd Bergmann:
"Driver updates for ARM SoCs, these contain various things that touch
the drivers/ directory but got merged through arm-soc for practical
reasons:
- Rockchip rk3368 gains power domain support
- Small updates for the ARM spmi driver
- The Atmel PMC driver saw a larger rework, touching both
arch/arm/mach-at91 and drivers/clk/at91
- All reset controller driver changes alway get merged through
arm-soc, though this time the largest change is the addition of a
MIPS pistachio reset driver
- One bugfix for the NXP (formerly Freescale) i.MX weim bus driver"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (43 commits)
bus: imx-weim: Take the 'status' property value into account
clk: at91: remove useless includes
clk: at91: pmc: remove useless capacities handling
clk: at91: pmc: drop at91_pmc_base
usb: gadget: atmel: access the PMC using regmap
ARM: at91: remove useless includes and function prototypes
ARM: at91: pm: move idle functions to pm.c
ARM: at91: pm: find and remap the pmc
ARM: at91: pm: simply call at91_pm_init
clk: at91: pmc: move pmc structures to C file
clk: at91: pmc: merge at91_pmc_init in atmel_pmc_probe
clk: at91: remove IRQ handling and use polling
clk: at91: make use of syscon/regmap internally
clk: at91: make use of syscon to share PMC registers in several drivers
hwmon: (scpi) add energy meter support
firmware: arm_scpi: add support for 64-bit sensor values
firmware: arm_scpi: decrease Tx timeout to 20ms
firmware: arm_scpi: fix send_message and sensor_get_value for big-endian
reset: sti: Make reset_control_ops const
reset: zynq: Make reset_control_ops const
...
Newly added support for additional SoCs:
- Axis Artpec-6 SoC family
- Allwinner A83T SoC
- Mediatek MT7623
- NXP i.MX6QP SoC
- ST Microelectronics stm32f469 microcontroller
New features:
- SMP support for Mediatek mt2701
- Big-endian support for NXP i.MX
- DaVinci now uses the new DMA engine dma_slave_map
- OMAP now uses the new DMA engine dma_slave_map
- earlyprintk support for palmchip uart on mach-tango
- delay timer support for orion
Other:
- Exynos PMU driver moved out to drivers/soc/
- Various smaller updates for Renesas, Xilinx, PXA, AT91, OMAP, uniphier
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVu68DGCrR//JCVInAQIHVQ//Wblms+NKj3aKh6m2Sscs/YkSbFaQ4sY2
rNyfxLIYsLXkth1kbdHRFSMyL68Ym+xutErgw/3HQPB2D1YtYJE3VJ/y8AU92SU3
oHyQIty+atB8d8zBbtlkWmat94NIfYf0I8PQETreGb1LMaJqAf0mDEDAyorTLZcZ
UtQ817Ihn7urqwdTJpTO58V41RmY/vflbHI5T6bIjUJn6fF1e/7+VqtMIfq5sjJ6
0EPEQdu8s5AJ7gcGlGi9I5gAtSnWSA/9phAxul9P8/HrMpUWIxreSEAy8FY7W14F
4TON3sQrnw7nyA72U80KGIXhgLy7SbEmHcSqyy4YJK3ycdk6VYk0CBO7nWVYAiD1
knLisOH6jwe0LIj9WXiRR+Y2Q53pXN8SF77pLDahSnvuShnYEjEH5uELHtxe7Vxh
gn+NH1rDkRTgdYgt4RWlVyUoLkddQWzLb1m4QyQlvxtTR25cJJayXdVX2MRrNPF5
c1zRa9HH+b8LJQIMdWfo/NoHhHtftkkGGsqHAAaypZqdpyk0j2HpJYk5ecPR4f5C
/8o/h/5xOI9gEzp/DVYSZ1VAvRqBQGIDfKBXWq6GuoZaF0aN8ISe5IxFn5Yx2F46
fNaxqiNpWmyywl8D+tSWPFK6aE21AXKGi5zIzexZZqy283aDjlUPI+tgF2GKIuKP
3ayYTDeBpLI=
=ynNj
-----END PGP SIGNATURE-----
Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC platform updates from Arnd Bergmann:
"Newly added support for additional SoCs:
- Axis Artpec-6 SoC family
- Allwinner A83T SoC
- Mediatek MT7623
- NXP i.MX6QP SoC
- ST Microelectronics stm32f469 microcontroller
New features:
- SMP support for Mediatek mt2701
- Big-endian support for NXP i.MX
- DaVinci now uses the new DMA engine dma_slave_map
- OMAP now uses the new DMA engine dma_slave_map
- earlyprintk support for palmchip uart on mach-tango
- delay timer support for orion
Other:
- Exynos PMU driver moved out to drivers/soc/
- Various smaller updates for Renesas, Xilinx, PXA, AT91, OMAP,
uniphier"
* tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (83 commits)
ARM: uniphier: rework SMP code to support new System Bus binding
ARM: uniphier: add missing of_node_put()
ARM: at91: avoid defining CONFIG_* symbols in source code
ARM: DRA7: hwmod: Add data for eDMA tpcc, tptc0, tptc1
ARM: imx: Make reset_control_ops const
ARM: imx: Do L2 errata only if the L2 cache isn't enabled
ARM: imx: select ARM_CPU_SUSPEND only for imx6
dmaengine: pxa_dma: fix the maximum requestor line
ARM: alpine: select the Alpine MSI controller driver
ARM: pxa: add the number of DMA requestor lines
dmaengine: mmp-pdma: add number of requestors
dma: mmp_pdma: Add the #dma-requests DT property documentation
ARM: OMAP2+: Add rtc hwmod configuration for ti81xx
ARM: s3c24xx: Avoid warning for inb/outb
ARM: zynq: Move early printk virtual address to vmalloc area
ARM: DRA7: hwmod: Add custom reset handler for PCIeSS
ARM: SAMSUNG: Remove unused register offset definition
ARM: EXYNOS: Cleanup header files inclusion
drivers: soc: samsung: Enable COMPILE_TEST
MAINTAINERS: Add maintainers entry for drivers/soc/samsung
...
This driver has two issues. First, it tries to fiddle with the hot
plugged CPU's MSR on the UP_PREPARE event, at a time when the CPU is
not yet online. Second, the driver sets the "boost-disable" bit for a
CPU when going down, but does not clear the bit again if the CPU comes
up again due to DOWN_FAILED.
This patch fixes the issues by changing the driver to react to the
ONLINE/DOWN_FAILED events instead of UP_PREPARE. As an added benefit,
the driver also becomes symmetric with respect to the hot plug
mechanism.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit a4675fbc4a (cpufreq: intel_pstate: Replace timers with
utilization update callbacks) wrmsrl_on_cpu() cannot be called in the
intel_pstate_adjust_busy_pstate() path as that is executed with
disabled interrupts. However, atom_set_pstate() called from there
via intel_pstate_set_pstate() uses wrmsrl_on_cpu() to update the
IA32_PERF_CTL MSR which triggers the WARN_ON_ONCE() in
smp_call_function_single().
The reason why wrmsrl_on_cpu() is used by atom_set_pstate() is
because intel_pstate_set_pstate() calling it is also invoked during
the initialization and cleanup of the driver and in those cases it is
not guaranteed to be run on the CPU that is being updated. However,
in the case when intel_pstate_set_pstate() is called by
intel_pstate_adjust_busy_pstate(), wrmsrl() can be used to update
the register safely. Moreover, intel_pstate_set_pstate() already
contains code that only is executed if the function is called by
intel_pstate_adjust_busy_pstate() and there is a special argument
passed to it because of that.
To fix the problem at hand, rearrange the code taking the above
observations into account.
First, replace the ->set() callback in struct pstate_funcs with a
->get_val() one that will return the value to be written to the
IA32_PERF_CTL MSR without updating the register.
Second, split intel_pstate_set_pstate() into two functions,
intel_pstate_update_pstate() to be called by
intel_pstate_adjust_busy_pstate() that will contain all of the
intel_pstate_set_pstate() code which only needs to be executed in
that case and will use wrmsrl() to update the MSR (after obtaining
the value to write to it from the ->get_val() callback), and
intel_pstate_set_min_pstate() to be invoked during the
initialization and cleanup that will set the P-state to the
minimum one and will update the MSR using wrmsrl_on_cpu().
Finally, move the code shared between intel_pstate_update_pstate()
and intel_pstate_set_min_pstate() to a new static inline function
intel_pstate_record_pstate() and make them both call it.
Of course, that unifies the handling of the IA32_PERF_CTL MSR writes
between Atom and Core.
Fixes: a4675fbc4a (cpufreq: intel_pstate: Replace timers with utilization update callbacks)
Reported-and-tested-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function, cpufreq_quick_get, accesses the global 'cpufreq_driver' and
its fields without taking the associated lock, cpufreq_driver_lock.
Without the locking, nothing guarantees that 'cpufreq_driver' remains
consistent during the call. This patch fixes the issue by taking the lock
before accessing the data structure.
Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq: (94 commits)
intel_pstate: Do not skip samples partially
intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
intel_pstate: Optimize calculation for max/min_perf_adj
intel_pstate: Remove extra conversions in pid calculation
cpufreq: Move scheduler-related code to the sched directory
Revert "cpufreq: postfix policy directory with the first CPU in related_cpus"
cpufreq: Reduce cpufreq_update_util() overhead a bit
cpufreq: Select IRQ_WORK if CPU_FREQ_GOV_COMMON is set
cpufreq: Remove 'policy->governor_enabled'
cpufreq: Rename __cpufreq_governor() to cpufreq_governor()
cpufreq: Relocate handle_update() to kill its declaration
cpufreq: governor: Drop unnecessary checks from show() and store()
cpufreq: governor: Fix race in dbs_update_util_handler()
cpufreq: governor: Make gov_set_update_util() static
cpufreq: governor: Narrow down the dbs_data_mutex coverage
cpufreq: governor: Make dbs_data_mutex static
cpufreq: governor: Relocate definitions of tuners structures
cpufreq: governor: Move per-CPU data to the common code
cpufreq: governor: Make governor private data per-policy
...
If the current value of MPERF or the current value of TSC is the
same as the previous one, respectively, intel_pstate_sample() bails
out early and skips the sample.
However, intel_pstate_adjust_busy_pstate() is still called in that
case which is not correct, so modify intel_pstate_sample() to
return a bool value indicating whether or not the sample has been
taken and use it to decide whether or not to call
intel_pstate_adjust_busy_pstate().
While at it, remove redundant parentheses from the MPERF/TSC
check in intel_pstate_sample().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Use a helper function to compute the average pstate and call it only
where it is needed (only when tracing or in intel_pstate_get).
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpu_load algorithm doesn't need to invoke intel_pstate_calc_busy(),
so move that call from intel_pstate_sample() to
get_target_pstate_use_performance().
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
mul_fp(int_tofp(A), B) expands to:
((A << FRAC_BITS) * B) >> FRAC_BITS, so the same result can be obtained
via simple multiplication A * B. Apply this observation to
max_perf * limits->max_perf and max_perf * limits->min_perf in
intel_pstate_get_min_max()."
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
pid->setpoint and pid->deadband can be initialized in fixed point, so we
can avoid the int_tofp in pid_calc.
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file and to sched.h.
Redefine cpufreq_update_util() as a static inline function to avoid
function calls at its call sites in the scheduler code (as suggested
by Peter Zijlstra).
Also move the definition of struct update_util_data and declaration
of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
include/linux/sched.h.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Revert commit 3510fac454 (cpufreq: postfix policy directory with the
first CPU in related_cpus).
Earlier, the policy->kobj was added to the kobject core, before ->init()
callback was called for the cpufreq drivers. Which allowed those drivers
to add or remove, driver dependent, sysfs files/directories to the same
kobj from their ->init() and ->exit() callbacks.
That isn't possible anymore after commit 3510fac454.
Now, there is no other clean alternative that people can adopt.
Its better to revert the earlier commit to allow cpufreq drivers to
create/remove sysfs files from ->init() and ->exit() callbacks.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use the observation that cpufreq_update_util() is only called
by the scheduler with rq->lock held, so the callers of
cpufreq_set_update_util_data() can use synchronize_sched()
instead of synchronize_rcu() to wait for cpufreq_update_util()
to complete. Moreover, if they are updated to do that,
rcu_read_(un)lock() calls in cpufreq_update_util() might be
replaced with rcu_read_(un)lock_sched(), respectively, but
those aren't really necessary, because the scheduler calls
that function from RCU-sched read-side critical sections
already.
In addition to that, if cpufreq_set_update_util_data() checks
the func field in the struct update_util_data before setting
the per-CPU pointer to it, the data->func check may be dropped
from cpufreq_update_util() as well.
Make the above changes to reduce the overhead from
cpufreq_update_util() in the scheduler paths invoking it
and to make the cleanup after removing its callbacks less
heavy-weight somewhat.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Commit 0eb463be3436 (cpufreq: governor: Replace timers with utilization
update callbacks) made CPU_FREQ select IRQ_WORK, but that's not
necessary, as it is sufficient for IRQ_WORK to be selected by
CPU_FREQ_GOV_COMMON, so modify the cpufreq Kconfig to that effect.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The entire sequence of events (like INIT/START or STOP/EXIT) for which
cpufreq_governor() is called, is guaranteed to be protected by
policy->rwsem now.
The additional checks that were added earlier (as we were forced to drop
policy->rwsem before calling cpufreq_governor() for EXIT event), aren't
required anymore.
Over that, they weren't sufficient really. They just take care of
START/STOP events, but not INIT/EXIT and the state machine was never
maintained properly by them.
Kill the unnecessary checks and policy->governor_enabled field.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The __ at the beginning of the routine aren't really necessary at all.
Rename it to cpufreq_governor() instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
handle_update() is declared at the top of the file as its user appear
before its definition. Relocate the routine to get rid of this.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The show() and store() routines in the cpufreq-governor core don't need
to check if the struct governor_attr they want to use really provides
the callbacks they need as expected (if that's not the case, it means a
bug in the code anyway), so change them to avoid doing that.
Also change the error value to -EBUSY, if the governor is getting
removed and we aren't allowed to store any more changes.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is a scenario that may lead to undesired results in
dbs_update_util_handler(). Namely, if two CPUs sharing a policy
enter the funtion at the same time, pass the sample delay check
and then one of them is stalled until dbs_work_handler() (queued
up by the other CPU) clears the work counter, it may update the
work counter and queue up another work item prematurely.
To prevent that from happening, use the observation that the CPU
queuing up a work item in dbs_update_util_handler() updates the
last sample time. This means that if another CPU was stalling after
passing the sample delay check and now successfully updated the work
counter as a result of the race described above, it will see the new
value of the last sample time which is different from what it used in
the sample delay check before. If that happens, the sample delay
check passed previously is not valid any more, so the CPU should not
continue.
Fixes: f17cbb53783c (cpufreq: governor: Avoid atomic operations in hot paths)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The gov_set_update_util() routine is only used internally by the
common governor code and it doesn't need to be exported, so make
it static.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Since cpufreq_governor_dbs() is now always called with policy->rwsem
held, it cannot be executed twice in parallel for the same policy.
Thus it is not necessary to hold dbs_data_mutex around the invocations
of cpufreq_governor_start/stop/limits() from it as those functions
never modify any data that can be shared between different policies.
However, cpufreq_governor_dbs() may be executed twice in parallal
for different policies using the same gov->gdbs_data object and
dbs_data_mutex is still necessary to protect that object against
concurrent updates.
For this reason, narrow down the dbs_data_mutex locking to
cpufreq_governor_init/exit() where it is needed and rename the
mutex to gov_dbs_data_mutex to reflect its purpose.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
That mutex is only used by cpufreq_governor_dbs() and it doesn't
need to be exported to modules, so make it static and drop the
export incantation.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Move the definitions of struct od_dbs_tuners and struct cs_dbs_tuners
from the common governor header to the ondemand and conservative
governor code, respectively, as they don't need to be in the common
header any more.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
After previous changes there is only one piece of code in the
ondemand governor making references to per-CPU data structures,
but it can be easily modified to avoid doing that, so modify it
accordingly and move the definition of per-CPU data used by the
ondemand and conservative governors to the common code. Next,
change that code to access the per-CPU data structures directly
rather than via a governor callback.
This causes the ->get_cpu_cdbs governor callback to become
unnecessary, so drop it along with the macro and function
definitions related to it.
Finally, drop the definitions of struct od_cpu_dbs_info_s and
struct cs_cpu_dbs_info_s that aren't necessary any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Some fields in struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s
are only used for a limited set of CPUs. Namely, if a policy is
shared between multiple CPUs, those fields will only be used for one
of them (policy->cpu). This means that they really are per-policy
rather than per-CPU and holding room for them in per-CPU data
structures is generally wasteful. Also moving those fields into
per-policy data structures will allow some significant simplifications
to be made going forward.
For this reason, introduce struct cs_policy_dbs_info and
struct od_policy_dbs_info to hold those fields. Define each of the
new structures as an extension of struct policy_dbs_info (such that
struct policy_dbs_info is embedded in each of them) and introduce
new ->alloc and ->free governor callbacks to allocate and free
those structures, respectively, such that ->alloc() will return
a pointer to the struct policy_dbs_info embedded in the allocated
data structure and ->free() will take that pointer as its argument.
With that, modify the code accessing the data fields in question
in per-CPU data objects to look for them in the new structures
via the struct policy_dbs_info pointer available to it and drop
them from struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The ondemand_powersave_bias_init() function used for resetting data
fields related to the powersave bias tunable of the ondemand governor
works by walking all of the online CPUs in the system and updating the
od_cpu_dbs_info_s structures for all of them.
However, if governor tunables are per policy, the update should not
touch the CPUs that are not associated with the given dbs_data.
Moreover, since the data fields in question are only ever used for
policy->cpu in each policy governed by ondemand, the update can be
limited to those specific CPUs.
Rework the code to take the above observations into account.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The ->store() callbacks of some tunable sysfs attributes of the
ondemand and conservative governors trigger immediate updates of
the CPU load information for all CPUs "governed" by the given
dbs_data by walking the cpu_dbs_info structures for all online
CPUs in the system and updating them.
This is questionable for two reasons. First, it may lead to a lot of
extra overhead on a system with many CPUs if the given dbs_data is
only associated with a few of them. Second, if governor tunables are
per-policy, the CPUs associated with the other sets of governor
tunables should not be updated.
To address this issue, use the observation that in all of the places
in question the update operation may be carried out in the same way
(because all of the tunables involved are now located in struct
dbs_data and readily available to the common code) and make the
code in those places invoke the same (new) helper function that
will carry out the update correctly.
That new function always checks the ignore_nice_load tunable value
and updates the CPUs' prev_cpu_nice data fields if that's set, which
wasn't done by the original code in store_io_is_busy(), but it
should have been done in there too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The ->powersave_bias_init_cpu callback in struct od_ops is only used
in one place and that invocation may be replaced with a direct call
to the function pointed to by that callback, so change the code
accordingly and drop the callback.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
After some previous changes, the ->get_cpu_dbs_info_s governor
callback and the "governor" field in struct dbs_governor (whose
value represents the governor type) are not used any more, so
drop them.
Also drop the unused gov_ops field from struct dbs_governor.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
To avoid having to check the governor type explicitly in the common
code in order to initialize data structures specific to the governor
type properly, add a ->start callback to struct dbs_governor and
use it to initialize those data structures for the ondemand and
conservative governors.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The io_is_busy governor tunable is only used by the ondemand governor
and is located in the ondemand-specific data structure, but it is
looked at by the common governor code that has to do ugly things to
get to that value, so move it to struct dbs_data and modify ondemand
accordingly.
Since the conservative governor never touches that field, it will
be always 0 for that governor and it won't have any effect on the
results of computations in that case.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
It is possible for a dbs_data object to be updated after its
usage counter has become 0. That may happen if governor_store()
runs (via a govenor tunable sysfs attribute write) in parallel
with cpufreq_governor_exit() called for the last cpufreq policy
associated with the dbs_data in question. In that case, if
governor_store() acquires dbs_data->mutex right after
cpufreq_governor_exit() has released it, the ->store() callback
invoked by it may operate on dbs_data with no users. Although
sysfs will cause the kobject_put() in cpufreq_governor_exit() to
block until governor_store() has returned, that situation may
lead to some unexpected results, depending on the implementation
of the ->store callback, and therefore it should be avoided.
To that end, modify governor_store() to check the dbs_data's
usage count before invoking the ->store() callback and return
an error if it is 0 at that point.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The ->freq_increase callback in struct od_ops is never invoked,
so drop it.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Drop some lines of code from od_update() by arranging the statements
in there in a more logical way.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Do not convert microseconds to jiffies and the other way around
in governor computations related to the sampling rate and sample
delay and drop delay_for_sampling_rate() which isn't of any use
then.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reduce the indentation level in the conditionals in od_dbs_timer()
and drop the delay variable from it.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
The rate_mult field in struct od_cpu_dbs_info_s is used by the code
shared with the conservative governor and to access it that code
has to do an ugly governor type check. However, first of all it
is ever only used for policy->cpu, so it is per-policy rather than
per-CPU and second, it is initialized to 1 by cpufreq_governor_start(),
so if the conservative governor never modifies it, it will have no
effect on the results of any computations.
For these reasons, move rate_mult to struct policy_dbs_info (as a
common field).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
If store_sampling_rate() updates the sample delay when the ondemand
governor is in the middle of its high/low dance (OD_SUB_SAMPLE sample
type is set), the governor will still do the bottom half of the
previous sample which may take too much time.
To prevent that from happening, change store_sampling_rate() to always
reset the sample delay to 0 which also is consistent with the new
behavior of cpufreq_governor_limits().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The way the ->gov_check_cpu governor callback is used by the ondemand
and conservative governors is not really straightforward. Namely, the
governor calls dbs_check_cpu() that updates the load information for
the policy and the invokes ->gov_check_cpu() for the governor.
To get rid of that entanglement, notice that cpufreq_governor_limits()
doesn't need to call dbs_check_cpu() directly. Instead, it can simply
reset the sample delay to 0 which will cause a sample to be taken
immediately. The result of that is practically equivalent to calling
dbs_check_cpu() except that it will trigger a full update of governor
internal state and not just the ->gov_check_cpu() part.
Following that observation, make cpufreq_governor_limits() reset
the sample delay and turn dbs_check_cpu() into a function that will
simply evaluate the load and return the result called dbs_update().
That function can now be called by governors from the routines that
previously were pointed to by ->gov_check_cpu and those routines
can be called directly by each governor instead of dbs_check_cpu().
This way ->gov_check_cpu becomes unnecessary, so drop it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Clean up some load-related computations in dbs_check_cpu() and
cpufreq_governor_start() to get rid of unnecessary operations and
type casts and make the code easier to read.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The contribution of the CPU nice time to the idle time in dbs_check_cpu()
is computed in a bogus way, as the code may subtract current and previous
nice values for different CPUs.
That doesn't matter for cases when cpufreq policies are not shared,
but may lead to problems otherwise.
Fix the computation and simplify it to avoid taking unnecessary steps.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Rework the handling of work items by dbs_update_util_handler() and
dbs_work_handler() so the former (which is executed in scheduler
paths) only uses atomic operations when absolutely necessary. That
is, when the policy is shared and dbs_update_util_handler() has
already decided that this is the time to queue up a work item.
In particular, this avoids the atomic ops entirely on platforms where
policy objects are never shared.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The atomic work counter incrementation in gov_cancel_work() is not
necessary any more, because work items won't be queued up after
gov_clear_update_util() anyway, so drop it along with the comment
about how it may be missed by the gov_clear_update_util().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
As it turns out, irq_work_queue_on() will crash if invoked on
non-SMP ARM platforms, but in fact it is not necessary to use that
function in the cpufreq governor code (as it doesn't matter to that
code which CPU will handle the irq_work), so change it to always use
irq_work_queue().
Fixes: 8fb47ff100af (cpufreq: governor: Replace timers with utilization update callbacks)
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Reported-and-tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid extra checks in od_dbs_timer() by rearranging updates to the
local delay variable in it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ondemand governor already updates sample_delay_ns immediately on
updates to the sampling rate, but conservative doesn't do that.
It was left out earlier as the code was really too complex to get
that done easily. Things are sorted out very well now, however, and
the conservative governor can be modified to follow ondemand in that
respect.
Moreover, since the code needed to implement that in the
conservative governor would be identical to the corresponding
ondemand governor's code, make that code common and change both
governors to use it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq core now guarantees that policy->rwsem won't be dropped
while running the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT
event and will be held acquired until the complete sequence of governor
state changes has finished.
This allows governor state machine checks to be dropped from multiple
functions in cpufreq_governor.c.
This also means that policy_dbs->policy can be initialized upfront, so
the entire initialization of struct policy_dbs can be carried out in
one place.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We used to drop policy->rwsem just before calling __cpufreq_governor()
in some cases earlier and so it was possible that __cpufreq_governor()
ran concurrently via separate threads for the same policy.
In order to guarantee valid state transitions for governors,
'governor_enabled' was required to be protected using some locking
and cpufreq_governor_lock was added for that.
But now __cpufreq_governor() is always called under policy->rwsem,
and 'governor_enabled' is protected against races even without
cpufreq_governor_lock.
Get rid of the extra lock now.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq core code is not consistent with respect to invoking
__cpufreq_governor() under policy->rwsem.
Changing all code to always hold policy->rwsem around
__cpufreq_governor() invocations will allow us to remove
cpufreq_governor_lock that is used today because we can't
guarantee that __cpufreq_governor() isn't executed twice in
parallel for the same policy.
We should also ensure that policy->rwsem is held across governor
state changes.
For example, while adding a CPU to the policy in the CPU online path,
we need to stop the governor, change policy->cpus, start the governor
and then refresh its limits. The complete sequence must be guaranteed
to complete without interruptions by concurrent governor state
updates. That can be achieved by holding policy->rwsem around those
sequences of operations.
Also note that after this patch cpufreq_driver->stop_cpu() and
->exit() will get called under policy->rwsem which wasn't the case
earlier. That shouldn't have any side effects, though.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 1aee40ac9c (cpufreq: Invoke __cpufreq_remove_dev_finish()
after releasing cpu_hotplug.lock) split the cpufreq's CPU offline
routine in two pieces, one of them to be run with CPU offline/online
locked and the other to be called later. The reason for that split
was a possible deadlock scenario involving cpufreq sysfs attributes
and CPU offline.
However, the handling of CPU offline in cpufreq has changed since
then. Policy sysfs attributes are never removed during CPU offline,
so there's no need to worry about accessing them during CPU offline,
because that can't lead to any deadlocks now. Governor sysfs
attributes are still removed in __cpufreq_governor(_EXIT), but
there is a new kobject type for them now and its show/store
callbacks don't lock CPU offline/online (they don't need to do
that).
This means that the CPU offline code in cpufreq doesn't need to
be split any more, so combine cpufreq_offline_prepare() with
cpufreq_offline_finish().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The dbs_data_mutex lock is currently used in two places. First,
cpufreq_governor_dbs() uses it to guarantee mutual exclusion between
invocations of governor operations from the core. Second, it is used by
ondemand governor's update_sampling_rate() to ensure the stability of
data structures walked by it.
The second usage is quite problematic, because update_sampling_rate() is
called from a governor sysfs attribute's ->store callback and that leads
to a deadlock scenario involving cpufreq_governor_exit() which runs
under dbs_data_mutex. Thus it is better to rework the code so
update_sampling_rate() doesn't need to acquire dbs_data_mutex.
To that end, rework update_sampling_rate() to walk a list of policy_dbs
objects supported by the dbs_data one it has been called for (instead of
walking cpu_dbs_info object for all CPUs). The list manipulation is
protected with dbs_data->mutex which also is held around the execution
of update_sampling_rate(), it is not necessary to hold dbs_data_mutex in
that function any more.
Reported-by: Juri Lelli <juri.lelli@arm.com>
Reported-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Earlier, when the struct freq-attr was used to represent governor
attributes, the standard cpufreq show/store sysfs attribute callbacks
were applied to the governor tunable attributes and they always acquire
the policy->rwsem lock before carrying out the operation. That could
have resulted in an ABBA deadlock if governor tunable attributes are
removed under policy->rwsem while one of them is being accessed
concurrently (if sysfs attributes removal wins the race, it will wait
for the access to complete with policy->rwsem held while the attribute
callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time.
The previous commit, "cpufreq: governor: New sysfs show/store callbacks
for governor tunables", fixed the original ABBA deadlock by adding new
governor specific show/store callbacks.
We don't have to drop rwsem around invocations of governor event
CPUFREQ_GOV_POLICY_EXIT anymore, and original fix can be reverted now.
Fixes: 955ef48335 (cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reported-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The previous commit introduced a new set of macros for creating sysfs
attributes that represent governor tunables and the old macros used for
this purpose are not needed any more, so drop them.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ondemand and conservative governors use the global-attr or freq-attr
structures to represent sysfs attributes corresponding to their tunables
(which of them is actually used depends on whether or not different
policy objects can use the same governor with different tunables at the
same time and, consequently, on where those attributes are located in
sysfs).
Unfortunately, in the freq-attr case, the standard cpufreq show/store
sysfs attribute callbacks are applied to the governor tunable attributes
and they always acquire the policy->rwsem lock before carrying out the
operation. That may lead to an ABBA deadlock if governor tunable
attributes are removed under policy->rwsem while one of them is being
accessed concurrently (if sysfs attributes removal wins the race, it
will wait for the access to complete with policy->rwsem held while the
attribute callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around
governor tunable attributes removal (that is, around invocations of the
->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT)
in cpufreq_set_policy(), but that opened up race conditions that had not
been possible with policy->rwsem held all the time. Therefore
policy->rwsem cannot be dropped in cpufreq_set_policy() at any point,
but the deadlock situation described above must be avoided too.
To that end, use the observation that in principle governor tunables may
be represented by the same data type regardless of whether the governor
is system-wide or per-policy and introduce a new structure, struct
governor_attr, for representing them and new corresponding macros for
creating show/store sysfs callbacks for them. Also make their parent
kobject use a new kobject type whose default show/store callbacks are
not related to the standard core cpufreq ones in any way (and they don't
acquire policy->rwsem in particular).
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog + rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are a few common tunables shared between the ondemand and
conservative governors. Move them to struct dbs_data to simplify
code.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some tunables are present in governor-specific structures, whereas one
(min_sampling_rate) is located directly in struct dbs_data.
There is a special macro for creating its sysfs attribute and the
show/store callbacks, but since more tunables are going to be moved
to struct dbs_data, a new generic macro for such cases will be useful,
so add it and use it for min_sampling_rate.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Juri Lelli <juri.lelli@arm.com>
Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is silly to jump around "return 0", so don't do that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The skip_work field in struct policy_dbs_info technically is a
counter, so give it a new name to reflect that.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Make the initialization of struct cpu_dbs_info objects in
alloc_policy_dbs_info() and the code that cleans them up in
free_policy_dbs_info() more symmetrical. In particular,
set/clear the update_util.func field in those functions along
with the policy_dbs field.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The struct policy_dbs_info objects representing per-policy governor
data are not accessible directly from the corresponding policy
objects. To access them, one has to get a pointer to the
struct cpu_dbs_info of policy->cpu and use the policy_dbs field of
that which isn't really straightforward.
To address that rearrange the governor data structures so the
governor_data pointer in struct cpufreq_policy will point to
struct policy_dbs_info (instead of struct dbs_data) and that will
contain a pointer to struct dbs_data.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Use the observation that cpufreq_governor_limits() doesn't have to
get to the policy object it wants to manipulate by walking the
reference chain cdbs->policy_dbs->policy, as the final pointer is
actually equal to its argument, and make it access the policy
object directy via its argument.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Since policy->cpu is always passed as the second argument to
dbs_check_cpu(), it is not really necessary to pass it, because
the function can obtain that value via its first argument just fine.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The struct cpu_common_dbs_info structure represents the per-policy
part of the governor data (for the ondemand and conservative
governors), but its name doesn't reflect its purpose.
Rename it to struct policy_dbs_info and rename variables related to
it accordingly.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Since it is possible to obtain a pointer to struct dbs_governor
from a pointer to the struct governor embedded in it with the help
of container_of(), the additional gov pointer in struct dbs_data
isn't really necessary.
Drop that pointer and make the code using it reach the dbs_governor
object via policy->governor.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Since it is possible to obtain a pointer to struct dbs_governor
from a pointer to the struct governor embedded in it via
container_of(), the second argument of cpufreq_governor_init()
is not necessary. Accordingly, cpufreq_governor_dbs() doesn't
need its second argument either and the ->governor callbacks
for both the ondemand and conservative governors may be set
to cpufreq_governor_dbs() directly. Make that happen.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The ondemand and conservative governors are represented by
struct common_dbs_data whose name doesn't reflect the purpose it
is used for, so rename it to struct dbs_governor and rename
variables of that type accordingly.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
For the ondemand and conservative governors (generally, governors
that use the common code in cpufreq_governor.c), there are two static
data structures representing the governor, the struct governor
structure (the interface to the cpufreq core) and the struct
common_dbs_data one (the interface to the cpufreq_governor.c code).
There's no fundamental reason why those two structures have to be
separate. Moreover, if the struct governor one is included into
struct common_dbs_data, it will be possible to reach the latter from
the policy via its policy->governor pointer, so it won't be necessary
to pass a separate pointer to it around. For this reason, embed
struct governor in struct common_dbs_data.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Do not pass struct dbs_data pointers to the family of functions
implementing governor operations in cpufreq_governor.c as they can
take that pointer from policy->governor by themselves.
The cpufreq_governor_init() case is slightly more complicated, since
policy->governor may be NULL when it is invoked, but then it can reach
the pointer in question via its cdata argument just fine.
While at it, rework cpufreq_governor_dbs() to avoid a pointless
policy_governor check in the CPUFREQ_GOV_POLICY_INIT case.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Every governor relying on the common code in cpufreq_governor.c
has to provide its own mutex in struct common_dbs_data. However,
there actually is no need to have a separate mutex per governor
for this purpose, they may be using the same global mutex just
fine. Accordingly, introduce a single common mutex for that and
drop the mutex field from struct common_dbs_data.
That at least will ensure that the mutex is always present and
initialized regardless of what the particular governors do.
Another benefit is that the common code does not need a pointer to
a governor-related structure to get to the mutex which sometimes
helps.
Finally, it makes the code generally easier to follow.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Instead of using a per-CPU deferrable timer for queuing up governor
work items, register a utilization update callback that will be
invoked from the scheduler on utilization changes.
The sampling rate is still the same as what was used for the
deferrable timers and the added irq_work overhead should be offset by
the eliminated timers overhead, so in theory the functional impact of
this patch should not be significant.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Instead of using a per-CPU deferrable timer for utilization sampling
and P-states adjustments, register a utilization update callback that
will be invoked from the scheduler on utilization changes.
The sampling rate is still the same as what was used for the deferrable
timers, so the functional impact of this patch should not be significant.
Based on an earlier patch from Srinivas Pandruvada.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.
This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.
The update_load_avg() changes were suggested by Peter Zijlstra.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Setting a new CPU frequency and reading the current request value
in the ACPI cpufreq driver involves each at least two switch
instructions (there's more if the policy is shared). One of
them is present in drv_read/write() that prepares a command
structure and the other happens in subsequent do_drv_read/write()
when that structure is interpreted. However, all of those switches
may be avoided by using function pointers.
To that end, add two function pointers to struct acpi_cpufreq_data
to represent read and write operations on the frequency register
and set them up during policy intitialization to point to the pair
of routines suitable for the given processor (Intel/AMD MSR access
or I/O port access). Then, use those pointers in do_drv_read/write()
and modify drv_read/write() to prepare the command structure for
them without any checks.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The MT8173 cpufreq driver can currently only be built-in, but
it has a Kconfig dependency on the thermal core. THERMAL
can be a loadable module, which in turn makes this driver
impossible to build.
It is nicer to make the cpufreq driver a module as well, so
this patch turns the option in to a 'tristate' and adapts
the dependency accordingly.
The driver has no module_exit() function, so it will continue
to not support unloading, but it can be built as a module
and loaded at runtime now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5269e7067c (cpufreq: Add ARM_MT8173_CPUFREQ dependency on THERMAL)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
My previous patch to avoid link errors with the qoriq cpufreq
driver disallowed all of the broken cases, but also prevented
the driver from being built when CONFIG_THERMAL is a module.
This changes the dependency to allow the cpufreq driver to
also be a module in this case, just not built-in.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 8ae1702a0d (cpufreq: qoriq: Register cooling device based on device tree)
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This is a minor cycle with :
- cleanup fixes from Arnd, mainly build oriented and sparse type ones
- dma fixes for requestors above 32 (impacting mainly camera driver)
- some minor cleanup on pxa3xx device-tree side
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW0Y3SAAoJEAP2et0duMsSfesQAIGKccj/WRWtlCxje+8Aj3UW
eu6vV9xGVBZu0RJxCchwCBP5ZO8WSvXBYvX5MIaBH2TFdJqJEjie6ES4mRlGVuup
XMUZybcFOu9N5j0WMddj2OHhEeYGgMsSZwSOgK4nsQ/eyhrcI5fjykdafmPKywjW
+/USV90ucVp+38C+yg4yXSuI8FOEABIn6VoX//+YpDZetvIoJCUQne1g+uPE2YoF
dcPqVbY2OeAvuABkI3Wqdrc3Ico9i8Ns8erg8EuDe5xv2TvJXhn/mIeoVNZ2s4So
5aCD87RQu3rwDfqAv6FzW06k9AYAE/p/VKh0smI12D8MCxhSD0EZP+jBfDvuwx/n
IICMH7YuunRSRe7VDFgWyIz7wduHQu3xctF9scSYZD+kBpvAD274sZs9WYs4fGd6
hXxxV4iXrP8af6A+sddDeo0Gq25jC7JoL43YlQTUzBMqbrzK4W3c0dXCib6hh/eN
W/YVdPERGs9cTG/IZFbn3cn1QIYdA4exNoE38txWkKpeiVxu0tZKSsg6m9xCPu7+
vMQj1m8E8mkgiMq0BJnei02QC8Xw5ekf4wNUsLOOou29c7CTt3zicM8YtnDmANV2
fIIT4BK0izIZj4N0RZp9KT6h/IkF1VHRz3pcw3vYXJLbaKqHk/6doDUW30E/Ol1f
PzJedKaWujhOV1DG1SLb
=lBYC
-----END PGP SIGNATURE-----
Merge tag 'pxa-for-4.6' of https://github.com/rjarzmik/linux into next/soc
Merge "pxa changes for v4.6 cycle" from Robert Jarzmik:
This is a minor cycle with :
- cleanup fixes from Arnd, mainly build oriented and sparse type ones
- dma fixes for requestors above 32 (impacting mainly camera driver)
- some minor cleanup on pxa3xx device-tree side
* tag 'pxa-for-4.6' of https://github.com/rjarzmik/linux:
dmaengine: pxa_dma: fix the maximum requestor line
ARM: pxa: add the number of DMA requestor lines
dmaengine: mmp-pdma: add number of requestors
dma: mmp_pdma: Add the #dma-requests DT property documentation
ARM: pxa: pxa3xx device-tree support cleanup
ARM: pxa: don't select RFKILL if CONFIG_NET is disabled
ARM: pxa: fix building without IWMMXT
ARM: pxa: move extern declarations to pm.h
ARM: pxa: always select one of the two CPU types
ARM: pxa: don't select GPIO_SYSFS for MIOA701
ARM: pxa: mark unused eseries code as __maybe_unused
ARM: pxa: mark spitz_card_pwr_ctrl as __maybe_unused
ARM: pxa: define clock registers as __iomem
Unregister the notifiers if cpufreq_driver_register() fails in
powernv_cpufreq_init(). Re-arrange the unregistration and cleanup routines
in powernv_cpufreq_exit() to free all the resources after the driver
has unregistered.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Disable HWP Interrupt notification before enabling HWP. Since we don't
have HWP interrupt handling for possible performance interrupts, there
is not much use of enabling HWP interrupts.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the processor supports HWP, enable it by default without checking
for the cpu model. This will allow to enable HWP in all supported
processors without driver change.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The "out" label at the final return statement in acpi_cpufreq_target()
is totally pointless, so drop them and modify the code to return the
right values immediately instead of jumping to it.
No functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Drop a pointless label at a return statement from
__cpufreq_driver_target() and rearrange that function
to reduce the indentation level.
No intentional functional changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The intel-pstate driver is using intel_pstate_hwp_set() from two
separate paths, i.e. ->set_policy() callback and sysfs update path for
the files present in /sys/devices/system/cpu/intel_pstate/ directory.
While an update to the sysfs path applies to all the CPUs being managed
by the driver (which essentially means all the online CPUs), the update
via the ->set_policy() callback applies to a smaller group of CPUs
managed by the policy for which ->set_policy() is called.
And so, intel_pstate_hwp_set() should update frequencies of only the
CPUs that are part of policy->cpus mask, while it is called from
->set_policy() callback.
In order to do that, add a parameter (cpumask) to intel_pstate_hwp_set()
and apply the frequency changes only to the concerned CPUs.
For ->set_policy() path, we are only concerned about policy->cpus, and
so policy->rwsem lock taken by the core prior to calling ->set_policy()
is enough to take care of any races. The larger lock acquired by
get_online_cpus() is required only for the updates to sysfs files.
Add another routine, intel_pstate_hwp_set_online_cpus(), and call it
from the sysfs update paths.
This also fixes a lockdep reported recently, where policy->rwsem and
get_online_cpus() could have been acquired in any order causing an ABBA
deadlock. The sequence of events leading to that was:
intel_pstate_init(...)
...cpufreq_online(...)
down_write(&policy->rwsem); // Locks policy->rwsem
...
cpufreq_init_policy(policy);
...intel_pstate_hwp_set();
get_online_cpus(); // Temporarily locks cpu_hotplug.lock
...
up_write(&policy->rwsem);
pm_suspend(...)
...disable_nonboot_cpus()
_cpu_down()
cpu_hotplug_begin(); // Locks cpu_hotplug.lock
__cpu_notify(CPU_DOWN_PREPARE, ...);
...cpufreq_offline_prepare();
down_write(&policy->rwsem); // Locks policy->rwsem
Reported-and-tested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The comment has been incorrect since commit 4dea5806d3
("cpufreq: return EEXIST instead of EBUSY for second registering").
Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The show() and store() routines in the cpufreq core don't need to
check if the struct freq_attr they want to use really provides the
callbacks they need as expected (if that's not the case, it means
a bug in the code anyway), so change them to avoid doing that.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
OPP layer manages it now and cpufreq-dt driver doesn't need it. But, we
still need to check for availability of resources for deferred probing.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Its already done by core and we don't need to get it anymore. And so,
we don't need to get of node in cpufreq_init() anymore, move that to
find_supply_name() instead.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
OPP core supports frequency/voltage changes based on the target
frequency now, use that instead of open coding the same in cpufreq-dt
driver.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
OPP layer has all the information now to calculate transition latency
(clock_latency + voltage_latency). Lets reuse the OPP layer helper
dev_pm_opp_get_max_transition_latency() instead of open coding the same
in cpufreq-dt driver.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The core already have a valid regulator set for the device opp and the
unsupported OPPs are already disabled by the core. There is no need to
repeat that in the user drivers, get rid of it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
OPP core can handle the regulators by itself, and but it needs to know
the name of the regulator to fetch. Add support for that.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
"clock-latency" is handled by OPP layer for all bindings and so there is
no need to make special calls for V1 bindings. Use
dev_pm_opp_get_max_clock_latency() for both the cases.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
That's the real purpose of this field, i.e. to take special care of old
OPP V1 bindings. Lets name it accordingly, so that it can be used
elsewhere.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We have the device structure available now, lets use it for better print
messages.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently we use printk message to notify the throttle event. But this
can flood the console if the cpu is throttled frequently. So replace the
printk with the tracepoint to notify the throttle event. And also events
like throttle below nominal frequency and OCC_RESET are reduced to
pr_warn/pr_warn_once as pointed by MFG to not mark them as critical
messages. This patch adds 'throttle_reason' to struct chip to store the
throttle reason.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpu_to_chip_id() does a DT walk through to find out the chip id by
taking a contended device tree lock. This adds an unnecessary overhead
in a hot path. So instead of calling cpu_to_chip_id() everytime cache
the chip ids for all cores in the array 'core_to_chip_map' and use it
in the hotpath.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In the kworker_thread powernv_cpufreq_work_fn(), we can end up
sending an IPI to a cpu going offline. This is a rare corner case
which is fixed using {get/put}_online_cpus(). Along with this fix,
this patch adds changes to do oneshot cpumask_{clear/and} operation.
Suggested-by: Shreyas B Prabhu <shreyas@linux.vnet.ibm.com>
Suggested-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This will free the dynamically allocated memory of 'chips' on
module exit.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The preprocessor magic used for setting the default cpufreq governor
(and for using the performance governor as a fallback one for that
matter) is really nasty, so replace it with __weak functions and
overrides.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Saravana Kannan <skannan@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
We should not dereference registers as pointers, so use readl/writel
instead for these registers.
The clock registers are accessed in multiple files, so we have to
change them all at once.
I stumbled over these registers while looking at something unrelated.
There are in fact other registers with the same problem, but I did
not try to address those at this point.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr>
* pm-cpuidle:
cpuidle: coupled: remove unused define cpuidle_coupled_lock
cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze
* pm-cpufreq:
cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
cpufreq: Use list_is_last() to check last entry of the policy list
cpufreq: Fix NULL reference crash while accessing policy->governor_data
* pm-domains:
PM / Domains: Fix typo in comment
PM / Domains: Fix potential deadlock while adding/removing subdomains
PM / domains: fix lockdep issue for all subdomains
* pm-sleep:
PM: APM_EMULATION does not depend on PM
gcc warns quite a bit about values returned from allocate_resources()
in cpufreq-dt.c:
cpufreq-dt.c: In function 'cpufreq_init':
cpufreq-dt.c:327:6: error: 'cpu_dev' may be used uninitialized in this function [-Werror=maybe-uninitialized]
cpufreq-dt.c:197:17: note: 'cpu_dev' was declared here
cpufreq-dt.c:376:2: error: 'cpu_clk' may be used uninitialized in this function [-Werror=maybe-uninitialized]
cpufreq-dt.c:199:14: note: 'cpu_clk' was declared here
cpufreq-dt.c: In function 'dt_cpufreq_probe':
cpufreq-dt.c:461:2: error: 'cpu_clk' may be used uninitialized in this function [-Werror=maybe-uninitialized]
cpufreq-dt.c:447:14: note: 'cpu_clk' was declared here
The problem is that it's slightly hard for gcc to follow return
codes across PTR_ERR() calls.
This patch uses explicit assignments to the "ret" variable to make
it easier for gcc to verify that the code is actually correct,
without the need to add a bogus initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are two definitions of pxa_cpufreq_change_voltage, with slightly
different prototypes after one of them had its argument marked 'const'.
Now the other one (for !CONFIG_REGULATOR) produces a harmless warning:
drivers/cpufreq/pxa2xx-cpufreq.c: In function 'pxa_set_target':
drivers/cpufreq/pxa2xx-cpufreq.c:291:36: warning: passing argument 1 of 'pxa_cpufreq_change_voltage' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
ret = pxa_cpufreq_change_voltage(&pxa_freq_settings[idx]);
^
drivers/cpufreq/pxa2xx-cpufreq.c:205:12: note: expected 'struct pxa_freqs *' but argument is of type 'const struct pxa_freqs *'
static int pxa_cpufreq_change_voltage(struct pxa_freqs *pxa_freq)
^
This changes the prototype in the same way as the other, which
avoids the warning.
Fixes: 03c2299063 (cpufreq: pxa: make pxa_freqs arrays const)
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.2+ <stable@vger.kernel.org> # 4.2+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently next_policy() explicitly checks if a policy is the last
policy in the cpufreq_policy_list. Use the standard list_is_last
primitive instead.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CONFIG_PM ifdefs are superfluous and can be removed.
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Kukjin Kim <kgene@kernel.org>
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
- cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
- cxl: use correct operator when writing pcie config space values from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
- Freescale updates from Scott: Highlights include moving QE code out of
arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
TUxMR/38aI7bLcIQjwj9
=R5bX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Core:
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
Misc:
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de
Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions
fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica
Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from
Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from
Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from
Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing
from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc
from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
cxl:
- cxl: Fix possible idr warning when contexts are released from
Vaibhav Jain
- cxl: use correct operator when writing pcie config space values
from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav
Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma
Krishnan
Freescale:
- Freescale updates from Scott: Highlights include moving QE code out
of arch/powerpc (to be shared with arm), device tree updates, and
minor fixes"
* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
powerpc/module: Handle R_PPC64_ENTRY relocations
scripts/recordmcount.pl: support data in text section on powerpc
powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
powerpc/mm: Fix _PAGE_PTE breaking swapoff
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: use -Werror only with CONFIG_PPC_WERROR
cxl: fix build for GCC 4.6.x
powerpc: Add HWCAP bits for Power9
powerpc/powernv: Reserve PE#0 on NPU
powerpc/powernv: Change NPU PE# assignment
powerpc/powernv: Fix update of NVLink DMA mask
powerpc/powernv: Remove misleading comment in pci.c
powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
powerpc: Fix build break due to paca mm_context_t changes
cxl: Fix DSI misses when the context owning task exits
MAINTAINERS: Update Scott Wood's e-mail address
powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
powerpc: Fix style of self-test config prompts
powerpc/powernv: Only delay opal_rtc_read() retry when necessary
...
The function can return negative values so it should be assigned
to signed type.
The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/unsigned_lesser_than_zero.cocci.
Link: http://permalink.gmane.org/gmane.linux.kernel/2038576
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is reported that, with CONFIG_HZ_PERIODIC=y cpu stays at the
lowest frequency even if the usage goes to 100%, neither ondemand
nor conservative governor works, however performance and
userspace work as expected. If set with CONFIG_NO_HZ_FULL=y,
everything goes well.
This problem is caused by improper calculation of the idle_time
when the load is extremely high(near 100%). Firstly, cpufreq_governor
uses get_cpu_idle_time to get the total idle time for specific cpu, then:
1.If the system is configured with CONFIG_NO_HZ_FULL, the idle time is
returned by ktime_get, which is always increasing, it's OK.
2.However, if the system is configured with CONFIG_HZ_PERIODIC,
get_cpu_idle_time might not guarantee to be always increasing,
because it will leverage get_cpu_idle_time_jiffy to calculate the
idle_time, consider the following scenario:
At T1:
idle_tick_1 = total_tick_1 - user_tick_1
sample period(80ms)...
At T2: ( T2 = T1 + 80ms):
idle_tick_2 = total_tick_2 - user_tick_2
Currently the algorithm is using (idle_tick_2 - idle_tick_1) to
get the delta idle_time during the past sample period, however
it CAN NOT guarantee that idle_tick_2 >= idle_tick_1, especially
when cpu load is high.
(Yes, total_tick_2 >= total_tick_1, and user_tick_2 >= user_tick_1,
but how about idle_tick_2 and idle_tick_1? No guarantee.)
So governor might get a negative value of idle_time during the past
sample period, which might mislead the system that the idle time is
very big(converted to unsigned int), and the busy time is nearly zero,
which causes the governor to always choose the lowest cpufreq,
then cause this problem.
In theory there are two solutions:
1.The logic should not rely on the idle tick during every sample period,
but be based on the busy tick directly, as this is how 'top' is
implemented.
2.Or the logic must make sure that the idle_time is strictly increasing
during each sample period, then there would be no negative idle_time
anymore. This solution requires minimum modification to current code
and this patch uses method 2.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821
Reported-by: Jan Fikar <j.fikar@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Modify mt8173-cpufreq driver to get OPP-sharing information and set up
OPP table provided by operating-points-v2 bindings.
Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that the boost_supported field in struct cpufreq_driver is
redundant, because the driver's ->set_boost callback may be left
unset if "boost" is not supported. Moreover, the only driver
populating the ->set_boost callback is acpi_cpufreq, so make it
avoid populating that callback if "boost" is not supported, rework
the core to check ->set_boost instead of boost_supported to
verify "boost" support and drop boost_supported which isn't
used any more.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The store_boost() routine is only used by store_cpb(), so move
the code from it directly to that function and rename _store_boost()
to set_boost() to make its name reflect the name of the driver
callback pointing to it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
cpufreq_boost_supported() is not used outside of cpufreq.c, so make
it static.
While at it, refactor it as a one-liner (which it really is).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
The cpu_set_cclk() function was only used in a single source file so far.
Indicate this setting also by the corresponding linkage specifier.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The return type "unsigned long" was used by the cpu_set_cclk() function
while the type "int" is provided by the clk_set_rate() function.
Let us make this usage consistent.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The "domain" variable needs to be signed for the error handling to work.
Fixes: 8def31034d (cpufreq: arm_big_little: add SCPI interface driver)
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Long ago, only in the lab, there was OPALv1 and OPALv2. Now there is
just OPALv3, with nobody ever expecting anything on pre-OPALv3 to
be cared about or supported by mainline kernels.
So, let's remove FW_FEATURE_OPALv3 and instead use FW_FEATURE_OPAL
exclusively.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Here are a bunch of small bug fixes for various ARM platforms, nothing
really sticks out this week, most of either fixes bugs in code that was
just added in 4.4, or that has been broken for many years without anyone
noticing.
at91/sama5d2
- fix sama5de hardware setup of sd/mmc interface
- proper selection of pinctrl drivers. PIO4 is necessary for sama5d2
berlin
- fix incorrect clock input for SDIO
exynos
- Fix potential NULL pointer dereference in Exynos PMU driver.
imx
- Fix vf610 SAI clock configuration bug which is discovered by
the newly added master mode support in SAI audio driver.
- Fix buggy L2 cache latency values in vf610 device trees, which may
cause system hang when cpu runs at a higher frequency.
ixp4xx
- fix prototypes for readl/writel functions
ls2080a
- use little-endian register access for GPIO and SDHCI
omap
- Fix clock source for ARM TWD and global timers on am437x
- Always select REGULATOR_FIXED_VOLTAGE for omap2+ instead of
when MACH_OMAP3_PANDORA is selected
- Fix SPI DMA handles for dm816x as only some were mapped
- Fix up mbox cells for dm816x to make mailbox usable
pxa
- use PWM lookup table for all ezx machines
s3c24xx
- Remove incorrect __init annotation from s3c24xx cpufreq driver structures.
versatile
- fix PCI IRQ mapping on Versatile PB
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVmyQMWCrR//JCVInAQIIDA//VyJ2UoTJ2JC3thVP56P/ZXh7Pz8VDqnq
cgoFUio27IeHPSgs+W9qWliOrb+LaXkuOl8CKgepm+Bv7j8Y+uryP4X2rKQ3ZRmy
2f5+uUqAIZ0Co2aJdtG395lY9TKNHl6cPEskcbgL7cjdgj7QBqfIyj22QZbj6yRp
kp8pj+cKXBFRLa5PvePon2w03MA/bLaP30VzKCSL1zchcs52rxekU694V3ISNa63
eshyyKf354Sl9hP4Y8xCdl/mboymKzQxEGDQS/Fcb8h/OQ3djoh+7EKdVbdyZ2A7
phgfazd2aE7wQ5GVIkMNV/MzGHj9xpiD4Z1Hi/2E8WdzuXJTRicS4bJihRAIualt
H1FOEdgqT+xS4JUYxAvl46fwwqcFJfixtGgKka27sJTtk+Y1kHjASWvueZKlHMIK
ln9CF7PoecF0InQaY2N8Vy05Qcp5MuoB/0v+XlftI0sAtIXNeo142H2NQZCsO+1U
bJDyb5E4z06jzqk7IOK4/AKyEAV9KZPDws+ZxcNH/faPT10epK7MeZdetbD7b8q3
pkY7s5iXV8uBox7FtHoamrlMFgAzN9Qh0E4bcw70aKaJZZ02ozTXCvJIKjoIPMne
FsvidQToznqbA2RSXpxRQrcXrMxvURaPCRBe7CxrCoynmhIxd4UHND2HJ4OG645z
4SAGOzOlZKM=
=fgEd
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Here are a bunch of small bug fixes for various ARM platforms, nothing
really sticks out this week, most of either fixes bugs in code that
was just added in 4.4, or that has been broken for many years without
anyone noticing.
at91/sama5d2:
- fix sama5de hardware setup of sd/mmc interface
- proper selection of pinctrl drivers. PIO4 is necessary for sama5d2
berlin:
- fix incorrect clock input for SDIO
exynos:
- Fix potential NULL pointer dereference in Exynos PMU driver.
imx:
- Fix vf610 SAI clock configuration bug which is discovered by the
newly added master mode support in SAI audio driver.
- Fix buggy L2 cache latency values in vf610 device trees, which may
cause system hang when cpu runs at a higher frequency.
ixp4xx:
- fix prototypes for readl/writel functions
ls2080a:
- use little-endian register access for GPIO and SDHCI
omap:
- Fix clock source for ARM TWD and global timers on am437x
- Always select REGULATOR_FIXED_VOLTAGE for omap2+ instead of when
MACH_OMAP3_PANDORA is selected
- Fix SPI DMA handles for dm816x as only some were mapped
- Fix up mbox cells for dm816x to make mailbox usable
pxa:
- use PWM lookup table for all ezx machines
s3c24xx:
- Remove incorrect __init annotation from s3c24xx cpufreq driver
structures.
versatile:
- fix PCI IRQ mapping on Versatile PB"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ls2080a/dts: Add little endian property for GPIO IP block
dt-bindings: define little-endian property for QorIQ GPIO
ARM64: dts: ls2080a: fix eSDHC endianness
ARM: dts: vf610: use reset values for L2 cache latencies
ARM: pxa: use PWM lookup table for all machines
ARM: dts: berlin: add 2nd clock for BG2Q sdhci0 and sdhci1
ARM: dts: berlin: correct BG2Q's sdhci2 2nd clock
ARM: dts: am4372: fix clock source for arm twd and global timers
ARM: at91: fix pinctrl driver selection
ARM: at91/dt: add always-on to 1.8V regulator
ARM: dts: vf610: fix clock definition for SAI2
ARM: imx: clk-vf610: fix SAI clock tree
ARM: ixp4xx: fix read{b,w,l} return types
irqchip/versatile-fpga: Fix PCI IRQ mapping on Versatile PB
ARM: OMAP2+: enable REGULATOR_FIXED_VOLTAGE
ARM: dts: add dm816x missing spi DT dma handles
ARM: dts: add dm816x missing #mbox-cells
cpufreq: s3c24xx: Do not mark s3c2410_plls_add as __init
ARM: EXYNOS: Fix potential NULL pointer access in exynos_sys_powerdown_conf
The bootloader is charged with the responsibility to provide platform
specific Dynamic Voltage and Frequency Scaling (DVFS) information via
Device Tree. This driver takes the supplied configuration and
registers it with the new generic OPP framework, to then be used with
CPUFreq.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
785ee27 ("cpufreq: intel_pstate: Fix limits->max_perf rounding error")
hardcodes the value of FRAC_BITS. This patch fixes that minor issue.
Fixes: 785ee27881 (cpufreq: intel_pstate: Fix limits->max_perf rounding error)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the return value of ->init() of cpufreq driver is not propagated
to the device driver model now, move resources allocation into
->probe() to handle -EPROBE_DEFER properly.
Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This driver is the only one that calls regulator_sync_voltage(), but it
can currently be built with CONFIG_REGULATOR disabled, producing
this build error:
drivers/cpufreq/tegra124-cpufreq.c: In function 'tegra124_cpu_switch_to_pllx':
drivers/cpufreq/tegra124-cpufreq.c:68:2: error: implicit declaration of function 'regulator_sync_voltage' [-Werror=implicit-function-declaration]
regulator_sync_voltage(priv->vdd_cpu_reg);
My first attempt was to implement a helper for this function
for regulator_sync_voltage, but Mark Brown explained:
We don't do this for *all* regulator API functions - there's some where
using them strongly suggests that there is actually a dependency on
the regulator API. This does seem like it might be falling into the
specialist category [...]
Looking at the code I'm pretty unclear on what the authors think the
use of _sync_voltage() is doing in the first place so it may be even
better to just remove the call. It seems to have been included in the
first commit so there's not changelog explaining things and there's
no comment either. I'd *expect* it to be a noop as far as I can see.
This adds the dependency to make the driver always build successfully
or not be enabled at all. Alternatively, we could investigate if the
driver should stop calling regulator_sync_voltage instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In cases where we have many IOs, the global load becomes low and the
load algorithm will decrease the requested P-State. Because of that,
the IOs overheads will increase and impact the IO performances.
To improve IO bound work, we can count the io-wait time as busy time
in calculating CPU busy.
This change uses get_cpu_iowait_time_us() to obtain the IO wait time value
and converts time into number of cycles spent waiting on IO at the TSC
rate. At the moment, this trick is only used for Atom.
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current function to calculate cpu utilization uses the average P-state
ratio (APerf/Mperf) scaled by the ratio of the current P-state to the
max available non-turbo one. This leads to an overestimation of
utilization which causes higher-performance P-states to be selected more
often and that leads to increased energy consumption.
This is a problem for low-power systems, so it is better to use a
different utilization calculation algorithm for them.
Namely, the Percent Busy value (or load) can be estimated as the ratio of the
MPERF counter that runs at a constant rate only during active periods (C0) to
the time stamp counter (TSC) that also runs (at the same rate) during idle.
That is:
Percent Busy = 100 * (delta_mperf / delta_tsc)
Use this algorithm for platforms with SoCs based on the Airmont and Silvermont
Atom cores.
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Target systems using different cpus have different power and performance
requirements. They may use different algorithms to get the next P-state
based on their power or performance preference.
For example, power-constrained systems may not want to use
high-performance P-states as aggressively as a full-size desktop or a
server platform. A server platform may want to run close to the max to
achieve better performance, while laptop-like systems may prefer
sacrificing performance for longer battery lifes.
For the above reasons, modify intel_pstate to allow the target P-state
selection algorithm to be depend on the CPU ID.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Philippe Longepe <philippe.longepe@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Sometimes regulator_get_voltage() call returns negative values for
reasons(e.g. underlying I2C bus timeout). Add check for the return
values and fail out early.
Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Remove redundant regulator_get_voltage() call to get Vsram value
since it will be obtained later at the beginning of voltage tracking
loop.
Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add CPUFREQ_HAVE_GOVERNOR_PER_POLICY to have individual set of tunables
for each cluster of MT8173.
Signed-off-by: Pi-Cheng Chen <pi-cheng.chen@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Register the qoriq cpufreq driver as a cooling device, based on the
thermal device tree framework. When temperature crosses the passive trip
point cpufreq is used to throttle CPUs.
Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpufreq documentation specifies
policy->cpuinfo.transition_latency the time it takes on this CPU to
switch between two frequencies in
nanoseconds (if appropriate, else
specify CPUFREQ_ETERNAL)
currently pcc-cpufreq does not expose the value and sets it to zero. I
changed the pcc-cpufreq driver and it's documentation to conform to the
default value specified in Documentation/cpu-freq/cpu-drivers.txt
Signed-off-by: Jacob Tanenbaum <jtanenba@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Register passive cooling devices when initialising cpufreq on
big.LITTLE systems. If the device tree provides a dynamic power
coefficient for the CPUs then the bound cooling device will support
the extensions that allow it to be used with all the existing thermal
governors including the power allocator governor.
A cooling device will be created per individual frequency domain and
can be bound to thermal zones via the thermal DT bindings.
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Support registering cooling devices with dynamic power coefficient
where provided by the device tree. This allows OF registered cooling
devices driver to be used with the power_allocator thermal governor.
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is possible to get rid of the timer_lock spinlock used by the
governor timer function for synchronization, but a couple of races
need to be avoided.
The first race is between multiple dbs_timer_handler() instances
that may be running in parallel with each other on different
CPUs. Namely, one of them has to queue up the work item, but it
cannot be queued up more than once. To achieve that,
atomic_inc_return() can be used on the skip_work field of
struct cpu_common_dbs_info.
The second race is between an already running dbs_timer_handler()
and gov_cancel_work(). In that case the dbs_timer_handler() might
not notice the skip_work incrementation in gov_cancel_work() and
it might queue up its work item after gov_cancel_work() had
returned (and that work item would corrupt skip_work going
forward). To prevent that from happening, gov_cancel_work()
can be made wait for the timer function to complete (on all CPUs)
right after skip_work has been incremented.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Currently update_sampling_rate() runs over each online CPU and
cancels/queues timers on all policy->cpus every time. This should be
done just once for any cpu belonging to a policy.
Create a cpumask and keep on clearing it as and when we process
policies, so that we don't have to traverse through all CPUs of the same
policy.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq governors evaluate load at sampling rate and based on that they
update frequency for a group of CPUs belonging to the same cpufreq
policy.
This is required to be done in a single thread for all policy->cpus, but
because we don't want to wakeup idle CPUs to do just that, we use
deferrable work for this. If we would have used a single delayed
deferrable work for the entire policy, there were chances that the CPU
required to run the handler can be in idle and we might end up not
changing the frequency for the entire group with load variations.
And so we were forced to keep per-cpu works, and only the one that
expires first need to do the real work and others are rescheduled for
next sampling time.
We have been using the more complex solution until now, where we used a
delayed deferrable work for this, which is a combination of a timer and
a work.
This could be made lightweight by keeping per-cpu deferred timers with a
single work item, which is scheduled by the first timer that expires.
This patch does just that and here are important changes:
- The timer handler will run in irq context and so we need to use a
spin_lock instead of the timer_mutex. And so a separate timer_lock is
created. This also makes the use of the mutex and lock quite clear, as
we know what exactly they are protecting.
- A new field 'skip_work' is added to track when the timer handlers can
queue a work. More comments present in code.
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Ashwin Chaugule <ashwin.chaugule@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
timer_mutex is required to be initialized only while memory for 'shared'
is allocated and in a similar way it is required to be destroyed only
when memory for 'shared' is freed.
There is no need to do the same every time we start/stop the governor.
Move code to initialize/destroy timer_mutex to the relevant places.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pass 'policy' as argument to ->gov_dbs_timer() instead of cdbs and
dbs_data.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We are guaranteed to have works scheduled for policy->cpus, as the
policy isn't stopped yet. And so there is no need to check that again.
Drop it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We are comparing policy->governor against cpufreq_gov_ondemand to make
sure that we update sampling rate only for the concerned CPUs. But that
isn't enough.
In case of governor_per_policy, there can be multiple instances of
ondemand governor and we will always end up updating all of them with
current code. What we rather need to do, is to compare dbs_data with
poilcy->governor_data, which will match only for the policies governed
by dbs_data.
This code is also racy as the governor might be getting stopped at that
time and we may end up scheduling work for a policy, which we have just
disabled.
Fix that by protecting the entire function with &od_dbs_cdata.mutex,
which will prevent against races with policy START/STOP/etc.
After these locks are in place, we can safely get the policy via per-cpu
dbs_info.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-domains:
PM / Domains: Fix bad of_node_put() in failure paths of genpd_dev_pm_attach()
PM / Domains: Validate cases of a non-bound driver in genpd governor
* pm-cpufreq:
cpufreq: use last policy after online for drivers with ->setpolicy
For cpufreq drivers which use setpolicy interface, after offline->online
the policy is set to default. This can be reproduced by setting the
default policy of intel_pstate or longrun to ondemand and then change to
"performance". After offline and online, the setpolicy will be called with
the policy=ondemand.
For drivers using governors this condition is handled by storing
last_governor, during offline and restoring during online. The same should
be done for drivers using setpolicy interface. Storing last_policy during
offline and restoring during online.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
s3c2410_plls_add is a device notifier that may be called at runtime and
is correctly not marked __init. However it calls s3c_plltab_register()
which is marked __init, and that triggers a build error when we are
checking for section mismatches:
WARNING: vmlinux.o(.text+0x195e0): Section mismatch in reference from the function s3c2410_plls_add() to the function .init.text:s3c_plltab_register()
The function s3c2410_plls_add() references
the function __init s3c_plltab_register().
This is often because s3c2410_plls_add lacks a __init
annotation or the annotation of s3c_plltab_register is wrong.
This removes the __init annotation from s3c2410_plls_add as well as the
__initdata section annotations from s3c2440_plls_12 and s3c2440_plls_169344,
which in turn are referenced from s3c2410_plls_add.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
If hardware-driven P-state selection (HWP) is enabled, the
"performance" mode of intel_pstate should only allow the processor
to use the highest-performance P-state available. That is not
the case currently, so make it actually happen.
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The SCPI clk driver registers the virtual cpufreq device that kicks off
initialisation of the SCPI cpufreq driver. Also, clk_get() will fail for
the cpufreq driver if the SCPI clk driver is missing.
Fix this by making the SCPI cpufreq driver explicitly depend on the SCPI
clk driver.
Fixes: 8def31034d (cpufreq: arm_big_little: add SCPI interface driver)
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A rounding error was found in the calculation of limits->max_perf
in intel_pstate_set_policy(), which is used to calculate the max and min
pstate values in intel_pstate_get_min_max(). In that code,
limits->max_perf is truncated to 2 hex digits such that, for example,
0x169 was incorrectly calculated to 0x16 instead of 0x17. This resulted in
the pstate being set one level too low. This patch rounds the value of
limits->max_perf up instead of down so that the correct max pstate can
be reached.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
I have a Intel (6,63) processor with a "marketing" frequency (from
/proc/cpuinfo) of 2100MHz, and a max turbo frequency of 2600MHz. I
can execute
cpupower frequency-set -g powersave --min 1200MHz --max 2100MHz
and the max_freq_pct is set to 80. When adding load to the system I noticed
that the cpu frequency only reached 2000MHZ and not 2100MHz as expected.
This is because limits->max_policy_pct is calculated as 2100 * 100 /2600 = 80.7
and is rounded down to 80 when it should be rounded up to 81. This patch
adds a DIV_ROUND_UP() which will return the correct value.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subsys interface's ->remove_dev() is called when the cpufreq driver is
unregistering or the CPU is getting physically removed. We keep removing
the cpuX/cpufreq link for all CPUs except the last one, which is a
mistake as all CPUs contain a link now.
Because of this, one CPU from each policy will still contain a link (to
an already removed policyX directory), after the cpufreq driver is
unregistered.
Fix that by removing the link first and then only see if the policy is
required to be freed. That will make sure that no links are left out.
Fixes: 96bdda61f5 ("cpufreq: create cpu/cpufreq/policyX directories")
Reported-and-tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The CPU policy struct indicates the co-ordination type
for all CPUs of a common freq domain. Initialize it
correctly using the CPU specific data gathered from
CPPC ACPI lib via acpi_get_psd_map().
The PSD object is optional, so the cpu->shared_type
can also be 0. So instead of assuming any value other
than SW_ANY(0xFD) is unsupported, explictly check
if shared_type is SW_ALL and then bail.
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq:
Revert "Documentation: kernel_parameters for Intel P state driver"
cpufreq: mediatek: fix build error
cpufreq: intel_pstate: Add separate support for Airmont cores
cpufreq: intel_pstate: Replace BYT with ATOM
Revert "cpufreq: intel_pstate: Use ACPI perf configuration"
Revert "cpufreq: intel_pstate: Avoid calculation for max/min"
* acpi-cppc:
ACPI / CPPC: Use h/w reduced version of the PCCT structure
The recently added mt8173 cpufreq driver relies on the cpu topology
that is always present on ARM64 but optional on ARM32:
drivers/cpufreq/mt8173-cpufreq.c: In function 'mtk_cpufreq_init':
drivers/cpufreq/mt8173-cpufreq.c:441:30: error: 'cpu_topology' undeclared (first use in this function)
cpumask_copy(policy->cpus, &cpu_topology[policy->cpu].core_sibling);
This refines the Kconfig dependencies so that we can still build on
ARM32, but only if COMPILE_TEST is selected and the CPU topology
code is present.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are two flavors of Atom cores to be supported by intel_pstate,
Silvermont and Airmont, so make the driver distinguish between them by
adding separate frequency tables.
Separate the CPU defaults params for each of them and match the CPU IDs
against them as appropriate.
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename symbol and function names starting with "BYT" or "byt" to
start with "ATOM" or "atom", respectively, so as to make it clear
that they may apply to Atom in general and not just to Baytrail
(the goal is to support several Atoms architectures eventually).
This should not lead to any functional changes.
Signed-off-by: Philippe Longepe <philippe.longepe@linux.intel.com>
Signed-off-by: Stephane Gasparini <stephane.gasparini@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw : Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 37afb00032 (cpufreq: intel_pstate: Use ACPI perf
configuration) that is reported to cause a regression to happen
on a system where invalid data are returned by the ACPI _PSS object.
Since that commit makes assumptions regarding the _PSS output
correctness that may turn out to be overly optimistic in general,
there is a concern that it may introduce regression on more
systems, so it's better to revert it now and we'll revisit the
underlying issue in the next cycle with a more robust solution.
Conflicts:
drivers/cpufreq/intel_pstate.c
Fixes: 37afb00032 (cpufreq: intel_pstate: Use ACPI perf configuration)
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Revert commit 4ef4514870 (cpufreq: intel_pstate: Avoid calculation for
max/min) as it depends on commit 37afb00032 (cpufreq: intel_pstate: Use
ACPI perf configuration) that causes problems to happen and needs to be
reverted.
Conflicts:
drivers/cpufreq/intel_pstate.c
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Support for the ACPI _CCA configuration object intended to tell
the OS whether or not a bus master device supports hardware
managed cache coherency and a new set of functions to allow
drivers to check the cache coherency support for devices in a
platform firmware interface agnostic way (Suravee Suthikulpanit,
Jeremy Linton).
- ACPI backlight quirks for ESPRIMO Mobile M9410 and Dell XPS L421X
(Aaron Lu, Hans de Goede).
- Fixes for the arm_big_little and s5pv210-cpufreq cpufreq drivers
(Jon Medhurst, Nicolas Pitre).
- kfree()-related fixup for the recently introduced CPPC cpufreq
frontend (Markus Elfring).
- intel_pstate fix reducing kernel log noise on systems where
P-states are managed by hardware (Prarit Bhargava).
- intel_pstate maintainers information update (Srinivas Pandruvada).
- cpufreq core optimization related to the handling of delayed work
items used by governors (Viresh Kumar).
- Locking fixes and cleanups of the Operating Performance Points
(OPP) framework (Viresh Kumar).
- Generic power domains framework cleanups (Lina Iyer).
- cpupower tool updates (Jacob Tanenbaum, Sriram Raghunathan,
Thomas Renninger).
- turbostat tool updates (Len Brown).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJWQ96OAAoJEILEb/54YlRxyYYQALJ1HXu76SvYX1re2aawOw6Y
WgzF3Ly7JX034E1VvA2xP6wgkWpBRBDcpnRDeltNA4dYXPBDei/eTcRZTLX12N3g
AfFRGjGWTtLJfpNPecNMmUyF5xHjgDgMIQRabY+Is5NfP5STkPHJeqULnEpvTtx8
bd0lnC5jc4vuZiPEh1xVb+ClYDqWS8YQPyFJVjV/BaIf8Qwe5+oRX36byMBaKc9D
ZgmvmCk5n/HLQQ1uQsqe4xnhFLHN2rypt2BLvFrOtlnSz9VNNpQyB+OIW1mgCD4f
LhpKIwjP8NhZNQUq8HFu7nDlm8ciQtWmeMPB5NdGQ+OESu7yfKAOzQ+3U6Gl2Gaf
66zVGyV6SOJJwfDVJ3qKTtroWps9QV7ZClOJ+zJGgiujwU+tJ3pDQyZM9pa7CL3C
s7ZAUsI6IigSBjD3nJVOyG4DO0a8KQFCIE1mDmyqId45Qz8xJoOrYP33/ZnDuOdo
2OtL/emyfWsz9ixbHVfwIhb7EC6aoaUxQrhSWmNraaQS43YfioZR7h4we8gwenph
X4E1KY4SdML+uFf2VKIcd45NM3IBprCxx5UgFAJdrqe8+otqPNF2AVosG4iqhg/b
k4nxwuIvw2a8Fm77U9ytyXDYMItU/wIlAHMbnmgx+oTwRv6AbZ07MHkyfuQLYuhD
tq5Y14qSiTS7prNacx98
=XZiP
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.4-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"The only new feature in this batch is support for the ACPI _CCA device
configuration object, which it a pre-requisite for future ACPI PCI
support on ARM64, but should not affect the other architectures.
The rest is fixes and cleanups, mostly in cpufreq (including
intel_pstate), the Operating Performace Points (OPP) framework and
tools (cpupower and turbostat).
Specifics:
- Support for the ACPI _CCA configuration object intended to tell the
OS whether or not a bus master device supports hardware managed
cache coherency and a new set of functions to allow drivers to
check the cache coherency support for devices in a platform
firmware interface agnostic way (Suravee Suthikulpanit, Jeremy
Linton).
- ACPI backlight quirks for ESPRIMO Mobile M9410 and Dell XPS L421X
(Aaron Lu, Hans de Goede).
- Fixes for the arm_big_little and s5pv210-cpufreq cpufreq drivers
(Jon Medhurst, Nicolas Pitre).
- kfree()-related fixup for the recently introduced CPPC cpufreq
frontend (Markus Elfring).
- intel_pstate fix reducing kernel log noise on systems where
P-states are managed by hardware (Prarit Bhargava).
- intel_pstate maintainers information update (Srinivas Pandruvada).
- cpufreq core optimization related to the handling of delayed work
items used by governors (Viresh Kumar).
- Locking fixes and cleanups of the Operating Performance Points
(OPP) framework (Viresh Kumar).
- Generic power domains framework cleanups (Lina Iyer).
- cpupower tool updates (Jacob Tanenbaum, Sriram Raghunathan, Thomas
Renninger).
- turbostat tool updates (Len Brown)"
* tag 'pm+acpi-4.4-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
PCI: ACPI: Add support for PCI device DMA coherency
PCI: OF: Move of_pci_dma_configure() to pci_dma_configure()
of/pci: Fix pci_get_host_bridge_device leak
device property: ACPI: Remove unused DMA APIs
device property: ACPI: Make use of the new DMA Attribute APIs
device property: Adding DMA Attribute APIs for Generic Devices
ACPI: Adding DMA Attribute APIs for ACPI Device
device property: Introducing enum dev_dma_attr
ACPI: Honor ACPI _CCA attribute setting
cpufreq: CPPC: Delete an unnecessary check before the function call kfree()
PM / OPP: Add opp_rcu_lockdep_assert() to _find_device_opp()
PM / OPP: Hold dev_opp_list_lock for writers
PM / OPP: Protect updates to list_dev with mutex
PM / OPP: Propagate error properly from dev_pm_opp_set_sharing_cpus()
cpufreq: s5pv210-cpufreq: fix wrong do_div() usage
MAINTAINERS: update for intel P-state driver
Creating a common structure initialization pattern for struct option
cpupower: Enable disabled Cstates if they are below max latency
cpupower: Remove debug message when using cpupower idle-set -D switch
cpupower: cpupower monitor reports uninitialized values for offline cpus
...
As we've enabled multiplatform kernels on ARM, and greatly done away with
the contents under arch/arm/mach-*, there's still need for SoC-related
drivers to go somewhere.
Many of them go in through other driver trees, but we still have
drivers/soc to hold some of the "doesn't fit anywhere" lowlevel code
that might be shared between ARM and ARM64 (or just in general makes
sense to not have under the architecture directory).
This branch contains mostly such code:
- Drivers for qualcomm SoCs for SMEM, SMD and SMD-RPM, used to communicate
with power management blocks on these SoCs for use by clock, regulator and
bus frequency drivers.
- Allwinner Reduced Serial Bus driver, again used to communicate with PMICs.
- Drivers for ARM's SCPI (System Control Processor). Not to be confused with
PSCI (Power State Coordination Interface). SCPI is used to communicate with
the assistant embedded cores doing power management, and we have yet to see
how many of them will implement this for their hardware vs abstracting in
other ways (or not at all like in the past).
- To make confusion between SCPI and PSCI more likely, this release also
includes an update of PSCI to interface version 1.0.
- Rockchip support for power domains.
- A driver to talk to the firmware on Raspberry Pi.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWQC+cAAoJEIwa5zzehBx3jEUP/0GpxfDVanEUkudVLLe7J0RH
CNlRan107Cw6hXRUJo7elEsuCALjccXjc1CAH4+RnNpOAeBKW97n+WU7trTv+wUZ
sQX4SkBPKFBlgwGF2qhsi5q74gms/BrgtCa4kNb9joOYso039tlfIOPzK80DMkOm
TkyIJdUCgFJMjCQLhX6kGT0PDcrbIjb6aA2cF3FAVeaJA7uz8lNe/eHJr3oHxIEY
CvC651yJ2mIHQUU4BJx/AJo+wXg3dRUXNCAtBjwLRPEAzduYZXYm1ZTVIby/1q9r
dR2KDFEuibODXmXrDBzKNJwCu/TLJEwo/1oPaEIVfY91XLKfiWUhgVqa1o1I+d9U
XoGPibCW461qFahjQW87MfInALpCOA7/RbTNjFp+MVyipCYvkaYq7KFiYEldgFDx
z4Qx/J4hYc2TlDWrpNiUCZMfmhwi7y+Ib+tnenYTO1eyMuw0e9mfnVdjk5iU3Pvk
Ye4qPqpYclJruyHbYi164878+1lLaW2NCUgC3rkBO/GWPAzp7d9iLWoZ3PuyD5i5
PEjs668UcRdZYbI4rdrhGHL8Eq9Gnuc4Rthu7HxPOK+DG0XgP8r97PhM8aYGYVDO
+yikBtjWRsA9fPj3rMKA3UsQ61DAeR9LmZ0XPGjWFMCjCG0JlUoIMaA+Uu0i8fr8
95qxBVxbO7rhL39r1rhV
=dm+I
-----END PGP SIGNATURE-----
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC driver updates from Olof Johansson:
"As we've enabled multiplatform kernels on ARM, and greatly done away
with the contents under arch/arm/mach-*, there's still need for
SoC-related drivers to go somewhere.
Many of them go in through other driver trees, but we still have
drivers/soc to hold some of the "doesn't fit anywhere" lowlevel code
that might be shared between ARM and ARM64 (or just in general makes
sense to not have under the architecture directory).
This branch contains mostly such code:
- Drivers for qualcomm SoCs for SMEM, SMD and SMD-RPM, used to
communicate with power management blocks on these SoCs for use by
clock, regulator and bus frequency drivers.
- Allwinner Reduced Serial Bus driver, again used to communicate with
PMICs.
- Drivers for ARM's SCPI (System Control Processor). Not to be
confused with PSCI (Power State Coordination Interface). SCPI is
used to communicate with the assistant embedded cores doing power
management, and we have yet to see how many of them will implement
this for their hardware vs abstracting in other ways (or not at all
like in the past).
- To make confusion between SCPI and PSCI more likely, this release
also includes an update of PSCI to interface version 1.0.
- Rockchip support for power domains.
- A driver to talk to the firmware on Raspberry Pi"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (57 commits)
soc: qcom: smd-rpm: Correct size of outgoing message
bus: sunxi-rsb: Add driver for Allwinner Reduced Serial Bus
bus: sunxi-rsb: Add Allwinner Reduced Serial Bus (RSB) controller bindings
ARM: bcm2835: add mutual inclusion protection
drivers: psci: make PSCI 1.0 functions initialization version dependent
dt-bindings: Correct paths in Rockchip power domains binding document
soc: rockchip: power-domain: don't try to print the clock name in error case
soc: qcom/smem: add HWSPINLOCK dependency
clk: berlin: add cpuclk
ARM: berlin: dts: add CLKID_CPU for BG2Q
ARM: bcm2835: Add the Raspberry Pi firmware driver
soc: qcom: smem: Move RPM message ram out of smem DT node
soc: qcom: smd-rpm: Correct the active vs sleep state flagging
soc: qcom: smd: delete unneeded of_node_put
firmware: qcom-scm: build for correct architecture level
soc: qcom: smd: Correct SMEM items for upper channels
qcom-scm: add missing prototype for qcom_scm_is_available()
qcom-scm: fix endianess issue in __qcom_scm_is_call_available
soc: qcom: smd: Reject send of too big packets
soc: qcom: smd: Handle big endian CPUs
...
* pm-cpufreq:
cpufreq: s5pv210-cpufreq: fix wrong do_div() usage
MAINTAINERS: update for intel P-state driver
cpufreq: governor: Quit work-handlers early if governor is stopped
intel_pstate: decrease number of "HWP enabled" messages
cpufreq: arm_big_little: fix frequency check when bL switcher is active
* acpi-video:
ACPI / video: only register backlight for LCD device
ACPI / video: Add a quirk to force acpi-video backlight on Dell XPS L421X
* acpi-cppc:
cpufreq: CPPC: Delete an unnecessary check before the function call kfree()
The kfree() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is wrong to use do_div() with 32-bit dividends (unsigned long is
32 bits on 32-bit architectures).
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
gov_queue_work() acquires cpufreq_governor_lock to allow
cpufreq_governor_stop() to drain delayed work items possibly scheduled
on CPUs that share the policy with a CPU being taken offline.
However, the same goal may be achieved in a more straightforward way if
the policy pointer in the struct cpu_dbs_info matching the policy CPU is
reset upfront by cpufreq_governor_stop() under the timer_mutex belonging
to it and checked against NULL, under the same lock, at the beginning of
dbs_timer().
In that case every instance of dbs_timer() run for a struct cpu_dbs_info
sharing the policy pointer in question after cpufreq_governor_stop() has
started will notice that that pointer is NULL and bail out immediately
without queuing up any new work items. In turn, gov_cancel_work()
called by cpufreq_governor_stop() before destroying timer_mutex will
wait for all of the delayed work items currently running on the CPUs
sharing the policy to drop the mutex, so it may be destroyed safely.
Make cpufreq_governor_stop() and dbs_timer() work as described and
modify gov_queue_work() so it does not acquire cpufreq_governor_lock any
more.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When booting an HWP enabled system the kernel displays one "HWP enabled"
message for each cpu. The messages are superfluous since HWP is globally
enabled across all CPUs. This patch also adds an informational message
when HWP is disabled via intel_pstate=no_hwp.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The check for correct frequency being set in bL_cpufreq_set_rate is
broken when the big.LITTLE switcher is active, for two reasons.
1. The 'new_rate' variable gets overwritten before the test by the
code calculating the frequency of the old cluster.
2. The frequency returned by bL_cpufreq_get_rate will be the virtual
frequency, not the actual one the intended version of new_rate contains.
This means the function always returns an error causing an endless
stream of: "cpufreq: __target_index: Failed to change cpu frequency: -5"
As the intent is to check for errors that clk_set_rate doesn't report
lets move the check to immediately after that and directly use
clk_get_rate, rather than the arm_big_little helpers which only confuse
matters. Also, update the comment to be hopefully clearer about the
purpose of the code.
Fixes: 0a95e630b4 (cpufreq: arm_big_little: check if the frequency is set correctly)
Signed-off-by: Jon Medhurst <tixy@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-opp:
PM / OPP: passing NULL to PTR_ERR()
PM / OPP: Move cpu specific code to opp/cpu.c
PM / OPP: Move opp core to its own directory
PM / OPP: Prefix exported opp routines with dev_pm_opp_
PM / OPP: Rename opp init/free table routines
PM / OPP: reuse of_parse_phandle()