License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:07:57 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2011-11-30 10:46:42 +08:00
|
|
|
/*
|
2014-01-14 18:56:02 +08:00
|
|
|
* cpuidle-pseries - idle state cpuidle driver.
|
2011-11-30 10:46:42 +08:00
|
|
|
* Adapted from drivers/idle/intel_idle.c and
|
|
|
|
* drivers/acpi/processor_idle.c
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/moduleparam.h>
|
|
|
|
#include <linux/cpuidle.h>
|
|
|
|
#include <linux/cpu.h>
|
2012-05-21 02:34:27 +08:00
|
|
|
#include <linux/notifier.h>
|
2011-11-30 10:46:42 +08:00
|
|
|
|
|
|
|
#include <asm/paca.h>
|
|
|
|
#include <asm/reg.h>
|
|
|
|
#include <asm/machdep.h>
|
|
|
|
#include <asm/firmware.h>
|
2014-02-12 12:48:45 +08:00
|
|
|
#include <asm/runlatch.h>
|
2020-04-07 16:47:39 +08:00
|
|
|
#include <asm/idle.h>
|
2013-08-22 17:53:52 +08:00
|
|
|
#include <asm/plpar_wrappers.h>
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
#include <asm/rtas.h>
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2020-07-14 22:24:24 +08:00
|
|
|
static struct cpuidle_driver pseries_idle_driver = {
|
2013-04-03 20:15:22 +08:00
|
|
|
.name = "pseries_idle",
|
|
|
|
.owner = THIS_MODULE,
|
2011-11-30 10:46:42 +08:00
|
|
|
};
|
|
|
|
|
2017-06-14 21:02:40 +08:00
|
|
|
static int max_idle_state __read_mostly;
|
|
|
|
static struct cpuidle_state *cpuidle_state_table __read_mostly;
|
|
|
|
static u64 snooze_timeout __read_mostly;
|
|
|
|
static bool snooze_timeout_en __read_mostly;
|
2011-11-30 10:46:42 +08:00
|
|
|
|
|
|
|
static int snooze_loop(struct cpuidle_device *dev,
|
|
|
|
struct cpuidle_driver *drv,
|
|
|
|
int index)
|
|
|
|
{
|
2015-06-18 19:23:11 +08:00
|
|
|
u64 snooze_exit_time;
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2017-06-14 21:02:39 +08:00
|
|
|
set_thread_flag(TIF_POLLING_NRFLAG);
|
|
|
|
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_prolog();
|
cpuidle/powerpc: Fix snooze state problem in the cpuidle design on pseries.
Earlier without cpuidle framework on pseries, the native arch
idle routine comprised of both snooze and nap
states. smt_snooze_delay variable was used to delay
the idle process entry to deeper idle state like nap.
With the coming of cpuidle, this arch specific idle was replaced
by two different idle routines, one for supporting snooze and other
for nap. This enabled addition of more
low level idle states on pseries in the future.
On adopting the generic cpuidle framework for POWER systems,
the decision of which idle state to choose from, given a predicted
idle time is taken by the menu governor based on
target_residency and exit_latency of the idle states.
target_residency is the minimum time to be resident in that idle state.
Exit_latency is time taken to exit out of idle state.
Deeper the idle state, both the target residency and exit latency
would be higher.
In the current design, smt_snooze_delay is used as target_residency
for the snooze state which is incorrect, as it is not the
minimum but the maximum duration to be in snooze state.
This would result in the governor in taking bad decision,
as presently target_residency of nap < target_residency of snooze
inspite of nap being deeper idle state.
This patch aims to fix this problem by replacing the smt_snooze_delay loop
in snooze state, with the need_resched() as the governor is aware of
entry and exit of various idle transitions based on which
next idle time prediction.
The governor is intelligent enough to determine the idle state the needs to
be transitioned to and maintains a whole of heuristics including
io load, previous idle states predictions etc for the same, based on
which idle state entry decision is taken.
With this fix, of setting target_residency of snooze to 0
nap to smt_snooze_delay
if the predicted idle time is less
than smt_snooze_delay (target_residency of nap)
value governor would pick snooze state, else nap. This adhers to the
previous native idle design.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-10-04 02:42:26 +08:00
|
|
|
local_irq_enable();
|
2015-06-18 19:23:11 +08:00
|
|
|
snooze_exit_time = get_tb() + snooze_timeout;
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2014-01-14 18:56:09 +08:00
|
|
|
while (!need_resched()) {
|
cpuidle/powerpc: Fix snooze state problem in the cpuidle design on pseries.
Earlier without cpuidle framework on pseries, the native arch
idle routine comprised of both snooze and nap
states. smt_snooze_delay variable was used to delay
the idle process entry to deeper idle state like nap.
With the coming of cpuidle, this arch specific idle was replaced
by two different idle routines, one for supporting snooze and other
for nap. This enabled addition of more
low level idle states on pseries in the future.
On adopting the generic cpuidle framework for POWER systems,
the decision of which idle state to choose from, given a predicted
idle time is taken by the menu governor based on
target_residency and exit_latency of the idle states.
target_residency is the minimum time to be resident in that idle state.
Exit_latency is time taken to exit out of idle state.
Deeper the idle state, both the target residency and exit latency
would be higher.
In the current design, smt_snooze_delay is used as target_residency
for the snooze state which is incorrect, as it is not the
minimum but the maximum duration to be in snooze state.
This would result in the governor in taking bad decision,
as presently target_residency of nap < target_residency of snooze
inspite of nap being deeper idle state.
This patch aims to fix this problem by replacing the smt_snooze_delay loop
in snooze state, with the need_resched() as the governor is aware of
entry and exit of various idle transitions based on which
next idle time prediction.
The governor is intelligent enough to determine the idle state the needs to
be transitioned to and maintains a whole of heuristics including
io load, previous idle states predictions etc for the same, based on
which idle state entry decision is taken.
With this fix, of setting target_residency of snooze to 0
nap to smt_snooze_delay
if the predicted idle time is less
than smt_snooze_delay (target_residency of nap)
value governor would pick snooze state, else nap. This adhers to the
previous native idle design.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-10-04 02:42:26 +08:00
|
|
|
HMT_low();
|
|
|
|
HMT_very_low();
|
2017-06-14 21:02:41 +08:00
|
|
|
if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) {
|
|
|
|
/*
|
|
|
|
* Task has not woken up but we are exiting the polling
|
|
|
|
* loop anyway. Require a barrier after polling is
|
|
|
|
* cleared to order subsequent test of need_resched().
|
|
|
|
*/
|
|
|
|
clear_thread_flag(TIF_POLLING_NRFLAG);
|
|
|
|
smp_mb();
|
2015-06-18 19:23:11 +08:00
|
|
|
break;
|
2017-06-14 21:02:41 +08:00
|
|
|
}
|
2011-11-30 10:46:42 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
HMT_medium();
|
cpuidle/powerpc: Fix snooze state problem in the cpuidle design on pseries.
Earlier without cpuidle framework on pseries, the native arch
idle routine comprised of both snooze and nap
states. smt_snooze_delay variable was used to delay
the idle process entry to deeper idle state like nap.
With the coming of cpuidle, this arch specific idle was replaced
by two different idle routines, one for supporting snooze and other
for nap. This enabled addition of more
low level idle states on pseries in the future.
On adopting the generic cpuidle framework for POWER systems,
the decision of which idle state to choose from, given a predicted
idle time is taken by the menu governor based on
target_residency and exit_latency of the idle states.
target_residency is the minimum time to be resident in that idle state.
Exit_latency is time taken to exit out of idle state.
Deeper the idle state, both the target residency and exit latency
would be higher.
In the current design, smt_snooze_delay is used as target_residency
for the snooze state which is incorrect, as it is not the
minimum but the maximum duration to be in snooze state.
This would result in the governor in taking bad decision,
as presently target_residency of nap < target_residency of snooze
inspite of nap being deeper idle state.
This patch aims to fix this problem by replacing the smt_snooze_delay loop
in snooze state, with the need_resched() as the governor is aware of
entry and exit of various idle transitions based on which
next idle time prediction.
The governor is intelligent enough to determine the idle state the needs to
be transitioned to and maintains a whole of heuristics including
io load, previous idle states predictions etc for the same, based on
which idle state entry decision is taken.
With this fix, of setting target_residency of snooze to 0
nap to smt_snooze_delay
if the predicted idle time is less
than smt_snooze_delay (target_residency of nap)
value governor would pick snooze state, else nap. This adhers to the
previous native idle design.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-10-04 02:42:26 +08:00
|
|
|
clear_thread_flag(TIF_POLLING_NRFLAG);
|
|
|
|
|
2017-11-17 00:00:52 +08:00
|
|
|
local_irq_disable();
|
|
|
|
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_epilog();
|
2013-04-03 20:15:22 +08:00
|
|
|
|
2011-11-30 10:46:42 +08:00
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
static void check_and_cede_processor(void)
|
|
|
|
{
|
|
|
|
/*
|
2012-07-10 16:36:40 +08:00
|
|
|
* Ensure our interrupt state is properly tracked,
|
|
|
|
* also checks if no interrupt has occurred while we
|
|
|
|
* were soft-disabled
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
*/
|
2012-07-10 16:36:40 +08:00
|
|
|
if (prep_irq_for_idle()) {
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
cede_processor();
|
2012-07-10 16:36:40 +08:00
|
|
|
#ifdef CONFIG_TRACE_IRQFLAGS
|
|
|
|
/* Ensure that H_CEDE returns with IRQs on */
|
|
|
|
if (WARN_ON(!(mfmsr() & MSR_EE)))
|
|
|
|
__hard_irq_enable();
|
|
|
|
#endif
|
|
|
|
}
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
}
|
|
|
|
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
/*
|
|
|
|
* XCEDE: Extended CEDE states discovered through the
|
|
|
|
* "ibm,get-systems-parameter" RTAS call with the token
|
|
|
|
* CEDE_LATENCY_TOKEN
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Section 7.3.16 System Parameters Option of PAPR version 2.8.1 has a
|
|
|
|
* table with all the parameters to ibm,get-system-parameters.
|
|
|
|
* CEDE_LATENCY_TOKEN corresponds to the token value for Cede Latency
|
|
|
|
* Settings Information.
|
|
|
|
*/
|
|
|
|
#define CEDE_LATENCY_TOKEN 45
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the platform supports the cede latency settings information system
|
|
|
|
* parameter it must provide the following information in the NULL terminated
|
|
|
|
* parameter string:
|
|
|
|
*
|
|
|
|
* a. The first byte is the length “N” of each cede latency setting record minus
|
|
|
|
* one (zero indicates a length of 1 byte).
|
|
|
|
*
|
|
|
|
* b. For each supported cede latency setting a cede latency setting record
|
|
|
|
* consisting of the first “N” bytes as per the following table.
|
|
|
|
*
|
|
|
|
* -----------------------------
|
|
|
|
* | Field | Field |
|
|
|
|
* | Name | Length |
|
|
|
|
* -----------------------------
|
|
|
|
* | Cede Latency | 1 Byte |
|
|
|
|
* | Specifier Value | |
|
|
|
|
* -----------------------------
|
|
|
|
* | Maximum wakeup | |
|
|
|
|
* | latency in | 8 Bytes |
|
|
|
|
* | tb-ticks | |
|
|
|
|
* -----------------------------
|
|
|
|
* | Responsive to | |
|
|
|
|
* | external | 1 Byte |
|
|
|
|
* | interrupts | |
|
|
|
|
* -----------------------------
|
|
|
|
*
|
|
|
|
* This version has cede latency record size = 10.
|
|
|
|
*
|
|
|
|
* The structure xcede_latency_payload represents a) and b) with
|
|
|
|
* xcede_latency_record representing the table in b).
|
|
|
|
*
|
|
|
|
* xcede_latency_parameter is what gets returned by
|
|
|
|
* ibm,get-systems-parameter RTAS call when made with
|
|
|
|
* CEDE_LATENCY_TOKEN.
|
|
|
|
*
|
|
|
|
* These structures are only used to represent the data obtained by the RTAS
|
|
|
|
* call. The data is in big-endian.
|
|
|
|
*/
|
|
|
|
struct xcede_latency_record {
|
|
|
|
u8 hint;
|
|
|
|
__be64 latency_ticks;
|
|
|
|
u8 wake_on_irqs;
|
|
|
|
} __packed;
|
|
|
|
|
|
|
|
// Make space for 16 records, which "should be enough".
|
|
|
|
struct xcede_latency_payload {
|
|
|
|
u8 record_size;
|
|
|
|
struct xcede_latency_record records[16];
|
|
|
|
} __packed;
|
|
|
|
|
|
|
|
struct xcede_latency_parameter {
|
|
|
|
__be16 payload_size;
|
|
|
|
struct xcede_latency_payload payload;
|
|
|
|
u8 null_char;
|
|
|
|
} __packed;
|
|
|
|
|
|
|
|
static unsigned int nr_xcede_records;
|
|
|
|
static struct xcede_latency_parameter xcede_latency_parameter __initdata;
|
|
|
|
|
|
|
|
static int __init parse_cede_parameters(void)
|
|
|
|
{
|
|
|
|
struct xcede_latency_payload *payload;
|
|
|
|
u32 total_xcede_records_size;
|
|
|
|
u8 xcede_record_size;
|
|
|
|
u16 payload_size;
|
|
|
|
int ret, i;
|
|
|
|
|
|
|
|
ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
|
|
|
|
NULL, CEDE_LATENCY_TOKEN, __pa(&xcede_latency_parameter),
|
|
|
|
sizeof(xcede_latency_parameter));
|
|
|
|
if (ret) {
|
|
|
|
pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
payload_size = be16_to_cpu(xcede_latency_parameter.payload_size);
|
|
|
|
payload = &xcede_latency_parameter.payload;
|
2020-07-30 13:32:55 +08:00
|
|
|
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
xcede_record_size = payload->record_size + 1;
|
|
|
|
|
|
|
|
if (xcede_record_size != sizeof(struct xcede_latency_record)) {
|
|
|
|
pr_err("xcede: Expected record-size %lu. Observed size %u.\n",
|
|
|
|
sizeof(struct xcede_latency_record), xcede_record_size);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_info("xcede: xcede_record_size = %d\n", xcede_record_size);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the payload_size includes the last NULL byte and the
|
|
|
|
* xcede_record_size, the remaining bytes correspond to array of all
|
|
|
|
* cede_latency settings.
|
|
|
|
*/
|
|
|
|
total_xcede_records_size = payload_size - 2;
|
|
|
|
nr_xcede_records = total_xcede_records_size / xcede_record_size;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_xcede_records; i++) {
|
|
|
|
struct xcede_latency_record *record = &payload->records[i];
|
|
|
|
u64 latency_ticks = be64_to_cpu(record->latency_ticks);
|
|
|
|
u8 wake_on_irqs = record->wake_on_irqs;
|
|
|
|
u8 hint = record->hint;
|
|
|
|
|
|
|
|
pr_info("xcede: Record %d : hint = %u, latency = 0x%llx tb ticks, Wake-on-irq = %u\n",
|
|
|
|
i, hint, latency_ticks, wake_on_irqs);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define NR_DEDICATED_STATES 2 /* snooze, CEDE */
|
2020-07-30 13:32:55 +08:00
|
|
|
static u8 cede_latency_hint[NR_DEDICATED_STATES];
|
|
|
|
|
2011-11-30 10:46:42 +08:00
|
|
|
static int dedicated_cede_loop(struct cpuidle_device *dev,
|
|
|
|
struct cpuidle_driver *drv,
|
|
|
|
int index)
|
|
|
|
{
|
2020-07-30 13:32:55 +08:00
|
|
|
u8 old_latency_hint;
|
2011-11-30 10:46:42 +08:00
|
|
|
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_prolog();
|
2011-11-30 10:46:42 +08:00
|
|
|
get_lppaca()->donate_dedicated_cpu = 1;
|
2020-07-30 13:32:55 +08:00
|
|
|
old_latency_hint = get_lppaca()->cede_latency_hint;
|
|
|
|
get_lppaca()->cede_latency_hint = cede_latency_hint[index];
|
2011-11-30 10:46:42 +08:00
|
|
|
|
|
|
|
HMT_medium();
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
check_and_cede_processor();
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2017-11-17 00:00:52 +08:00
|
|
|
local_irq_disable();
|
2011-11-30 10:46:42 +08:00
|
|
|
get_lppaca()->donate_dedicated_cpu = 0;
|
2020-07-30 13:32:55 +08:00
|
|
|
get_lppaca()->cede_latency_hint = old_latency_hint;
|
2013-04-03 20:15:22 +08:00
|
|
|
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_epilog();
|
2013-04-03 20:15:22 +08:00
|
|
|
|
2011-11-30 10:46:42 +08:00
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int shared_cede_loop(struct cpuidle_device *dev,
|
|
|
|
struct cpuidle_driver *drv,
|
|
|
|
int index)
|
|
|
|
{
|
|
|
|
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_prolog();
|
2011-11-30 10:46:42 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Yield the processor to the hypervisor. We return if
|
|
|
|
* an external interrupt occurs (which are driven prior
|
|
|
|
* to returning here) or if a prod occurs from another
|
|
|
|
* processor. When returning here, external interrupts
|
|
|
|
* are enabled.
|
|
|
|
*/
|
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 15:27:59 +08:00
|
|
|
check_and_cede_processor();
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2017-11-17 00:00:52 +08:00
|
|
|
local_irq_disable();
|
powerpc/idle: Store PURR snapshot in a per-cpu global variable
Currently when CPU goes idle, we take a snapshot of PURR via
pseries_idle_prolog() which is used at the CPU idle exit to compute
the idle PURR cycles via the function pseries_idle_epilog(). Thus,
the value of idle PURR cycle thus read before pseries_idle_prolog() and
after pseries_idle_epilog() is always correct.
However, if we were to read the idle PURR cycles from an interrupt
context between pseries_idle_prolog() and pseries_idle_epilog() (this
will be done in a future patch), then, the value of the idle PURR thus
read will not include the cycles spent in the most recent idle period.
Thus, in that interrupt context, we will need access to the snapshot
of the PURR before going idle, in order to compute the idle PURR
cycles for the latest idle duration.
In this patch, we save the snapshot of PURR in pseries_idle_prolog()
in a per-cpu variable, instead of on the stack, so that it can be
accessed from an interrupt context.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1586249263-14048-3-git-send-email-ego@linux.vnet.ibm.com
2020-04-07 16:47:40 +08:00
|
|
|
pseries_idle_epilog();
|
2013-04-03 20:15:22 +08:00
|
|
|
|
2011-11-30 10:46:42 +08:00
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* States for dedicated partition case.
|
|
|
|
*/
|
2020-07-30 13:32:55 +08:00
|
|
|
static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
|
2011-11-30 10:46:42 +08:00
|
|
|
{ /* Snooze */
|
|
|
|
.name = "snooze",
|
|
|
|
.desc = "snooze",
|
|
|
|
.exit_latency = 0,
|
|
|
|
.target_residency = 0,
|
|
|
|
.enter = &snooze_loop },
|
|
|
|
{ /* CEDE */
|
|
|
|
.name = "CEDE",
|
|
|
|
.desc = "CEDE",
|
cpuidle/powerpc: Fix snooze state problem in the cpuidle design on pseries.
Earlier without cpuidle framework on pseries, the native arch
idle routine comprised of both snooze and nap
states. smt_snooze_delay variable was used to delay
the idle process entry to deeper idle state like nap.
With the coming of cpuidle, this arch specific idle was replaced
by two different idle routines, one for supporting snooze and other
for nap. This enabled addition of more
low level idle states on pseries in the future.
On adopting the generic cpuidle framework for POWER systems,
the decision of which idle state to choose from, given a predicted
idle time is taken by the menu governor based on
target_residency and exit_latency of the idle states.
target_residency is the minimum time to be resident in that idle state.
Exit_latency is time taken to exit out of idle state.
Deeper the idle state, both the target residency and exit latency
would be higher.
In the current design, smt_snooze_delay is used as target_residency
for the snooze state which is incorrect, as it is not the
minimum but the maximum duration to be in snooze state.
This would result in the governor in taking bad decision,
as presently target_residency of nap < target_residency of snooze
inspite of nap being deeper idle state.
This patch aims to fix this problem by replacing the smt_snooze_delay loop
in snooze state, with the need_resched() as the governor is aware of
entry and exit of various idle transitions based on which
next idle time prediction.
The governor is intelligent enough to determine the idle state the needs to
be transitioned to and maintains a whole of heuristics including
io load, previous idle states predictions etc for the same, based on
which idle state entry decision is taken.
With this fix, of setting target_residency of snooze to 0
nap to smt_snooze_delay
if the predicted idle time is less
than smt_snooze_delay (target_residency of nap)
value governor would pick snooze state, else nap. This adhers to the
previous native idle design.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-10-04 02:42:26 +08:00
|
|
|
.exit_latency = 10,
|
|
|
|
.target_residency = 100,
|
2011-11-30 10:46:42 +08:00
|
|
|
.enter = &dedicated_cede_loop },
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* States for shared partition case.
|
|
|
|
*/
|
2014-01-14 18:56:28 +08:00
|
|
|
static struct cpuidle_state shared_states[] = {
|
2017-10-10 15:11:09 +08:00
|
|
|
{ /* Snooze */
|
|
|
|
.name = "snooze",
|
|
|
|
.desc = "snooze",
|
|
|
|
.exit_latency = 0,
|
|
|
|
.target_residency = 0,
|
|
|
|
.enter = &snooze_loop },
|
2011-11-30 10:46:42 +08:00
|
|
|
{ /* Shared Cede */
|
|
|
|
.name = "Shared Cede",
|
|
|
|
.desc = "Shared Cede",
|
2017-10-10 15:11:09 +08:00
|
|
|
.exit_latency = 10,
|
|
|
|
.target_residency = 100,
|
2011-11-30 10:46:42 +08:00
|
|
|
.enter = &shared_cede_loop },
|
|
|
|
};
|
|
|
|
|
2016-08-18 20:57:25 +08:00
|
|
|
static int pseries_cpuidle_cpu_online(unsigned int cpu)
|
2011-11-30 10:46:42 +08:00
|
|
|
{
|
2016-08-18 20:57:25 +08:00
|
|
|
struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
|
2012-05-21 02:34:27 +08:00
|
|
|
|
2012-07-04 04:07:22 +08:00
|
|
|
if (dev && cpuidle_get_driver()) {
|
2016-08-18 20:57:25 +08:00
|
|
|
cpuidle_pause_and_lock();
|
|
|
|
cpuidle_enable_device(dev);
|
|
|
|
cpuidle_resume_and_unlock();
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2012-07-04 04:07:22 +08:00
|
|
|
|
2016-08-18 20:57:25 +08:00
|
|
|
static int pseries_cpuidle_cpu_dead(unsigned int cpu)
|
|
|
|
{
|
|
|
|
struct cpuidle_device *dev = per_cpu(cpuidle_devices, cpu);
|
2012-07-04 04:07:22 +08:00
|
|
|
|
2016-08-18 20:57:25 +08:00
|
|
|
if (dev && cpuidle_get_driver()) {
|
|
|
|
cpuidle_pause_and_lock();
|
|
|
|
cpuidle_disable_device(dev);
|
|
|
|
cpuidle_resume_and_unlock();
|
2011-11-30 10:46:42 +08:00
|
|
|
}
|
2016-08-18 20:57:25 +08:00
|
|
|
return 0;
|
2011-11-30 10:46:42 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pseries_cpuidle_driver_init()
|
|
|
|
*/
|
|
|
|
static int pseries_cpuidle_driver_init(void)
|
|
|
|
{
|
|
|
|
int idle_state;
|
|
|
|
struct cpuidle_driver *drv = &pseries_idle_driver;
|
|
|
|
|
|
|
|
drv->state_count = 0;
|
|
|
|
|
2014-01-14 18:56:28 +08:00
|
|
|
for (idle_state = 0; idle_state < max_idle_state; ++idle_state) {
|
|
|
|
/* Is the state not enabled? */
|
2011-11-30 10:46:42 +08:00
|
|
|
if (cpuidle_state_table[idle_state].enter == NULL)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
drv->states[drv->state_count] = /* structure copy */
|
|
|
|
cpuidle_state_table[idle_state];
|
|
|
|
|
|
|
|
drv->state_count += 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
cpuidle: pseries: Fixup exit latency for CEDE(0)
We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform. However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.
Benchmark results:
On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.
On POWER9 we see improvement the single-threaded performance of
ebizzy, and no regression in the wakeup latency or the number of
context-switches.
ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 10 2491089 5834307 5398375 4244335 1596244.9
* 10 2893813 5834474 5832448 5327281.3 1055941.4
context_switch2:
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.
context_switch2 across CPU0 CPU1 (Both belong to same big-core, but
different small cores). We observe a minor 0.14% regression in the
number of context-switches (higher is better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 348872 362236 354712 354745.69 2711.827
* 500 349422 361452 353942 354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)
context_switch2 across CPU0 CPU8 (Different big-cores). We observe a
0.37% improvement in the number of context-switches (higher is
better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 287956 294940 288896 288977.23 646.59295
* 500 288300 294646 289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)
schbench:
No major difference could be seen until the 99.9th percentile.
Without-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993
With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-4-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:57 +08:00
|
|
|
static void __init fixup_cede0_latency(void)
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
{
|
cpuidle: pseries: Fixup exit latency for CEDE(0)
We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform. However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.
Benchmark results:
On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.
On POWER9 we see improvement the single-threaded performance of
ebizzy, and no regression in the wakeup latency or the number of
context-switches.
ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 10 2491089 5834307 5398375 4244335 1596244.9
* 10 2893813 5834474 5832448 5327281.3 1055941.4
context_switch2:
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.
context_switch2 across CPU0 CPU1 (Both belong to same big-core, but
different small cores). We observe a minor 0.14% regression in the
number of context-switches (higher is better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 348872 362236 354712 354745.69 2711.827
* 500 349422 361452 353942 354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)
context_switch2 across CPU0 CPU8 (Different big-cores). We observe a
0.37% improvement in the number of context-switches (higher is
better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 287956 294940 288896 288977.23 646.59295
* 500 288300 294646 289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)
schbench:
No major difference could be seen until the 99.9th percentile.
Without-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993
With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-4-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:57 +08:00
|
|
|
struct xcede_latency_payload *payload;
|
|
|
|
u64 min_latency_us;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
min_latency_us = dedicated_states[1].exit_latency; // CEDE latency
|
|
|
|
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
if (parse_cede_parameters())
|
|
|
|
return;
|
|
|
|
|
cpuidle: pseries: Fixup exit latency for CEDE(0)
We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform. However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.
Benchmark results:
On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.
On POWER9 we see improvement the single-threaded performance of
ebizzy, and no regression in the wakeup latency or the number of
context-switches.
ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 10 2491089 5834307 5398375 4244335 1596244.9
* 10 2893813 5834474 5832448 5327281.3 1055941.4
context_switch2:
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.
context_switch2 across CPU0 CPU1 (Both belong to same big-core, but
different small cores). We observe a minor 0.14% regression in the
number of context-switches (higher is better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 348872 362236 354712 354745.69 2711.827
* 500 349422 361452 353942 354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)
context_switch2 across CPU0 CPU8 (Different big-cores). We observe a
0.37% improvement in the number of context-switches (higher is
better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 287956 294940 288896 288977.23 646.59295
* 500 288300 294646 289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)
schbench:
No major difference could be seen until the 99.9th percentile.
Without-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993
With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-4-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:57 +08:00
|
|
|
pr_info("cpuidle: Skipping the %d Extended CEDE idle states\n",
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
nr_xcede_records);
|
cpuidle: pseries: Fixup exit latency for CEDE(0)
We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform. However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.
Benchmark results:
On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.
On POWER9 we see improvement the single-threaded performance of
ebizzy, and no regression in the wakeup latency or the number of
context-switches.
ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 10 2491089 5834307 5398375 4244335 1596244.9
* 10 2893813 5834474 5832448 5327281.3 1055941.4
context_switch2:
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.
context_switch2 across CPU0 CPU1 (Both belong to same big-core, but
different small cores). We observe a minor 0.14% regression in the
number of context-switches (higher is better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 348872 362236 354712 354745.69 2711.827
* 500 349422 361452 353942 354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)
context_switch2 across CPU0 CPU8 (Different big-cores). We observe a
0.37% improvement in the number of context-switches (higher is
better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 287956 294940 288896 288977.23 646.59295
* 500 288300 294646 289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)
schbench:
No major difference could be seen until the 99.9th percentile.
Without-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993
With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-4-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:57 +08:00
|
|
|
|
|
|
|
payload = &xcede_latency_parameter.payload;
|
|
|
|
for (i = 0; i < nr_xcede_records; i++) {
|
|
|
|
struct xcede_latency_record *record = &payload->records[i];
|
|
|
|
u64 latency_tb = be64_to_cpu(record->latency_ticks);
|
|
|
|
u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
|
|
|
|
|
|
|
|
if (latency_us < min_latency_us)
|
|
|
|
min_latency_us = latency_us;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* By default, we assume that CEDE(0) has exit latency 10us,
|
|
|
|
* since there is no way for us to query from the platform.
|
|
|
|
*
|
|
|
|
* However, if the wakeup latency of an Extended CEDE state is
|
|
|
|
* smaller than 10us, then we can be sure that CEDE(0)
|
|
|
|
* requires no more than that.
|
|
|
|
*
|
|
|
|
* Perform the fix-up.
|
|
|
|
*/
|
|
|
|
if (min_latency_us < dedicated_states[1].exit_latency) {
|
|
|
|
u64 cede0_latency = min_latency_us - 1;
|
|
|
|
|
|
|
|
if (cede0_latency <= 0)
|
|
|
|
cede0_latency = min_latency_us;
|
|
|
|
|
|
|
|
dedicated_states[1].exit_latency = cede0_latency;
|
|
|
|
dedicated_states[1].target_residency = 10 * (cede0_latency);
|
|
|
|
pr_info("cpuidle: Fixed up CEDE exit latency to %llu us\n",
|
|
|
|
cede0_latency);
|
|
|
|
}
|
|
|
|
|
cpuidle: pseries: Add function to parse extended CEDE records
Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.
The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.
This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.
dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows
POWER8
[ 10.093279] xcede : xcede_record_size = 10
[ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1
[ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0
[ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states
POWER9
[ 5.913180] xcede : xcede_record_size = 10
[ 5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1
[ 5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0
[ 5.913193] cpuidle : Skipping the 2 Extended CEDE idle states
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Make space for 16 records, drop memset, minor cleanup & formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-3-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:56 +08:00
|
|
|
}
|
|
|
|
|
2011-11-30 10:46:42 +08:00
|
|
|
/*
|
|
|
|
* pseries_idle_probe()
|
|
|
|
* Choose state table for shared versus dedicated partition
|
|
|
|
*/
|
|
|
|
static int pseries_idle_probe(void)
|
|
|
|
{
|
|
|
|
|
2011-11-30 10:47:03 +08:00
|
|
|
if (cpuidle_disable != IDLE_NO_OVERRIDE)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2014-01-14 18:56:09 +08:00
|
|
|
if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
|
2018-11-24 00:30:11 +08:00
|
|
|
/*
|
|
|
|
* Use local_paca instead of get_lppaca() since
|
|
|
|
* preemption is not disabled, and it is not required in
|
|
|
|
* fact, since lppaca_ptr does not need to be the value
|
|
|
|
* associated to the current CPU, it can be from any CPU.
|
|
|
|
*/
|
|
|
|
if (lppaca_shared_proc(local_paca->lppaca_ptr)) {
|
2014-01-14 18:56:09 +08:00
|
|
|
cpuidle_state_table = shared_states;
|
2014-01-14 18:56:28 +08:00
|
|
|
max_idle_state = ARRAY_SIZE(shared_states);
|
|
|
|
} else {
|
cpuidle: pseries: Fixup exit latency for CEDE(0)
We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform. However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.
Benchmark results:
On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.
On POWER9 we see improvement the single-threaded performance of
ebizzy, and no regression in the wakeup latency or the number of
context-switches.
ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 10 2491089 5834307 5398375 4244335 1596244.9
* 10 2893813 5834474 5832448 5327281.3 1055941.4
context_switch2:
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.
context_switch2 across CPU0 CPU1 (Both belong to same big-core, but
different small cores). We observe a minor 0.14% regression in the
number of context-switches (higher is better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 348872 362236 354712 354745.69 2711.827
* 500 349422 361452 353942 354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)
context_switch2 across CPU0 CPU8 (Different big-cores). We observe a
0.37% improvement in the number of context-switches (higher is
better).
x without_patch
* with_patch
N Min Max Median Avg Stddev
x 500 287956 294940 288896 288977.23 646.59295
* 500 288300 294646 289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)
schbench:
No major difference could be seen until the 99.9th percentile.
Without-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993
With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[mpe: Minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596087177-30329-4-git-send-email-ego@linux.vnet.ibm.com
2020-07-30 13:32:57 +08:00
|
|
|
fixup_cede0_latency();
|
2014-01-14 18:56:09 +08:00
|
|
|
cpuidle_state_table = dedicated_states;
|
2020-07-30 13:32:55 +08:00
|
|
|
max_idle_state = NR_DEDICATED_STATES;
|
2014-01-14 18:56:28 +08:00
|
|
|
}
|
2014-01-14 18:56:09 +08:00
|
|
|
} else
|
|
|
|
return -ENODEV;
|
2011-11-30 10:46:42 +08:00
|
|
|
|
2015-06-18 19:23:11 +08:00
|
|
|
if (max_idle_state > 1) {
|
|
|
|
snooze_timeout_en = true;
|
|
|
|
snooze_timeout = cpuidle_state_table[1].target_residency *
|
|
|
|
tb_ticks_per_usec;
|
|
|
|
}
|
2011-11-30 10:46:42 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init pseries_processor_idle_init(void)
|
|
|
|
{
|
|
|
|
int retval;
|
|
|
|
|
|
|
|
retval = pseries_idle_probe();
|
|
|
|
if (retval)
|
|
|
|
return retval;
|
|
|
|
|
|
|
|
pseries_cpuidle_driver_init();
|
2014-01-14 18:56:09 +08:00
|
|
|
retval = cpuidle_register(&pseries_idle_driver, NULL);
|
2011-11-30 10:46:42 +08:00
|
|
|
if (retval) {
|
|
|
|
printk(KERN_DEBUG "Registration of pseries driver failed.\n");
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2016-08-18 20:57:25 +08:00
|
|
|
retval = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
|
|
|
|
"cpuidle/pseries:online",
|
|
|
|
pseries_cpuidle_cpu_online, NULL);
|
|
|
|
WARN_ON(retval < 0);
|
|
|
|
retval = cpuhp_setup_state_nocalls(CPUHP_CPUIDLE_DEAD,
|
|
|
|
"cpuidle/pseries:DEAD", NULL,
|
|
|
|
pseries_cpuidle_cpu_dead);
|
|
|
|
WARN_ON(retval < 0);
|
2011-11-30 10:46:42 +08:00
|
|
|
printk(KERN_DEBUG "pseries_idle_driver registered\n");
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-01-14 18:56:18 +08:00
|
|
|
device_initcall(pseries_processor_idle_init);
|