Merge branches 'pm-avs', 'pm-docs' and 'pm-tools'

* pm-avs: ARM: OMAP2+: SmartReflex: add omap_sr_pdata definition power: avs: smartreflex: Remove superfluous cast in debugfs_create_file() call * pm-docs: PM: Wrap documentation to fit in 80 columns * pm-tools: cpupower: ToDo: Update ToDo with ideas for per_cpu_schedule handling cpupower: mperf_monitor: Update cpupower to use the RDPRU instruction cpupower: mperf_monitor: Introduce per_cpu_schedule flag cpupower: Move needs_root variable into a sub-struct cpupower : Handle set and info subcommands correctly pm-graph info added to MAINTAINERS tools/power/cpupower: Fix initializer override in hsw_ext_cstates
2019-11-26 10:28:34 +01:00 · 2019-11-26 10:28:34 +01:00 · e350b60f4e
parent 5a97aa5bbc 2079fe6ea8 1992b66d2f 9581e24c3f
commit e350b60f4e
24 changed files with 194 additions and 94 deletions
--- a/Documentation/power/drivers-testing.rst
+++ b/Documentation/power/drivers-testing.rst
@ -39,9 +39,10 @@ c) Compile the driver directly into the kernel and try the test modes of
 d) Attempt to hibernate with the driver compiled directly into the kernel
   in the "reboot", "shutdown" and "platform" modes.

-e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst,
-   2).  [As far as the STR tests are concerned, it should not matter whether or
-   not the driver is built as a module.]
+e) Try the test modes of suspend (see:
+   Documentation/power/basic-pm-debugging.rst, 2).  [As far as the STR tests are
+   concerned, it should not matter whether or not the driver is built as a
+   module.]

 f) Attempt to suspend to RAM using the s2ram tool with the driver loaded
   (see: Documentation/power/basic-pm-debugging.rst, 2).
--- a/Documentation/power/freezing-of-tasks.rst
+++ b/Documentation/power/freezing-of-tasks.rst
@ -215,30 +215,31 @@ VI. Are there any precautions to be taken to prevent freezing failures?

 Yes, there are.

-First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code
-from system-wide sleep such as suspend/hibernation is not encouraged.
-If possible, that piece of code must instead hook onto the suspend/hibernation
-notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code
-(kernel/cpu.c) for an example.
+First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a
+piece of code from system-wide sleep such as suspend/hibernation is not
+encouraged.  If possible, that piece of code must instead hook onto the
+suspend/hibernation notifiers to achieve mutual exclusion. Look at the
+CPU-Hotplug code (kernel/cpu.c) for an example.

-However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary,
-it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since
-that could lead to freezing failures, because if the suspend/hibernate code
-successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed
-to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE
-state. As a consequence, the freezer would not be able to freeze that task,
-leading to freezing failure.
+However, if that is not feasible, and grabbing 'system_transition_mutex' is
+deemed necessary, it is strongly discouraged to directly call
+mutex_[un]lock(&system_transition_mutex) since that could lead to freezing
+failures, because if the suspend/hibernate code successfully acquired the
+'system_transition_mutex' lock, and hence that other entity failed to acquire
+the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE state. As a
+consequence, the freezer would not be able to freeze that task, leading to
+freezing failure.

 However, the [un]lock_system_sleep() APIs are safe to use in this scenario,
 since they ask the freezer to skip freezing this task, since it is anyway
-"frozen enough" as it is blocked on 'system_transition_mutex', which will be released
-only after the entire suspend/hibernation sequence is complete.
-So, to summarize, use [un]lock_system_sleep() instead of directly using
+"frozen enough" as it is blocked on 'system_transition_mutex', which will be
+released only after the entire suspend/hibernation sequence is complete.  So, to
+summarize, use [un]lock_system_sleep() instead of directly using
 mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures.

 V. Miscellaneous
 ================

 /sys/power/pm_freeze_timeout controls how long it will cost at most to freeze
-all user space processes or all freezable kernel threads, in unit of millisecond.
-The default value is 20000, with range of unsigned integer.
+all user space processes or all freezable kernel threads, in unit of
+millisecond.  The default value is 20000, with range of unsigned integer.
--- a/Documentation/power/opp.rst
+++ b/Documentation/power/opp.rst
@ -73,19 +73,21 @@ factors. Example usage: Thermal management or other exceptional situations where
 SoC framework might choose to disable a higher frequency OPP to safely continue
 operations until that OPP could be re-enabled if possible.

-OPP library facilitates this concept in it's implementation. The following
+OPP library facilitates this concept in its implementation. The following
 operational functions operate only on available opps:
-opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count
+opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq,
+dev_pm_opp_get_opp_count

-dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then
-be used for dev_pm_opp_enable/disable functions to make an opp available as required.
+dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer
+which can then be used for dev_pm_opp_enable/disable functions to make an
+opp available as required.

 WARNING: Users of OPP library should refresh their availability count using
-get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the
-exact mechanism to trigger these or the notification mechanism to other
-dependent subsystems such as cpufreq are left to the discretion of the SoC
-specific framework which uses the OPP library. Similar care needs to be taken
-care to refresh the cpufreq table in cases of these operations.
+get_opp_count if dev_pm_opp_enable/disable functions are invoked for a
+device, the exact mechanism to trigger these or the notification mechanism
+to other dependent subsystems such as cpufreq are left to the discretion of
+the SoC specific framework which uses the OPP library. Similar care needs
+to be taken care to refresh the cpufreq table in cases of these operations.

 2. Initial OPP List Registration
 ================================
@ -99,11 +101,11 @@ OPPs dynamically using the dev_pm_opp_enable / disable functions.
 dev_pm_opp_add
 	Add a new OPP for a specific domain represented by the device pointer.
 	The OPP is defined using the frequency and voltage. Once added, the OPP
-	is assumed to be available and control of it's availability can be done
-	with the dev_pm_opp_enable/disable functions. OPP library internally stores
-	and manages this information in the opp struct. This function may be
-	used by SoC framework to define a optimal list as per the demands of
-	SoC usage environment.
+	is assumed to be available and control of its availability can be done
+	with the dev_pm_opp_enable/disable functions. OPP library
+	internally stores and manages this information in the opp struct.
+	This function may be used by SoC framework to define a optimal list
+	as per the demands of SoC usage environment.

 	WARNING:
 		Do not use this function in interrupt context.
@ -354,7 +356,7 @@ struct dev_pm_opp

 struct device
 	This is used to identify a domain to the OPP layer. The
-	nature of the device and it's implementation is left to the user of
+	nature of the device and its implementation is left to the user of
 	OPP library such as the SoC framework.

 Overall, in a simplistic view, the data structure operations is represented as
--- a/Documentation/power/pci.rst
+++ b/Documentation/power/pci.rst
@ -426,12 +426,12 @@ pm->runtime_idle() callback.
 2.4. System-Wide Power Transitions
 ----------------------------------
 There are a few different types of system-wide power transitions, described in
-Documentation/driver-api/pm/devices.rst.  Each of them requires devices to be handled
-in a specific way and the PM core executes subsystem-level power management
-callbacks for this purpose.  They are executed in phases such that each phase
-involves executing the same subsystem-level callback for every device belonging
-to the given subsystem before the next phase begins.  These phases always run
-after tasks have been frozen.
+Documentation/driver-api/pm/devices.rst.  Each of them requires devices to be
+handled in a specific way and the PM core executes subsystem-level power
+management callbacks for this purpose.  They are executed in phases such that
+each phase involves executing the same subsystem-level callback for every device
+belonging to the given subsystem before the next phase begins.  These phases
+always run after tasks have been frozen.

 2.4.1. System Suspend
 ^^^^^^^^^^^^^^^^^^^^^
@ -636,12 +636,12 @@ System restore requires a hibernation image to be loaded into memory and the
 pre-hibernation memory contents to be restored before the pre-hibernation system
 activity can be resumed.

-As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded
-into memory by a fresh instance of the kernel, called the boot kernel, which in
-turn is loaded and run by a boot loader in the usual way.  After the boot kernel
-has loaded the image, it needs to replace its own code and data with the code
-and data of the "hibernated" kernel stored within the image, called the image
-kernel.  For this purpose all devices are frozen just like before creating
+As described in Documentation/driver-api/pm/devices.rst, the hibernation image
+is loaded into memory by a fresh instance of the kernel, called the boot kernel,
+which in turn is loaded and run by a boot loader in the usual way.  After the
+boot kernel has loaded the image, it needs to replace its own code and data with
+the code and data of the "hibernated" kernel stored within the image, called the
+image kernel.  For this purpose all devices are frozen just like before creating
 the image during hibernation, in the

 	prepare, freeze, freeze_noirq
@ -691,8 +691,8 @@ controlling the runtime power management of their devices.

 At the time of this writing there are two ways to define power management
 callbacks for a PCI device driver, the recommended one, based on using a
-dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the
-"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
+dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and
+the "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
 .resume() callbacks from struct pci_driver are used.  The legacy approach,
 however, doesn't allow one to define runtime power management callbacks and is
 not really suitable for any new drivers.  Therefore it is not covered by this
--- a/Documentation/power/pm_qos_interface.rst
+++ b/Documentation/power/pm_qos_interface.rst
@ -8,8 +8,8 @@ one of the parameters.

 Two different PM QoS frameworks are available:
 1. PM QoS classes for cpu_dma_latency
-2. the per-device PM QoS framework provides the API to manage the per-device latency
-constraints and PM QoS flags.
+2. The per-device PM QoS framework provides the API to manage the
+   per-device latency constraints and PM QoS flags.

 Each parameters have defined units:

@ -47,14 +47,14 @@ void pm_qos_add_request(handle, param_class, target_value):
  pm_qos API functions.

 void pm_qos_update_request(handle, new_target_value):
-  Will update the list element pointed to by the handle with the new target value
-  and recompute the new aggregated target, calling the notification tree if the
-  target is changed.
+  Will update the list element pointed to by the handle with the new target
+  value and recompute the new aggregated target, calling the notification tree
+  if the target is changed.

 void pm_qos_remove_request(handle):
-  Will remove the element.  After removal it will update the aggregate target and
-  call the notification tree if the target was changed as a result of removing
-  the request.
+  Will remove the element.  After removal it will update the aggregate target
+  and call the notification tree if the target was changed as a result of
+  removing the request.

 int pm_qos_request(param_class):
  Returns the aggregated value for a given PM QoS class.
@ -167,9 +167,9 @@ int dev_pm_qos_expose_flags(device, value)
  change the value of the PM_QOS_FLAG_NO_POWER_OFF flag.

 void dev_pm_qos_hide_flags(device)
-  Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list
-  of flags and remove sysfs attribute pm_qos_no_power_off from the device's power
-  directory.
+  Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS
+  list of flags and remove sysfs attribute pm_qos_no_power_off from the device's
+  power directory.

 Notification mechanisms:

@ -179,8 +179,8 @@ int dev_pm_qos_add_notifier(device, notifier, type):
  Adds a notification callback function for the device for a particular request
  type.

-  The callback is called when the aggregated value of the device constraints list
-  is changed.
+  The callback is called when the aggregated value of the device constraints
+  list is changed.

 int dev_pm_qos_remove_notifier(device, notifier, type):
  Removes the notification callback function for the device.
--- a/Documentation/power/runtime_pm.rst
+++ b/Documentation/power/runtime_pm.rst
@ -268,8 +268,8 @@ defined in include/linux/pm.h:
  `unsigned int runtime_auto;`
    - if set, indicates that the user space has allowed the device driver to
      power manage the device at run time via the /sys/devices/.../power/control
-      `interface;` it may only be modified with the help of the pm_runtime_allow()
-      and pm_runtime_forbid() helper functions
+      `interface;` it may only be modified with the help of the
+      pm_runtime_allow() and pm_runtime_forbid() helper functions

  `unsigned int no_callbacks;`
    - indicates that the device does not use the runtime PM callbacks (see
--- a/Documentation/power/suspend-and-cpuhotplug.rst
+++ b/Documentation/power/suspend-and-cpuhotplug.rst
@ -106,8 +106,8 @@ execution during resume):
 * Release system_transition_mutex lock.


-It is to be noted here that the system_transition_mutex lock is acquired at the very
-beginning, when we are just starting out to suspend, and then released only
+It is to be noted here that the system_transition_mutex lock is acquired at the
+very beginning, when we are just starting out to suspend, and then released only
 after the entire cycle is complete (i.e., suspend + resume).

 ::
@ -165,7 +165,8 @@ Important files and functions/entry points:

 - kernel/power/process.c : freeze_processes(), thaw_processes()
 - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
+- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](),
+  [disable|enable]_nonboot_cpus()



--- a/Documentation/power/swsusp.rst
+++ b/Documentation/power/swsusp.rst
@ -118,7 +118,8 @@ In a really perfect world::

  echo 1 > /proc/acpi/sleep       # for standby
  echo 2 > /proc/acpi/sleep       # for suspend to ram
-  echo 3 > /proc/acpi/sleep       # for suspend to ram, but with more power conservative
+  echo 3 > /proc/acpi/sleep       # for suspend to ram, but with more power
+                                  # conservative
  echo 4 > /proc/acpi/sleep       # for suspend to disk
  echo 5 > /proc/acpi/sleep       # for shutdown unfriendly the system

@ -192,8 +193,8 @@ Q:

 A:
  The freezing of tasks is a mechanism by which user space processes and some
-  kernel threads are controlled during hibernation or system-wide suspend (on some
-  architectures).  See freezing-of-tasks.txt for details.
+  kernel threads are controlled during hibernation or system-wide suspend (on
+  some architectures).  See freezing-of-tasks.txt for details.

 Q:
  What is the difference between "platform" and "shutdown"?
@ -282,7 +283,8 @@ A:
      suspend(PMSG_FREEZE): devices are frozen so that they don't interfere
      with state snapshot

-      state snapshot: copy of whole used memory is taken with interrupts disabled
+      state snapshot: copy of whole used memory is taken with interrupts
+      disabled

      resume(): devices are woken up so that we can write image to swap

@ -353,8 +355,8 @@ Q:

 A:
  Generally, yes, you can.  However, it requires you to use the "resume=" and
-  "resume_offset=" kernel command line parameters, so the resume from a swap file
-  cannot be initiated from an initrd or initramfs image.  See
+  "resume_offset=" kernel command line parameters, so the resume from a swap
+  file cannot be initiated from an initrd or initramfs image.  See
  swsusp-and-swap-files.txt for details.

 Q:
--- a/9
+++ b/9
@ -13000,6 +13000,15 @@ L:	linux-scsi@vger.kernel.org
 S:	Supported
 F:	drivers/scsi/pm8001/

+PM-GRAPH UTILITY
+M:	"Todd E Brandt" <todd.e.brandt@linux.intel.com>
+L:	linux-pm@vger.kernel.org
+W:	https://01.org/pm-graph
+B:	https://bugzilla.kernel.org/buglist.cgi?component=pm-graph&product=Tools
+T:	git git://github.com/intel/pm-graph
+S:	Supported
+F:	tools/power/pm-graph
+
 PNP SUPPORT
 M:	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
 S:	Maintained
--- a/drivers/power/avs/smartreflex.c
+++ b/drivers/power/avs/smartreflex.c
@ -905,7 +905,7 @@ static int omap_sr_probe(struct platform_device *pdev)
 	sr_info->dbg_dir = debugfs_create_dir(sr_info->name, sr_dbg_dir);

 	debugfs_create_file("autocomp", S_IRUGO | S_IWUSR, sr_info->dbg_dir,
-			    (void *)sr_info, &pm_sr_fops);
+			    sr_info, &pm_sr_fops);
 	debugfs_create_x32("errweight", S_IRUGO, sr_info->dbg_dir,
 			   &sr_info->err_weight);
 	debugfs_create_x32("errmaxlimit", S_IRUGO, sr_info->dbg_dir,
--- a/include/linux/power/smartreflex.h
+++ b/include/linux/power/smartreflex.h
@ -293,6 +293,9 @@ struct omap_sr_data {
 	struct voltagedomain		*voltdm;
 };

+
+extern struct omap_sr_data omap_sr_pdata[OMAP_SR_NR];
+
 #ifdef CONFIG_POWER_AVS_OMAP

 /* Smartreflex module enable/disable interface */
--- a/tools/power/cpupower/ToDo
+++ b/tools/power/cpupower/ToDo
@ -8,3 +8,17 @@ ToDos sorted by priority:
 - Add another c1e debug idle monitor
  -> Is by design racy with BIOS, but could be added
     with a --force option and some "be careful" messages
+- Add cpu_start()/cpu_stop() callbacks for monitor
+  -> This is to move the per_cpu logic from inside the
+     monitor to outside it. This can be given higher
+     priority in fork_it.
+- Fork as many processes as there are CPUs in case the
+  per_cpu_schedule flag is set.
+  -> Bind forked process to each cpu.
+  -> Execute start measures via the forked processes on
+     each cpu.
+  -> Run test executable in a forked process.
+  -> Execute stop measures via the forked processes on
+     each cpu.
+  This would be ideal as it will not introduce noise in the
+  tested executable.
--- a/tools/power/cpupower/utils/cpupower-info.c
+++ b/tools/power/cpupower/utils/cpupower-info.c
@ -10,6 +10,7 @@
 #include <errno.h>
 #include <string.h>
 #include <getopt.h>
+#include <sys/utsname.h>

 #include "helpers/helpers.h"
 #include "helpers/sysfs.h"
@ -30,6 +31,7 @@ int cmd_info(int argc, char **argv)
 	extern char *optarg;
 	extern int optind, opterr, optopt;
 	unsigned int cpu;
+	struct utsname uts;

 	union {
 		struct {
@ -39,6 +41,13 @@ int cmd_info(int argc, char **argv)
 	} params = {};
 	int ret = 0;

+	ret = uname(&uts);
+	if (!ret && (!strcmp(uts.machine, "ppc64le") ||
+		     !strcmp(uts.machine, "ppc64"))) {
+		fprintf(stderr, _("Subcommand not supported on POWER.\n"));
+		return ret;
+	}
+
 	setlocale(LC_ALL, "");
 	textdomain(PACKAGE);

--- a/tools/power/cpupower/utils/cpupower-set.c
+++ b/tools/power/cpupower/utils/cpupower-set.c
@ -10,6 +10,7 @@
 #include <errno.h>
 #include <string.h>
 #include <getopt.h>
+#include <sys/utsname.h>

 #include "helpers/helpers.h"
 #include "helpers/sysfs.h"
@ -31,6 +32,7 @@ int cmd_set(int argc, char **argv)
 	extern char *optarg;
 	extern int optind, opterr, optopt;
 	unsigned int cpu;
+	struct utsname uts;

 	union {
 		struct {
@ -41,6 +43,13 @@ int cmd_set(int argc, char **argv)
 	int perf_bias = 0;
 	int ret = 0;

+	ret = uname(&uts);
+	if (!ret && (!strcmp(uts.machine, "ppc64le") ||
+		     !strcmp(uts.machine, "ppc64"))) {
+		fprintf(stderr, _("Subcommand not supported on POWER.\n"));
+		return ret;
+	}
+
 	setlocale(LC_ALL, "");
 	textdomain(PACKAGE);

--- a/tools/power/cpupower/utils/helpers/cpuid.c
+++ b/tools/power/cpupower/utils/helpers/cpuid.c
@ -131,6 +131,10 @@ int get_cpu_info(struct cpupower_cpu_info *cpu_info)
 		if (ext_cpuid_level >= 0x80000007 &&
 		    (cpuid_edx(0x80000007) & (1 << 9)))
 			cpu_info->caps |= CPUPOWER_CAP_AMD_CBP;
+
+		if (ext_cpuid_level >= 0x80000008 &&
+		    cpuid_ebx(0x80000008) & (1 << 4))
+			cpu_info->caps |= CPUPOWER_CAP_AMD_RDPRU;
 	}

 	if (cpu_info->vendor == X86_VENDOR_INTEL) {
--- a/tools/power/cpupower/utils/helpers/helpers.h
+++ b/tools/power/cpupower/utils/helpers/helpers.h
@ -69,6 +69,7 @@ enum cpupower_cpu_vendor {X86_VENDOR_UNKNOWN = 0, X86_VENDOR_INTEL,
 #define CPUPOWER_CAP_HAS_TURBO_RATIO	0x00000010
 #define CPUPOWER_CAP_IS_SNB		0x00000020
 #define CPUPOWER_CAP_INTEL_IDA		0x00000040
+#define CPUPOWER_CAP_AMD_RDPRU		0x00000080

 #define CPUPOWER_AMD_CPBDIS		0x02000000

--- a/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c
+++ b/tools/power/cpupower/utils/idle_monitor/amd_fam14h_idle.c
@ -328,7 +328,7 @@ struct cpuidle_monitor amd_fam14h_monitor = {
 	.stop			= amd_fam14h_stop,
 	.do_register		= amd_fam14h_register,
 	.unregister		= amd_fam14h_unregister,
-	.needs_root		= 1,
+	.flags.needs_root	= 1,
 	.overflow_s		= OVERFLOW_MS / 1000,
 };
 #endif /* #if defined(__i386__) || defined(__x86_64__) */
--- a/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c
+++ b/tools/power/cpupower/utils/idle_monitor/cpuidle_sysfs.c
@ -207,6 +207,6 @@ struct cpuidle_monitor cpuidle_sysfs_monitor = {
 	.stop			= cpuidle_stop,
 	.do_register		= cpuidle_register,
 	.unregister		= cpuidle_unregister,
-	.needs_root		= 0,
+	.flags.needs_root	= 0,
 	.overflow_s		= UINT_MAX,
 };
--- a/tools/power/cpupower/utils/idle_monitor/cpupower-monitor.c
+++ b/tools/power/cpupower/utils/idle_monitor/cpupower-monitor.c
@ -408,7 +408,7 @@ int cmd_monitor(int argc, char **argv)
 		dprint("Try to register: %s\n", all_monitors[num]->name);
 		test_mon = all_monitors[num]->do_register();
 		if (test_mon) {
-			if (test_mon->needs_root && !run_as_root) {
+			if (test_mon->flags.needs_root && !run_as_root) {
 				fprintf(stderr, _("Available monitor %s needs "
 					  "root access\n"), test_mon->name);
 				continue;
--- a/tools/power/cpupower/utils/idle_monitor/cpupower-monitor.h
+++ b/tools/power/cpupower/utils/idle_monitor/cpupower-monitor.h
@ -60,7 +60,10 @@ struct cpuidle_monitor {
 	struct cpuidle_monitor* (*do_register) (void);
 	void (*unregister)(void);
 	unsigned int overflow_s;
-	int needs_root;
+	struct {
+		unsigned int needs_root:1;
+		unsigned int per_cpu_schedule:1;
+	} flags;
 };

 extern long long timespec_diff_us(struct timespec start, struct timespec end);
--- a/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c
+++ b/tools/power/cpupower/utils/idle_monitor/hsw_ext_idle.c
@ -39,7 +39,6 @@ static cstate_t hsw_ext_cstates[HSW_EXT_CSTATE_COUNT] = {
 	{
 		.name			= "PC9",
 		.desc			= N_("Processor Package C9"),
-		.desc			= N_("Processor Package C2"),
 		.id			= PC9,
 		.range			= RANGE_PACKAGE,
 		.get_count_percent	= hsw_ext_get_count_percent,
@ -188,7 +187,7 @@ struct cpuidle_monitor intel_hsw_ext_monitor = {
 	.stop			= hsw_ext_stop,
 	.do_register		= hsw_ext_register,
 	.unregister		= hsw_ext_unregister,
-	.needs_root		= 1,
+	.flags.needs_root	= 1,
 	.overflow_s		= 922000000 /* 922337203 seconds TSC overflow
 					       at 20GHz */
 };
--- a/tools/power/cpupower/utils/idle_monitor/mperf_monitor.c
+++ b/tools/power/cpupower/utils/idle_monitor/mperf_monitor.c
@ -19,6 +19,10 @@
 #define MSR_APERF	0xE8
 #define MSR_MPERF	0xE7

+#define RDPRU ".byte 0x0f, 0x01, 0xfd"
+#define RDPRU_ECX_MPERF	0
+#define RDPRU_ECX_APERF	1
+
 #define MSR_TSC	0x10

 #define MSR_AMD_HWCR 0xc0010015
@ -86,15 +90,51 @@ static int mperf_get_tsc(unsigned long long *tsc)
 	return ret;
 }

-static int mperf_init_stats(unsigned int cpu)
+static int get_aperf_mperf(int cpu, unsigned long long *aval,
+				    unsigned long long *mval)
 {
-	unsigned long long val;
+	unsigned long low_a, high_a;
+	unsigned long low_m, high_m;
 	int ret;

-	ret = read_msr(cpu, MSR_APERF, &val);
-	aperf_previous_count[cpu] = val;
-	ret |= read_msr(cpu, MSR_MPERF, &val);
-	mperf_previous_count[cpu] = val;
+	/*
+	 * Running on the cpu from which we read the registers will
+	 * prevent APERF/MPERF from going out of sync because of IPI
+	 * latency introduced by read_msr()s.
+	 */
+	if (mperf_monitor.flags.per_cpu_schedule) {
+		if (bind_cpu(cpu))
+			return 1;
+	}
+
+	if (cpupower_cpu_info.caps & CPUPOWER_CAP_AMD_RDPRU) {
+		asm volatile(RDPRU
+			     : "=a" (low_a), "=d" (high_a)
+			     : "c" (RDPRU_ECX_APERF));
+		asm volatile(RDPRU
+			     : "=a" (low_m), "=d" (high_m)
+			     : "c" (RDPRU_ECX_MPERF));
+
+		*aval = ((low_a) | (high_a) << 32);
+		*mval = ((low_m) | (high_m) << 32);
+
+		return 0;
+	}
+
+	ret  = read_msr(cpu, MSR_APERF, aval);
+	ret |= read_msr(cpu, MSR_MPERF, mval);
+
+	return ret;
+}
+
+static int mperf_init_stats(unsigned int cpu)
+{
+	unsigned long long aval, mval;
+	int ret;
+
+	ret = get_aperf_mperf(cpu, &aval, &mval);
+	aperf_previous_count[cpu] = aval;
+	mperf_previous_count[cpu] = mval;
 	is_valid[cpu] = !ret;

 	return 0;
@ -102,13 +142,12 @@ static int mperf_init_stats(unsigned int cpu)

 static int mperf_measure_stats(unsigned int cpu)
 {
-	unsigned long long val;
+	unsigned long long aval, mval;
 	int ret;

-	ret = read_msr(cpu, MSR_APERF, &val);
-	aperf_current_count[cpu] = val;
-	ret |= read_msr(cpu, MSR_MPERF, &val);
-	mperf_current_count[cpu] = val;
+	ret = get_aperf_mperf(cpu, &aval, &mval);
+	aperf_current_count[cpu] = aval;
+	mperf_current_count[cpu] = mval;
 	is_valid[cpu] = !ret;

 	return 0;
@ -305,6 +344,9 @@ struct cpuidle_monitor *mperf_register(void)
 	if (init_maxfreq_mode())
 		return NULL;

+	if (cpupower_cpu_info.vendor == X86_VENDOR_AMD)
+		mperf_monitor.flags.per_cpu_schedule = 1;
+
 	/* Free this at program termination */
 	is_valid = calloc(cpu_count, sizeof(int));
 	mperf_previous_count = calloc(cpu_count, sizeof(unsigned long long));
@ -333,7 +375,7 @@ struct cpuidle_monitor mperf_monitor = {
 	.stop			= mperf_stop,
 	.do_register		= mperf_register,
 	.unregister		= mperf_unregister,
-	.needs_root		= 1,
+	.flags.needs_root	= 1,
 	.overflow_s		= 922000000 /* 922337203 seconds TSC overflow
 					       at 20GHz */
 };
--- a/tools/power/cpupower/utils/idle_monitor/nhm_idle.c
+++ b/tools/power/cpupower/utils/idle_monitor/nhm_idle.c
@ -208,7 +208,7 @@ struct cpuidle_monitor intel_nhm_monitor = {
 	.stop			= nhm_stop,
 	.do_register		= intel_nhm_register,
 	.unregister		= intel_nhm_unregister,
-	.needs_root		= 1,
+	.flags.needs_root	= 1,
 	.overflow_s		= 922000000 /* 922337203 seconds TSC overflow
 					       at 20GHz */
 };
--- a/tools/power/cpupower/utils/idle_monitor/snb_idle.c
+++ b/tools/power/cpupower/utils/idle_monitor/snb_idle.c
@ -192,7 +192,7 @@ struct cpuidle_monitor intel_snb_monitor = {
 	.stop			= snb_stop,
 	.do_register		= snb_register,
 	.unregister		= snb_unregister,
-	.needs_root		= 1,
+	.flags.needs_root	= 1,
 	.overflow_s		= 922000000 /* 922337203 seconds TSC overflow
 					       at 20GHz */
 };