Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cache quality monitoring update from Thomas Gleixner:
 "This update provides a complete rewrite of the Cache Quality
  Monitoring (CQM) facility.

  The existing CQM support was duct taped into perf with a lot of issues
  and the attempts to fix those turned out to be incomplete and
  horrible.

  After lengthy discussions it was decided to integrate the CQM support
  into the Resource Director Technology (RDT) facility, which is the
  obvious choise as in hardware CQM is part of RDT. This allowed to add
  Memory Bandwidth Monitoring support on top.

  As a result the mechanisms for allocating cache/memory bandwidth and
  the corresponding monitoring mechanisms are integrated into a single
  management facility with a consistent user interface"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  x86/intel_rdt: Turn off most RDT features on Skylake
  x86/intel_rdt: Add command line options for resource director technology
  x86/intel_rdt: Move special case code for Haswell to a quirk function
  x86/intel_rdt: Remove redundant ternary operator on return
  x86/intel_rdt/cqm: Improve limbo list processing
  x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
  x86/intel_rdt: Modify the intel_pqr_state for better performance
  x86/intel_rdt/cqm: Clear the default RMID during hotcpu
  x86/intel_rdt: Show bitmask of shareable resource with other executing units
  x86/intel_rdt/mbm: Handle counter overflow
  x86/intel_rdt/mbm: Add mbm counter initialization
  x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
  x86/intel_rdt/cqm: Add CPU hotplug support
  x86/intel_rdt/cqm: Add sched_in support
  x86/intel_rdt: Introduce rdt_enable_key for scheduling
  x86/intel_rdt/cqm: Add mount,umount support
  x86/intel_rdt/cqm: Add rmdir support
  x86/intel_rdt: Separate the ctrl bits from rmdir
  x86/intel_rdt/cqm: Add mon_data
  x86/intel_rdt: Prepare for RDT monitor data support
  ...
This commit is contained in:
Linus Torvalds 2017-09-04 13:56:37 -07:00
commit f57091767a
21 changed files with 2643 additions and 2439 deletions

View File

@ -138,6 +138,7 @@ parameter is applicable::
PPT Parallel port support is enabled.
PS2 Appropriate PS/2 support is enabled.
RAM RAM disk support is enabled.
RDT Intel Resource Director Technology.
S390 S390 architecture is enabled.
SCSI Appropriate SCSI support is enabled.
A lot of drivers have their options described inside

View File

@ -3612,6 +3612,12 @@
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, mba.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
reboot= [KNL]
Format (x86 or x86_64):
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \

View File

@ -6,8 +6,8 @@ Fenghua Yu <fenghua.yu@intel.com>
Tony Luck <tony.luck@intel.com>
Vikas Shivappa <vikas.shivappa@intel.com>
This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".
To use the feature mount the file system:
@ -17,6 +17,13 @@ mount options are:
"cdp": Enable code/data prioritization in L3 cache allocations.
RDT features are orthogonal. A particular system may support only
monitoring, only control, or both monitoring and control.
The mount succeeds if either of allocation or monitoring is present, but
only those files and directories supported by the system will be created.
For more details on the behavior of the interface during monitoring
and allocation, see the "Resource alloc and monitor groups" section.
Info directory
--------------
@ -24,7 +31,12 @@ Info directory
The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names.
Cache resource(L3/L2) subdirectory contains the following files:
Each subdirectory contains the following files with respect to
allocation:
Cache resource(L3/L2) subdirectory contains the following files
related to allocation:
"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@ -36,7 +48,15 @@ Cache resource(L3/L2) subdirectory contains the following files:
"min_cbm_bits": The minimum number of consecutive bits which
must be set when writing a mask.
Memory bandwitdh(MB) subdirectory contains the following files:
"shareable_bits": Bitmask of shareable resource with other executing
entities (e.g. I/O). User can use this when
setting up exclusive cache partitions. Note that
some platforms support devices that have their
own settings for cache use which can over-ride
these bits.
Memory bandwitdh(MB) subdirectory contains the following files
with respect to allocation:
"min_bandwidth": The minimum memory bandwidth percentage which
user can request.
@ -52,48 +72,152 @@ Memory bandwitdh(MB) subdirectory contains the following files:
non-linear. This field is purely informational
only.
Resource groups
---------------
If RDT monitoring is available there will be an "L3_MON" directory
with the following files:
"num_rmids": The number of RMIDs available. This is the
upper bound for how many "CTRL_MON" + "MON"
groups can be created.
"mon_features": Lists the monitoring events if
monitoring is enabled for the resource.
"max_threshold_occupancy":
Read/write file provides the largest value (in
bytes) at which a previously used LLC_occupancy
counter can be considered for re-use.
Resource alloc and monitor groups
---------------------------------
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".
system. The default group is the root directory which, immediately
after mounting, owns all the tasks and cpus in the system and can make
full use of all resources.
There are three files associated with each group:
On a system with RDT control features additional directories can be
created in the root directory that specify different amounts of each
resource (see "schemata" below). The root and these additional top level
directories are referred to as "CTRL_MON" groups below.
"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
(which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2)
and clone(2) are added to the same group as their parent.
If a pid is not in any sub partition, it is in root partition
(i.e. default partition).
On a system with RDT monitoring the root directory and other top level
directories contain a directory named "mon_groups" in which additional
directories can be created to monitor subsets of tasks in the CTRL_MON
group that is their ancestor. These are called "MON" groups in the rest
of this document.
"cpus": A bitmask of logical CPUs assigned to this group. Writing
a new mask can add/remove CPUs from this group. Added CPUs
are removed from their previous group. Removed ones are
given to the default (root) group. You cannot remove CPUs
from the default group.
Removing a directory will move all tasks and cpus owned by the group it
represents to the parent. Removing one of the created CTRL_MON groups
will automatically remove all MON groups below it.
"cpus_list": One or more CPU ranges of logical CPUs assigned to this
group. Same rules apply like for the "cpus" file.
All groups contain the following files:
"schemata": A list of all the resources available to this group.
Each resource has its own line and format - see below for
details.
"tasks":
Reading this file shows the list of all tasks that belong to
this group. Writing a task id to the file will add a task to the
group. If the group is a CTRL_MON group the task is removed from
whichever previous CTRL_MON group owned the task and also from
any MON group that owned the task. If the group is a MON group,
then the task must already belong to the CTRL_MON parent of this
group. The task is removed from any previous MON group.
When a task is running the following rules define which resources
are available to it:
"cpus":
Reading this file shows a bitmask of the logical CPUs owned by
this group. Writing a mask to this file will add and remove
CPUs to/from this group. As with the tasks file a hierarchy is
maintained where MON groups may only include CPUs owned by the
parent CTRL_MON group.
"cpus_list":
Just like "cpus", only using ranges of CPUs instead of bitmasks.
When control is enabled all CTRL_MON groups will also contain:
"schemata":
A list of all the resources available to this group.
Each resource has its own line and format - see below for details.
When monitoring is enabled all MON groups will also contain:
"mon_data":
This contains a set of files organized by L3 domain and by
RDT event. E.g. on a system with two L3 domains there will
be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
directories have one file per event (e.g. "llc_occupancy",
"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
files provide a read out of the current value of the event for
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
Resource allocation rules
-------------------------
When a task is running the following rules define which resources are
available to it:
1) If the task is a member of a non-default group, then the schemata
for that group is used.
for that group is used.
2) Else if the task belongs to the default group, but is running on a
CPU that is assigned to some specific group, then the schemata for
the CPU's group is used.
CPU that is assigned to some specific group, then the schemata for the
CPU's group is used.
3) Otherwise the schemata for the default group is used.
Resource monitoring rules
-------------------------
1) If a task is a member of a MON group, or non-default CTRL_MON group
then RDT events for the task will be reported in that group.
2) If a task is a member of the default CTRL_MON group, but is running
on a CPU that is assigned to some specific group, then the RDT events
for the task will be reported in that group.
3) Otherwise RDT events for the task will be reported in the root level
"mon_data" group.
Notes on cache occupancy monitoring and control
-----------------------------------------------
When moving a task from one group to another you should remember that
this only affects *new* cache allocations by the task. E.g. you may have
a task in a monitor group showing 3 MB of cache occupancy. If you move
to a new group and immediately check the occupancy of the old and new
groups you will likely see that the old group is still showing 3 MB and
the new group zero. When the task accesses locations still in cache from
before the move, the h/w does not update any counters. On a busy system
you will likely see the occupancy in the old group go down as cache lines
are evicted and re-used while the occupancy in the new group rises as
the task accesses memory and loads into the cache are counted based on
membership in the new group.
The same applies to cache allocation control. Moving a task to a group
with a smaller cache partition will not evict any cache lines. The
process may continue to use them from the old partition.
Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
to identify a control group and a monitoring group respectively. Each of
the resource groups are mapped to these IDs based on the kind of group. The
number of CLOSid and RMID are limited by the hardware and hence the creation of
a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
and creation of "MON" group may fail if we run out of RMIDs.
max_threshold_occupancy - generic concepts
------------------------------------------
Note that an RMID once freed may not be immediately available for use as
the RMID is still tagged the cache lines of the previous user of RMID.
Hence such RMIDs are placed on limbo list and checked back if the cache
occupancy has gone down. If there is a time when system has a lot of
limbo RMIDs but which are not ready to be used, user may see an -EBUSY
during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
Schemata files - general concepts
---------------------------------
@ -143,22 +267,22 @@ SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.
L3 details (code and data prioritization disabled)
--------------------------------------------------
L3 schemata file details (code and data prioritization disabled)
----------------------------------------------------------------
With CDP disabled the L3 schemata format is:
L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3 details (CDP enabled via mount option to resctrl)
----------------------------------------------------
L3 schemata file details (CDP enabled via mount option to resctrl)
------------------------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:
L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L2 details
----------
L2 schemata file details
------------------------
L2 cache does not support code and data prioritization, so the
schemata format is always:
@ -185,6 +309,8 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
Examples for RDT allocation usage:
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
@ -410,3 +536,124 @@ void main(void)
/* code to read and write directory contents */
resctrl_release_lock(fd);
}
Examples for RDT Monitoring along with allocation usage:
Reading monitored data
----------------------
Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
show the current snapshot of LLC occupancy of the corresponding MON
group or CTRL_MON group.
Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
---------
On a two socket machine (one L3 cache per socket) with just four bits
for cache bit masks
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
# echo 5678 > p1/tasks
# echo 5679 > p1/tasks
The default resource group is unmodified, so we have access to all parts
of all caches (its schemata file reads "L3:0=f;1=f").
Tasks that are under the control of group "p0" may only allocate from the
"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
Tasks in group "p1" use the "lower" 50% of cache on both sockets.
Create monitor groups and assign a subset of tasks to each monitor group.
# cd /sys/fs/resctrl/p1/mon_groups
# mkdir m11 m12
# echo 5678 > m11/tasks
# echo 5679 > m12/tasks
fetch data (data shown in bytes)
# cat m11/mon_data/mon_L3_00/llc_occupancy
16234000
# cat m11/mon_data/mon_L3_01/llc_occupancy
14789000
# cat m12/mon_data/mon_L3_00/llc_occupancy
16789000
The parent ctrl_mon group shows the aggregated data.
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
31234000
Example 2 (Monitor a task from its creation)
---------
On a two socket machine (one L3 cache per socket)
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
An RMID is allocated to the group once its created and hence the <cmd>
below is monitored from its creation.
# echo $$ > /sys/fs/resctrl/p1/tasks
# <cmd>
Fetch the data
# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
31789000
Example 3 (Monitor without CAT support or before creating CAT groups)
---------
Assume a system like HSW has only CQM and no CAT support. In this case
the resctrl will still mount but cannot create CTRL_MON directories.
But user can create different MON groups within the root group thereby
able to monitor all tasks including kernel threads.
This can also be used to profile jobs cache size footprint before being
able to allocate them to different allocation groups.
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir mon_groups/m01
# mkdir mon_groups/m02
# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
Monitor the groups separately and also get per domain data. From the
below its apparent that the tasks are mostly doing work on
domain(socket) 0.
# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
31234000
# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
34555
# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
31234000
# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
32789
Example 4 (Monitor real time tasks)
-----------------------------------
A single socket system which has real time tasks running on cores 4-7
and non real time tasks on other cpus. We want to monitor the cache
occupancy of the real time threads on these cores.
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p1
Move the cpus 4-7 over to p1
# echo f0 > p0/cpus
View the llc occupancy snapshot
# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
11234000

View File

@ -11121,7 +11121,7 @@ M: Fenghua Yu <fenghua.yu@intel.com>
L: linux-kernel@vger.kernel.org
S: Supported
F: arch/x86/kernel/cpu/intel_rdt*
F: arch/x86/include/asm/intel_rdt*
F: arch/x86/include/asm/intel_rdt_sched.h
F: Documentation/x86/intel_rdt*
READ-COPY UPDATE (RCU)

View File

@ -429,16 +429,16 @@ config GOLDFISH
def_bool y
depends on X86_GOLDFISH
config INTEL_RDT_A
bool "Intel Resource Director Technology Allocation support"
config INTEL_RDT
bool "Intel Resource Director Technology support"
default n
depends on X86 && CPU_SUP_INTEL
select KERNFS
help
Select to enable resource allocation which is a sub-feature of
Intel Resource Director Technology(RDT). More information about
RDT can be found in the Intel x86 Architecture Software
Developer Manual.
Select to enable resource allocation and monitoring which are
sub-features of Intel Resource Director Technology(RDT). More
information about RDT can be found in the Intel x86
Architecture Software Developer Manual.
Say N if unsure.

View File

@ -1,4 +1,4 @@
obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o cqm.o
obj-$(CONFIG_CPU_SUP_INTEL) += core.o bts.o
obj-$(CONFIG_CPU_SUP_INTEL) += ds.o knc.o
obj-$(CONFIG_CPU_SUP_INTEL) += lbr.o p4.o p6.o pt.o
obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += intel-rapl-perf.o

File diff suppressed because it is too large Load Diff

View File

@ -1,286 +0,0 @@
#ifndef _ASM_X86_INTEL_RDT_H
#define _ASM_X86_INTEL_RDT_H
#ifdef CONFIG_INTEL_RDT_A
#include <linux/sched.h>
#include <linux/kernfs.h>
#include <linux/jump_label.h>
#include <asm/intel_rdt_common.h>
#define IA32_L3_QOS_CFG 0xc81
#define IA32_L3_CBM_BASE 0xc90
#define IA32_L2_CBM_BASE 0xd10
#define IA32_MBA_THRTL_BASE 0xd50
#define L3_QOS_CDP_ENABLE 0x01ULL
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
* @rdtgroup_list: linked list for all rdtgroups
* @closid: closid for this rdtgroup
* @cpu_mask: CPUs assigned to this rdtgroup
* @flags: status bits
* @waitcount: how many cpus expect to find this
* group when they acquire rdtgroup_mutex
*/
struct rdtgroup {
struct kernfs_node *kn;
struct list_head rdtgroup_list;
int closid;
struct cpumask cpu_mask;
int flags;
atomic_t waitcount;
};
/* rdtgroup.flags */
#define RDT_DELETED 1
/* rftype.flags */
#define RFTYPE_FLAGS_CPUS_LIST 1
/* List of all resource groups */
extern struct list_head rdt_all_groups;
extern int max_name_width, max_data_width;
int __init rdtgroup_init(void);
/**
* struct rftype - describe each file in the resctrl file system
* @name: File name
* @mode: Access mode
* @kf_ops: File operations
* @flags: File specific RFTYPE_FLAGS_* flags
* @seq_show: Show content of the file
* @write: Write to the file
*/
struct rftype {
char *name;
umode_t mode;
struct kernfs_ops *kf_ops;
unsigned long flags;
int (*seq_show)(struct kernfs_open_file *of,
struct seq_file *sf, void *v);
/*
* write() is the generic write callback which maps directly to
* kernfs write operation and overrides all other operations.
* Maximum write size is determined by ->max_write_len.
*/
ssize_t (*write)(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
};
/**
* struct rdt_domain - group of cpus sharing an RDT resource
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which cpus share this resource
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
*/
struct rdt_domain {
struct list_head list;
int id;
struct cpumask cpu_mask;
u32 *ctrl_val;
u32 new_ctrl;
bool have_new_ctrl;
};
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
* @low: Beginning index from base MSR
* @high: End index
*/
struct msr_param {
struct rdt_resource *res;
int low;
int high;
};
/**
* struct rdt_cache - Cache allocation related data
* @cbm_len: Length of the cache bit mask
* @min_cbm_bits: Minimum number of consecutive bits to be set
* @cbm_idx_mult: Multiplier of CBM index
* @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
* closid * cbm_idx_multi + cbm_idx_offset
* in a cache bit mask
*/
struct rdt_cache {
unsigned int cbm_len;
unsigned int min_cbm_bits;
unsigned int cbm_idx_mult;
unsigned int cbm_idx_offset;
};
/**
* struct rdt_membw - Memory bandwidth allocation related data
* @max_delay: Max throttle delay. Delay is the hardware
* representation for memory bandwidth.
* @min_bw: Minimum memory bandwidth percentage user can request
* @bw_gran: Granularity at which the memory bandwidth is allocated
* @delay_linear: True if memory B/W delay is in linear scale
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct rdt_membw {
u32 max_delay;
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
u32 *mb_map;
};
/**
* struct rdt_resource - attributes of an RDT resource
* @enabled: Is this feature enabled on this machine
* @capable: Is this feature available on this machine
* @name: Name to use in "schemata" file
* @num_closid: Number of CLOSIDs available
* @cache_level: Which cache level defines scope of this resource
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
* @msr_base: Base MSR address for CBMs
* @msr_update: Function pointer to update QOS MSRs
* @data_width: Character width of data when displaying
* @domains: All domains for this resource
* @cache: Cache allocation related data
* @info_files: resctrl info files for the resource
* @nr_info_files: Number of info files
* @format_str: Per resource format string to show domain value
* @parse_ctrlval: Per resource function pointer to parse control values
*/
struct rdt_resource {
bool enabled;
bool capable;
char *name;
int num_closid;
int cache_level;
u32 default_ctrl;
unsigned int msr_base;
void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
int data_width;
struct list_head domains;
struct rdt_cache cache;
struct rdt_membw membw;
struct rftype *info_files;
int nr_info_files;
const char *format_str;
int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
struct rdt_domain *d);
};
void rdt_get_cache_infofile(struct rdt_resource *r);
void rdt_get_mba_infofile(struct rdt_resource *r);
int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);
extern struct mutex rdtgroup_mutex;
extern struct rdt_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
int __init rdtgroup_init(void);
enum {
RDT_RESOURCE_L3,
RDT_RESOURCE_L3DATA,
RDT_RESOURCE_L3CODE,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
/* Must be the last */
RDT_NUM_RESOURCES,
};
#define for_each_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->capable)
#define for_each_enabled_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->enabled)
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
struct {
unsigned int cbm_len:5;
} split;
unsigned int full;
};
/* CPUID.(EAX=10H, ECX=ResID=3).EAX */
union cpuid_0x10_3_eax {
struct {
unsigned int max_delay:12;
} split;
unsigned int full;
};
/* CPUID.(EAX=10H, ECX=ResID).EDX */
union cpuid_0x10_x_edx {
struct {
unsigned int cos_max:16;
} split;
unsigned int full;
};
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_closid);
void rdt_ctrl_update(void *arg);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
void rdtgroup_kn_unlock(struct kernfs_node *kn);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
/*
* intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
*
* Following considerations are made so that this has minimal impact
* on scheduler hot path:
* - This will stay as no-op unless we are running on an Intel SKU
* which supports resource control and we enable by mounting the
* resctrl file system.
* - Caches the per cpu CLOSid values and does the MSR write only
* when a task with a different CLOSid is scheduled in.
*
* Must be called with preemption disabled.
*/
static inline void intel_rdt_sched_in(void)
{
if (static_branch_likely(&rdt_enable_key)) {
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
int closid;
/*
* If this task has a closid assigned, use it.
* Else use the closid assigned to this cpu.
*/
closid = current->closid;
if (closid == 0)
closid = this_cpu_read(cpu_closid);
if (closid != state->closid) {
state->closid = closid;
wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, closid);
}
}
}
#else
static inline void intel_rdt_sched_in(void) {}
#endif /* CONFIG_INTEL_RDT_A */
#endif /* _ASM_X86_INTEL_RDT_H */

View File

@ -1,27 +0,0 @@
#ifndef _ASM_X86_INTEL_RDT_COMMON_H
#define _ASM_X86_INTEL_RDT_COMMON_H
#define MSR_IA32_PQR_ASSOC 0x0c8f
/**
* struct intel_pqr_state - State cache for the PQR MSR
* @rmid: The cached Resource Monitoring ID
* @closid: The cached Class Of Service ID
* @rmid_usecnt: The usage counter for rmid
*
* The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
* contains both parts, so we need to cache them.
*
* The cache also helps to avoid pointless updates if the value does
* not change.
*/
struct intel_pqr_state {
u32 rmid;
u32 closid;
int rmid_usecnt;
};
DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
#endif /* _ASM_X86_INTEL_RDT_COMMON_H */

View File

@ -0,0 +1,92 @@
#ifndef _ASM_X86_INTEL_RDT_SCHED_H
#define _ASM_X86_INTEL_RDT_SCHED_H
#ifdef CONFIG_INTEL_RDT
#include <linux/sched.h>
#include <linux/jump_label.h>
#define IA32_PQR_ASSOC 0x0c8f
/**
* struct intel_pqr_state - State cache for the PQR MSR
* @cur_rmid: The cached Resource Monitoring ID
* @cur_closid: The cached Class Of Service ID
* @default_rmid: The user assigned Resource Monitoring ID
* @default_closid: The user assigned cached Class Of Service ID
*
* The upper 32 bits of IA32_PQR_ASSOC contain closid and the
* lower 10 bits rmid. The update to IA32_PQR_ASSOC always
* contains both parts, so we need to cache them. This also
* stores the user configured per cpu CLOSID and RMID.
*
* The cache also helps to avoid pointless updates if the value does
* not change.
*/
struct intel_pqr_state {
u32 cur_rmid;
u32 cur_closid;
u32 default_rmid;
u32 default_closid;
};
DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
/*
* __intel_rdt_sched_in() - Writes the task's CLOSid/RMID to IA32_PQR_MSR
*
* Following considerations are made so that this has minimal impact
* on scheduler hot path:
* - This will stay as no-op unless we are running on an Intel SKU
* which supports resource control or monitoring and we enable by
* mounting the resctrl file system.
* - Caches the per cpu CLOSid/RMID values and does the MSR write only
* when a task with a different CLOSid/RMID is scheduled in.
* - We allocate RMIDs/CLOSids globally in order to keep this as
* simple as possible.
* Must be called with preemption disabled.
*/
static void __intel_rdt_sched_in(void)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
u32 closid = state->default_closid;
u32 rmid = state->default_rmid;
/*
* If this task has a closid/rmid assigned, use it.
* Else use the closid/rmid assigned to this cpu.
*/
if (static_branch_likely(&rdt_alloc_enable_key)) {
if (current->closid)
closid = current->closid;
}
if (static_branch_likely(&rdt_mon_enable_key)) {
if (current->rmid)
rmid = current->rmid;
}
if (closid != state->cur_closid || rmid != state->cur_rmid) {
state->cur_closid = closid;
state->cur_rmid = rmid;
wrmsr(IA32_PQR_ASSOC, rmid, closid);
}
}
static inline void intel_rdt_sched_in(void)
{
if (static_branch_likely(&rdt_enable_key))
__intel_rdt_sched_in();
}
#else
static inline void intel_rdt_sched_in(void) {}
#endif /* CONFIG_INTEL_RDT */
#endif /* _ASM_X86_INTEL_RDT_SCHED_H */

View File

@ -33,7 +33,7 @@ obj-$(CONFIG_CPU_SUP_CENTAUR) += centaur.o
obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
obj-$(CONFIG_CPU_SUP_UMC_32) += umc.o
obj-$(CONFIG_INTEL_RDT_A) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_schemata.o
obj-$(CONFIG_INTEL_RDT) += intel_rdt.o intel_rdt_rdtgroup.o intel_rdt_monitor.o intel_rdt_ctrlmondata.o
obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/

View File

@ -30,7 +30,8 @@
#include <linux/cpuhotplug.h>
#include <asm/intel-family.h>
#include <asm/intel_rdt.h>
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"
#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
@ -38,7 +39,13 @@
/* Mutex to protect rdtgroup access. */
DEFINE_MUTEX(rdtgroup_mutex);
DEFINE_PER_CPU_READ_MOSTLY(int, cpu_closid);
/*
* The cached intel_pqr_state is strictly per CPU and can never be
* updated from a remote CPU. Functions which modify the state
* are called with interrupts disabled and no preemption, which
* is sufficient for the protection.
*/
DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
/*
* Used to store the max resource name width and max resource data width
@ -46,6 +53,12 @@ DEFINE_PER_CPU_READ_MOSTLY(int, cpu_closid);
*/
int max_name_width, max_data_width;
/*
* Global boolean for rdt_alloc which is true if any
* resource allocation is enabled.
*/
bool rdt_alloc_capable;
static void
mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
static void
@ -54,7 +67,9 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r);
#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].domains)
struct rdt_resource rdt_resources_all[] = {
[RDT_RESOURCE_L3] =
{
.rid = RDT_RESOURCE_L3,
.name = "L3",
.domains = domain_init(RDT_RESOURCE_L3),
.msr_base = IA32_L3_CBM_BASE,
@ -67,8 +82,11 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L3DATA] =
{
.rid = RDT_RESOURCE_L3DATA,
.name = "L3DATA",
.domains = domain_init(RDT_RESOURCE_L3DATA),
.msr_base = IA32_L3_CBM_BASE,
@ -81,8 +99,11 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L3CODE] =
{
.rid = RDT_RESOURCE_L3CODE,
.name = "L3CODE",
.domains = domain_init(RDT_RESOURCE_L3CODE),
.msr_base = IA32_L3_CBM_BASE,
@ -95,8 +116,11 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_L2] =
{
.rid = RDT_RESOURCE_L2,
.name = "L2",
.domains = domain_init(RDT_RESOURCE_L2),
.msr_base = IA32_L2_CBM_BASE,
@ -109,8 +133,11 @@ struct rdt_resource rdt_resources_all[] = {
},
.parse_ctrlval = parse_cbm,
.format_str = "%d=%0*x",
.fflags = RFTYPE_RES_CACHE,
},
[RDT_RESOURCE_MBA] =
{
.rid = RDT_RESOURCE_MBA,
.name = "MB",
.domains = domain_init(RDT_RESOURCE_MBA),
.msr_base = IA32_MBA_THRTL_BASE,
@ -118,6 +145,7 @@ struct rdt_resource rdt_resources_all[] = {
.cache_level = 3,
.parse_ctrlval = parse_bw,
.format_str = "%d=%*d",
.fflags = RFTYPE_RES_MB,
},
};
@ -144,33 +172,28 @@ static unsigned int cbm_idx(struct rdt_resource *r, unsigned int closid)
* is always 20 on hsw server parts. The minimum cache bitmask length
* allowed for HSW server is always 2 bits. Hardcode all of them.
*/
static inline bool cache_alloc_hsw_probe(void)
static inline void cache_alloc_hsw_probe(void)
{
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 &&
boot_cpu_data.x86_model == INTEL_FAM6_HASWELL_X) {
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
u32 l, h, max_cbm = BIT_MASK(20) - 1;
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3];
u32 l, h, max_cbm = BIT_MASK(20) - 1;
if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0))
return false;
rdmsr(IA32_L3_CBM_BASE, l, h);
if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0))
return;
rdmsr(IA32_L3_CBM_BASE, l, h);
/* If all the bits were set in MSR, return success */
if (l != max_cbm)
return false;
/* If all the bits were set in MSR, return success */
if (l != max_cbm)
return;
r->num_closid = 4;
r->default_ctrl = max_cbm;
r->cache.cbm_len = 20;
r->cache.min_cbm_bits = 2;
r->capable = true;
r->enabled = true;
r->num_closid = 4;
r->default_ctrl = max_cbm;
r->cache.cbm_len = 20;
r->cache.shareable_bits = 0xc0000;
r->cache.min_cbm_bits = 2;
r->alloc_capable = true;
r->alloc_enabled = true;
return true;
}
return false;
rdt_alloc_capable = true;
}
/*
@ -213,15 +236,14 @@ static bool rdt_get_mem_config(struct rdt_resource *r)
return false;
}
r->data_width = 3;
rdt_get_mba_infofile(r);
r->capable = true;
r->enabled = true;
r->alloc_capable = true;
r->alloc_enabled = true;
return true;
}
static void rdt_get_cache_config(int idx, struct rdt_resource *r)
static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r)
{
union cpuid_0x10_1_eax eax;
union cpuid_0x10_x_edx edx;
@ -231,10 +253,10 @@ static void rdt_get_cache_config(int idx, struct rdt_resource *r)
r->num_closid = edx.split.cos_max + 1;
r->cache.cbm_len = eax.split.cbm_len + 1;
r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1;
r->cache.shareable_bits = ebx & r->default_ctrl;
r->data_width = (r->cache.cbm_len + 3) / 4;
rdt_get_cache_infofile(r);
r->capable = true;
r->enabled = true;
r->alloc_capable = true;
r->alloc_enabled = true;
}
static void rdt_get_cdp_l3_config(int type)
@ -246,12 +268,12 @@ static void rdt_get_cdp_l3_config(int type)
r->cache.cbm_len = r_l3->cache.cbm_len;
r->default_ctrl = r_l3->default_ctrl;
r->data_width = (r->cache.cbm_len + 3) / 4;
r->capable = true;
r->alloc_capable = true;
/*
* By default, CDP is disabled. CDP can be enabled by mount parameter
* "cdp" during resctrl file system mount time.
*/
r->enabled = false;
r->alloc_enabled = false;
}
static int get_cache_id(int cpu, int level)
@ -300,6 +322,19 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r)
wrmsrl(r->msr_base + cbm_idx(r, i), d->ctrl_val[i]);
}
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r)
{
struct rdt_domain *d;
list_for_each_entry(d, &r->domains, list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->cpu_mask))
return d;
}
return NULL;
}
void rdt_ctrl_update(void *arg)
{
struct msr_param *m = arg;
@ -307,12 +342,10 @@ void rdt_ctrl_update(void *arg)
int cpu = smp_processor_id();
struct rdt_domain *d;
list_for_each_entry(d, &r->domains, list) {
/* Find the domain that contains this CPU */
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
r->msr_update(d, m, r);
return;
}
d = get_domain_from_cpu(cpu, r);
if (d) {
r->msr_update(d, m, r);
return;
}
pr_warn_once("cpu %d not found in any domain for resource %s\n",
cpu, r->name);
@ -326,8 +359,8 @@ void rdt_ctrl_update(void *arg)
* caller, return the first domain whose id is bigger than the input id.
* The domain list is sorted by id in ascending order.
*/
static struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head **pos)
struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head **pos)
{
struct rdt_domain *d;
struct list_head *l;
@ -377,6 +410,44 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
return 0;
}
static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d)
{
size_t tsize;
if (is_llc_occupancy_enabled()) {
d->rmid_busy_llc = kcalloc(BITS_TO_LONGS(r->num_rmid),
sizeof(unsigned long),
GFP_KERNEL);
if (!d->rmid_busy_llc)
return -ENOMEM;
INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo);
}
if (is_mbm_total_enabled()) {
tsize = sizeof(*d->mbm_total);
d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
if (!d->mbm_total) {
kfree(d->rmid_busy_llc);
return -ENOMEM;
}
}
if (is_mbm_local_enabled()) {
tsize = sizeof(*d->mbm_local);
d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL);
if (!d->mbm_local) {
kfree(d->rmid_busy_llc);
kfree(d->mbm_total);
return -ENOMEM;
}
}
if (is_mbm_enabled()) {
INIT_DELAYED_WORK(&d->mbm_over, mbm_handle_overflow);
mbm_setup_overflow_handler(d, MBM_OVERFLOW_INTERVAL);
}
return 0;
}
/*
* domain_add_cpu - Add a cpu to a resource's domain list.
*
@ -412,14 +483,26 @@ static void domain_add_cpu(int cpu, struct rdt_resource *r)
return;
d->id = id;
cpumask_set_cpu(cpu, &d->cpu_mask);
if (domain_setup_ctrlval(r, d)) {
if (r->alloc_capable && domain_setup_ctrlval(r, d)) {
kfree(d);
return;
}
if (r->mon_capable && domain_setup_mon_state(r, d)) {
kfree(d);
return;
}
cpumask_set_cpu(cpu, &d->cpu_mask);
list_add_tail(&d->list, add_pos);
/*
* If resctrl is mounted, add
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
mkdir_mondata_subdir_allrdtgrp(r, d);
}
static void domain_remove_cpu(int cpu, struct rdt_resource *r)
@ -435,19 +518,58 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
cpumask_clear_cpu(cpu, &d->cpu_mask);
if (cpumask_empty(&d->cpu_mask)) {
/*
* If resctrl is mounted, remove all the
* per domain monitor data directories.
*/
if (static_branch_unlikely(&rdt_mon_enable_key))
rmdir_mondata_subdir_allrdtgrp(r, d->id);
kfree(d->ctrl_val);
kfree(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
list_del(&d->list);
if (is_mbm_enabled())
cancel_delayed_work(&d->mbm_over);
if (is_llc_occupancy_enabled() && has_busy_rmid(r, d)) {
/*
* When a package is going down, forcefully
* decrement rmid->ebusy. There is no way to know
* that the L3 was flushed and hence may lead to
* incorrect counts in rare scenarios, but leaving
* the RMID as busy creates RMID leaks if the
* package never comes back.
*/
__check_limbo(d, true);
cancel_delayed_work(&d->cqm_limbo);
}
kfree(d);
return;
}
if (r == &rdt_resources_all[RDT_RESOURCE_L3]) {
if (is_mbm_enabled() && cpu == d->mbm_work_cpu) {
cancel_delayed_work(&d->mbm_over);
mbm_setup_overflow_handler(d, 0);
}
if (is_llc_occupancy_enabled() && cpu == d->cqm_work_cpu &&
has_busy_rmid(r, d)) {
cancel_delayed_work(&d->cqm_limbo);
cqm_setup_limbo_handler(d, 0);
}
}
}
static void clear_closid(int cpu)
static void clear_closid_rmid(int cpu)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
per_cpu(cpu_closid, cpu) = 0;
state->closid = 0;
wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);
state->default_closid = 0;
state->default_rmid = 0;
state->cur_closid = 0;
state->cur_rmid = 0;
wrmsr(IA32_PQR_ASSOC, 0, 0);
}
static int intel_rdt_online_cpu(unsigned int cpu)
@ -459,12 +581,23 @@ static int intel_rdt_online_cpu(unsigned int cpu)
domain_add_cpu(cpu, r);
/* The cpu is set in default rdtgroup after online. */
cpumask_set_cpu(cpu, &rdtgroup_default.cpu_mask);
clear_closid(cpu);
clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);
return 0;
}
static void clear_childcpus(struct rdtgroup *r, unsigned int cpu)
{
struct rdtgroup *cr;
list_for_each_entry(cr, &r->mon.crdtgrp_list, mon.crdtgrp_list) {
if (cpumask_test_and_clear_cpu(cpu, &cr->cpu_mask)) {
break;
}
}
}
static int intel_rdt_offline_cpu(unsigned int cpu)
{
struct rdtgroup *rdtgrp;
@ -474,10 +607,12 @@ static int intel_rdt_offline_cpu(unsigned int cpu)
for_each_capable_rdt_resource(r)
domain_remove_cpu(cpu, r);
list_for_each_entry(rdtgrp, &rdt_all_groups, rdtgroup_list) {
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask))
if (cpumask_test_and_clear_cpu(cpu, &rdtgrp->cpu_mask)) {
clear_childcpus(rdtgrp, cpu);
break;
}
}
clear_closid(cpu);
clear_closid_rmid(cpu);
mutex_unlock(&rdtgroup_mutex);
return 0;
@ -492,7 +627,7 @@ static __init void rdt_init_padding(void)
struct rdt_resource *r;
int cl;
for_each_capable_rdt_resource(r) {
for_each_alloc_capable_rdt_resource(r) {
cl = strlen(r->name);
if (cl > max_name_width)
max_name_width = cl;
@ -502,38 +637,153 @@ static __init void rdt_init_padding(void)
}
}
static __init bool get_rdt_resources(void)
enum {
RDT_FLAG_CMT,
RDT_FLAG_MBM_TOTAL,
RDT_FLAG_MBM_LOCAL,
RDT_FLAG_L3_CAT,
RDT_FLAG_L3_CDP,
RDT_FLAG_L2_CAT,
RDT_FLAG_MBA,
};
#define RDT_OPT(idx, n, f) \
[idx] = { \
.name = n, \
.flag = f \
}
struct rdt_options {
char *name;
int flag;
bool force_off, force_on;
};
static struct rdt_options rdt_options[] __initdata = {
RDT_OPT(RDT_FLAG_CMT, "cmt", X86_FEATURE_CQM_OCCUP_LLC),
RDT_OPT(RDT_FLAG_MBM_TOTAL, "mbmtotal", X86_FEATURE_CQM_MBM_TOTAL),
RDT_OPT(RDT_FLAG_MBM_LOCAL, "mbmlocal", X86_FEATURE_CQM_MBM_LOCAL),
RDT_OPT(RDT_FLAG_L3_CAT, "l3cat", X86_FEATURE_CAT_L3),
RDT_OPT(RDT_FLAG_L3_CDP, "l3cdp", X86_FEATURE_CDP_L3),
RDT_OPT(RDT_FLAG_L2_CAT, "l2cat", X86_FEATURE_CAT_L2),
RDT_OPT(RDT_FLAG_MBA, "mba", X86_FEATURE_MBA),
};
#define NUM_RDT_OPTIONS ARRAY_SIZE(rdt_options)
static int __init set_rdt_options(char *str)
{
struct rdt_options *o;
bool force_off;
char *tok;
if (*str == '=')
str++;
while ((tok = strsep(&str, ",")) != NULL) {
force_off = *tok == '!';
if (force_off)
tok++;
for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
if (strcmp(tok, o->name) == 0) {
if (force_off)
o->force_off = true;
else
o->force_on = true;
break;
}
}
}
return 1;
}
__setup("rdt", set_rdt_options);
static bool __init rdt_cpu_has(int flag)
{
bool ret = boot_cpu_has(flag);
struct rdt_options *o;
if (!ret)
return ret;
for (o = rdt_options; o < &rdt_options[NUM_RDT_OPTIONS]; o++) {
if (flag == o->flag) {
if (o->force_off)
ret = false;
if (o->force_on)
ret = true;
break;
}
}
return ret;
}
static __init bool get_rdt_alloc_resources(void)
{
bool ret = false;
if (cache_alloc_hsw_probe())
if (rdt_alloc_capable)
return true;
if (!boot_cpu_has(X86_FEATURE_RDT_A))
return false;
if (boot_cpu_has(X86_FEATURE_CAT_L3)) {
rdt_get_cache_config(1, &rdt_resources_all[RDT_RESOURCE_L3]);
if (boot_cpu_has(X86_FEATURE_CDP_L3)) {
if (rdt_cpu_has(X86_FEATURE_CAT_L3)) {
rdt_get_cache_alloc_cfg(1, &rdt_resources_all[RDT_RESOURCE_L3]);
if (rdt_cpu_has(X86_FEATURE_CDP_L3)) {
rdt_get_cdp_l3_config(RDT_RESOURCE_L3DATA);
rdt_get_cdp_l3_config(RDT_RESOURCE_L3CODE);
}
ret = true;
}
if (boot_cpu_has(X86_FEATURE_CAT_L2)) {
if (rdt_cpu_has(X86_FEATURE_CAT_L2)) {
/* CPUID 0x10.2 fields are same format at 0x10.1 */
rdt_get_cache_config(2, &rdt_resources_all[RDT_RESOURCE_L2]);
rdt_get_cache_alloc_cfg(2, &rdt_resources_all[RDT_RESOURCE_L2]);
ret = true;
}
if (boot_cpu_has(X86_FEATURE_MBA)) {
if (rdt_cpu_has(X86_FEATURE_MBA)) {
if (rdt_get_mem_config(&rdt_resources_all[RDT_RESOURCE_MBA]))
ret = true;
}
return ret;
}
static __init bool get_rdt_mon_resources(void)
{
if (rdt_cpu_has(X86_FEATURE_CQM_OCCUP_LLC))
rdt_mon_features |= (1 << QOS_L3_OCCUP_EVENT_ID);
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_TOTAL))
rdt_mon_features |= (1 << QOS_L3_MBM_TOTAL_EVENT_ID);
if (rdt_cpu_has(X86_FEATURE_CQM_MBM_LOCAL))
rdt_mon_features |= (1 << QOS_L3_MBM_LOCAL_EVENT_ID);
if (!rdt_mon_features)
return false;
return !rdt_get_mon_l3_config(&rdt_resources_all[RDT_RESOURCE_L3]);
}
static __init void rdt_quirks(void)
{
switch (boot_cpu_data.x86_model) {
case INTEL_FAM6_HASWELL_X:
if (!rdt_options[RDT_FLAG_L3_CAT].force_off)
cache_alloc_hsw_probe();
break;
case INTEL_FAM6_SKYLAKE_X:
if (boot_cpu_data.x86_mask <= 4)
set_rdt_options("!cmt,!mbmtotal,!mbmlocal,!l3cat");
}
}
static __init bool get_rdt_resources(void)
{
rdt_quirks();
rdt_alloc_capable = get_rdt_alloc_resources();
rdt_mon_capable = get_rdt_mon_resources();
return (rdt_mon_capable || rdt_alloc_capable);
}
static int __init intel_rdt_late_init(void)
{
struct rdt_resource *r;
@ -556,9 +806,12 @@ static int __init intel_rdt_late_init(void)
return ret;
}
for_each_capable_rdt_resource(r)
for_each_alloc_capable_rdt_resource(r)
pr_info("Intel RDT %s allocation detected\n", r->name);
for_each_mon_capable_rdt_resource(r)
pr_info("Intel RDT %s monitoring detected\n", r->name);
return 0;
}

View File

@ -0,0 +1,440 @@
#ifndef _ASM_X86_INTEL_RDT_H
#define _ASM_X86_INTEL_RDT_H
#include <linux/sched.h>
#include <linux/kernfs.h>
#include <linux/jump_label.h>
#define IA32_L3_QOS_CFG 0xc81
#define IA32_L3_CBM_BASE 0xc90
#define IA32_L2_CBM_BASE 0xd10
#define IA32_MBA_THRTL_BASE 0xd50
#define L3_QOS_CDP_ENABLE 0x01ULL
/*
* Event IDs are used to program IA32_QM_EVTSEL before reading event
* counter from IA32_QM_CTR
*/
#define QOS_L3_OCCUP_EVENT_ID 0x01
#define QOS_L3_MBM_TOTAL_EVENT_ID 0x02
#define QOS_L3_MBM_LOCAL_EVENT_ID 0x03
#define CQM_LIMBOCHECK_INTERVAL 1000
#define MBM_CNTR_WIDTH 24
#define MBM_OVERFLOW_INTERVAL 1000
#define RMID_VAL_ERROR BIT_ULL(63)
#define RMID_VAL_UNAVAIL BIT_ULL(62)
DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
/**
* struct mon_evt - Entry in the event list of a resource
* @evtid: event id
* @name: name of the event
*/
struct mon_evt {
u32 evtid;
char *name;
struct list_head list;
};
/**
* struct mon_data_bits - Monitoring details for each event file
* @rid: Resource id associated with the event file.
* @evtid: Event id associated with the event file
* @domid: The domain to which the event file belongs
*/
union mon_data_bits {
void *priv;
struct {
unsigned int rid : 10;
unsigned int evtid : 8;
unsigned int domid : 14;
} u;
};
struct rmid_read {
struct rdtgroup *rgrp;
struct rdt_domain *d;
int evtid;
bool first;
u64 val;
};
extern unsigned int intel_cqm_threshold;
extern bool rdt_alloc_capable;
extern bool rdt_mon_capable;
extern unsigned int rdt_mon_features;
enum rdt_group_type {
RDTCTRL_GROUP = 0,
RDTMON_GROUP,
RDT_NUM_GROUP,
};
/**
* struct mongroup - store mon group's data in resctrl fs.
* @mon_data_kn kernlfs node for the mon_data directory
* @parent: parent rdtgrp
* @crdtgrp_list: child rdtgroup node list
* @rmid: rmid for this rdtgroup
*/
struct mongroup {
struct kernfs_node *mon_data_kn;
struct rdtgroup *parent;
struct list_head crdtgrp_list;
u32 rmid;
};
/**
* struct rdtgroup - store rdtgroup's data in resctrl file system.
* @kn: kernfs node
* @rdtgroup_list: linked list for all rdtgroups
* @closid: closid for this rdtgroup
* @cpu_mask: CPUs assigned to this rdtgroup
* @flags: status bits
* @waitcount: how many cpus expect to find this
* group when they acquire rdtgroup_mutex
* @type: indicates type of this rdtgroup - either
* monitor only or ctrl_mon group
* @mon: mongroup related data
*/
struct rdtgroup {
struct kernfs_node *kn;
struct list_head rdtgroup_list;
u32 closid;
struct cpumask cpu_mask;
int flags;
atomic_t waitcount;
enum rdt_group_type type;
struct mongroup mon;
};
/* rdtgroup.flags */
#define RDT_DELETED 1
/* rftype.flags */
#define RFTYPE_FLAGS_CPUS_LIST 1
/*
* Define the file type flags for base and info directories.
*/
#define RFTYPE_INFO BIT(0)
#define RFTYPE_BASE BIT(1)
#define RF_CTRLSHIFT 4
#define RF_MONSHIFT 5
#define RFTYPE_CTRL BIT(RF_CTRLSHIFT)
#define RFTYPE_MON BIT(RF_MONSHIFT)
#define RFTYPE_RES_CACHE BIT(8)
#define RFTYPE_RES_MB BIT(9)
#define RF_CTRL_INFO (RFTYPE_INFO | RFTYPE_CTRL)
#define RF_MON_INFO (RFTYPE_INFO | RFTYPE_MON)
#define RF_CTRL_BASE (RFTYPE_BASE | RFTYPE_CTRL)
/* List of all resource groups */
extern struct list_head rdt_all_groups;
extern int max_name_width, max_data_width;
int __init rdtgroup_init(void);
/**
* struct rftype - describe each file in the resctrl file system
* @name: File name
* @mode: Access mode
* @kf_ops: File operations
* @flags: File specific RFTYPE_FLAGS_* flags
* @fflags: File specific RF_* or RFTYPE_* flags
* @seq_show: Show content of the file
* @write: Write to the file
*/
struct rftype {
char *name;
umode_t mode;
struct kernfs_ops *kf_ops;
unsigned long flags;
unsigned long fflags;
int (*seq_show)(struct kernfs_open_file *of,
struct seq_file *sf, void *v);
/*
* write() is the generic write callback which maps directly to
* kernfs write operation and overrides all other operations.
* Maximum write size is determined by ->max_write_len.
*/
ssize_t (*write)(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
};
/**
* struct mbm_state - status for each MBM counter in each domain
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_msr Value of IA32_QM_CTR for this RMID last time we read it
*/
struct mbm_state {
u64 chunks;
u64 prev_msr;
};
/**
* struct rdt_domain - group of cpus sharing an RDT resource
* @list: all instances of this resource
* @id: unique id for this instance
* @cpu_mask: which cpus share this resource
* @rmid_busy_llc:
* bitmap of which limbo RMIDs are above threshold
* @mbm_total: saved state for MBM total bandwidth
* @mbm_local: saved state for MBM local bandwidth
* @mbm_over: worker to periodically read MBM h/w counters
* @cqm_limbo: worker to periodically read CQM h/w counters
* @mbm_work_cpu:
* worker cpu for MBM h/w counters
* @cqm_work_cpu:
* worker cpu for CQM h/w counters
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
*/
struct rdt_domain {
struct list_head list;
int id;
struct cpumask cpu_mask;
unsigned long *rmid_busy_llc;
struct mbm_state *mbm_total;
struct mbm_state *mbm_local;
struct delayed_work mbm_over;
struct delayed_work cqm_limbo;
int mbm_work_cpu;
int cqm_work_cpu;
u32 *ctrl_val;
u32 new_ctrl;
bool have_new_ctrl;
};
/**
* struct msr_param - set a range of MSRs from a domain
* @res: The resource to use
* @low: Beginning index from base MSR
* @high: End index
*/
struct msr_param {
struct rdt_resource *res;
int low;
int high;
};
/**
* struct rdt_cache - Cache allocation related data
* @cbm_len: Length of the cache bit mask
* @min_cbm_bits: Minimum number of consecutive bits to be set
* @cbm_idx_mult: Multiplier of CBM index
* @cbm_idx_offset: Offset of CBM index. CBM index is computed by:
* closid * cbm_idx_multi + cbm_idx_offset
* in a cache bit mask
* @shareable_bits: Bitmask of shareable resource with other
* executing entities
*/
struct rdt_cache {
unsigned int cbm_len;
unsigned int min_cbm_bits;
unsigned int cbm_idx_mult;
unsigned int cbm_idx_offset;
unsigned int shareable_bits;
};
/**
* struct rdt_membw - Memory bandwidth allocation related data
* @max_delay: Max throttle delay. Delay is the hardware
* representation for memory bandwidth.
* @min_bw: Minimum memory bandwidth percentage user can request
* @bw_gran: Granularity at which the memory bandwidth is allocated
* @delay_linear: True if memory B/W delay is in linear scale
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct rdt_membw {
u32 max_delay;
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
u32 *mb_map;
};
static inline bool is_llc_occupancy_enabled(void)
{
return (rdt_mon_features & (1 << QOS_L3_OCCUP_EVENT_ID));
}
static inline bool is_mbm_total_enabled(void)
{
return (rdt_mon_features & (1 << QOS_L3_MBM_TOTAL_EVENT_ID));
}
static inline bool is_mbm_local_enabled(void)
{
return (rdt_mon_features & (1 << QOS_L3_MBM_LOCAL_EVENT_ID));
}
static inline bool is_mbm_enabled(void)
{
return (is_mbm_total_enabled() || is_mbm_local_enabled());
}
static inline bool is_mbm_event(int e)
{
return (e >= QOS_L3_MBM_TOTAL_EVENT_ID &&
e <= QOS_L3_MBM_LOCAL_EVENT_ID);
}
/**
* struct rdt_resource - attributes of an RDT resource
* @rid: The index of the resource
* @alloc_enabled: Is allocation enabled on this machine
* @mon_enabled: Is monitoring enabled for this feature
* @alloc_capable: Is allocation available on this machine
* @mon_capable: Is monitor feature available on this machine
* @name: Name to use in "schemata" file
* @num_closid: Number of CLOSIDs available
* @cache_level: Which cache level defines scope of this resource
* @default_ctrl: Specifies default cache cbm or memory B/W percent.
* @msr_base: Base MSR address for CBMs
* @msr_update: Function pointer to update QOS MSRs
* @data_width: Character width of data when displaying
* @domains: All domains for this resource
* @cache: Cache allocation related data
* @format_str: Per resource format string to show domain value
* @parse_ctrlval: Per resource function pointer to parse control values
* @evt_list: List of monitoring events
* @num_rmid: Number of RMIDs available
* @mon_scale: cqm counter * mon_scale = occupancy in bytes
* @fflags: flags to choose base and info files
*/
struct rdt_resource {
int rid;
bool alloc_enabled;
bool mon_enabled;
bool alloc_capable;
bool mon_capable;
char *name;
int num_closid;
int cache_level;
u32 default_ctrl;
unsigned int msr_base;
void (*msr_update) (struct rdt_domain *d, struct msr_param *m,
struct rdt_resource *r);
int data_width;
struct list_head domains;
struct rdt_cache cache;
struct rdt_membw membw;
const char *format_str;
int (*parse_ctrlval) (char *buf, struct rdt_resource *r,
struct rdt_domain *d);
struct list_head evt_list;
int num_rmid;
unsigned int mon_scale;
unsigned long fflags;
};
int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d);
int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d);
extern struct mutex rdtgroup_mutex;
extern struct rdt_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
int __init rdtgroup_init(void);
enum {
RDT_RESOURCE_L3,
RDT_RESOURCE_L3DATA,
RDT_RESOURCE_L3CODE,
RDT_RESOURCE_L2,
RDT_RESOURCE_MBA,
/* Must be the last */
RDT_NUM_RESOURCES,
};
#define for_each_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->alloc_capable || r->mon_capable)
#define for_each_alloc_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->alloc_capable)
#define for_each_mon_capable_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->mon_capable)
#define for_each_alloc_enabled_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->alloc_enabled)
#define for_each_mon_enabled_rdt_resource(r) \
for (r = rdt_resources_all; r < rdt_resources_all + RDT_NUM_RESOURCES;\
r++) \
if (r->mon_enabled)
/* CPUID.(EAX=10H, ECX=ResID=1).EAX */
union cpuid_0x10_1_eax {
struct {
unsigned int cbm_len:5;
} split;
unsigned int full;
};
/* CPUID.(EAX=10H, ECX=ResID=3).EAX */
union cpuid_0x10_3_eax {
struct {
unsigned int max_delay:12;
} split;
unsigned int full;
};
/* CPUID.(EAX=10H, ECX=ResID).EDX */
union cpuid_0x10_x_edx {
struct {
unsigned int cos_max:16;
} split;
unsigned int full;
};
void rdt_ctrl_update(void *arg);
struct rdtgroup *rdtgroup_kn_lock_live(struct kernfs_node *kn);
void rdtgroup_kn_unlock(struct kernfs_node *kn);
struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
struct list_head **pos);
ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int rdtgroup_schemata_show(struct kernfs_open_file *of,
struct seq_file *s, void *v);
struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r);
int alloc_rmid(void);
void free_rmid(u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
void mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void rmdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
unsigned int dom_id);
void mkdir_mondata_subdir_allrdtgrp(struct rdt_resource *r,
struct rdt_domain *d);
void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
struct rdtgroup *rdtgrp, int evtid, int first);
void mbm_setup_overflow_handler(struct rdt_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
void __check_limbo(struct rdt_domain *d, bool force_free);
#endif /* _ASM_X86_INTEL_RDT_H */

View File

@ -26,7 +26,7 @@
#include <linux/kernfs.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
#include <asm/intel_rdt.h>
#include "intel_rdt.h"
/*
* Check whether MBA bandwidth percentage value is correct. The value is
@ -192,7 +192,7 @@ static int rdtgroup_parse_resource(char *resname, char *tok, int closid)
{
struct rdt_resource *r;
for_each_enabled_rdt_resource(r) {
for_each_alloc_enabled_rdt_resource(r) {
if (!strcmp(resname, r->name) && closid < r->num_closid)
return parse_line(tok, r);
}
@ -221,7 +221,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
closid = rdtgrp->closid;
for_each_enabled_rdt_resource(r) {
for_each_alloc_enabled_rdt_resource(r) {
list_for_each_entry(dom, &r->domains, list)
dom->have_new_ctrl = false;
}
@ -237,7 +237,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of,
goto out;
}
for_each_enabled_rdt_resource(r) {
for_each_alloc_enabled_rdt_resource(r) {
ret = update_domains(r, closid);
if (ret)
goto out;
@ -269,12 +269,13 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
{
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
int closid, ret = 0;
int ret = 0;
u32 closid;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
if (rdtgrp) {
closid = rdtgrp->closid;
for_each_enabled_rdt_resource(r) {
for_each_alloc_enabled_rdt_resource(r) {
if (closid < r->num_closid)
show_doms(s, r, closid);
}
@ -284,3 +285,57 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of,
rdtgroup_kn_unlock(of->kn);
return ret;
}
void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
struct rdtgroup *rdtgrp, int evtid, int first)
{
/*
* setup the parameters to send to the IPI to read the data.
*/
rr->rgrp = rdtgrp;
rr->evtid = evtid;
rr->d = d;
rr->val = 0;
rr->first = first;
smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
}
int rdtgroup_mondata_show(struct seq_file *m, void *arg)
{
struct kernfs_open_file *of = m->private;
u32 resid, evtid, domid;
struct rdtgroup *rdtgrp;
struct rdt_resource *r;
union mon_data_bits md;
struct rdt_domain *d;
struct rmid_read rr;
int ret = 0;
rdtgrp = rdtgroup_kn_lock_live(of->kn);
md.priv = of->kn->priv;
resid = md.u.rid;
domid = md.u.domid;
evtid = md.u.evtid;
r = &rdt_resources_all[resid];
d = rdt_find_domain(r, domid, NULL);
if (!d) {
ret = -ENOENT;
goto out;
}
mon_event_read(&rr, d, rdtgrp, evtid, false);
if (rr.val & RMID_VAL_ERROR)
seq_puts(m, "Error\n");
else if (rr.val & RMID_VAL_UNAVAIL)
seq_puts(m, "Unavailable\n");
else
seq_printf(m, "%llu\n", rr.val * r->mon_scale);
out:
rdtgroup_kn_unlock(of->kn);
return ret;
}

View File

@ -0,0 +1,499 @@
/*
* Resource Director Technology(RDT)
* - Monitoring code
*
* Copyright (C) 2017 Intel Corporation
*
* Author:
* Vikas Shivappa <vikas.shivappa@intel.com>
*
* This replaces the cqm.c based on perf but we reuse a lot of
* code and datastructures originally from Peter Zijlstra and Matt Fleming.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*
* More information about RDT be found in the Intel (R) x86 Architecture
* Software Developer Manual June 2016, volume 3, section 17.17.
*/
#include <linux/module.h>
#include <linux/slab.h>
#include <asm/cpu_device_id.h>
#include "intel_rdt.h"
#define MSR_IA32_QM_CTR 0x0c8e
#define MSR_IA32_QM_EVTSEL 0x0c8d
struct rmid_entry {
u32 rmid;
int busy;
struct list_head list;
};
/**
* @rmid_free_lru A least recently used list of free RMIDs
* These RMIDs are guaranteed to have an occupancy less than the
* threshold occupancy
*/
static LIST_HEAD(rmid_free_lru);
/**
* @rmid_limbo_count count of currently unused but (potentially)
* dirty RMIDs.
* This counts RMIDs that no one is currently using but that
* may have a occupancy value > intel_cqm_threshold. User can change
* the threshold occupancy value.
*/
unsigned int rmid_limbo_count;
/**
* @rmid_entry - The entry in the limbo and free lists.
*/
static struct rmid_entry *rmid_ptrs;
/*
* Global boolean for rdt_monitor which is true if any
* resource monitoring is enabled.
*/
bool rdt_mon_capable;
/*
* Global to indicate which monitoring events are enabled.
*/
unsigned int rdt_mon_features;
/*
* This is the threshold cache occupancy at which we will consider an
* RMID available for re-allocation.
*/
unsigned int intel_cqm_threshold;
static inline struct rmid_entry *__rmid_entry(u32 rmid)
{
struct rmid_entry *entry;
entry = &rmid_ptrs[rmid];
WARN_ON(entry->rmid != rmid);
return entry;
}
static u64 __rmid_read(u32 rmid, u32 eventid)
{
u64 val;
/*
* As per the SDM, when IA32_QM_EVTSEL.EvtID (bits 7:0) is configured
* with a valid event code for supported resource type and the bits
* IA32_QM_EVTSEL.RMID (bits 41:32) are configured with valid RMID,
* IA32_QM_CTR.data (bits 61:0) reports the monitored data.
* IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
* are error bits.
*/
wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
rdmsrl(MSR_IA32_QM_CTR, val);
return val;
}
static bool rmid_dirty(struct rmid_entry *entry)
{
u64 val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
return val >= intel_cqm_threshold;
}
/*
* Check the RMIDs that are marked as busy for this domain. If the
* reported LLC occupancy is below the threshold clear the busy bit and
* decrement the count. If the busy count gets to zero on an RMID, we
* free the RMID
*/
void __check_limbo(struct rdt_domain *d, bool force_free)
{
struct rmid_entry *entry;
struct rdt_resource *r;
u32 crmid = 1, nrmid;
r = &rdt_resources_all[RDT_RESOURCE_L3];
/*
* Skip RMID 0 and start from RMID 1 and check all the RMIDs that
* are marked as busy for occupancy < threshold. If the occupancy
* is less than the threshold decrement the busy counter of the
* RMID and move it to the free list when the counter reaches 0.
*/
for (;;) {
nrmid = find_next_bit(d->rmid_busy_llc, r->num_rmid, crmid);
if (nrmid >= r->num_rmid)
break;
entry = __rmid_entry(nrmid);
if (force_free || !rmid_dirty(entry)) {
clear_bit(entry->rmid, d->rmid_busy_llc);
if (!--entry->busy) {
rmid_limbo_count--;
list_add_tail(&entry->list, &rmid_free_lru);
}
}
crmid = nrmid + 1;
}
}
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d)
{
return find_first_bit(d->rmid_busy_llc, r->num_rmid) != r->num_rmid;
}
/*
* As of now the RMIDs allocation is global.
* However we keep track of which packages the RMIDs
* are used to optimize the limbo list management.
*/
int alloc_rmid(void)
{
struct rmid_entry *entry;
lockdep_assert_held(&rdtgroup_mutex);
if (list_empty(&rmid_free_lru))
return rmid_limbo_count ? -EBUSY : -ENOSPC;
entry = list_first_entry(&rmid_free_lru,
struct rmid_entry, list);
list_del(&entry->list);
return entry->rmid;
}
static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r;
struct rdt_domain *d;
int cpu;
u64 val;
r = &rdt_resources_all[RDT_RESOURCE_L3];
entry->busy = 0;
cpu = get_cpu();
list_for_each_entry(d, &r->domains, list) {
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
val = __rmid_read(entry->rmid, QOS_L3_OCCUP_EVENT_ID);
if (val <= intel_cqm_threshold)
continue;
}
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
*/
if (!has_busy_rmid(r, d))
cqm_setup_limbo_handler(d, CQM_LIMBOCHECK_INTERVAL);
set_bit(entry->rmid, d->rmid_busy_llc);
entry->busy++;
}
put_cpu();
if (entry->busy)
rmid_limbo_count++;
else
list_add_tail(&entry->list, &rmid_free_lru);
}
void free_rmid(u32 rmid)
{
struct rmid_entry *entry;
if (!rmid)
return;
lockdep_assert_held(&rdtgroup_mutex);
entry = __rmid_entry(rmid);
if (is_llc_occupancy_enabled())
add_rmid_to_limbo(entry);
else
list_add_tail(&entry->list, &rmid_free_lru);
}
static int __mon_event_count(u32 rmid, struct rmid_read *rr)
{
u64 chunks, shift, tval;
struct mbm_state *m;
tval = __rmid_read(rmid, rr->evtid);
if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL)) {
rr->val = tval;
return -EINVAL;
}
switch (rr->evtid) {
case QOS_L3_OCCUP_EVENT_ID:
rr->val += tval;
return 0;
case QOS_L3_MBM_TOTAL_EVENT_ID:
m = &rr->d->mbm_total[rmid];
break;
case QOS_L3_MBM_LOCAL_EVENT_ID:
m = &rr->d->mbm_local[rmid];
break;
default:
/*
* Code would never reach here because
* an invalid event id would fail the __rmid_read.
*/
return -EINVAL;
}
if (rr->first) {
m->prev_msr = tval;
m->chunks = 0;
return 0;
}
shift = 64 - MBM_CNTR_WIDTH;
chunks = (tval << shift) - (m->prev_msr << shift);
chunks >>= shift;
m->chunks += chunks;
m->prev_msr = tval;
rr->val += m->chunks;
return 0;
}
/*
* This is called via IPI to read the CQM/MBM counters
* on a domain.
*/
void mon_event_count(void *info)
{
struct rdtgroup *rdtgrp, *entry;
struct rmid_read *rr = info;
struct list_head *head;
rdtgrp = rr->rgrp;
if (__mon_event_count(rdtgrp->mon.rmid, rr))
return;
/*
* For Ctrl groups read data from child monitor groups.
*/
head = &rdtgrp->mon.crdtgrp_list;
if (rdtgrp->type == RDTCTRL_GROUP) {
list_for_each_entry(entry, head, mon.crdtgrp_list) {
if (__mon_event_count(entry->mon.rmid, rr))
return;
}
}
}
static void mbm_update(struct rdt_domain *d, int rmid)
{
struct rmid_read rr;
rr.first = false;
rr.d = d;
/*
* This is protected from concurrent reads from user
* as both the user and we hold the global mutex.
*/
if (is_mbm_total_enabled()) {
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
__mon_event_count(rmid, &rr);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
__mon_event_count(rmid, &rr);
}
}
/*
* Handler to scan the limbo list and move the RMIDs
* to free list whose occupancy < threshold_occupancy.
*/
void cqm_handle_limbo(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(CQM_LIMBOCHECK_INTERVAL);
int cpu = smp_processor_id();
struct rdt_resource *r;
struct rdt_domain *d;
mutex_lock(&rdtgroup_mutex);
r = &rdt_resources_all[RDT_RESOURCE_L3];
d = get_domain_from_cpu(cpu, r);
if (!d) {
pr_warn_once("Failure to get domain for limbo worker\n");
goto out_unlock;
}
__check_limbo(d, false);
if (has_busy_rmid(r, d))
schedule_delayed_work_on(cpu, &d->cqm_limbo, delay);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
struct rdt_resource *r;
int cpu;
r = &rdt_resources_all[RDT_RESOURCE_L3];
cpu = cpumask_any(&dom->cpu_mask);
dom->cqm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->cqm_limbo, delay);
}
void mbm_handle_overflow(struct work_struct *work)
{
unsigned long delay = msecs_to_jiffies(MBM_OVERFLOW_INTERVAL);
struct rdtgroup *prgrp, *crgrp;
int cpu = smp_processor_id();
struct list_head *head;
struct rdt_domain *d;
mutex_lock(&rdtgroup_mutex);
if (!static_branch_likely(&rdt_enable_key))
goto out_unlock;
d = get_domain_from_cpu(cpu, &rdt_resources_all[RDT_RESOURCE_L3]);
if (!d)
goto out_unlock;
list_for_each_entry(prgrp, &rdt_all_groups, rdtgroup_list) {
mbm_update(d, prgrp->mon.rmid);
head = &prgrp->mon.crdtgrp_list;
list_for_each_entry(crgrp, head, mon.crdtgrp_list)
mbm_update(d, crgrp->mon.rmid);
}
schedule_delayed_work_on(cpu, &d->mbm_over, delay);
out_unlock:
mutex_unlock(&rdtgroup_mutex);
}
void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms)
{
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;
if (!static_branch_likely(&rdt_enable_key))
return;
cpu = cpumask_any(&dom->cpu_mask);
dom->mbm_work_cpu = cpu;
schedule_delayed_work_on(cpu, &dom->mbm_over, delay);
}
static int dom_data_init(struct rdt_resource *r)
{
struct rmid_entry *entry = NULL;
int i, nr_rmids;
nr_rmids = r->num_rmid;
rmid_ptrs = kcalloc(nr_rmids, sizeof(struct rmid_entry), GFP_KERNEL);
if (!rmid_ptrs)
return -ENOMEM;
for (i = 0; i < nr_rmids; i++) {
entry = &rmid_ptrs[i];
INIT_LIST_HEAD(&entry->list);
entry->rmid = i;
list_add_tail(&entry->list, &rmid_free_lru);
}
/*
* RMID 0 is special and is always allocated. It's used for all
* tasks that are not monitored.
*/
entry = __rmid_entry(0);
list_del(&entry->list);
return 0;
}
static struct mon_evt llc_occupancy_event = {
.name = "llc_occupancy",
.evtid = QOS_L3_OCCUP_EVENT_ID,
};
static struct mon_evt mbm_total_event = {
.name = "mbm_total_bytes",
.evtid = QOS_L3_MBM_TOTAL_EVENT_ID,
};
static struct mon_evt mbm_local_event = {
.name = "mbm_local_bytes",
.evtid = QOS_L3_MBM_LOCAL_EVENT_ID,
};
/*
* Initialize the event list for the resource.
*
* Note that MBM events are also part of RDT_RESOURCE_L3 resource
* because as per the SDM the total and local memory bandwidth
* are enumerated as part of L3 monitoring.
*/
static void l3_mon_evt_init(struct rdt_resource *r)
{
INIT_LIST_HEAD(&r->evt_list);
if (is_llc_occupancy_enabled())
list_add_tail(&llc_occupancy_event.list, &r->evt_list);
if (is_mbm_total_enabled())
list_add_tail(&mbm_total_event.list, &r->evt_list);
if (is_mbm_local_enabled())
list_add_tail(&mbm_local_event.list, &r->evt_list);
}
int rdt_get_mon_l3_config(struct rdt_resource *r)
{
int ret;
r->mon_scale = boot_cpu_data.x86_cache_occ_scale;
r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
/*
* A reasonable upper limit on the max threshold is the number
* of lines tagged per RMID if all RMIDs have the same number of
* lines tagged in the LLC.
*
* For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
*/
intel_cqm_threshold = boot_cpu_data.x86_cache_size * 1024 / r->num_rmid;
/* h/w works in units of "boot_cpu_data.x86_cache_occ_scale" */
intel_cqm_threshold /= r->mon_scale;
ret = dom_data_init(r);
if (ret)
return ret;
l3_mon_evt_init(r);
r->mon_capable = true;
r->mon_enabled = true;
return 0;
}

File diff suppressed because it is too large Load Diff

View File

@ -56,7 +56,7 @@
#include <asm/debugreg.h>
#include <asm/switch_to.h>
#include <asm/vm86.h>
#include <asm/intel_rdt.h>
#include <asm/intel_rdt_sched.h>
#include <asm/proto.h>
void __show_regs(struct pt_regs *regs, int all)

View File

@ -52,7 +52,7 @@
#include <asm/switch_to.h>
#include <asm/xen/hypervisor.h>
#include <asm/vdso.h>
#include <asm/intel_rdt.h>
#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */

View File

@ -139,14 +139,6 @@ struct hw_perf_event {
/* for tp_event->class */
struct list_head tp_list;
};
struct { /* intel_cqm */
int cqm_state;
u32 cqm_rmid;
int is_group_event;
struct list_head cqm_events_entry;
struct list_head cqm_groups_entry;
struct list_head cqm_group_entry;
};
struct { /* amd_power */
u64 pwr_acc;
u64 ptsc;
@ -413,11 +405,6 @@ struct pmu {
size_t task_ctx_size;
/*
* Return the count value for a counter.
*/
u64 (*count) (struct perf_event *event); /*optional*/
/*
* Set up pmu-private data structures for an AUX area
*/
@ -1112,11 +1099,6 @@ static inline void perf_event_task_sched_out(struct task_struct *prev,
__perf_event_task_sched_out(prev, next);
}
static inline u64 __perf_event_count(struct perf_event *event)
{
return local64_read(&event->count) + atomic64_read(&event->child_count);
}
extern void perf_event_mmap(struct vm_area_struct *vma);
extern struct perf_guest_info_callbacks *perf_guest_cbs;
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);

View File

@ -909,8 +909,9 @@ struct task_struct {
/* cg_list protected by css_set_lock and tsk->alloc_lock: */
struct list_head cg_list;
#endif
#ifdef CONFIG_INTEL_RDT_A
int closid;
#ifdef CONFIG_INTEL_RDT
u32 closid;
u32 rmid;
#endif
#ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;

View File

@ -3673,10 +3673,7 @@ static void __perf_event_read(void *info)
static inline u64 perf_event_count(struct perf_event *event)
{
if (event->pmu->count)
return event->pmu->count(event);
return __perf_event_count(event);
return local64_read(&event->count) + atomic64_read(&event->child_count);
}
/*
@ -3707,15 +3704,6 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
goto out;
}
/*
* It must not have a pmu::count method, those are not
* NMI safe.
*/
if (event->pmu->count) {
ret = -EOPNOTSUPP;
goto out;
}
/* If this is a per-task event, it must be for current */
if ((event->attach_state & PERF_ATTACH_TASK) &&
event->hw.target != current) {