Skip to content

Commit

Permalink
Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/li…
Browse files Browse the repository at this point in the history
…nux/kernel/git/tip/tip

Pull x86 cache resource controller updates from Thomas Gleixner:
 "An update for the Intel Resource Director Technolgy (RDT) which adds a
  feedback driven software controller to runtime adjust the bandwidth
  allocation MSRs.

  This makes the allocations more accurate and allows to use bandwidth
  values in understandable units (MB/s) instead of using percentage
  based allocations as the original, still available, interface.

  The software controller can be enabled with a new mount option for the
  resctrl filesystem"

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/intel_rdt/mba_sc: Feedback loop to dynamically update mem bandwidth
  x86/intel_rdt/mba_sc: Prepare for feedback loop
  x86/intel_rdt/mba_sc: Add schemata support
  x86/intel_rdt/mba_sc: Add initialization support
  x86/intel_rdt/mba_sc: Enable/disable MBA software controller
  x86/intel_rdt/mba_sc: Documentation for MBA software controller(mba_sc)
  • Loading branch information
Linus Torvalds committed Jun 5, 2018
2 parents ba252f1 + de73f38 commit ab20fd0
Show file tree
Hide file tree
Showing 6 changed files with 337 additions and 33 deletions.
75 changes: 67 additions & 8 deletions Documentation/x86/intel_rdt_ui.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@ MBA (Memory Bandwidth Allocation) - "mba"

To use the feature mount the file system:

# mount -t resctrl resctrl [-o cdp[,cdpl2]] /sys/fs/resctrl
# mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl

mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.
"cdpl2": Enable code/data prioritization in L2 cache allocations.
"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA
bandwidth in MBps

L2 and L3 CDP are controlled seperately.

Expand Down Expand Up @@ -270,10 +272,11 @@ and 0xA are not. On a system with a 20-bit mask each bit represents 5%
of the capacity of the cache. You could partition the cache into four
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

Memory bandwidth(b/w) percentage
--------------------------------
For Memory b/w resource, user controls the resource by indicating the
percentage of total memory b/w.
Memory bandwidth Allocation and monitoring
------------------------------------------

For Memory bandwidth resource, by default the user controls the resource
by indicating the percentage of total memory bandwidth.

The minimum bandwidth percentage value for each cpu model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
Expand All @@ -285,7 +288,47 @@ to the next control step available on the hardware.
The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.
low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core
specific mechanism where as memory bandwidth monitoring(MBM) is done at
the package level may lead to confusion when users try to apply control
via the MBA and then monitor the bandwidth to see if the controls are
effective. Below are such scenarios:

1. User may *not* see increase in actual bandwidth when percentage
values are increased:

This can occur when aggregate L2 external bandwidth is more than L3
external bandwidth. Consider an SKL SKU with 24 cores on a package and
where L2 external is 10GBps (hence aggregate L2 external bandwidth is
240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
bandwidth of 100GBps although the percentage value specified is only 50%
<< 100%. Hence increasing the bandwidth percentage will not yeild any
more bandwidth. This is because although the L2 external bandwidth still
has capacity, the L3 external bandwidth is fully used. Also note that
this would be dependent on number of cores the benchmark is run on.

2. Same bandwidth percentage may mean different actual bandwidth
depending on # of threads:

For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
they have same percentage bandwidth of 10%. This is simply because as
threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.

In order to mitigate this and make the interface more user friendly,
resctrl added support for specifying the bandwidth in MBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memowy bandwidth percentages to ensure

"actual bandwidth < user specified bandwidth".

By default, the schemata would take the bandwidth percentage values
where as user can switch to the "MBA software controller" mode using
a mount option 'mba_MBps'. The schemata format is specified in the below
sections.

L3 schemata file details (code and data prioritization disabled)
----------------------------------------------------------------
Expand All @@ -308,13 +351,20 @@ schemata format is always:

L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

Memory b/w Allocation details
-----------------------------
Memory bandwidth Allocation (default mode)
------------------------------------------

Memory b/w domain is L3 cache.

MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...

Memory bandwidth Allocation specified in MBps
---------------------------------------------

Memory bandwidth domain is L3 cache.

MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...

Reading/writing the schemata file
---------------------------------
Reading the schemata file will show the state of all resources
Expand Down Expand Up @@ -358,6 +408,15 @@ allocations can overlap or not. The allocations specifies the maximum
b/w that the group may be able to use and the system admin can configure
the b/w accordingly.

If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
rather than the percentage values.

# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata

In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
of 1024MB where as on socket 1 they would use 500MB.

Example 2
---------
Again two sockets, but this time with a more realistic 20-bit mask.
Expand Down
50 changes: 37 additions & 13 deletions arch/x86/kernel/cpu/intel_rdt.c
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@
#include <asm/intel_rdt_sched.h>
#include "intel_rdt.h"

#define MAX_MBA_BW 100u
#define MBA_IS_LINEAR 0x4
#define MBA_MAX_MBPS U32_MAX

/* Mutex to protect rdtgroup access. */
DEFINE_MUTEX(rdtgroup_mutex);
Expand Down Expand Up @@ -178,7 +178,7 @@ struct rdt_resource rdt_resources_all[] = {
.msr_update = mba_wrmsr,
.cache_level = 3,
.parse_ctrlval = parse_bw,
.format_str = "%d=%*d",
.format_str = "%d=%*u",
.fflags = RFTYPE_RES_MB,
},
};
Expand Down Expand Up @@ -230,6 +230,14 @@ static inline void cache_alloc_hsw_probe(void)
rdt_alloc_capable = true;
}

bool is_mba_sc(struct rdt_resource *r)
{
if (!r)
return rdt_resources_all[RDT_RESOURCE_MBA].membw.mba_sc;

return r->membw.mba_sc;
}

/*
* rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values
* exposed to user interface and the h/w understandable delay values.
Expand Down Expand Up @@ -341,7 +349,7 @@ static int get_cache_id(int cpu, int level)
* that can be written to QOS_MSRs.
* There are currently no SKUs which support non linear delay values.
*/
static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
{
if (r->membw.delay_linear)
return MAX_MBA_BW - bw;
Expand Down Expand Up @@ -431,25 +439,40 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
return NULL;
}

void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
{
int i;

/*
* Initialize the Control MSRs to having no control.
* For Cache Allocation: Set all bits in cbm
* For Memory Allocation: Set b/w requested to 100%
* and the bandwidth in MBps to U32_MAX
*/
for (i = 0; i < r->num_closid; i++, dc++, dm++) {
*dc = r->default_ctrl;
*dm = MBA_MAX_MBPS;
}
}

static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
{
struct msr_param m;
u32 *dc;
int i;
u32 *dc, *dm;

dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL);
if (!dc)
return -ENOMEM;

d->ctrl_val = dc;
dm = kmalloc_array(r->num_closid, sizeof(*d->mbps_val), GFP_KERNEL);
if (!dm) {
kfree(dc);
return -ENOMEM;
}

/*
* Initialize the Control MSRs to having no control.
* For Cache Allocation: Set all bits in cbm
* For Memory Allocation: Set b/w requested to 100
*/
for (i = 0; i < r->num_closid; i++, dc++)
*dc = r->default_ctrl;
d->ctrl_val = dc;
d->mbps_val = dm;
setup_default_ctrlval(r, dc, dm);

m.low = 0;
m.high = r->num_closid;
Expand Down Expand Up @@ -588,6 +611,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
}

kfree(d->ctrl_val);
kfree(d->mbps_val);
kfree(d->rmid_busy_llc);
kfree(d->mbm_total);
kfree(d->mbm_local);
Expand Down
18 changes: 18 additions & 0 deletions arch/x86/kernel/cpu/intel_rdt.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@

#define MBM_CNTR_WIDTH 24
#define MBM_OVERFLOW_INTERVAL 1000
#define MAX_MBA_BW 100u

#define RMID_VAL_ERROR BIT_ULL(63)
#define RMID_VAL_UNAVAIL BIT_ULL(62)
Expand Down Expand Up @@ -180,10 +181,20 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
* @prev_msr Value of IA32_QM_CTR for this RMID last time we read it
* @chunks_bw Total local data moved. Used for bandwidth calculation
* @prev_bw_msr:Value of previous IA32_QM_CTR for bandwidth counting
* @prev_bw The most recent bandwidth in MBps
* @delta_bw Difference between the current and previous bandwidth
* @delta_comp Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 chunks;
u64 prev_msr;
u64 chunks_bw;
u64 prev_bw_msr;
u32 prev_bw;
u32 delta_bw;
bool delta_comp;
};

/**
Expand All @@ -202,6 +213,7 @@ struct mbm_state {
* @cqm_work_cpu:
* worker cpu for CQM h/w counters
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
* @mbps_val: When mba_sc is enabled, this holds the bandwidth in MBps
* @new_ctrl: new ctrl value to be loaded
* @have_new_ctrl: did user provide new_ctrl for this domain
*/
Expand All @@ -217,6 +229,7 @@ struct rdt_domain {
int mbm_work_cpu;
int cqm_work_cpu;
u32 *ctrl_val;
u32 *mbps_val;
u32 new_ctrl;
bool have_new_ctrl;
};
Expand Down Expand Up @@ -259,13 +272,15 @@ struct rdt_cache {
* @min_bw: Minimum memory bandwidth percentage user can request
* @bw_gran: Granularity at which the memory bandwidth is allocated
* @delay_linear: True if memory B/W delay is in linear scale
* @mba_sc: True if MBA software controller(mba_sc) is enabled
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct rdt_membw {
u32 max_delay;
u32 min_bw;
u32 bw_gran;
u32 delay_linear;
bool mba_sc;
u32 *mb_map;
};

Expand Down Expand Up @@ -445,6 +460,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
void mbm_setup_overflow_handler(struct rdt_domain *dom,
unsigned long delay_ms);
void mbm_handle_overflow(struct work_struct *work);
bool is_mba_sc(struct rdt_resource *r);
void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
u32 delay_bw_map(unsigned long bw, struct rdt_resource *r);
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
void cqm_handle_limbo(struct work_struct *work);
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
Expand Down
24 changes: 19 additions & 5 deletions arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
return false;
}

if (bw < r->membw.min_bw || bw > r->default_ctrl) {
if ((bw < r->membw.min_bw || bw > r->default_ctrl) &&
!is_mba_sc(r)) {
rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw,
r->membw.min_bw, r->default_ctrl);
return false;
Expand Down Expand Up @@ -179,6 +180,8 @@ static int update_domains(struct rdt_resource *r, int closid)
struct msr_param msr_param;
cpumask_var_t cpu_mask;
struct rdt_domain *d;
bool mba_sc;
u32 *dc;
int cpu;

if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
Expand All @@ -188,13 +191,20 @@ static int update_domains(struct rdt_resource *r, int closid)
msr_param.high = msr_param.low + 1;
msr_param.res = r;

mba_sc = is_mba_sc(r);
list_for_each_entry(d, &r->domains, list) {
if (d->have_new_ctrl && d->new_ctrl != d->ctrl_val[closid]) {
dc = !mba_sc ? d->ctrl_val : d->mbps_val;
if (d->have_new_ctrl && d->new_ctrl != dc[closid]) {
cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
d->ctrl_val[closid] = d->new_ctrl;
dc[closid] = d->new_ctrl;
}
}
if (cpumask_empty(cpu_mask))

/*
* Avoid writing the control msr with control values when
* MBA software controller is enabled
*/
if (cpumask_empty(cpu_mask) || mba_sc)
goto done;
cpu = get_cpu();
/* Update CBM on this cpu if it's in cpu_mask. */
Expand Down Expand Up @@ -282,13 +292,17 @@ static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
{
struct rdt_domain *dom;
bool sep = false;
u32 ctrl_val;

seq_printf(s, "%*s:", max_name_width, r->name);
list_for_each_entry(dom, &r->domains, list) {
if (sep)
seq_puts(s, ";");

ctrl_val = (!is_mba_sc(r) ? dom->ctrl_val[closid] :
dom->mbps_val[closid]);
seq_printf(s, r->format_str, dom->id, max_data_width,
dom->ctrl_val[closid]);
ctrl_val);
sep = true;
}
seq_puts(s, "\n");
Expand Down
Loading

0 comments on commit ab20fd0

Please sign in to comment.