Skip to content

Commit

Permalink
Merge branch 'pm-em'
Browse files Browse the repository at this point in the history
Merge Enery Model changes for 6.9-rc1:

 - Allow the Energy Model to be updated dynamically (Lukasz Luba).

* pm-em: (24 commits)
  PM: EM: Fix nr_states warnings in static checks
  Documentation: EM: Update with runtime modification design
  PM: EM: Add em_dev_compute_costs()
  PM: EM: Remove old table
  PM: EM: Change debugfs configuration to use runtime EM table data
  drivers/thermal/devfreq_cooling: Use new Energy Model interface
  drivers/thermal/cpufreq_cooling: Use new Energy Model interface
  powercap/dtpm_devfreq: Use new Energy Model interface to get table
  powercap/dtpm_cpu: Use new Energy Model interface to get table
  PM: EM: Optimize em_cpu_energy() and remove division
  PM: EM: Support late CPUs booting and capacity adjustment
  PM: EM: Add performance field to struct em_perf_state and optimize
  PM: EM: Add em_perf_state_from_pd() to get performance states table
  PM: EM: Introduce em_dev_update_perf_domain() for EM updates
  PM: EM: Add functions for memory allocations for new EM tables
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  PM: EM: Introduce runtime modifiable table
  PM: EM: Split the allocation and initialization of the EM table
  PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  PM: EM: Introduce em_compute_costs()
  ...
  • Loading branch information
Rafael J. Wysocki committed Mar 11, 2024
2 parents c907ab5 + 3a561ea commit 3bd8346
Show file tree
Hide file tree
Showing 7 changed files with 821 additions and 170 deletions.
183 changes: 179 additions & 4 deletions Documentation/power/energy-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,31 @@ whose performance is scaled together. Performance domains generally have a
required to have the same micro-architecture. CPUs in different performance
domains can have different micro-architectures.

To better reflect power variation due to static power (leakage) the EM
supports runtime modifications of the power values. The mechanism relies on
RCU to free the modifiable EM perf_state table memory. Its user, the task
scheduler, also uses RCU to access this memory. The EM framework provides
API for allocating/freeing the new memory for the modifiable EM table.
The old memory is freed automatically using RCU callback mechanism when there
are no owners anymore for the given EM runtime table instance. This is tracked
using kref mechanism. The device driver which provided the new EM at runtime,
should call EM API to free it safely when it's no longer needed. The EM
framework will handle the clean-up when it's possible.

The kernel code which want to modify the EM values is protected from concurrent
access using a mutex. Therefore, the device driver code must run in sleeping
context when it tries to modify the EM.

With the runtime modifiable EM we switch from a 'single and during the entire
runtime static EM' (system property) design to a 'single EM which can be
changed during runtime according e.g. to the workload' (system and workload
property) design.

It is possible also to modify the CPU performance values for each EM's
performance state. Thus, the full power and performance profile (which
is an exponential curve) can be changed according e.g. to the workload
or system property.


2. Core APIs
------------
Expand Down Expand Up @@ -175,10 +200,82 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
not provided for other type of devices.

More details about the above APIs can be found in ``<linux/energy_model.h>``
or in Section 2.4
or in Section 2.5


2.4 Runtime modifications
^^^^^^^^^^^^^^^^^^^^^^^^^

Drivers willing to update the EM at runtime should use the following dedicated
function to allocate a new instance of the modified EM. The API is listed
below::

struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);

This allows to allocate a structure which contains the new EM table with
also RCU and kref needed by the EM framework. The 'struct em_perf_table'
contains array 'struct em_perf_state state[]' which is a list of performance
states in ascending order. That list must be populated by the device driver
which wants to update the EM. The list of frequencies can be taken from
existing EM (created during boot). The content in the 'struct em_perf_state'
must be populated by the driver as well.

This is the API which does the EM update, using RCU pointers swap::

int em_dev_update_perf_domain(struct device *dev,
struct em_perf_table __rcu *new_table);

Drivers must provide a pointer to the allocated and initialized new EM
'struct em_perf_table'. That new EM will be safely used inside the EM framework
and will be visible to other sub-systems in the kernel (thermal, powercap).
The main design goal for this API is to be fast and avoid extra calculations
or memory allocations at runtime. When pre-computed EMs are available in the
device driver, than it should be possible to simply re-use them with low
performance overhead.

In order to free the EM, provided earlier by the driver (e.g. when the module
is unloaded), there is a need to call the API::

void em_table_free(struct em_perf_table __rcu *table);

It will allow the EM framework to safely remove the memory, when there is
no other sub-system using it, e.g. EAS.

To use the power values in other sub-systems (like thermal, powercap) there is
a need to call API which protects the reader and provide consistency of the EM
table data::

struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);

It returns the 'struct em_perf_state' pointer which is an array of performance
states in ascending order.
This function must be called in the RCU read lock section (after the
rcu_read_lock()). When the EM table is not needed anymore there is a need to
call rcu_real_unlock(). In this way the EM safely uses the RCU read section
and protects the users. It also allows the EM framework to manage the memory
and free it. More details how to use it can be found in Section 3.2 in the
example driver.

There is dedicated API for device drivers to calculate em_perf_state::cost
values::

int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
int nr_states);

These 'cost' values from EM are used in EAS. The new EM table should be passed
together with the number of entries and device pointer. When the computation
of the cost values is done properly the return value from the function is 0.
The function takes care for right setting of inefficiency for each performance
state as well. It updates em_perf_state::flags accordingly.
Then such prepared new EM can be passed to the em_dev_update_perf_domain()
function, which will allow to use it.

More details about the above APIs can be found in ``<linux/energy_model.h>``
or in Section 3.2 with an example code showing simple implementation of the
updating mechanism in a device driver.


2.4 Description details of this API
2.5 Description details of this API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. kernel-doc:: include/linux/energy_model.h
:internal:
Expand All @@ -187,8 +284,11 @@ or in Section 2.4
:export:


3. Example driver
-----------------
3. Examples
-----------

3.1 Example driver with EM registration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The CPUFreq framework supports dedicated callback for registering
the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
Expand Down Expand Up @@ -242,3 +342,78 @@ EM framework::
39 static struct cpufreq_driver foo_cpufreq_driver = {
40 .register_em = foo_cpufreq_register_em,
41 };


3.2 Example driver with EM modification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provides a simple example of a thermal driver modifying the EM.
The driver implements a foo_thermal_em_update() function. The driver is woken
up periodically to check the temperature and modify the EM data::

-> drivers/soc/example/example_em_mod.c

01 static void foo_get_new_em(struct foo_context *ctx)
02 {
03 struct em_perf_table __rcu *em_table;
04 struct em_perf_state *table, *new_table;
05 struct device *dev = ctx->dev;
06 struct em_perf_domain *pd;
07 unsigned long freq;
08 int i, ret;
09
10 pd = em_pd_get(dev);
11 if (!pd)
12 return;
13
14 em_table = em_table_alloc(pd);
15 if (!em_table)
16 return;
17
18 new_table = em_table->state;
19
20 rcu_read_lock();
21 table = em_perf_state_from_pd(pd);
22 for (i = 0; i < pd->nr_perf_states; i++) {
23 freq = table[i].frequency;
24 foo_get_power_perf_values(dev, freq, &new_table[i]);
25 }
26 rcu_read_unlock();
27
28 /* Calculate 'cost' values for EAS */
29 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
30 if (ret) {
31 dev_warn(dev, "EM: compute costs failed %d\n", ret);
32 em_free_table(em_table);
33 return;
34 }
35
36 ret = em_dev_update_perf_domain(dev, em_table);
37 if (ret) {
38 dev_warn(dev, "EM: update failed %d\n", ret);
39 em_free_table(em_table);
40 return;
41 }
42
43 /*
44 * Since it's one-time-update drop the usage counter.
45 * The EM framework will later free the table when needed.
46 */
47 em_table_free(em_table);
48 }
49
50 /*
51 * Function called periodically to check the temperature and
52 * update the EM if needed
53 */
54 static void foo_thermal_em_update(struct foo_context *ctx)
55 {
56 struct device *dev = ctx->dev;
57 int cpu;
58
59 ctx->temperature = foo_get_temp(dev, ctx);
60 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
61 return;
62
63 foo_get_new_em(ctx);
64 }
41 changes: 30 additions & 11 deletions drivers/powercap/dtpm_cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
struct em_perf_state *table;
struct cpumask cpus;
unsigned long freq;
u64 power;
Expand All @@ -50,20 +51,22 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
nr_cpus = cpumask_weight(&cpus);

rcu_read_lock();
table = em_perf_state_from_pd(pd);
for (i = 0; i < pd->nr_perf_states; i++) {

power = pd->table[i].power * nr_cpus;
power = table[i].power * nr_cpus;

if (power > power_limit)
break;
}

freq = pd->table[i - 1].frequency;
freq = table[i - 1].frequency;
power_limit = table[i - 1].power * nr_cpus;
rcu_read_unlock();

freq_qos_update_request(&dtpm_cpu->qos_req, freq);

power_limit = pd->table[i - 1].power * nr_cpus;

return power_limit;
}

Expand All @@ -87,9 +90,11 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
static u64 get_pd_power_uw(struct dtpm *dtpm)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
struct em_perf_state *table;
struct em_perf_domain *pd;
struct cpumask *pd_mask;
unsigned long freq;
u64 power = 0;
int i;

pd = em_cpu_get(dtpm_cpu->cpu);
Expand All @@ -98,33 +103,43 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)

freq = cpufreq_quick_get(dtpm_cpu->cpu);

rcu_read_lock();
table = em_perf_state_from_pd(pd);
for (i = 0; i < pd->nr_perf_states; i++) {

if (pd->table[i].frequency < freq)
if (table[i].frequency < freq)
continue;

return scale_pd_power_uw(pd_mask, pd->table[i].power);
power = scale_pd_power_uw(pd_mask, table[i].power);
break;
}
rcu_read_unlock();

return 0;
return power;
}

static int update_pd_power_uw(struct dtpm *dtpm)
{
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
struct em_perf_state *table;
struct cpumask cpus;
int nr_cpus;

cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
nr_cpus = cpumask_weight(&cpus);

dtpm->power_min = em->table[0].power;
rcu_read_lock();
table = em_perf_state_from_pd(em);

dtpm->power_min = table[0].power;
dtpm->power_min *= nr_cpus;

dtpm->power_max = em->table[em->nr_perf_states - 1].power;
dtpm->power_max = table[em->nr_perf_states - 1].power;
dtpm->power_max *= nr_cpus;

rcu_read_unlock();

return 0;
}

Expand All @@ -143,7 +158,7 @@ static void pd_release(struct dtpm *dtpm)

cpufreq_cpu_put(policy);
}

kfree(dtpm_cpu);
}

Expand Down Expand Up @@ -180,6 +195,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
{
struct dtpm_cpu *dtpm_cpu;
struct cpufreq_policy *policy;
struct em_perf_state *table;
struct em_perf_domain *pd;
char name[CPUFREQ_NAME_LEN];
int ret = -ENOMEM;
Expand Down Expand Up @@ -216,9 +232,12 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
if (ret)
goto out_kfree_dtpm_cpu;

rcu_read_lock();
table = em_perf_state_from_pd(pd);
ret = freq_qos_add_request(&policy->constraints,
&dtpm_cpu->qos_req, FREQ_QOS_MAX,
pd->table[pd->nr_perf_states - 1].frequency);
table[pd->nr_perf_states - 1].frequency);
rcu_read_unlock();
if (ret < 0)
goto out_dtpm_unregister;

Expand Down
Loading

0 comments on commit 3bd8346

Please sign in to comment.