Skip to content

Commit

Permalink
Merge branches 'pm-cpuidle' and 'pm-em'
Browse files Browse the repository at this point in the history
* pm-cpuidle:
  cpuidle: Select polling interval based on a c-state with a longer target residency
  cpuidle: psci: Enable suspend-to-idle for PSCI OSI mode
  PM: domains: Enable dev_pm_genpd_suspend|resume() for suspend-to-idle
  PM: domains: Rename pm_genpd_syscore_poweroff|poweron()

* pm-em:
  PM / EM: Micro optimization in em_cpu_energy
  PM: EM: Update Energy Model with new flag indicating power scale
  PM: EM: update the comments related to power scale
  PM: EM: Clarify abstract scale usage for power values in Energy Model
  • Loading branch information
Rafael J. Wysocki committed Dec 15, 2020
3 parents e1f1320 + 7a25759 + 1080399 commit 4c5744a
Show file tree
Hide file tree
Showing 13 changed files with 154 additions and 49 deletions.
12 changes: 11 additions & 1 deletion Documentation/driver-api/thermal/power_allocator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore
simply an estimate, and may be tuned to affect the aggressiveness of
the thermal ramp. For reference, the sustainable power of a 4" phone
is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
depending on screen size).
depending on screen size). It is possible to have the power value
expressed in an abstract scale. The sustained power should be aligned
to the scale used by the related cooling devices.

If you are using device tree, do add it as a property of the
thermal-zone. For example::
Expand Down Expand Up @@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this
governor, step-wise will also misbehave if you call its throttle()
faster than the normal thermal framework tick (due to interrupts for
example) as it will overreact.

Energy Model requirements
=========================

Another important thing is the consistent scale of the power values
provided by the cooling devices. All of the cooling devices in a single
thermal zone should have power values reported either in milli-Watts
or scaled to the same 'abstract scale'.
30 changes: 25 additions & 5 deletions Documentation/power/energy-model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,21 @@ possible source of information on its own, the EM framework intervenes as an
abstraction layer which standardizes the format of power cost tables in the
kernel, hence enabling to avoid redundant work.

The power values might be expressed in milli-Watts or in an 'abstract scale'.
Multiple subsystems might use the EM and it is up to the system integrator to
check that the requirements for the power value scale types are met. An example
can be found in the Energy-Aware Scheduler documentation
Documentation/scheduler/sched-energy.rst. For some subsystems like thermal or
powercap power values expressed in an 'abstract scale' might cause issues.
These subsystems are more interested in estimation of power used in the past,
thus the real milli-Watts might be needed. An example of these requirements can
be found in the Intelligent Power Allocation in
Documentation/driver-api/thermal/power_allocator.rst.
Kernel subsystems might implement automatic detection to check whether EM
registered devices have inconsistent scale (based on EM internal flag).
Important thing to keep in mind is that when the power values are expressed in
an 'abstract scale' deriving real energy in milli-Joules would not be possible.

The figure below depicts an example of drivers (Arm-specific here, but the
approach is applicable to any architecture) providing power costs to the EM
framework, and interested clients reading the data from it::
Expand Down Expand Up @@ -73,14 +88,18 @@ Drivers are expected to register performance domains into the EM framework by
calling the following API::

int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
struct em_data_callback *cb, cpumask_t *cpus);
struct em_data_callback *cb, cpumask_t *cpus, bool milliwatts);

Drivers must provide a callback function returning <frequency, power> tuples
for each performance state. The callback function provided by the driver is free
to fetch data from any relevant location (DT, firmware, ...), and by any mean
deemed necessary. Only for CPU devices, drivers must specify the CPUs of the
performance domains using cpumask. For other devices than CPUs the last
argument must be set to NULL.
The last argument 'milliwatts' is important to set with correct value. Kernel
subsystems which use EM might rely on this flag to check if all EM devices use
the same scale. If there are different scales, these subsystems might decide
to: return warning/error, stop working or panic.
See Section 3. for an example of driver implementing this
callback, and kernel/power/energy_model.c for further documentation on this
API.
Expand Down Expand Up @@ -156,7 +175,8 @@ EM framework::
37 nr_opp = foo_get_nr_opp(policy);
38
39 /* And register the new performance domain */
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus);
41
42 return 0;
43 }
40 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,
41 true);
42
43 return 0;
44 }
5 changes: 5 additions & 0 deletions Documentation/scheduler/sched-energy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,11 @@ independent EM framework in Documentation/power/energy-model.rst.
Please also note that the scheduling domains need to be re-built after the
EM has been registered in order to start EAS.

EAS uses the EM to make a forecasting decision on energy usage and thus it is
more focused on the difference when checking possible options for task
placement. For EAS it doesn't matter whether the EM power values are expressed
in milli-Watts or in an 'abstract scale'.


6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
51 changes: 35 additions & 16 deletions drivers/base/power/domain.c
Original file line number Diff line number Diff line change
Expand Up @@ -1363,41 +1363,60 @@ static void genpd_complete(struct device *dev)
genpd_unlock(genpd);
}

/**
* genpd_syscore_switch - Switch power during system core suspend or resume.
* @dev: Device that normally is marked as "always on" to switch power for.
*
* This routine may only be called during the system core (syscore) suspend or
* resume phase for devices whose "always on" flags are set.
*/
static void genpd_syscore_switch(struct device *dev, bool suspend)
static void genpd_switch_state(struct device *dev, bool suspend)
{
struct generic_pm_domain *genpd;
bool use_lock;

genpd = dev_to_genpd_safe(dev);
if (!genpd)
return;

use_lock = genpd_is_irq_safe(genpd);

if (use_lock)
genpd_lock(genpd);

if (suspend) {
genpd->suspended_count++;
genpd_sync_power_off(genpd, false, 0);
genpd_sync_power_off(genpd, use_lock, 0);
} else {
genpd_sync_power_on(genpd, false, 0);
genpd_sync_power_on(genpd, use_lock, 0);
genpd->suspended_count--;
}

if (use_lock)
genpd_unlock(genpd);
}

void pm_genpd_syscore_poweroff(struct device *dev)
/**
* dev_pm_genpd_suspend - Synchronously try to suspend the genpd for @dev
* @dev: The device that is attached to the genpd, that can be suspended.
*
* This routine should typically be called for a device that needs to be
* suspended during the syscore suspend phase. It may also be called during
* suspend-to-idle to suspend a corresponding CPU device that is attached to a
* genpd.
*/
void dev_pm_genpd_suspend(struct device *dev)
{
genpd_syscore_switch(dev, true);
genpd_switch_state(dev, true);
}
EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweroff);
EXPORT_SYMBOL_GPL(dev_pm_genpd_suspend);

void pm_genpd_syscore_poweron(struct device *dev)
/**
* dev_pm_genpd_resume - Synchronously try to resume the genpd for @dev
* @dev: The device that is attached to the genpd, which needs to be resumed.
*
* This routine should typically be called for a device that needs to be resumed
* during the syscore resume phase. It may also be called during suspend-to-idle
* to resume a corresponding CPU device that is attached to a genpd.
*/
void dev_pm_genpd_resume(struct device *dev)
{
genpd_syscore_switch(dev, false);
genpd_switch_state(dev, false);
}
EXPORT_SYMBOL_GPL(pm_genpd_syscore_poweron);
EXPORT_SYMBOL_GPL(dev_pm_genpd_resume);

#else /* !CONFIG_PM_SLEEP */

Expand Down
8 changes: 4 additions & 4 deletions drivers/clocksource/sh_cmt.c
Original file line number Diff line number Diff line change
Expand Up @@ -658,7 +658,7 @@ static void sh_cmt_clocksource_suspend(struct clocksource *cs)
return;

sh_cmt_stop(ch, FLAG_CLOCKSOURCE);
pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
dev_pm_genpd_suspend(&ch->cmt->pdev->dev);
}

static void sh_cmt_clocksource_resume(struct clocksource *cs)
Expand All @@ -668,7 +668,7 @@ static void sh_cmt_clocksource_resume(struct clocksource *cs)
if (!ch->cs_enabled)
return;

pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
dev_pm_genpd_resume(&ch->cmt->pdev->dev);
sh_cmt_start(ch, FLAG_CLOCKSOURCE);
}

Expand Down Expand Up @@ -760,7 +760,7 @@ static void sh_cmt_clock_event_suspend(struct clock_event_device *ced)
{
struct sh_cmt_channel *ch = ced_to_sh_cmt(ced);

pm_genpd_syscore_poweroff(&ch->cmt->pdev->dev);
dev_pm_genpd_suspend(&ch->cmt->pdev->dev);
clk_unprepare(ch->cmt->clk);
}

Expand All @@ -769,7 +769,7 @@ static void sh_cmt_clock_event_resume(struct clock_event_device *ced)
struct sh_cmt_channel *ch = ced_to_sh_cmt(ced);

clk_prepare(ch->cmt->clk);
pm_genpd_syscore_poweron(&ch->cmt->pdev->dev);
dev_pm_genpd_resume(&ch->cmt->pdev->dev);
}

static int sh_cmt_register_clockevent(struct sh_cmt_channel *ch,
Expand Down
4 changes: 2 additions & 2 deletions drivers/clocksource/sh_mtu2.c
Original file line number Diff line number Diff line change
Expand Up @@ -297,12 +297,12 @@ static int sh_mtu2_clock_event_set_periodic(struct clock_event_device *ced)

static void sh_mtu2_clock_event_suspend(struct clock_event_device *ced)
{
pm_genpd_syscore_poweroff(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
dev_pm_genpd_suspend(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
}

static void sh_mtu2_clock_event_resume(struct clock_event_device *ced)
{
pm_genpd_syscore_poweron(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
dev_pm_genpd_resume(&ced_to_sh_mtu2(ced)->mtu->pdev->dev);
}

static void sh_mtu2_register_clockevent(struct sh_mtu2_channel *ch,
Expand Down
8 changes: 4 additions & 4 deletions drivers/clocksource/sh_tmu.c
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ static void sh_tmu_clocksource_suspend(struct clocksource *cs)

if (--ch->enable_count == 0) {
__sh_tmu_disable(ch);
pm_genpd_syscore_poweroff(&ch->tmu->pdev->dev);
dev_pm_genpd_suspend(&ch->tmu->pdev->dev);
}
}

Expand All @@ -304,7 +304,7 @@ static void sh_tmu_clocksource_resume(struct clocksource *cs)
return;

if (ch->enable_count++ == 0) {
pm_genpd_syscore_poweron(&ch->tmu->pdev->dev);
dev_pm_genpd_resume(&ch->tmu->pdev->dev);
__sh_tmu_enable(ch);
}
}
Expand Down Expand Up @@ -394,12 +394,12 @@ static int sh_tmu_clock_event_next(unsigned long delta,

static void sh_tmu_clock_event_suspend(struct clock_event_device *ced)
{
pm_genpd_syscore_poweroff(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
dev_pm_genpd_suspend(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
}

static void sh_tmu_clock_event_resume(struct clock_event_device *ced)
{
pm_genpd_syscore_poweron(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
dev_pm_genpd_resume(&ced_to_sh_tmu(ced)->tmu->pdev->dev);
}

static void sh_tmu_register_clockevent(struct sh_tmu_channel *ch,
Expand Down
2 changes: 2 additions & 0 deletions drivers/cpuidle/cpuidle-psci-domain.c
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,8 @@ struct device *psci_dt_attach_cpu(int cpu)
if (cpu_online(cpu))
pm_runtime_get_sync(dev);

dev_pm_syscore_device(dev, true);

return dev;
}

Expand Down
34 changes: 30 additions & 4 deletions drivers/cpuidle/cpuidle-psci.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
#include <linux/of_device.h>
#include <linux/platform_device.h>
#include <linux/psci.h>
#include <linux/pm_domain.h>
#include <linux/pm_runtime.h>
#include <linux/slab.h>
#include <linux/string.h>
Expand Down Expand Up @@ -52,8 +53,9 @@ static inline int psci_enter_state(int idx, u32 state)
return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state);
}

static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int idx)
static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int idx,
bool s2idle)
{
struct psci_cpuidle_data *data = this_cpu_ptr(&psci_cpuidle_data);
u32 *states = data->psci_states;
Expand All @@ -66,15 +68,25 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
return -1;

/* Do runtime PM to manage a hierarchical CPU toplogy. */
RCU_NONIDLE(pm_runtime_put_sync_suspend(pd_dev));
rcu_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_suspend(pd_dev);
else
pm_runtime_put_sync_suspend(pd_dev);
rcu_irq_exit_irqson();

state = psci_get_domain_state();
if (!state)
state = states[idx];

ret = psci_cpu_suspend_enter(state) ? -1 : idx;

RCU_NONIDLE(pm_runtime_get_sync(pd_dev));
rcu_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_resume(pd_dev);
else
pm_runtime_get_sync(pd_dev);
rcu_irq_exit_irqson();

cpu_pm_exit();

Expand All @@ -83,6 +95,19 @@ static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
return ret;
}

static int psci_enter_domain_idle_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int idx)
{
return __psci_enter_domain_idle_state(dev, drv, idx, false);
}

static int psci_enter_s2idle_domain_idle_state(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int idx)
{
return __psci_enter_domain_idle_state(dev, drv, idx, true);
}

static int psci_idle_cpuhp_up(unsigned int cpu)
{
struct device *pd_dev = __this_cpu_read(psci_cpuidle_data.dev);
Expand Down Expand Up @@ -170,6 +195,7 @@ static int psci_dt_cpu_init_topology(struct cpuidle_driver *drv,
* deeper states.
*/
drv->states[state_count - 1].enter = psci_enter_domain_idle_state;
drv->states[state_count - 1].enter_s2idle = psci_enter_s2idle_domain_idle_state;
psci_cpuidle_use_cpuhp = true;

return 0;
Expand Down
25 changes: 23 additions & 2 deletions drivers/cpuidle/cpuidle.c
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
cpuidle_curr_governor->reflect(dev, index);
}

/*
* Min polling interval of 10usec is a guess. It is assuming that
* for most users, the time for a single ping-pong workload like
* perf bench pipe would generally complete within 10usec but
* this is hardware dependant. Actual time can be estimated with
*
* perf bench sched pipe -l 10000
*
* Run multiple times to avoid cpufreq effects.
*/
#define CPUIDLE_POLL_MIN 10000
#define CPUIDLE_POLL_MAX (TICK_NSEC / 16)

/**
* cpuidle_poll_time - return amount of time to poll for,
* governors can override dev->poll_limit_ns if necessary
Expand All @@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
int i;
u64 limit_ns;

BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX);

if (dev->poll_limit_ns)
return dev->poll_limit_ns;

limit_ns = TICK_NSEC;
limit_ns = CPUIDLE_POLL_MAX;
for (i = 1; i < drv->state_count; i++) {
u64 state_limit;

if (dev->states_usage[i].disable)
continue;

limit_ns = drv->states[i].target_residency_ns;
state_limit = drv->states[i].target_residency_ns;
if (state_limit < CPUIDLE_POLL_MIN)
continue;

limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX);
break;
}

Expand Down
Loading

0 comments on commit 4c5744a

Please sign in to comment.