Commit 6453dbdd authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'pm-4.8-rc1' of git://

Pull power management updates from Rafael  Wysocki:
 "Again, the majority of changes go into the cpufreq subsystem, but
  there are no big features this time.  The cpufreq changes that stand
  out somewhat are the governor interface rework and improvements
  related to the handling of frequency tables.  Apart from those, there
  are fixes and new device/CPU IDs in drivers, cleanups and an
  improvement of the new schedutil governor.

  Next, there are some changes in the hibernation core, including a fix
  for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
  during resume from hibernation, a few core improvements related to
  memory management during resume, a couple of additional debug features
  and cleanups.

  Finally, we have some fixes and cleanups in the devfreq subsystem,
  generic power domains framework improvements related to system
  suspend/resume, support for some new chips in intel_idle and in the
  power capping RAPL driver, a new version of the AnalyzeSuspend utility
  and some assorted fixes and cleanups.


   - Rework the cpufreq governor interface to make it more
     straightforward and modify the conservative governor to avoid using
     transition notifications (Rafael Wysocki).

   - Rework the handling of frequency tables by the cpufreq core to make
     it more efficient (Viresh Kumar).

   - Modify the schedutil governor to reduce the number of wakeups it
     causes to occur in cases when the CPU frequency doesn't need to be
     changed (Steve Muckle, Viresh Kumar).

   - Fix some minor issues and clean up code in the cpufreq core and
     governors (Rafael Wysocki, Viresh Kumar).

   - Add Intel Broxton support to the intel_pstate driver (Srinivas

   - Fix problems related to the config TDP feature and to the validity
     of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
     Srinivas Pandruvada).

   - Make intel_pstate update the cpu_frequency tracepoint even if the
     frequency doesn't change to avoid confusing powertop (Rafael

   - Clean up the usage of __init/__initdata in intel_pstate, mark some
     of its internal variables as __read_mostly and drop an unused
     structure element from it (Jisheng Zhang, Carsten Emde).

   - Clean up the usage of some duplicate MSR symbols in intel_pstate
     and turbostat (Srinivas Pandruvada).

   - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
     Adiga, Viresh Kumar, Ben Dooks).

   - Fix a regression (introduced during the 4.5 cycle) in the
     pcc-cpufreq driver by reverting the problematic commit (Andreas

   - Add support for Intel Denverton to intel_idle, clean up Broxton
     support in it and make it explicitly non-modular (Jacob Pan, Jan
     Beulich, Paul Gortmaker).

   - Add support for Denverton and Ivy Bridge server to the Intel RAPL
     power capping driver and make it more careful about the handing of
     MSRs that may not be present (Jacob Pan, Xiaolong Wang).

   - Fix resume from hibernation on x86-64 by making the CPU offline
     during resume avoid using MONITOR/MWAIT in the "play dead" loop
     which may lead to an inadvertent "revival" of a "dead" CPU and a
     page fault leading to a kernel crash from it (Rafael Wysocki).

   - Make memory management during resume from hibernation more
     straightforward (Rafael Wysocki).

   - Add debug features that should help to detect problems related to
     hibernation and resume from it (Rafael Wysocki, Chen Yu).

   - Clean up hibernation core somewhat (Rafael Wysocki).

   - Prevent KASAN from instrumenting the hibernation core which leads
     to large numbers of false-positives from it (James Morse).

   - Prevent PM (hibernate and suspend) notifiers from being called
     during the cleanup phase if they have not been called during the
     corresponding preparation phase which is possible if one of the
     other notifiers returns an error at that time (Lianwei Wang).

   - Improve suspend-related debug printout in the tasks freezer and
     clean up suspend-related console handling (Roger Lu, Borislav

   - Update the AnalyzeSuspend script in the kernel sources to version
     4.2 (Todd Brandt).

   - Modify the generic power domains framework to make it handle system
     suspend/resume better (Ulf Hansson).

   - Make the runtime PM framework avoid resuming devices synchronously
     when user space changes the runtime PM settings for them and
     improve its error reporting (Rafael Wysocki, Linus Walleij).

   - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
     exynos-bus) and in the core, make some devfreq code explicitly
     non-modular and change some of it into tristate (Bartlomiej
     Zolnierkiewicz, Peter Chen, Paul Gortmaker).

   - Add DT support to the generic PM clocks management code and make it
     export some more symbols (Jon Hunter, Paul Gortmaker).

   - Make the PCI PM core code slightly more robust against possible
     driver errors (Andy Shevchenko).

   - Make it possible to change DESTDIR and PREFIX in turbostat (Andy

* tag 'pm-4.8-rc1' of git:// (89 commits)
  Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
  PM / hibernate: Introduce test_resume mode for hibernation
  cpufreq: export cpufreq_driver_resolve_freq()
  cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
  PCI / PM: check all fields in pci_set_platform_pm()
  cpufreq: acpi-cpufreq: use cached frequency mapping when possible
  cpufreq: schedutil: map raw required frequency to driver frequency
  cpufreq: add cpufreq_driver_resolve_freq()
  cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
  intel_pstate: Update cpu_frequency tracepoint every time
  cpufreq: intel_pstate: clean remnant struct element
  PM / tools: scripts: AnalyzeSuspend v4.2
  x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  cpufreq: powernv: Replacing pstate_id with frequency table index
  intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
  PM / hibernate: Image data protection during restoration
  PM / hibernate: Add missing braces in __register_nosave_region()
  PM / hibernate: Clean up comments in snapshot.c
  PM / hibernate: Clean up function headers in snapshot.c
  PM / hibernate: Add missing braces in hibernate_setup()
parents 27b79027 bc841e26
......@@ -96,7 +96,7 @@ new - new frequency
For details about OPP, see Documentation/power/opp.txt
dev_pm_opp_init_cpufreq_table - cpufreq framework typically is initialized with
cpufreq_frequency_table_cpuinfo which is provided with the list of
cpufreq_table_validate_and_show() which is provided with the list of
frequencies that are available for operation. This function provides
a ready to use conversion routine to translate the OPP layer's internal
information about the available frequencies into a format readily
......@@ -110,7 +110,7 @@ dev_pm_opp_init_cpufreq_table - cpufreq framework typically is initialized with
/* Do things */
r = dev_pm_opp_init_cpufreq_table(dev, &freq_table);
if (!r)
cpufreq_frequency_table_cpuinfo(policy, freq_table);
cpufreq_table_validate_and_show(policy, freq_table);
/* Do other things */
......@@ -231,7 +231,7 @@ if you want to skip one entry in the table, set the frequency to
CPUFREQ_ENTRY_INVALID. The entries don't need to be in ascending
By calling cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
By calling cpufreq_table_validate_and_show(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table);
the cpuinfo.min_freq and cpuinfo.max_freq values are detected, and
policy->min and policy->max are set to the same values. This is
......@@ -244,14 +244,12 @@ policy->max, and all other criteria are met. This is helpful for the
->verify call.
int cpufreq_frequency_table_target(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int target_freq,
unsigned int relation,
unsigned int *index);
unsigned int relation);
is the corresponding frequency table helper for the ->target
stage. Just pass the values to this function, and the unsigned int
index returns the number of the frequency table entry which contains
stage. Just pass the values to this function, and this function
returns the number of the frequency table entry which contains
the frequency the CPU shall be set to.
The following macros can be used as iterators over cpufreq_frequency_table:
......@@ -159,8 +159,8 @@ to be strictly associated with a P-state.
2.2 cpuinfo_transition_latency:
The cpuinfo_transition_latency field is CPUFREQ_ETERNAL. The PCC specification
does not include a field to expose this value currently.
The cpuinfo_transition_latency field is 0. The PCC specification does
not include a field to expose this value currently.
2.3 cpuinfo_cur_freq:
......@@ -3598,6 +3598,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
present during boot.
nocompress Don't compress/decompress hibernation images.
no Disable hibernation and resume.
protect_image Turn on image protection during restoration
(that will set all pages holding image data
during restoration read-only).
retain_initrd [RAM] Keep initrd memory after extraction
......@@ -85,61 +85,57 @@ static void spu_gov_cancel_work(struct spu_gov_info_struct *info)
static int spu_gov_govern(struct cpufreq_policy *policy, unsigned int event)
static int spu_gov_start(struct cpufreq_policy *policy)
unsigned int cpu = policy->cpu;
struct spu_gov_info_struct *info, *affected_info;
struct spu_gov_info_struct *info = &per_cpu(spu_gov_info, cpu);
struct spu_gov_info_struct *affected_info;
int i;
int ret = 0;
info = &per_cpu(spu_gov_info, cpu);
switch (event) {
if (!cpu_online(cpu)) {
printk(KERN_ERR "cpu %d is not online\n", cpu);
ret = -EINVAL;
if (!cpu_online(cpu)) {
printk(KERN_ERR "cpu %d is not online\n", cpu);
return -EINVAL;
if (!policy->cur) {
printk(KERN_ERR "no cpu specified in policy\n");
ret = -EINVAL;
if (!policy->cur) {
printk(KERN_ERR "no cpu specified in policy\n");
return -EINVAL;
/* initialize spu_gov_info for all affected cpus */
for_each_cpu(i, policy->cpus) {
affected_info = &per_cpu(spu_gov_info, i);
affected_info->policy = policy;
/* initialize spu_gov_info for all affected cpus */
for_each_cpu(i, policy->cpus) {
affected_info = &per_cpu(spu_gov_info, i);
affected_info->policy = policy;
info->poll_int = POLL_TIME;
info->poll_int = POLL_TIME;
/* setup timer */
/* setup timer */
return 0;
/* cancel timer */
static void spu_gov_stop(struct cpufreq_policy *policy)
unsigned int cpu = policy->cpu;
struct spu_gov_info_struct *info = &per_cpu(spu_gov_info, cpu);
int i;
/* clean spu_gov_info for all affected cpus */
for_each_cpu (i, policy->cpus) {
info = &per_cpu(spu_gov_info, i);
info->policy = NULL;
/* cancel timer */
/* clean spu_gov_info for all affected cpus */
for_each_cpu (i, policy->cpus) {
info = &per_cpu(spu_gov_info, i);
info->policy = NULL;
return ret;
static struct cpufreq_governor spu_governor = {
.name = "spudemand",
.governor = spu_gov_govern,
.start = spu_gov_start,
.stop = spu_gov_stop,
.owner = THIS_MODULE,
......@@ -64,8 +64,6 @@
#define MSR_OFFCORE_RSP_0 0x000001a6
#define MSR_OFFCORE_RSP_1 0x000001a7
#define MSR_NHM_TURBO_RATIO_LIMIT 0x000001ad
#define MSR_IVT_TURBO_RATIO_LIMIT 0x000001ae
#define MSR_TURBO_RATIO_LIMIT 0x000001ad
#define MSR_TURBO_RATIO_LIMIT1 0x000001ae
#define MSR_TURBO_RATIO_LIMIT2 0x000001af
......@@ -135,6 +135,7 @@ int native_cpu_up(unsigned int cpunum, struct task_struct *tidle);
int native_cpu_disable(void);
int common_cpu_die(unsigned int cpu);
void native_cpu_die(unsigned int cpu);
void hlt_play_dead(void);
void native_play_dead(void);
void play_dead_common(void);
void wbinvd_on_cpu(int cpu);
......@@ -1644,7 +1644,7 @@ static inline void mwait_play_dead(void)
static inline void hlt_play_dead(void)
void hlt_play_dead(void)
if (__this_cpu_read(cpu_info.x86) >= 4)
......@@ -12,6 +12,7 @@
#include <linux/export.h>
#include <linux/smp.h>
#include <linux/perf_event.h>
#include <linux/tboot.h>
#include <asm/pgtable.h>
#include <asm/proto.h>
......@@ -266,6 +267,35 @@ void notrace restore_processor_state(void)
static void resume_play_dead(void)
int hibernate_resume_nonboot_cpu_disable(void)
void (*play_dead)(void) = smp_ops.play_dead;
int ret;
* Ensure that MONITOR/MWAIT will not be used in the "play dead" loop
* during hibernate image restoration, because it is likely that the
* monitored address will be actually written to at that time and then
* the "dead" CPU will attempt to execute instructions again, but the
* address in its instruction pointer may not be possible to resolve
* any more at that point (the page tables used by it previously may
* have been overwritten by hibernate image data).
smp_ops.play_dead = resume_play_dead;
ret = disable_nonboot_cpus();
smp_ops.play_dead = play_dead;
return ret;
* When bsp_check() is called in hibernate and suspend, cpu hotplug
* is disabled already. So it's unnessary to handle race condition between
......@@ -121,6 +121,7 @@ int pm_clk_add(struct device *dev, const char *con_id)
return __pm_clk_add(dev, con_id, NULL);
* pm_clk_add_clk - Start using a device clock for power management.
......@@ -136,8 +137,41 @@ int pm_clk_add_clk(struct device *dev, struct clk *clk)
return __pm_clk_add(dev, NULL, clk);
* of_pm_clk_add_clk - Start using a device clock for power management.
* @dev: Device whose clock is going to be used for power management.
* @name: Name of clock that is going to be used for power management.
* Add the clock described in the 'clocks' device-tree node that matches
* with the 'name' provided, to the list of clocks used for the power
* management of @dev. On success, returns 0. Returns a negative error
* code if the clock is not found or cannot be added.
int of_pm_clk_add_clk(struct device *dev, const char *name)
struct clk *clk;
int ret;
if (!dev || !dev->of_node || !name)
return -EINVAL;
clk = of_clk_get_by_name(dev->of_node, name);
if (IS_ERR(clk))
return PTR_ERR(clk);
ret = pm_clk_add_clk(dev, clk);
if (ret) {
return ret;
return 0;
* of_pm_clk_add_clks - Start using device clock(s) for power management.
* @dev: Device whose clock(s) is going to be used for power management.
......@@ -192,6 +226,7 @@ int of_pm_clk_add_clks(struct device *dev)
return ret;
* __pm_clk_remove - Destroy PM clock entry.
......@@ -252,6 +287,7 @@ void pm_clk_remove(struct device *dev, const char *con_id)
* pm_clk_remove_clk - Stop using a device clock for power management.
......@@ -285,6 +321,7 @@ void pm_clk_remove_clk(struct device *dev, struct clk *clk)
* pm_clk_init - Initialize a device's list of power management clocks.
......@@ -299,6 +336,7 @@ void pm_clk_init(struct device *dev)
if (psd)
* pm_clk_create - Create and initialize a device's list of PM clocks.
......@@ -311,6 +349,7 @@ int pm_clk_create(struct device *dev)
return dev_pm_get_subsys_data(dev);
* pm_clk_destroy - Destroy a device's list of power management clocks.
......@@ -345,6 +384,7 @@ void pm_clk_destroy(struct device *dev)
* pm_clk_suspend - Disable clocks in a device's PM clock list.
......@@ -375,6 +415,7 @@ int pm_clk_suspend(struct device *dev)
return 0;
* pm_clk_resume - Enable clocks in a device's PM clock list.
......@@ -400,6 +441,7 @@ int pm_clk_resume(struct device *dev)
return 0;
* pm_clk_notify - Notify routine for device addition and removal.
......@@ -480,6 +522,7 @@ int pm_clk_runtime_suspend(struct device *dev)
return 0;
int pm_clk_runtime_resume(struct device *dev)
......@@ -495,6 +538,7 @@ int pm_clk_runtime_resume(struct device *dev)
return pm_generic_runtime_resume(dev);
#else /* !CONFIG_PM_CLK */
......@@ -598,3 +642,4 @@ void pm_clk_add_notifier(struct bus_type *bus,
clknb->nb.notifier_call = pm_clk_notify;
bus_register_notifier(bus, &clknb->nb);
This diff is collapsed.
......@@ -1045,10 +1045,14 @@ int __pm_runtime_set_status(struct device *dev, unsigned int status)
if (!parent->power.disable_depth
&& !parent->power.ignore_children
&& parent->power.runtime_status != RPM_ACTIVE)
&& parent->power.runtime_status != RPM_ACTIVE) {
dev_err(dev, "runtime PM trying to activate child device %s but parent (%s) is not active\n",
error = -EBUSY;
else if (dev->power.runtime_status == RPM_SUSPENDED)
} else if (dev->power.runtime_status == RPM_SUSPENDED) {
......@@ -1256,7 +1260,7 @@ void pm_runtime_allow(struct device *dev)
dev->power.runtime_auto = true;
if (atomic_dec_and_test(&dev->power.usage_count))
rpm_idle(dev, RPM_AUTO);
rpm_idle(dev, RPM_AUTO | RPM_ASYNC);
......@@ -1506,6 +1510,9 @@ int pm_runtime_force_resume(struct device *dev)
goto out;
if (!pm_runtime_status_suspended(dev))
goto out;
ret = pm_runtime_set_active(dev);
if (ret)
goto out;
......@@ -31,23 +31,18 @@ config CPU_FREQ_BOOST_SW
depends on THERMAL
tristate "CPU frequency translation statistics"
bool "CPU frequency transition statistics"
default y
This driver exports CPU frequency statistics information through sysfs
file system.
To compile this driver as a module, choose M here: the
module will be called cpufreq_stats.
Export CPU frequency statistics information through sysfs.
If in doubt, say N.
bool "CPU frequency translation statistics details"
bool "CPU frequency transition statistics details"
depends on CPU_FREQ_STAT
This will show detail CPU frequency translation table in sysfs file
Show detailed CPU frequency transition table in sysfs.
If in doubt, say N.
......@@ -468,20 +468,17 @@ unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
struct acpi_cpufreq_data *data = policy->driver_data;
struct acpi_processor_performance *perf;
struct cpufreq_frequency_table *entry;
unsigned int next_perf_state, next_freq, freq;
unsigned int next_perf_state, next_freq, index;
* Find the closest frequency above target_freq.
* The table is sorted in the reverse order with respect to the
* frequency and all of the entries are valid (see the initialization).
entry = policy->freq_table;
do {
freq = entry->frequency;
} while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
if (policy->cached_target_freq == target_freq)
index = policy->cached_resolved_idx;
index = cpufreq_table_find_index_dl(policy, target_freq);
entry = &policy->freq_table[index];
next_freq = entry->frequency;
next_perf_state = entry->driver_data;
......@@ -48,9 +48,8 @@ static unsigned int amd_powersave_bias_target(struct cpufreq_policy *policy,
struct policy_dbs_info *policy_dbs = policy->governor_data;
struct dbs_data *od_data = policy_dbs->dbs_data;
struct od_dbs_tuners *od_tuners = od_data->tuners;
struct od_policy_dbs_info *od_info = to_dbs_info(policy_dbs);
if (!od_info->freq_table)
if (!policy->freq_table)
return freq_next;
rdmsr_on_cpu(policy->cpu, MSR_AMD64_FREQ_SENSITIVITY_ACTUAL,
......@@ -92,10 +91,9 @@ static unsigned int amd_powersave_bias_target(struct cpufreq_policy *policy,
else {
unsigned int index;
od_info->freq_table, policy->cur - 1,
freq_next = od_info->freq_table[index].frequency;
index = cpufreq_table_find_index_h(policy,
policy->cur - 1);
freq_next = policy->freq_table[index].frequency;
data->freq_prev = freq_next;
This diff is collapsed.
......@@ -17,7 +17,6 @@
struct cs_policy_dbs_info {
struct policy_dbs_info policy_dbs;
unsigned int down_skip;
unsigned int requested_freq;
static inline struct cs_policy_dbs_info *to_dbs_info(struct policy_dbs_info *policy_dbs)
......@@ -75,19 +74,17 @@ static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
/* Check for frequency increase */
if (load > dbs_data->up_threshold) {
unsigned int requested_freq = policy->cur;
dbs_info->down_skip = 0;
/* if we are already at full speed then break out early */
if (dbs_info->requested_freq == policy->max)
if (requested_freq == policy->max)
goto out;
dbs_info->requested_freq += get_freq_target(cs_tuners, policy);
if (dbs_info->requested_freq > policy->max)
dbs_info->requested_freq = policy->max;
requested_freq += get_freq_target(cs_tuners, policy);
__cpufreq_driver_target(policy, dbs_info->requested_freq,
__cpufreq_driver_target(policy, requested_freq, CPUFREQ_RELATION_H);
goto out;
......@@ -98,36 +95,27 @@ static unsigned int cs_dbs_timer(struct cpufreq_policy *policy)
/* Check for frequency decrease */
if (load < cs_tuners->down_threshold) {
unsigned int freq_target;
unsigned int freq_target, requested_freq = policy->cur;
* if we cannot reduce the frequency anymore, break out early
if (policy->cur == policy->min)
if (requested_freq == policy->min)
goto out;
freq_target = get_freq_target(cs_tuners, policy);
if (dbs_info->requested_freq > freq_target)
dbs_info->requested_freq -= freq_target;
if (requested_freq > freq_target)
requested_freq -= freq_target;
dbs_info->requested_freq = policy->min;
requested_freq = policy->min;
__cpufreq_driver_target(policy, dbs_info->requested_freq,
__cpufreq_driver_target(policy, requested_freq, CPUFREQ_RELATION_L);
return dbs_data->sampling_rate;
static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
void *data);
static struct notifier_block cs_cpufreq_notifier_block = {
.notifier_call = dbs_cpufreq_notifier,
/************************** sysfs interface ************************/
static struct dbs_governor cs_dbs_gov;
static ssize_t store_sampling_down_factor(struct gov_attr_set *attr_set,
const char *buf, size_t count)
......@@ -268,15 +256,13 @@ static void cs_free(struct policy_dbs_info *policy_dbs)
static int cs_init(struct dbs_data *dbs_data, bool notify)
static int cs_init(struct dbs_data *dbs_data)
struct cs_dbs_tuners *tuners;
tuners = kzalloc(sizeof(*tuners), GFP_KERNEL);
if (!tuners) {
pr_err("%s: kzalloc failed\n", __func__);
if (!tuners)
return -ENOMEM;
tuners->down_threshold = DEF_FREQUENCY_DOWN_THRESHOLD;
tuners->freq_step = DEF_FREQUENCY_STEP;
......@@ -288,19 +274,11 @@ static int cs_init(struct dbs_data *dbs_data, bool notify)
dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
if (notify)
return 0;
static void cs_exit(struct dbs_data *dbs_data, bool notify)
static void cs_exit(struct dbs_data *dbs_data)
if (notify)
......@@ -309,16 +287,10 @@ static void cs_start(struct cpufreq_policy *policy)
struct cs_policy_dbs_info *dbs_info = to_dbs_info(policy->governor_data);
dbs_info->down_skip = 0;
dbs_info->requested_freq = policy->cur;
static struct dbs_governor cs_dbs_gov = {
.gov = {
.name = "conservative",
.governor = cpufreq_governor_dbs,
.max_transition_latency = TRANSITION_LATENCY_LIMIT,
.owner = THIS_MODULE,
static struct dbs_governor cs_governor = {
.kobj_type = { .default_attrs = cs_attributes },
.gov_dbs_timer = cs_dbs_timer,
.alloc = cs_alloc,
......@@ -328,33 +300,7 @@ static struct dbs_governor cs_dbs_gov = {
.start = cs_start,
static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
void *data)