Hi all,
This patch set is a continuation of my previous work, which aimed to add Energy Model to all devices. This series is a follow up for the patches which got merged to v5.9-rc1. It aims to change the thermal devfreq cooling and use the Energy Model instead of private power table and structures. The new registration interface in the patch 3/5 helps to register devfreq cooling and the EM in one call. There is also another improvement, patch 2/5 is changing the way how thermal gets the device status. Now it's taken on demand and stored as a copy. The last patch wouldn't go through thermal tree, but it's here for consistency.
The patch set is based on current next-20201118, which has new EM API in the pm/linux-next tree.
changes: v2: - renamed freq_get_state() and related to perf_idx pattern as suggested by Ionela v1 [2]
Regards, Lukasz Luba
[1] https://lkml.org/lkml/2020/5/11/326 [2] https://lore.kernel.org/linux-pm/20200921122007.29610-1-lukasz.luba@arm.com/
Lukasz Luba (5): thermal: devfreq_cooling: change tracing function and arguments thermal: devfreq_cooling: get a copy of device status thermal: devfreq_cooling: add new registration functions with Energy Model thermal: devfreq_cooling: remove old power model and use EM drm/panfrost: Register devfreq cooling and attempt to add Energy Model
drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +- drivers/thermal/devfreq_cooling.c | 434 ++++++++++---------- include/linux/devfreq_cooling.h | 39 +- include/trace/events/thermal.h | 19 +- 4 files changed, 259 insertions(+), 235 deletions(-)
Prepare for deleting the static and dynamic power calculation and clean the trace function. These two fields are going to be removed in the next changes.
Reviewed-by: Steven Rostedt (VMware) rostedt@goodmis.org # for tracing code Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 3 +-- include/trace/events/thermal.h | 19 +++++++++---------- 2 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index dfab49a67252..659c0143c9f0 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -277,8 +277,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd *power = dyn_power + static_power; }
- trace_thermal_power_devfreq_get_power(cdev, status, freq, dyn_power, - static_power, *power); + trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);
return 0; fail: diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h index 135e5421f003..8a5f04888abd 100644 --- a/include/trace/events/thermal.h +++ b/include/trace/events/thermal.h @@ -153,31 +153,30 @@ TRACE_EVENT(thermal_power_cpu_limit, TRACE_EVENT(thermal_power_devfreq_get_power, TP_PROTO(struct thermal_cooling_device *cdev, struct devfreq_dev_status *status, unsigned long freq, - u32 dynamic_power, u32 static_power, u32 power), + u32 power),
- TP_ARGS(cdev, status, freq, dynamic_power, static_power, power), + TP_ARGS(cdev, status, freq, power),
TP_STRUCT__entry( __string(type, cdev->type ) __field(unsigned long, freq ) - __field(u32, load ) - __field(u32, dynamic_power ) - __field(u32, static_power ) + __field(u32, busy_time) + __field(u32, total_time) __field(u32, power) ),
TP_fast_assign( __assign_str(type, cdev->type); __entry->freq = freq; - __entry->load = (100 * status->busy_time) / status->total_time; - __entry->dynamic_power = dynamic_power; - __entry->static_power = static_power; + __entry->busy_time = status->busy_time; + __entry->total_time = status->total_time; __entry->power = power; ),
- TP_printk("type=%s freq=%lu load=%u dynamic_power=%u static_power=%u power=%u", + TP_printk("type=%s freq=%lu load=%u power=%u", __get_str(type), __entry->freq, - __entry->load, __entry->dynamic_power, __entry->static_power, + __entry->total_time == 0 ? 0 : + (100 * __entry->busy_time) / __entry->total_time, __entry->power) );
On Wednesday 18 Nov 2020 at 12:03:54 (+0000), Lukasz Luba wrote:
Reviewed-by: Ionela Voinescu ionela.voinescu@arm.com
Regards, Ionela.
Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.
Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.
In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.
Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 659c0143c9f0..925523694462 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,20 +227,46 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }
+static void _normalize_load(struct devfreq_dev_status *status) +{ + /* Make some space if needed */ + if (status->busy_time > 0xffff) { + status->busy_time >>= 10; + status->total_time >>= 10; + } + + if (status->busy_time > status->total_time) + status->busy_time = status->total_time; + + status->busy_time *= 100; + status->busy_time /= status->total_time ? : 1; + + /* Avoid division by 0 */ + status->busy_time = status->busy_time ? : 1; + status->total_time = 100; +}
static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev, u32 *power) { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; - struct devfreq_dev_status *status = &df->last_status; + struct devfreq_dev_status status; unsigned long state; - unsigned long freq = status->current_frequency; + unsigned long freq; unsigned long voltage; u32 dyn_power = 0; u32 static_power = 0; int res;
+ mutex_lock(&df->lock); + res = df->profile->get_dev_status(df->dev.parent, &status); + mutex_unlock(&df->lock); + if (res) + return res; + + freq = status.current_frequency; + state = freq_get_state(dfc, freq); if (state == THERMAL_CSTATE_INVALID) { res = -EAGAIN; @@ -268,16 +294,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd } else { dyn_power = dfc->power_table[state];
+ _normalize_load(&status); + /* Scale dynamic power for utilization */ - dyn_power *= status->busy_time; - dyn_power /= status->total_time; + dyn_power *= status.busy_time; + dyn_power /= status.total_time; /* Get static power */ static_power = get_static_power(dfc, freq);
*power = dyn_power + static_power; }
- trace_thermal_power_devfreq_get_power(cdev, status, freq, *power); + trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);
return 0; fail: @@ -309,14 +337,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; - struct devfreq_dev_status *status = &df->last_status; - unsigned long freq = status->current_frequency; + struct devfreq_dev_status status; unsigned long busy_time; + unsigned long freq; s32 dyn_power; u32 static_power; s32 est_power; int i;
+ mutex_lock(&df->lock); + status = df->last_status; + mutex_unlock(&df->lock); + + freq = status.current_frequency; + if (dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; @@ -328,8 +362,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, dyn_power = dyn_power > 0 ? dyn_power : 0;
/* Scale dynamic power for utilization */ - busy_time = status->busy_time ?: 1; - est_power = (dyn_power * status->total_time) / busy_time; + busy_time = status.busy_time ?: 1; + est_power = (dyn_power * status.total_time) / busy_time; }
/*
On Wednesday 18 Nov 2020 at 12:03:55 (+0000), Lukasz Luba wrote:
Reviewed-by: Ionela Voinescu ionela.voinescu@arm.com
Thanks, Ionela.
On 18/11/2020 13:03, Lukasz Luba wrote:
How the condition above is possible?
Why not base the normalization on 1024? and use an intermediate u64.
For example:
static u32 _normalize_load(struct devfreq_dev_status *status) { u64 load = 0;
/* Prevent divison by zero */ if (!status->busy_time) return 0;
/* * Assuming status->total_time is always greater or equal * to status->busy_time, it can not be equal to zero because * of the test above */ load = status->busy_time * 1024; load /= status->total_time;
/* * load is always [1..1024[, so it can not be truncated by a * u64 -> u32 coercive cast */ return (u32)load; }
load = _normalize_load(&status);
/* * May be change dyn_power to a u64 to prevent overflow * when multiplied by 'load' */ dyn_power = (dyn_power * load) / 1024;
On 12/3/20 1:09 PM, Daniel Lezcano wrote:
They should, be checked by the driver, but I cannot trust and have to check for all corner cases: (div by 0, overflow one of them, etc). The busy_time and total_time are unsigned long, which means 4B on 32bit machines. If these values are coming from device counters, which count every busy cycle and total cycles of a clock of a device running at e.g. 1GHz they would overflow every ~4s.
Normally IPA polling are 1s and 100ms, it's platform specific. But there are also 'empty' periods when IPA sees temperature very low and does not even call the .get_requested_power() callbacks for the cooling devices, just grants max freq to all. This is problematic. I am investigating it and will propose a solution for IPA soon.
I would avoid all of this if devfreq core would have default for all devices a reliable polling timer... Let me check some possibilities also for this case.
You are the 2nd reviewer who is asking this. I tried to keep 'load' as in range [0, 100] since we also have 'load' in cpufreq cooling in this range. Maybe I should switch to 1024 (Ionela was also asking for this).
I wanted to avoid any divisions which involve 64bit var on 32bit machine.
dyn_power value from EM should fit in 16bit [1], so we should be safe.
I will experiment with the 1024 code and check some corner cases.
Thank you Daniel for the review!
Regards, Lukasz
[1] https://elixir.bootlin.com/linux/v5.10-rc5/source/kernel/power/energy_model....
On 03/12/2020 16:38, Lukasz Luba wrote:
I don't think it is up to this routine to check the driver is correctly implemented, especially at every call to get_requested_power.
If the normalization ends up by doing this kind of thing, there is certainly something wrong in the 'status' computation to be fixed before submitting this series.
Ok, may be create an API to compute the 'idle,busy,total times' to be used by the different the devfreq drivers and then fix the overflow in this common place.
Well it is common practice to compute normalization with 1024 because the division is a bit shift and the compiler optimize the code very well with that value.
On 12/3/20 4:09 PM, Daniel Lezcano wrote:
Yes, I have this plan, but I have to close this patch series. To go forward with this, I will drop the normalization function and will keep only the code of safe copy of the 'status', so using busy_time and total_time will be safe.
I will address this computation and normalization in different patch series. There might be a need of a new API as you pointed out, which is out-of-scope of this patch set.
I will keep this 1024 in mind for the next topic series.
Regards, Lukasz
Hi Daniel,
On 12/7/20 12:41 PM, Lukasz Luba wrote:
I did experiments and actually I cannot drop this function. Drivers can feed total_time and busy_time which are in nanoseconds, e.g. [1] 50ms => 50.000.000ns which is then when multiplied by 1024 and exceed the u32. I want to avoid 64bit variables and divisions, so shifting them earlier would help. IMHO it does not harm this devfreq cooling to make that check and handle ns values.
I am going to use the normalization into 0..1024 as you and Ionela suggested. I will also drop the direct device status check. That would be a different patch series. In that patch set I will try to come with a generic solution and some API.
Regards, Lukasz
[1] https://elixir.bootlin.com/linux/v5.10-rc5/source/drivers/gpu/drm/panfrost/p...
The Energy Model (EM) framework supports devices such as Devfreq. Create new registration functions which automatically register EM for the thermal devfreq_cooling devices. This patch prepares the code for coming changes which are going to replace old power model with the new EM.
Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++- include/linux/devfreq_cooling.h | 22 +++++++ 2 files changed, 120 insertions(+), 1 deletion(-)
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 925523694462..b354271742c5 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida); * @capped_state: index to cooling state with in dynamic power budget * @req_max_freq: PM QoS request for limiting the maximum frequency * of the devfreq device. + * @em: Energy Model for the associated Devfreq device + * @em_registered: Devfreq cooling registered the EM and should free it. */ struct devfreq_cooling_device { int id; @@ -63,6 +65,8 @@ struct devfreq_cooling_device { u32 res_util; int capped_state; struct dev_pm_qos_request req_max_freq; + struct em_perf_domain *em; + bool em_registered; };
static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev, @@ -583,22 +587,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df) } EXPORT_SYMBOL_GPL(devfreq_cooling_register);
+/** + * devfreq_cooling_em_register_power() - Register devfreq cooling device with + * power information and attempt to register Energy Model (EM) + * @df: Pointer to devfreq device. + * @dfc_power: Pointer to devfreq_cooling_power. + * @em_cb: Callback functions providing the data of the EM + * + * Register a devfreq cooling device and attempt to register Energy Model. The + * available OPPs must be registered for the device. + * + * If @dfc_power is provided, the cooling device is registered with the + * power extensions. If @em_cb is provided it will be called for each OPP to + * calculate power value and cost. If @em_cb is not provided then simple Energy + * Model is going to be used, which requires "dynamic-power-coefficient" a + * devicetree property. + */ +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb) +{ + struct thermal_cooling_device *cdev; + struct devfreq_cooling_device *dfc; + struct device_node *np = NULL; + struct device *dev; + int nr_opp, ret; + + if (IS_ERR_OR_NULL(df)) + return ERR_PTR(-EINVAL); + + dev = df->dev.parent; + + if (em_cb) { + nr_opp = dev_pm_opp_get_opp_count(dev); + if (nr_opp <= 0) { + dev_err(dev, "No valid OPPs found\n"); + return ERR_PTR(-EINVAL); + } + + ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL, false); + } else { + ret = dev_pm_opp_of_register_em(dev, NULL); + } + + if (ret) + dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n", + ret); + + if (dev->of_node) + np = of_node_get(dev->of_node); + + cdev = of_devfreq_cooling_register_power(np, df, dfc_power); + + if (np) + of_node_put(np); + + if (IS_ERR_OR_NULL(cdev)) { + if (!ret) + em_dev_unregister_perf_domain(dev); + } else { + dfc = cdev->devdata; + dfc->em_registered = !ret; + } + + return cdev; +} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power); + +/** + * devfreq_cooling_em_register() - Register devfreq cooling device together + * with Energy Model. + * @df: Pointer to devfreq device. + * @em_cb: Callback functions providing the data of the Energy Model + * + * This function attempts to register Energy Model for devfreq device and then + * register the devfreq cooling device. + */ +struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{ + return devfreq_cooling_em_register_power(df, NULL, em_cb); +} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register); + /** * devfreq_cooling_unregister() - Unregister devfreq cooling device. * @cdev: Pointer to devfreq cooling device to unregister. + * + * Unregisters devfreq cooling device and related Energy Model if it was + * present. */ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) { struct devfreq_cooling_device *dfc; + struct device *dev;
- if (!cdev) + if (IS_ERR_OR_NULL(cdev)) return;
dfc = cdev->devdata; + dev = dfc->devfreq->dev.parent;
thermal_cooling_device_unregister(dfc->cdev); ida_simple_remove(&devfreq_ida, dfc->id); dev_pm_qos_remove_request(&dfc->req_max_freq); + + if (dfc->em_registered) + em_dev_unregister_perf_domain(dev); + kfree(dfc->power_table); kfree(dfc->freq_table);
diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 9df2dfca68dd..19868fb922f1 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -11,6 +11,7 @@ #define __DEVFREQ_COOLING_H__
#include <linux/devfreq.h> +#include <linux/energy_model.h> #include <linux/thermal.h>
@@ -65,6 +66,13 @@ struct thermal_cooling_device * of_devfreq_cooling_register(struct device_node *np, struct devfreq *df); struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df); void devfreq_cooling_unregister(struct thermal_cooling_device *dfc); +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb); +struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, + struct em_data_callback *em_cb);
#else /* !CONFIG_DEVFREQ_THERMAL */
@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df) return ERR_PTR(-EINVAL); }
+static inline struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb) +{ + return ERR_PTR(-EINVAL); +} + +static inline struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{ + return ERR_PTR(-EINVAL); +} + static inline void devfreq_cooling_unregister(struct thermal_cooling_device *dfc) {
Hi Lukasz,
On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:
Should np be checked before use? I'm not sure if it's better to do the assign first and then the check on np before use. It depends on the consequences of passing a NULL node pointer later on.
Otherwise it looks good to me:
Reviewed-by: Ionela Voinescu ionela.voinescu@arm.com
Ionela.
Hi Ionela,
On 12/2/20 10:24 AM, Ionela Voinescu wrote:
Hi Lukasz,
On Wednesday 18 Nov 2020 at 12:03:56 (+0000), Lukasz Luba wrote:
[snip]
[snip]
The np is actually dev->of_node (or left NULL, as set at the begging). The only meaning of the line above is to increment the counter and then decrement if CONFIG_OF_DYNAMIC was used. The devfreq_cooling_register() has np = NULL and the registration can handle it, so we should be OK here as well.
[snip]
Otherwise it looks good to me:
Reviewed-by: Ionela Voinescu ionela.voinescu@arm.com
Thank you for the review.
Regards, Lukasz
Ionela.
On Wednesday 02 Dec 2020 at 11:14:02 (+0000), Lukasz Luba wrote:
Yes, I just wanted to make sure later registration can handle np = NULL, or whether we need to bail out.
In this case, you can drop both ifs - for (dev->of_node) before get and for np before put below, as of_node_get/of_node_put can handle NULL pointers themselves.
Thanks, Ionela.
On 18/11/2020 13:03, Lukasz Luba wrote:
This pointer is not needed, it is in the struct device.
The boolean em_registered is not needed because of the test in the function em_dev_unregister_perf_domain():
if (IS_ERR_OR_NULL(dev) || !dev->em_pd) return;
Logically if the 'em' was not initialized, it must be NULL, the corresponding struct device was zero-allocated.
Why the parent ?
Why this additional IS_ERR check ?
As stated before it can be called unconditionally
On 12/3/20 3:40 PM, Daniel Lezcano wrote:
It is just a helper pointer, to make the code simpler and avoid nested pointers:
struct device *dev = dfc->devfreq->dev.parent and then using dev->em
The code is cleaner with dfc->em, but let me have a look if I can remove it and still have a clean code.
It was needed for devfreq cooling to know who registered the EM. If there is 2 frameworks and driver and all could register EM, this code cannot blindly unregister EM in it's code. The EM might be used still by PowerCap DTM, so the unregister might be called explicitly by the driver.
But I will rewrite the register function and make it way simpler, just registration of EM (stopping when it failed) and then cooling device. Also unregister will be simpler.
Driver will have to keep the order of unregister functions for two frameworks and call unregister devfreq cooling device as last one, because it will remove the EM.
The parent has OPPs and we are calling em_perf_domain_register() or dev_pm_opp_of_register_em() (which in addition needs DT node) for that device.
The old devfreq cooling code also had dev parent, to enable/disenable OPPs.
Not needed too much, but helps if driver doesn't check the result of registration function and then just calls unregister function, i.e.
if (pfdev->devfreq.cooling) devfreq_cooling_unregister(pfdev->devfreq.cooling);
OK, I will rewrite it. The goal was to be able handle many situations in register/unregister function, but I will make them simpler.
Thank you Daniel for review comments. I will address them in next version.
Regards, Lukasz
Remove old power model and use new Energy Model to calculate the power budget. It drops static + dynamic power calculations and power table in order to use Energy Model performance domain data. This model should be easy to use and could find more users. It is also less complicated to setup the needed structures.
Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 302 ++++++++++-------------------- include/linux/devfreq_cooling.h | 17 -- 2 files changed, 96 insertions(+), 223 deletions(-)
diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index b354271742c5..28754ad46b96 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -33,20 +33,17 @@ static DEFINE_IDA(devfreq_ida); * @cdev: Pointer to associated thermal cooling device. * @devfreq: Pointer to associated devfreq device. * @cooling_state: Current cooling state. - * @power_table: Pointer to table with maximum power draw for each - * cooling state. State is the index into the table, and - * the power is in mW. * @freq_table: Pointer to a table with the frequencies sorted in descending * order. You can index the table by cooling device state - * @freq_table_size: Size of the @freq_table and @power_table - * @power_ops: Pointer to devfreq_cooling_power, used to generate the - * @power_table. + * @max_state: It is the last index, that is, one less than the number of the + * OPPs + * @power_ops: Pointer to devfreq_cooling_power, a more precised model. * @res_util: Resource utilization scaling factor for the power. * It is multiplied by 100 to minimize the error. It is used * for estimation of the power budget instead of using - * 'utilization' (which is 'busy_time / 'total_time'). - * The 'res_util' range is from 100 to (power_table[state] * 100) - * for the corresponding 'state'. + * 'utilization' (which is 'busy_time' / 'total_time'). + * The 'res_util' range is from 100 to power * 100 for the + * corresponding 'state'. * @capped_state: index to cooling state with in dynamic power budget * @req_max_freq: PM QoS request for limiting the maximum frequency * of the devfreq device. @@ -58,9 +55,8 @@ struct devfreq_cooling_device { struct thermal_cooling_device *cdev; struct devfreq *devfreq; unsigned long cooling_state; - u32 *power_table; u32 *freq_table; - size_t freq_table_size; + size_t max_state; struct devfreq_cooling_power *power_ops; u32 res_util; int capped_state; @@ -74,7 +70,7 @@ static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev, { struct devfreq_cooling_device *dfc = cdev->devdata;
- *state = dfc->freq_table_size - 1; + *state = dfc->max_state;
return 0; } @@ -96,16 +92,22 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev, struct devfreq *df = dfc->devfreq; struct device *dev = df->dev.parent; unsigned long freq; + int perf_idx;
if (state == dfc->cooling_state) return 0;
dev_dbg(dev, "Setting cooling state %lu\n", state);
- if (state >= dfc->freq_table_size) + if (state > dfc->max_state) return -EINVAL;
- freq = dfc->freq_table[state]; + if (dfc->em) { + perf_idx = dfc->max_state - state; + freq = dfc->em->table[perf_idx].frequency * 1000; + } else { + freq = dfc->freq_table[state]; + }
dev_pm_qos_update_request(&dfc->req_max_freq, DIV_ROUND_UP(freq, HZ_PER_KHZ)); @@ -116,24 +118,24 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev, }
/** - * freq_get_state() - get the cooling state corresponding to a frequency + * get_perf_idx() - get the performance index corresponding to a frequency * @dfc: Pointer to devfreq cooling device - * @freq: frequency in Hz + * @freq: frequency in kHz * - * Return: the cooling state associated with the @freq, or - * THERMAL_CSTATE_INVALID if it wasn't found. + * Return: the performance index associated with the @freq, or + * -EINVAL if it wasn't found. */ -static unsigned long -freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq) +static int get_perf_idx(struct devfreq_cooling_device *dfc, unsigned long freq) { + struct em_perf_domain *em = dfc->em; int i;
- for (i = 0; i < dfc->freq_table_size; i++) { - if (dfc->freq_table[i] == freq) + for (i = 0; i < em->nr_perf_states; i++) { + if (em->table[i].frequency == freq) return i; }
- return THERMAL_CSTATE_INVALID; + return -EINVAL; }
static unsigned long get_voltage(struct devfreq *df, unsigned long freq) @@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq) return voltage; }
-/** - * get_static_power() - calculate the static power - * @dfc: Pointer to devfreq cooling device - * @freq: Frequency in Hz - * - * Calculate the static power in milliwatts using the supplied - * get_static_power(). The current voltage is calculated using the - * OPP library. If no get_static_power() was supplied, assume the - * static power is negligible. - */ -static unsigned long -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq) -{ - struct devfreq *df = dfc->devfreq; - unsigned long voltage; - - if (!dfc->power_ops->get_static_power) - return 0; - - voltage = get_voltage(df, freq); - - if (voltage == 0) - return 0; - - return dfc->power_ops->get_static_power(df, voltage); -} - -/** - * get_dynamic_power - calculate the dynamic power - * @dfc: Pointer to devfreq cooling device - * @freq: Frequency in Hz - * @voltage: Voltage in millivolts - * - * Calculate the dynamic power in milliwatts consumed by the device at - * frequency @freq and voltage @voltage. If the get_dynamic_power() - * was supplied as part of the devfreq_cooling_power struct, then that - * function is used. Otherwise, a simple power model (Pdyn = Coeff * - * Voltage^2 * Frequency) is used. - */ -static unsigned long -get_dynamic_power(struct devfreq_cooling_device *dfc, unsigned long freq, - unsigned long voltage) +static void dfc_em_get_requested_power(struct em_perf_domain *em, + struct devfreq_dev_status *status, + u32 *power, int perf_idx) { - u64 power; - u32 freq_mhz; - struct devfreq_cooling_power *dfc_power = dfc->power_ops; - - if (dfc_power->get_dynamic_power) - return dfc_power->get_dynamic_power(dfc->devfreq, freq, - voltage); + *power = em->table[perf_idx].power;
- freq_mhz = freq / 1000000; - power = (u64)dfc_power->dyn_power_coeff * freq_mhz * voltage * voltage; - do_div(power, 1000000000); - - return power; -} - - -static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, - unsigned long freq, - unsigned long voltage) -{ - return get_static_power(dfc, freq) + get_dynamic_power(dfc, freq, - voltage); + /* Scale power for utilization */ + *power *= status->busy_time; + *power /= status->total_time; }
static void _normalize_load(struct devfreq_dev_status *status) @@ -259,9 +205,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd unsigned long state; unsigned long freq; unsigned long voltage; - u32 dyn_power = 0; - u32 static_power = 0; - int res; + int res, perf_idx;
mutex_lock(&df->lock); res = df->profile->get_dev_status(df->dev.parent, &status); @@ -271,13 +215,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd
freq = status.current_frequency;
- state = freq_get_state(dfc, freq); - if (state == THERMAL_CSTATE_INVALID) { - res = -EAGAIN; - goto fail; - } - - if (dfc->power_ops->get_real_power) { + if (dfc->power_ops && dfc->power_ops->get_real_power) { voltage = get_voltage(df, freq); if (voltage == 0) { res = -EINVAL; @@ -287,7 +225,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd res = dfc->power_ops->get_real_power(df, power, freq, voltage); if (!res) { state = dfc->capped_state; - dfc->res_util = dfc->power_table[state]; + dfc->res_util = dfc->em->table[state].power; dfc->res_util *= SCALE_ERROR_MITIGATION;
if (*power > 1) @@ -296,17 +234,15 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd goto fail; } } else { - dyn_power = dfc->power_table[state]; + /* Energy Model frequencies are in kHz */ + perf_idx = get_perf_idx(dfc, freq / 1000); + if (perf_idx < 0) { + res = -EAGAIN; + goto fail; + }
_normalize_load(&status); - - /* Scale dynamic power for utilization */ - dyn_power *= status.busy_time; - dyn_power /= status.total_time; - /* Get static power */ - static_power = get_static_power(dfc, freq); - - *power = dyn_power + static_power; + dfc_em_get_requested_power(dfc->em, &status, power, perf_idx); }
trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power); @@ -319,20 +255,17 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd }
static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev, - unsigned long state, - u32 *power) + unsigned long state, u32 *power) { struct devfreq_cooling_device *dfc = cdev->devdata; - unsigned long freq; - u32 static_power; + int perf_idx;
- if (state >= dfc->freq_table_size) + if (state > dfc->max_state) return -EINVAL;
- freq = dfc->freq_table[state]; - static_power = get_static_power(dfc, freq); + perf_idx = dfc->max_state - state; + *power = dfc->em->table[perf_idx].power;
- *power = dfc->power_table[state] + static_power; return 0; }
@@ -342,10 +275,7 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; struct devfreq_dev_status status; - unsigned long busy_time; unsigned long freq; - s32 dyn_power; - u32 static_power; s32 est_power; int i;
@@ -355,31 +285,27 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,
freq = status.current_frequency;
- if (dfc->power_ops->get_real_power) { + if (dfc->power_ops && dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; est_power /= SCALE_ERROR_MITIGATION; } else { - static_power = get_static_power(dfc, freq); - - dyn_power = power - static_power; - dyn_power = dyn_power > 0 ? dyn_power : 0; - - /* Scale dynamic power for utilization */ - busy_time = status.busy_time ?: 1; - est_power = (dyn_power * status.total_time) / busy_time; + _normalize_load(&status); + est_power = power * status.total_time; + est_power /= status.busy_time; }
/* * Find the first cooling state that is within the power - * budget for dynamic power. + * budget. The EM power table is sorted ascending. */ - for (i = 0; i < dfc->freq_table_size - 1; i++) - if (est_power >= dfc->power_table[i]) + for (i = dfc->max_state; i > 0; i--) + if (est_power >= dfc->em->table[i].power) break;
- *state = i; - dfc->capped_state = i; + *state = dfc->max_state - i; + dfc->capped_state = *state; + trace_thermal_power_devfreq_limit(cdev, freq, *state, power); return 0; } @@ -391,91 +317,43 @@ static struct thermal_cooling_device_ops devfreq_cooling_ops = { };
/** - * devfreq_cooling_gen_tables() - Generate power and freq tables. - * @dfc: Pointer to devfreq cooling device. - * - * Generate power and frequency tables: the power table hold the - * device's maximum power usage at each cooling state (OPP). The - * static and dynamic power using the appropriate voltage and - * frequency for the state, is acquired from the struct - * devfreq_cooling_power, and summed to make the maximum power draw. + * devfreq_cooling_gen_tables() - Generate frequency table. + * @dfc: Pointer to devfreq cooling device. + * @num_opps: Number of OPPs * - * The frequency table holds the frequencies in descending order. - * That way its indexed by cooling device state. - * - * The tables are malloced, and pointers put in dfc. They must be - * freed when unregistering the devfreq cooling device. + * Generate frequency table which holds the frequencies in descending + * order. That way its indexed by cooling device state. This is for + * compatibility with drivers which do not register Energy Model. * * Return: 0 on success, negative error code on failure. */ -static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc) +static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc, + int num_opps) { struct devfreq *df = dfc->devfreq; struct device *dev = df->dev.parent; - int ret, num_opps; unsigned long freq; - u32 *power_table = NULL; - u32 *freq_table; int i;
- num_opps = dev_pm_opp_get_opp_count(dev); - - if (dfc->power_ops) { - power_table = kcalloc(num_opps, sizeof(*power_table), - GFP_KERNEL); - if (!power_table) - return -ENOMEM; - } - - freq_table = kcalloc(num_opps, sizeof(*freq_table), + dfc->freq_table = kcalloc(num_opps, sizeof(*dfc->freq_table), GFP_KERNEL); - if (!freq_table) { - ret = -ENOMEM; - goto free_power_table; - } + if (!dfc->freq_table) + return -ENOMEM;
for (i = 0, freq = ULONG_MAX; i < num_opps; i++, freq--) { - unsigned long power, voltage; struct dev_pm_opp *opp;
opp = dev_pm_opp_find_freq_floor(dev, &freq); if (IS_ERR(opp)) { - ret = PTR_ERR(opp); - goto free_tables; + kfree(dfc->freq_table); + return PTR_ERR(opp); }
- voltage = dev_pm_opp_get_voltage(opp) / 1000; /* mV */ dev_pm_opp_put(opp); - - if (dfc->power_ops) { - if (dfc->power_ops->get_real_power) - power = get_total_power(dfc, freq, voltage); - else - power = get_dynamic_power(dfc, freq, voltage); - - dev_dbg(dev, "Power table: %lu MHz @ %lu mV: %lu = %lu mW\n", - freq / 1000000, voltage, power, power); - - power_table[i] = power; - } - - freq_table[i] = freq; + dfc->freq_table[i] = freq; }
- if (dfc->power_ops) - dfc->power_table = power_table; - - dfc->freq_table = freq_table; - dfc->freq_table_size = num_opps; - return 0; - -free_tables: - kfree(freq_table); -free_power_table: - kfree(power_table); - - return ret; }
/** @@ -500,7 +378,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, struct thermal_cooling_device *cdev; struct devfreq_cooling_device *dfc; char dev_name[THERMAL_NAME_LENGTH]; - int err; + int err, num_opps;
dfc = kzalloc(sizeof(*dfc), GFP_KERNEL); if (!dfc) @@ -508,28 +386,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
dfc->devfreq = df;
- if (dfc_power) { - dfc->power_ops = dfc_power; - + dfc->em = em_pd_get(df->dev.parent); + if (dfc->em) { devfreq_cooling_ops.get_requested_power = devfreq_cooling_get_requested_power; devfreq_cooling_ops.state2power = devfreq_cooling_state2power; devfreq_cooling_ops.power2state = devfreq_cooling_power2state; + + dfc->power_ops = dfc_power; + + num_opps = em_pd_nr_perf_states(dfc->em); + } else { + /* Backward compatibility for drivers which do not use IPA */ + dev_dbg(df->dev.parent, "missing EM for cooling device\n"); + + num_opps = dev_pm_opp_get_opp_count(df->dev.parent); + + err = devfreq_cooling_gen_tables(dfc, num_opps); + if (err) + goto free_dfc; }
- err = devfreq_cooling_gen_tables(dfc); - if (err) + if (num_opps <= 0) { + err = -EINVAL; goto free_dfc; + } + + /* max_state is an index, not a counter */ + dfc->max_state = num_opps - 1;
err = dev_pm_qos_add_request(df->dev.parent, &dfc->req_max_freq, DEV_PM_QOS_MAX_FREQUENCY, PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE); if (err < 0) - goto free_tables; + goto free_table;
err = ida_simple_get(&devfreq_ida, 0, 0, GFP_KERNEL); if (err < 0) goto remove_qos_req; + dfc->id = err;
snprintf(dev_name, sizeof(dev_name), "thermal-devfreq-%d", dfc->id); @@ -550,12 +445,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,
release_ida: ida_simple_remove(&devfreq_ida, dfc->id); - remove_qos_req: dev_pm_qos_remove_request(&dfc->req_max_freq); - -free_tables: - kfree(dfc->power_table); +free_table: kfree(dfc->freq_table); free_dfc: kfree(dfc); @@ -696,9 +588,7 @@ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) if (dfc->em_registered) em_dev_unregister_perf_domain(dev);
- kfree(dfc->power_table); kfree(dfc->freq_table); - kfree(dfc); } EXPORT_SYMBOL_GPL(devfreq_cooling_unregister); diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 19868fb922f1..4890b12b54b4 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -17,17 +17,6 @@
/** * struct devfreq_cooling_power - Devfreq cooling power ops - * @get_static_power: Take voltage, in mV, and return the static power - * in mW. If NULL, the static power is assumed - * to be 0. - * @get_dynamic_power: Take voltage, in mV, and frequency, in HZ, and - * return the dynamic power draw in mW. If NULL, - * a simple power model is used. - * @dyn_power_coeff: Coefficient for the simple dynamic power model in - * mW/(MHz mV mV). - * If get_dynamic_power() is NULL, then the - * dynamic power is calculated as - * @dyn_power_coeff * frequency * voltage^2 * @get_real_power: When this is set, the framework uses it to ask the * device driver for the actual power. * Some devices have more sophisticated methods @@ -47,14 +36,8 @@ * max total (static + dynamic) power value for each OPP. */ struct devfreq_cooling_power { - unsigned long (*get_static_power)(struct devfreq *devfreq, - unsigned long voltage); - unsigned long (*get_dynamic_power)(struct devfreq *devfreq, - unsigned long freq, - unsigned long voltage); int (*get_real_power)(struct devfreq *df, u32 *power, unsigned long freq, unsigned long voltage); - unsigned long dyn_power_coeff; };
#ifdef CONFIG_DEVFREQ_THERMAL
On Wednesday 18 Nov 2020 at 12:03:57 (+0000), Lukasz Luba wrote:
Nit: Reiterating my question on whether it's worth having this additional function or whether its contents should be collapsed into the single caller.
Reviewed-by: Ionela Voinescu ionela.voinescu@arm.com
Thanks, Ionela.
Register devfreq cooling device and attempt to register Energy Model. This will add the devfreq device to the Energy Model framework. It will create a dedicated and unified data structures used i.e. in thermal framework. The last NULL parameter indicates that the power model is simplified and created based on DT 'dynamic-power-coefficient', voltage and frequency.
Reviewed-by: Steven Price steven.price@arm.com Reviewed-by: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c index 78e9d82f7318..f44d28fad085 100644 --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c @@ -138,7 +138,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev) } pfdevfreq->devfreq = devfreq;
- cooling = of_devfreq_cooling_register(dev->of_node, devfreq); + cooling = devfreq_cooling_em_register(devfreq, NULL); if (IS_ERR(cooling)) DRM_DEV_INFO(dev, "Failed to register cooling device\n"); else
On 18/11/2020 13:03, Lukasz Luba wrote:
The patch 5/5 is reviewed by the maintainers. If they agree, I can apply the patch with this series.
dri-devel@lists.freedesktop.org