[PATCH 0/5] Thermal devfreq cooling improvements with Energy Model

List overview All Threads
Download

newer

older

[PATCH v3 0/6] drm/tidss: Use new...

[PATCH v1 00/30] Introduce core...

Lukasz Luba

21 Sep 2020 21 Sep '20

12:20 p.m.

Hi all,

This path set is a continuation of my previous work, which aimed to add Energy Model to all devices. This series is a follow up for the patches which got merged to v5.9-rc1. It aims to change the thermal devfreq cooling and use the Energy Model instead of private calculation and structures. The new registration interface in the patch 3/5 helps to register devfreq cooling and the EM in one call. There is also another improvement, patch 2/5 is changing the way how thermal gets the device status. Now it's taken on demand and stored as a copy. The last patch wouldn't go through thermal tree, but it's here to consistency.

The patch set is based on current next (next-20200921).

Regards, Lukasz Luba

[1] https://lkml.org/lkml/2020/5/11/326

Lukasz Luba (5): thermal: devfreq_cooling: change tracing function and arguments thermal: devfreq_cooling: get a copy of device status thermal: devfreq_cooling: add new registration functions with Energy Model thermal: devfreq_cooling: remove old power model and use EM drm/panfrost: Register devfreq cooling and attempt to add Energy Model

drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +- drivers/thermal/devfreq_cooling.c | 414 ++++++++++---------- include/linux/devfreq_cooling.h | 39 +- include/trace/events/thermal.h | 19 +- 4 files changed, 249 insertions(+), 225 deletions(-)

-- 2.17.1

Show replies by date

Lukasz Luba

21 Sep 21 Sep

12:20 p.m.

New subject: [PATCH 1/5] thermal: devfreq_cooling: change tracing function and arguments

Prepare for deleting the static and dynamic power calculation and clean the trace function. These two fields are going to be removed in the next changes.

Reviewed-by: Steven Rostedt (VMware) rostedt@goodmis.org # for tracing code Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 3 +-- include/trace/events/thermal.h | 19 +++++++++---------- 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index a12d29096229..7063ccb7b86d 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -278,8 +278,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd *power = dyn_power + static_power; }

- trace_thermal_power_devfreq_get_power(cdev, status, freq, dyn_power, - static_power, *power); + trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);

return 0; fail: diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h index 135e5421f003..8a5f04888abd 100644 --- a/include/trace/events/thermal.h +++ b/include/trace/events/thermal.h @@ -153,31 +153,30 @@ TRACE_EVENT(thermal_power_cpu_limit, TRACE_EVENT(thermal_power_devfreq_get_power, TP_PROTO(struct thermal_cooling_device *cdev, struct devfreq_dev_status *status, unsigned long freq, - u32 dynamic_power, u32 static_power, u32 power), + u32 power),

- TP_ARGS(cdev, status, freq, dynamic_power, static_power, power), + TP_ARGS(cdev, status, freq, power),

TP_STRUCT__entry( __string(type, cdev->type ) __field(unsigned long, freq ) - __field(u32, load ) - __field(u32, dynamic_power ) - __field(u32, static_power ) + __field(u32, busy_time) + __field(u32, total_time) __field(u32, power) ),

TP_fast_assign( __assign_str(type, cdev->type); __entry->freq = freq; - __entry->load = (100 * status->busy_time) / status->total_time; - __entry->dynamic_power = dynamic_power; - __entry->static_power = static_power; + __entry->busy_time = status->busy_time; + __entry->total_time = status->total_time; __entry->power = power; ),

- TP_printk("type=%s freq=%lu load=%u dynamic_power=%u static_power=%u power=%u", + TP_printk("type=%s freq=%lu load=%u power=%u", __get_str(type), __entry->freq, - __entry->load, __entry->dynamic_power, __entry->static_power, + __entry->total_time == 0 ? 0 : + (100 * __entry->busy_time) / __entry->total_time, __entry->power) );

-- 2.17.1

Lukasz Luba

12:20 p.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.

In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7063ccb7b86d..cf045bd4d16b 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }

+static void _normalize_load(struct devfreq_dev_status *status) +{ + /* Make some space if needed */ + if (status->busy_time > 0xffff) { + status->busy_time >>= 10; + status->total_time >>= 10; + } + + if (status->busy_time > status->total_time) + status->busy_time = status->total_time; + + status->busy_time *= 100; + status->busy_time /= status->total_time ? : 1; + + /* Avoid division by 0 */ + status->busy_time = status->busy_time ? : 1; + status->total_time = 100; +}

static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev, struct thermal_zone_device *tz, @@ -234,14 +252,22 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; - struct devfreq_dev_status *status = &df->last_status; + struct devfreq_dev_status status; unsigned long state; - unsigned long freq = status->current_frequency; + unsigned long freq; unsigned long voltage; u32 dyn_power = 0; u32 static_power = 0; int res;

+ mutex_lock(&df->lock); + res = df->profile->get_dev_status(df->dev.parent, &status); + mutex_unlock(&df->lock); + if (res) + return res; + + freq = status.current_frequency; + state = freq_get_state(dfc, freq); if (state == THERMAL_CSTATE_INVALID) { res = -EAGAIN; @@ -269,16 +295,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd } else { dyn_power = dfc->power_table[state];

+ _normalize_load(&status); + /* Scale dynamic power for utilization */ - dyn_power *= status->busy_time; - dyn_power /= status->total_time; + dyn_power *= status.busy_time; + dyn_power /= status.total_time; /* Get static power */ static_power = get_static_power(dfc, freq);

*power = dyn_power + static_power; }

- trace_thermal_power_devfreq_get_power(cdev, status, freq, *power); + trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);

return 0; fail: @@ -312,14 +340,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; - struct devfreq_dev_status *status = &df->last_status; - unsigned long freq = status->current_frequency; + struct devfreq_dev_status status; unsigned long busy_time; + unsigned long freq; s32 dyn_power; u32 static_power; s32 est_power; int i;

+ mutex_lock(&df->lock); + status = df->last_status; + mutex_unlock(&df->lock); + + freq = status.current_frequency; + if (dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; @@ -331,8 +365,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, dyn_power = dyn_power > 0 ? dyn_power : 0;

/* Scale dynamic power for utilization */ - busy_time = status->busy_time ?: 1; - est_power = (dyn_power * status->total_time) / busy_time; + busy_time = status.busy_time ?: 1; + est_power = (dyn_power * status.total_time) / busy_time; }

-- 2.17.1

Ionela Voinescu

7 Oct 7 Oct

4:11 p.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:

...

Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.

In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7063ccb7b86d..cf045bd4d16b 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }

+static void _normalize_load(struct devfreq_dev_status *status)

Is there a reason for the leading "_" ? AFAIK, "__name()" is meant to suggest a "worker" function for another "name()" function, but that would not apply here.

...

+{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}

How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;

...

if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

...

status->busy_time = status->total_time;

But a reversal is needed here: status->total_time = status->busy_time;

...

status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10, status->total_time); status->total_time = 1 << 10;

This way you gain some resolution to busy_time and the divisions in the callers would just become shifts by 10.

Hope it helps, Ionela.

Lukasz Luba

22 Oct 22 Oct

10:55 a.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

Hi Ionela,

On 10/7/20 5:11 PM, Ionela Voinescu wrote:

...

On Monday 21 Sep 2020 at 13:20:04 (+0100), Lukasz Luba wrote:

...
Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.

In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7063ccb7b86d..cf045bd4d16b 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }

+static void _normalize_load(struct devfreq_dev_status *status)

Is there a reason for the leading "_" ? AFAIK, "__name()" is meant to suggest a "worker" function for another "name()" function, but that would not apply here.

It is just a local name. Check e.g. ./drivers/opp/core.c there is a few: _generic_set_opp_regulator(), _generic_set_opp_clk_only(), _get_opp_count(), _find_opp_table(), _set_opp_bw(), etc.

It is just a shorter name for me, '_' means here locality. Instead of calling it devfreq_cooling_normalize_load().

...

...
+{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}
How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;

It's not equivalent. The code operates on raw device values, which might be big (e.g. read from counters). If it's lager than the 0xffff, it is going to be shifted to get smaller.

...

...

if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

...
status->busy_time = status->total_time;
But a reversal is needed here: status->total_time = status->busy_time;

No, I want to clamp the busy_time, which should not be bigger that total time. It could happen when we deal with 'raw' values from device counters.

...

...

status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10, status->total_time); status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

...

This way you gain some resolution to busy_time and the divisions in the callers would just become shifts by 10.

I don't want to gain more resolution here. I want to be prepare for raw (not processed yet) big values coming from driver.

Regards, Lukasz

...

Hope it helps, Ionela.

Ionela Voinescu

1 Dec 1 Dec

10:36 a.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

Hi,

Sorry for the delay and for the noise on this older version. I first want to understand the code better.

On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote: [..]

...

...
...
+{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}
How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;
It's not equivalent. The code operates on raw device values, which might be big (e.g. read from counters). If it's lager than the 0xffff, it is going to be shifted to get smaller.

Yes, the big values are handled below through the division and by making total_time = 1024. These two initial checks are only to cover the possibility for busy_time and total_time being 0, or busy_time > total_time.

...

...
...

if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

...
status->busy_time = status->total_time;
But a reversal is needed here: status->total_time = status->busy_time;
No, I want to clamp the busy_time, which should not be bigger that total time. It could happen when we deal with 'raw' values from device counters.

Yes, I understand. But isn't making total_time = busy_time accomplishing the same thing?

...

...
...

status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10, status->total_time); status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

I'm not sure I understand. total_time gets a value of 1024, it's not itself shifted by 10.

...

...
This way you gain some resolution to busy_time and the divisions in the callers would just become shifts by 10.

I don't want to gain more resolution here. I want to be prepare for raw (not processed yet) big values coming from driver.

Agreed! The higher resolution is an extra benefit. The more important benefit is that, through my suggestion, you'd be replacing all future divisions by shifts.

Thanks, Ionela.

...

Regards, Lukasz

...
Hope it helps, Ionela.

Lukasz Luba

12:19 p.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

On 12/1/20 10:36 AM, Ionela Voinescu wrote:

...

Hi,

Sorry for the delay and for the noise on this older version. I first want to understand the code better.

On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote: [..]

...
...
...
+{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}
How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;
It's not equivalent. The code operates on raw device values, which might be big (e.g. read from counters). If it's lager than the 0xffff, it is going to be shifted to get smaller.
Yes, the big values are handled below through the division and by making total_time = 1024. These two initial checks are only to cover the possibility for busy_time and total_time being 0, or busy_time > total_time.

...
...
...

if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

...
status->busy_time = status->total_time;
But a reversal is needed here: status->total_time = status->busy_time;
No, I want to clamp the busy_time, which should not be bigger that total time. It could happen when we deal with 'raw' values from device counters.
Yes, I understand. But isn't making total_time = busy_time accomplishing the same thing?

...
...
...

status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10, status->total_time); status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

I'm not sure I understand. total_time gets a value of 1024, it's not itself shifted by 10.

...
...
This way you gain some resolution to busy_time and the divisions in the callers would just become shifts by 10.

I don't want to gain more resolution here. I want to be prepare for raw (not processed yet) big values coming from driver.

Agreed! The higher resolution is an extra benefit. The more important benefit is that, through my suggestion, you'd be replacing all future divisions by shifts.

You have probably missed some bits. I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.

Then, what is the range of these values: busy_time [0, 1024], total_time 1024 in your case. These values are used for estimating power in two cases: 1. in devfreq_cooling_get_requested_power() est_power = power * busy_time / total_time 2. in devfreq_cooling_power2state(): est_power = power * total_time / busy_time

As you can see above, the est_power values could overflow if total_time, busy_time are raw values (like in old implementation). So normalize them into 'some' scale. That was the motivation ('scale' motivation below).

In your case you cannot avoid division in 2. use case, because busy_time can be any value in range [0, 1024]. We could avoid the division in 1. use case, but load in cpufreq cooling is also in range of [0, 100], so this devfreq cooling is aligned. I would like to avoid situation when someone is parsing the traces and these two devices present different load scale.

I will think about better 'devfreq utilization' (as also Daniel suggested)in future, but first this EM must be in mainline and cpufreq cooling changes made by Viresh also there. But it would be more then just scale change to [0, 1024]...

Regards, Lukasz

...

Thanks, Ionela.

...
Regards, Lukasz

...
Hope it helps, Ionela.

Ionela Voinescu

2:55 p.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

On Tuesday 01 Dec 2020 at 12:19:18 (+0000), Lukasz Luba wrote:

...

On 12/1/20 10:36 AM, Ionela Voinescu wrote:

...
Hi,

Sorry for the delay and for the noise on this older version. I first want to understand the code better.

On Thursday 22 Oct 2020 at 11:55:28 (+0100), Lukasz Luba wrote: [..]

...
...
...
+{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}
How about removing the above code and adding here:

status->busy_time = status->busy_time ? : 1;
It's not equivalent. The code operates on raw device values, which might be big (e.g. read from counters). If it's lager than the 0xffff, it is going to be shifted to get smaller.
Yes, the big values are handled below through the division and by making total_time = 1024. These two initial checks are only to cover the possibility for busy_time and total_time being 0, or busy_time > total_time.

...
...
...

if (status->busy_time > status->total_time)

This check would then cover the possibility that total_time is 0.

...
status->busy_time = status->total_time;
But a reversal is needed here: status->total_time = status->busy_time;
No, I want to clamp the busy_time, which should not be bigger that total time. It could happen when we deal with 'raw' values from device counters.
Yes, I understand. But isn't making total_time = busy_time accomplishing the same thing?

...
...
...

status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;

Then all of this code can be replaced by:

status->busy_time = (unsigned long)div64_u64((u64)status->busy_time << 10, status->total_time); status->total_time = 1 << 10;

No, the total_time closed to 'unsigned long' would overflow.

I'm not sure I understand. total_time gets a value of 1024, it's not itself shifted by 10.

...
...
This way you gain some resolution to busy_time and the divisions in the callers would just become shifts by 10.

I don't want to gain more resolution here. I want to be prepare for raw (not processed yet) big values coming from driver.

Agreed! The higher resolution is an extra benefit. The more important benefit is that, through my suggestion, you'd be replacing all future divisions by shifts.
You have probably missed some bits. I don't see benefits, you have div64_u64() which is heavy on 32bit CPUs.

Then, what is the range of these values: busy_time [0, 1024], total_time 1024 in your case. These values are used for estimating power in two cases:

in devfreq_cooling_get_requested_power() est_power = power * busy_time / total_time

in devfreq_cooling_power2state(): est_power = power * total_time / busy_time

As you can see above, the est_power values could overflow if total_time, busy_time are raw values (like in old implementation). So normalize them into 'some' scale. That was the motivation ('scale' motivation below).

Agreed! I do think scaling is necessary, but in my mind the [0, 1024] scale made more sense.

...

In your case you cannot avoid division in 2. use case, because busy_time can be any value in range [0, 1024]. We could avoid the division in 1. use case, but load in cpufreq cooling is also in range of [0, 100], so this devfreq cooling is aligned. I would like to avoid situation when someone is parsing the traces and these two devices present different load scale.

Got it! Looking through the code I did overlook that 2 was reversed.

...

I will think about better 'devfreq utilization' (as also Daniel suggested)in future, but first this EM must be in mainline and cpufreq cooling changes made by Viresh also there. But it would be more then just scale change to [0, 1024]...

Okay, looking forward to this. It would be nice to align all of these utilization metrics in the future for all kinds of devices.

Thanks, Ionela.

Daniel Lezcano

14 Oct 14 Oct

2:34 p.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

On 21/09/2020 14:20, Lukasz Luba wrote:

...

Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.

In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7063ccb7b86d..cf045bd4d16b 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }

+static void _normalize_load(struct devfreq_dev_status *status) +{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}

if (status->busy_time > status->total_time)
status->busy_time = status->total_time;
status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;
+}

Not sure that works if the devfreq governor is not on-demand.

Is it possible to use the energy model directly in devfreq to compute the energy consumption given the state transitions since the last reading ?

The power will be read directly from devfreq which will return:

nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)

The devfreq cooling device driver would result in a much simpler code, no?

...

static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cdev, struct thermal_zone_device *tz, @@ -234,14 +252,22 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq;

struct devfreq_dev_status *status = &df->last_status;

struct devfreq_dev_status status; unsigned long state;

unsigned long freq = status->current_frequency;
unsigned long freq; unsigned long voltage; u32 dyn_power = 0; u32 static_power = 0; int res;

mutex_lock(&df->lock);

res = df->profile->get_dev_status(df->dev.parent, &status);

mutex_unlock(&df->lock);

if (res)
return res;
freq = status.current_frequency;

state = freq_get_state(dfc, freq); if (state == THERMAL_CSTATE_INVALID) { res = -EAGAIN;
@@ -269,16 +295,18 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd } else { dyn_power = dfc->power_table[state];
_normalize_load(&status);
/* Scale dynamic power for utilization */
dyn_power *= status->busy_time;
dyn_power /= status->total_time;
dyn_power *= status.busy_time;
dyn_power /= status.total_time;
/* Get static power */ static_power = get_static_power(dfc, freq);

*power = dyn_power + static_power; }
trace_thermal_power_devfreq_get_power(cdev, status, freq, *power);

trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power);

return 0;

fail: @@ -312,14 +340,20 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, { struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq;

struct devfreq_dev_status *status = &df->last_status;

unsigned long freq = status->current_frequency;

struct devfreq_dev_status status; unsigned long busy_time;

unsigned long freq; s32 dyn_power; u32 static_power; s32 est_power; int i;

mutex_lock(&df->lock);

status = df->last_status;

mutex_unlock(&df->lock);

freq = status.current_frequency;

if (dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util;

@@ -331,8 +365,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, dyn_power = dyn_power > 0 ? dyn_power : 0;
/* Scale dynamic power for utilization */
busy_time = status->busy_time ?: 1;
est_power = (dyn_power * status->total_time) / busy_time;
busy_time = status.busy_time ?: 1;
est_power = (dyn_power * status.total_time) / busy_time;
}

/*

-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Lukasz Luba

22 Oct 22 Oct

11:45 a.m.

New subject: [PATCH 2/5] thermal: devfreq_cooling: get a copy of device status

Hi Daniel,

On 10/14/20 3:34 PM, Daniel Lezcano wrote:

...

On 21/09/2020 14:20, Lukasz Luba wrote:

...
Devfreq cooling needs to now the correct status of the device in order to operate. Do not rely on Devfreq last_status which might be a stale data and get more up-to-date values of the load.

Devfreq framework can change the device status in the background. To mitigate this situation make a copy of the status structure and use it for internal calculations.

In addition this patch adds normalization function, which also makes sure that whatever data comes from the device, it is in a sane range.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 52 +++++++++++++++++++++++++------ 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7063ccb7b86d..cf045bd4d16b 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -227,6 +227,24 @@ static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, voltage); }

+static void _normalize_load(struct devfreq_dev_status *status) +{
/* Make some space if needed */

if (status->busy_time > 0xffff) {
status->busy_time >>= 10;
status->total_time >>= 10;
}

if (status->busy_time > status->total_time)
status->busy_time = status->total_time;
status->busy_time *= 100;

status->busy_time /= status->total_time ? : 1;

/* Avoid division by 0 */

status->busy_time = status->busy_time ? : 1;

status->total_time = 100;
+}
Not sure that works if the devfreq governor is not on-demand.

Is it possible to use the energy model directly in devfreq to compute the energy consumption given the state transitions since the last reading ?

This change is actually trying to create a safety net for what we do.

In the original code we take the last_state directly: - struct devfreq_dev_status *status = &df->last_status;

Then we simply multiply by 'busy_time' and div by 'total_time', without checks... The values might be huge or zero, etc. The _normalize_load() introduces this safety.

Apart from that, just simply taking a pointer to &df->last_status does not protect us from: - working on a struct which might be modified at the same time in background - not safe - that struct might not be updated by long time, because devfreq didn't check it for a long (there are two polling modes in devfreq)

So taking a mutex and then a trigger the device status check and make a copy of newest data. It is more safe.

I think this can be treated as a fix, not a feature.

...

The power will be read directly from devfreq which will return:

nrj + (current_power * (jiffies - last_update)) / (jiffies - last_update)

The devfreq cooling device driver would result in a much simpler code, no?

This is something that I would like to address after the EM changes are merged. It would be the next step, how to estimate the power by taking into consideration more information. This patch series just tries to make it possible to use EM. The model improvements would be next.

Thank you Daniel for your review.

Regards, Lukasz

Lukasz Luba

21 Sep 21 Sep

12:20 p.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

The Energy Model (EM) framework supports devices such as Devfreq. Create new registration functions which automatically register EM for the thermal devfreq_cooling devices. This patch prepares the code for coming changes which are going to replace old power model with the new EM.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++- include/linux/devfreq_cooling.h | 22 +++++++ 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index cf045bd4d16b..7e091e795284 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida); * @capped_state: index to cooling state with in dynamic power budget * @req_max_freq: PM QoS request for limiting the maximum frequency * of the devfreq device. + * @em: Energy Model which represents the associated Devfreq device + * @em_registered: Devfreq cooling registered the EM and should free it. */ struct devfreq_cooling_device { int id; @@ -63,6 +65,8 @@ struct devfreq_cooling_device { u32 res_util; int capped_state; struct dev_pm_qos_request req_max_freq; + struct em_perf_domain *em; + bool em_registered; };

static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev, @@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df) } EXPORT_SYMBOL_GPL(devfreq_cooling_register);

+/** + * devfreq_cooling_em_register_power() - Register devfreq cooling device with + * power information and attempt to register Energy Model (EM) + * @df: Pointer to devfreq device. + * @dfc_power: Pointer to devfreq_cooling_power. + * @em_cb: Callback functions providing the data of the EM + * + * Register a devfreq cooling device and attempt to register Energy Model. The + * available OPPs must be registered for the device. + * + * If @dfc_power is provided, the cooling device is registered with the + * power extensions. If @em_cb is provided it will be called for each OPP to + * calculate power value and cost. If @em_cb is not provided then simple Energy + * Model is going to be used, which requires "dynamic-power-coefficient" a + * devicetree property. + */ +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb) +{ + struct thermal_cooling_device *cdev; + struct devfreq_cooling_device *dfc; + struct device_node *np = NULL; + struct device *dev; + int nr_opp, ret; + + if (IS_ERR_OR_NULL(df)) + return ERR_PTR(-EINVAL); + + dev = df->dev.parent; + + if (em_cb) { + nr_opp = dev_pm_opp_get_opp_count(dev); + if (nr_opp <= 0) { + dev_err(dev, "No valid OPPs found\n"); + return ERR_PTR(-EINVAL); + } + + ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL); + } else { + ret = dev_pm_opp_of_register_em(dev, NULL); + } + + if (ret) + dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n", + ret); + + if (dev->of_node) + np = of_node_get(dev->of_node); + + cdev = of_devfreq_cooling_register_power(np, df, dfc_power); + + if (np) + of_node_put(np); + + if (IS_ERR_OR_NULL(cdev)) { + if (!ret) + em_dev_unregister_perf_domain(dev); + } else { + dfc = cdev->devdata; + dfc->em_registered = !ret; + } + + return cdev; +} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power); + +/** + * devfreq_cooling_em_register() - Register devfreq cooling device together + * with Energy Model. + * @df: Pointer to devfreq device. + * @em_cb: Callback functions providing the data of the Energy Model + * + * This function attempts to register Energy Model for devfreq device and then + * register the devfreq cooling device. + */ +struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{ + return devfreq_cooling_em_register_power(df, NULL, em_cb); +} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register); + /** * devfreq_cooling_unregister() - Unregister devfreq cooling device. * @cdev: Pointer to devfreq cooling device to unregister. + * + * Unregisters devfreq cooling device and related Energy Model if it was + * present. */ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) { struct devfreq_cooling_device *dfc; + struct device *dev;

- if (!cdev) + if (IS_ERR_OR_NULL(cdev)) return;

dfc = cdev->devdata; + dev = dfc->devfreq->dev.parent;

thermal_cooling_device_unregister(dfc->cdev); ida_simple_remove(&devfreq_ida, dfc->id); dev_pm_qos_remove_request(&dfc->req_max_freq); + + if (dfc->em_registered) + em_dev_unregister_perf_domain(dev); + kfree(dfc->power_table); kfree(dfc->freq_table);

diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 9df2dfca68dd..19868fb922f1 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -11,6 +11,7 @@ #define __DEVFREQ_COOLING_H__

#include <linux/devfreq.h> +#include <linux/energy_model.h> #include <linux/thermal.h>

@@ -65,6 +66,13 @@ struct thermal_cooling_device * of_devfreq_cooling_register(struct device_node *np, struct devfreq *df); struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df); void devfreq_cooling_unregister(struct thermal_cooling_device *dfc); +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb); +struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, + struct em_data_callback *em_cb);

#else /* !CONFIG_DEVFREQ_THERMAL */

@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df) return ERR_PTR(-EINVAL); }

+static inline struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df, + struct devfreq_cooling_power *dfc_power, + struct em_data_callback *em_cb) +{ + return ERR_PTR(-EINVAL); +} + +static inline struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{ + return ERR_PTR(-EINVAL); +} + static inline void devfreq_cooling_unregister(struct thermal_cooling_device *dfc) {

-- 2.17.1

Ionela Voinescu

7 Oct 7 Oct

12:07 p.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:05 (+0100), Lukasz Luba wrote:

...

The Energy Model (EM) framework supports devices such as Devfreq. Create new registration functions which automatically register EM for the thermal devfreq_cooling devices. This patch prepares the code for coming changes which are going to replace old power model with the new EM.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++- include/linux/devfreq_cooling.h | 22 +++++++ 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index cf045bd4d16b..7e091e795284 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
@capped_state: index to cooling state with in dynamic power budget

@req_max_freq: PM QoS request for limiting the maximum frequency
	of the devfreq device.
@em: Energy Model which represents the associated Devfreq device

^^^^^^^^^^^^^^^^ for

...

@em_registered: Devfreq cooling registered the EM and should free it.

*/

struct devfreq_cooling_device { int id; @@ -63,6 +65,8 @@ struct devfreq_cooling_device { u32 res_util; int capped_state; struct dev_pm_qos_request req_max_freq;

struct em_perf_domain *em;

bool em_registered;

};

static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev, @@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df) } EXPORT_SYMBOL_GPL(devfreq_cooling_register);

+/**
devfreq_cooling_em_register_power() - Register devfreq cooling device with
power information and attempt to register Energy Model (EM)

It took me a while to understand the differences between devfreq register functions and it left me with a nagging feeling that we don't need all of them. Also, looking over the cpufreq cooling devices, they keep their registering interfaces quite simple.

With the functions added by this patch, the devfreq cooling devices will have: - old: of_devfreq_cooling_register_power of_devfreq_cooling_register devfreq_cooling_register devfreq_cooling_unregister - new: devfreq_cooling_em_register_power devfreq_cooling_em_register

My question is whether we actually need the two new devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

The power_ops and the em are dependent on one another, so could we extend the of_devfreq_cooling_register_power() to do the additional em registration. We only need a way to pass the em_cb and I think that could fit nicely in devfreq_cooling_power.

To be noted that I've reviewed these interfaces in the context of the final state of devfreq_cooling.c, after the changes in 4/5.

...

@df: Pointer to devfreq device.

@dfc_power: Pointer to devfreq_cooling_power.

@em_cb: Callback functions providing the data of the EM

Register a devfreq cooling device and attempt to register Energy Model. The

available OPPs must be registered for the device.

If @dfc_power is provided, the cooling device is registered with the

power extensions. If @em_cb is provided it will be called for each OPP to

calculate power value and cost. If @em_cb is not provided then simple Energy

Model is going to be used, which requires "dynamic-power-coefficient" a

devicetree property.

*/

+struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb)
+{
struct thermal_cooling_device *cdev;

struct devfreq_cooling_device *dfc;

struct device_node *np = NULL;

struct device *dev;

int nr_opp, ret;

if (IS_ERR_OR_NULL(df))
return ERR_PTR(-EINVAL);
dev = df->dev.parent;

if (em_cb) {
nr_opp = dev_pm_opp_get_opp_count(dev);
if (nr_opp <= 0) {
	dev_err(dev, "No valid OPPs found\n");
	return ERR_PTR(-EINVAL);
}
ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL);
} else {
ret = dev_pm_opp_of_register_em(dev, NULL);
}

if (ret)
dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
	 ret);
if (dev->of_node)
np = of_node_get(dev->of_node);
cdev = of_devfreq_cooling_register_power(np, df, dfc_power);

if (np)
of_node_put(np);
if (IS_ERR_OR_NULL(cdev)) {
if (!ret)
	em_dev_unregister_perf_domain(dev);
} else {
dfc = cdev->devdata;
dfc->em_registered = !ret;
}

return cdev;
+} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);

+/**
devfreq_cooling_em_register() - Register devfreq cooling device together
		with Energy Model.
@df: Pointer to devfreq device.

@em_cb: Callback functions providing the data of the Energy Model

This function attempts to register Energy Model for devfreq device and then

register the devfreq cooling device.

*/
+struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{

return devfreq_cooling_em_register_power(df, NULL, em_cb);

+} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);

/**

devfreq_cooling_unregister() - Unregister devfreq cooling device.

@cdev: Pointer to devfreq cooling device to unregister.

Unregisters devfreq cooling device and related Energy Model if it was

present.

*/

void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) { struct devfreq_cooling_device *dfc;

struct device *dev;

if (!cdev)
if (IS_ERR_OR_NULL(cdev)) return;

dfc = cdev->devdata;

dev = dfc->devfreq->dev.parent;

thermal_cooling_device_unregister(dfc->cdev); ida_simple_remove(&devfreq_ida, dfc->id); dev_pm_qos_remove_request(&dfc->req_max_freq);

if (dfc->em_registered)
em_dev_unregister_perf_domain(dev);

Nit: Isn't it enough to check if dev->em_pd != NULL to be able to unregister the perf_domain? That would remove the need for dfc->em_registered.

I suppose one could say that's using implementation details on how the EM is built and stored and we should not rely on it, so it's up to you if you want to change it.

Kind regards, Ionela.

...

kfree(dfc->power_table); kfree(dfc->freq_table);

diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 9df2dfca68dd..19868fb922f1 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -11,6 +11,7 @@ #define __DEVFREQ_COOLING_H__

#include <linux/devfreq.h> +#include <linux/energy_model.h> #include <linux/thermal.h>

@@ -65,6 +66,13 @@ struct thermal_cooling_device * of_devfreq_cooling_register(struct device_node *np, struct devfreq *df); struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df); void devfreq_cooling_unregister(struct thermal_cooling_device *dfc); +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb);
+struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df,
	    struct em_data_callback *em_cb);
#else /* !CONFIG_DEVFREQ_THERMAL */

@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df) return ERR_PTR(-EINVAL); }

+static inline struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb)
+{

return ERR_PTR(-EINVAL);

+}

+static inline struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{

return ERR_PTR(-EINVAL);

+}

static inline void devfreq_cooling_unregister(struct thermal_cooling_device *dfc) { -- 2.17.1

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Lukasz Luba

22 Oct 22 Oct

11:17 a.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

On 10/7/20 1:07 PM, Ionela Voinescu wrote:

...

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:05 (+0100), Lukasz Luba wrote:

...
The Energy Model (EM) framework supports devices such as Devfreq. Create new registration functions which automatically register EM for the thermal devfreq_cooling devices. This patch prepares the code for coming changes which are going to replace old power model with the new EM.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com

drivers/thermal/devfreq_cooling.c | 99 ++++++++++++++++++++++++++++++- include/linux/devfreq_cooling.h | 22 +++++++ 2 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index cf045bd4d16b..7e091e795284 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -50,6 +50,8 @@ static DEFINE_IDA(devfreq_ida);
@capped_state: index to cooling state with in dynamic power budget

@req_max_freq: PM QoS request for limiting the maximum frequency
	of the devfreq device.
@em: Energy Model which represents the associated Devfreq device
                                  ^^^^^^^^^^^^^^^^
		     for

I will change it.

...

...
@em_registered: Devfreq cooling registered the EM and should free it.

*/ struct devfreq_cooling_device { int id;

@@ -63,6 +65,8 @@ struct devfreq_cooling_device { u32 res_util; int capped_state; struct dev_pm_qos_request req_max_freq;

struct em_perf_domain *em;

bool em_registered; };

static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev,

@@ -586,22 +590,115 @@ struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df) } EXPORT_SYMBOL_GPL(devfreq_cooling_register);

+/**
devfreq_cooling_em_register_power() - Register devfreq cooling device with
power information and attempt to register Energy Model (EM)
It took me a while to understand the differences between devfreq register functions and it left me with a nagging feeling that we don't need all of them. Also, looking over the cpufreq cooling devices, they keep their registering interfaces quite simple.

This was discussed in previous series, related to EM core changes. It was requested to have a helper registration function which would create EM automatically.

...

With the functions added by this patch, the devfreq cooling devices will have:

old: of_devfreq_cooling_register_power of_devfreq_cooling_register devfreq_cooling_register devfreq_cooling_unregister

new: devfreq_cooling_em_register_power devfreq_cooling_em_register

My question is whether we actually need the two new devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

It is just for consistency, with older scheme. It is only a wrapper, one line, with default NULL. This scheme is common in thermal and some other frameworks.

...

The power_ops and the em are dependent on one another, so could we extend the of_devfreq_cooling_register_power() to do the additional em registration. We only need a way to pass the em_cb and I think that could fit nicely in devfreq_cooling_power.

No, they aren't 'dependent on one another'. The EM usage doesn't depend on presence of power_ops. Drivers might not support power_ops, but want the framework still use EM and do power estimation.

...

To be noted that I've reviewed these interfaces in the context of the final state of devfreq_cooling.c, after the changes in 4/5.

...
@df: Pointer to devfreq device.

@dfc_power: Pointer to devfreq_cooling_power.

@em_cb: Callback functions providing the data of the EM

Register a devfreq cooling device and attempt to register Energy Model. The

available OPPs must be registered for the device.

If @dfc_power is provided, the cooling device is registered with the

power extensions. If @em_cb is provided it will be called for each OPP to

calculate power value and cost. If @em_cb is not provided then simple Energy

Model is going to be used, which requires "dynamic-power-coefficient" a

devicetree property.

*/

+struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb)
+{
struct thermal_cooling_device *cdev;

struct devfreq_cooling_device *dfc;

struct device_node *np = NULL;

struct device *dev;

int nr_opp, ret;

if (IS_ERR_OR_NULL(df))
return ERR_PTR(-EINVAL);
dev = df->dev.parent;

if (em_cb) {
nr_opp = dev_pm_opp_get_opp_count(dev);
if (nr_opp <= 0) {
	dev_err(dev, "No valid OPPs found\n");
	return ERR_PTR(-EINVAL);
}
ret = em_dev_register_perf_domain(dev, nr_opp, em_cb, NULL);
} else {
ret = dev_pm_opp_of_register_em(dev, NULL);
}

if (ret)
dev_warn(dev, "Unable to register EM for devfreq cooling device (%d)\n",
	 ret);
if (dev->of_node)
np = of_node_get(dev->of_node);
cdev = of_devfreq_cooling_register_power(np, df, dfc_power);

if (np)
of_node_put(np);
if (IS_ERR_OR_NULL(cdev)) {
if (!ret)
	em_dev_unregister_perf_domain(dev);
} else {
dfc = cdev->devdata;
dfc->em_registered = !ret;
}

return cdev;
+} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register_power);

+/**
devfreq_cooling_em_register() - Register devfreq cooling device together
		with Energy Model.
@df: Pointer to devfreq device.

@em_cb: Callback functions providing the data of the Energy Model

This function attempts to register Energy Model for devfreq device and then

register the devfreq cooling device.

*/
+struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{

return devfreq_cooling_em_register_power(df, NULL, em_cb);

+} +EXPORT_SYMBOL_GPL(devfreq_cooling_em_register);

/**

devfreq_cooling_unregister() - Unregister devfreq cooling device.

@cdev: Pointer to devfreq cooling device to unregister.

Unregisters devfreq cooling device and related Energy Model if it was

present.

*/ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) { struct devfreq_cooling_device *dfc;

struct device *dev;

if (!cdev)
if (IS_ERR_OR_NULL(cdev)) return;

dfc = cdev->devdata;

dev = dfc->devfreq->dev.parent;

thermal_cooling_device_unregister(dfc->cdev); ida_simple_remove(&devfreq_ida, dfc->id); dev_pm_qos_remove_request(&dfc->req_max_freq);

if (dfc->em_registered)
em_dev_unregister_perf_domain(dev);
Nit: Isn't it enough to check if dev->em_pd != NULL to be able to unregister the perf_domain? That would remove the need for dfc->em_registered.

The devfreq cooling may only unregister the EM if it has registered it. If any other code did the registration, it should unregister when it finished using it.

...

I suppose one could say that's using implementation details on how the EM is built and stored and we should not rely on it, so it's up to you if you want to change it.

Kind regards, Ionela.

...
kfree(dfc->power_table); kfree(dfc->freq_table);

diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 9df2dfca68dd..19868fb922f1 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -11,6 +11,7 @@ #define __DEVFREQ_COOLING_H__

#include <linux/devfreq.h> +#include <linux/energy_model.h> #include <linux/thermal.h>

@@ -65,6 +66,13 @@ struct thermal_cooling_device * of_devfreq_cooling_register(struct device_node *np, struct devfreq *df); struct thermal_cooling_device *devfreq_cooling_register(struct devfreq *df); void devfreq_cooling_unregister(struct thermal_cooling_device *dfc); +struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb);
+struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df,
	    struct em_data_callback *em_cb);
#else /* !CONFIG_DEVFREQ_THERMAL */
@@ -87,6 +95,20 @@ devfreq_cooling_register(struct devfreq *df) return ERR_PTR(-EINVAL); }

+static inline struct thermal_cooling_device * +devfreq_cooling_em_register_power(struct devfreq *df,
		  struct devfreq_cooling_power *dfc_power,
		  struct em_data_callback *em_cb)
+{

return ERR_PTR(-EINVAL);

+}

+static inline struct thermal_cooling_device * +devfreq_cooling_em_register(struct devfreq *df, struct em_data_callback *em_cb) +{

return ERR_PTR(-EINVAL);

+}

static inline void devfreq_cooling_unregister(struct thermal_cooling_device *dfc) {

-- 2.17.1

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Ionela Voinescu

1 Dec 1 Dec

2:05 p.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

Hi,

On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote: [..]

...

...
...
+/**
devfreq_cooling_em_register_power() - Register devfreq cooling device with
power information and attempt to register Energy Model (EM)
It took me a while to understand the differences between devfreq register functions and it left me with a nagging feeling that we don't need all of them. Also, looking over the cpufreq cooling devices, they keep their registering interfaces quite simple.
This was discussed in previous series, related to EM core changes. It was requested to have a helper registration function which would create EM automatically.

...
With the functions added by this patch, the devfreq cooling devices will have:

old: of_devfreq_cooling_register_power of_devfreq_cooling_register devfreq_cooling_register devfreq_cooling_unregister

new: devfreq_cooling_em_register_power devfreq_cooling_em_register

My question is whether we actually need the two new devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

It is just for consistency, with older scheme. It is only a wrapper, one line, with default NULL. This scheme is common in thermal and some other frameworks.

...
The power_ops and the em are dependent on one another, so could we extend the of_devfreq_cooling_register_power() to do the additional em registration. We only need a way to pass the em_cb and I think that could fit nicely in devfreq_cooling_power.

No, they aren't 'dependent on one another'. The EM usage doesn't depend on presence of power_ops. Drivers might not support power_ops, but want the framework still use EM and do power estimation.

Okay, wrong choice of words. There's only a one way dependency: you can't use power_ops without an em, according to of_devfreq_cooling_register_power().

Correct me if I'm wrong, but I see this as being okay as you still need an em to give you the maximum power of a device in a certain state.

With this in mind, and taking in detail the possible calls of the devfreq cooling register functions:

1. Register devfreq cooling device with energy model. (used in patch 5/5)

-> devfreq_cooling_em_register() -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

2. Register devfreq cooling device with power_ops and em: (not used)

-> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

3. Register a devfreq cooling devices with power_ops but no em (not used)

-> of_devfreq_cooling_register_power(dfc_power != NULL)

4. Register a devfreq cooling devices without any kind of power information (em or dfc_power/power_ops)

-> devfreq_cooling_register() or of_devfreq_cooling_register() -> of_devfreq_cooling_register_power(dfc_power = NULL)

Given this, aren't we ending up with some possible calls to these registration functions that don't make sense? That is case 3, as of_devfreq_cooling_register_power() could not assign and later use power_ops without an em. For this usecase, 2 should be used instead.

Therefore, can't the same be achieved by collapsing devfreq_cooling_em_register_power() into of_devfreq_cooling_register_power()? (with the user having the possibility to provide the em callback similarly to how get_real_power() is provided - in devfreq_cooling_power).

IMO is cleaner to unify the functionality (registration and callbacks) of cooling devices with power capabilities (based on em alone or together with power_ops). Otherwise we just create confusion for users registering cooling devices not knowing which function to call.

If this has been discussed previously and I'm missing some details, please provide some links to the discussions.

Thank you for the patience :).

Ionela.

Lukasz Luba

2:37 p.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

On 12/1/20 2:05 PM, Ionela Voinescu wrote:

...

Hi,

On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote: [..]

...
...
...
+/**
devfreq_cooling_em_register_power() - Register devfreq cooling device with
power information and attempt to register Energy Model (EM)
It took me a while to understand the differences between devfreq register functions and it left me with a nagging feeling that we don't need all of them. Also, looking over the cpufreq cooling devices, they keep their registering interfaces quite simple.
This was discussed in previous series, related to EM core changes. It was requested to have a helper registration function which would create EM automatically.

...
With the functions added by this patch, the devfreq cooling devices will have:

old: of_devfreq_cooling_register_power of_devfreq_cooling_register devfreq_cooling_register devfreq_cooling_unregister

new: devfreq_cooling_em_register_power devfreq_cooling_em_register

My question is whether we actually need the two new devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

It is just for consistency, with older scheme. It is only a wrapper, one line, with default NULL. This scheme is common in thermal and some other frameworks.

...
The power_ops and the em are dependent on one another, so could we extend the of_devfreq_cooling_register_power() to do the additional em registration. We only need a way to pass the em_cb and I think that could fit nicely in devfreq_cooling_power.

No, they aren't 'dependent on one another'. The EM usage doesn't depend on presence of power_ops. Drivers might not support power_ops, but want the framework still use EM and do power estimation.
Okay, wrong choice of words. There's only a one way dependency: you can't use power_ops without an em, according to of_devfreq_cooling_register_power().

Correct me if I'm wrong, but I see this as being okay as you still need an em to give you the maximum power of a device in a certain state.

With this in mind, and taking in detail the possible calls of the devfreq cooling register functions:

Register devfreq cooling device with energy model. (used in patch 5/5)

-> devfreq_cooling_em_register() -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

Register devfreq cooling device with power_ops and em: (not used)

-> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

Register a devfreq cooling devices with power_ops but no em (not used)

-> of_devfreq_cooling_register_power(dfc_power != NULL)

Register a devfreq cooling devices without any kind of power information (em or dfc_power/power_ops)

-> devfreq_cooling_register() or of_devfreq_cooling_register() -> of_devfreq_cooling_register_power(dfc_power = NULL)

Given this, aren't we ending up with some possible calls to these registration functions that don't make sense? That is case 3, as of_devfreq_cooling_register_power() could not assign and later use power_ops without an em. For this usecase, 2 should be used instead.

In use case 3. you missed that the driver could registered EM by itself. Maybe wanted to manage the EM internally, for various reasons. Then this registration use case 3. makes sense.

...

Therefore, can't the same be achieved by collapsing devfreq_cooling_em_register_power() into of_devfreq_cooling_register_power()? (with the user having the possibility to provide the em callback similarly to how get_real_power() is provided - in devfreq_cooling_power).

IMO is cleaner to unify the functionality (registration and callbacks) of cooling devices with power capabilities (based on em alone or together with power_ops). Otherwise we just create confusion for users registering cooling devices not knowing which function to call.

I don't want to add the code from devfreq_cooling_em_register_power() into the of_devfreq_cooling_register_power(), these are pretty dense functions with complicated error handling paths. In this shape and a few wrappers, which help users to register according to their needs, it looks OK.

There will be always a review of the coming drivers which would like to register.

...

If this has been discussed previously and I'm missing some details, please provide some links to the discussions.

Thank you for the patience :).

Ionela.

Ionela Voinescu

3:02 p.m.

New subject: [PATCH 3/5] thermal: devfreq_cooling: add new registration functions with Energy Model

On Tuesday 01 Dec 2020 at 14:37:58 (+0000), Lukasz Luba wrote:

...

On 12/1/20 2:05 PM, Ionela Voinescu wrote:

...
Hi,

On Thursday 22 Oct 2020 at 12:17:31 (+0100), Lukasz Luba wrote: [..]

...
...
...
+/**
devfreq_cooling_em_register_power() - Register devfreq cooling device with
power information and attempt to register Energy Model (EM)
It took me a while to understand the differences between devfreq register functions and it left me with a nagging feeling that we don't need all of them. Also, looking over the cpufreq cooling devices, they keep their registering interfaces quite simple.
This was discussed in previous series, related to EM core changes. It was requested to have a helper registration function which would create EM automatically.

...
With the functions added by this patch, the devfreq cooling devices will have:

old: of_devfreq_cooling_register_power of_devfreq_cooling_register devfreq_cooling_register devfreq_cooling_unregister

new: devfreq_cooling_em_register_power devfreq_cooling_em_register

My question is whether we actually need the two new devfreq_cooling_em_register_power() and devfreq_cooling_em_register()?

It is just for consistency, with older scheme. It is only a wrapper, one line, with default NULL. This scheme is common in thermal and some other frameworks.

...
The power_ops and the em are dependent on one another, so could we extend the of_devfreq_cooling_register_power() to do the additional em registration. We only need a way to pass the em_cb and I think that could fit nicely in devfreq_cooling_power.

No, they aren't 'dependent on one another'. The EM usage doesn't depend on presence of power_ops. Drivers might not support power_ops, but want the framework still use EM and do power estimation.
Okay, wrong choice of words. There's only a one way dependency: you can't use power_ops without an em, according to of_devfreq_cooling_register_power().

Correct me if I'm wrong, but I see this as being okay as you still need an em to give you the maximum power of a device in a certain state.

With this in mind, and taking in detail the possible calls of the devfreq cooling register functions:

Register devfreq cooling device with energy model. (used in patch 5/5)

-> devfreq_cooling_em_register() -> devfreq_cooling_em_register_power(dfc_power = NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

Register devfreq cooling device with power_ops and em: (not used)

-> devfreq_cooling_em_register_power(dfc_power != NULL, em obtained through various methods) -> of_devfreq_cooling_register_power(same as above)

Register a devfreq cooling devices with power_ops but no em (not used)

-> of_devfreq_cooling_register_power(dfc_power != NULL)

Register a devfreq cooling devices without any kind of power information (em or dfc_power/power_ops)

-> devfreq_cooling_register() or of_devfreq_cooling_register() -> of_devfreq_cooling_register_power(dfc_power = NULL)

Given this, aren't we ending up with some possible calls to these registration functions that don't make sense? That is case 3, as of_devfreq_cooling_register_power() could not assign and later use power_ops without an em. For this usecase, 2 should be used instead.
In use case 3. you missed that the driver could registered EM by itself. Maybe wanted to manage the EM internally, for various reasons. Then this registration use case 3. makes sense.

Yes, the code allows it but it would be unlikely.

...

...
Therefore, can't the same be achieved by collapsing devfreq_cooling_em_register_power() into of_devfreq_cooling_register_power()? (with the user having the possibility to provide the em callback similarly to how get_real_power() is provided - in devfreq_cooling_power).

IMO is cleaner to unify the functionality (registration and callbacks) of cooling devices with power capabilities (based on em alone or together with power_ops). Otherwise we just create confusion for users registering cooling devices not knowing which function to call.

I don't want to add the code from devfreq_cooling_em_register_power() into the of_devfreq_cooling_register_power(), these are pretty dense functions with complicated error handling paths. In this shape and a few wrappers, which help users to register according to their needs, it looks OK.

There will be always a review of the coming drivers which would like to register.

Okay, no other arguments from my part.

I'll now take a look over v2. I just wanted to get some of these design choices out of the way first.

Thanks, Ionela.

...

...
If this has been discussed previously and I'm missing some details, please provide some links to the discussions.

Thank you for the patience :).

Ionela.

Lukasz Luba

21 Sep 21 Sep

12:20 p.m.

New subject: [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM

Remove old power model and use new Energy Model to calculate the power budget. It drops static + dynamic power calculations and power table in order to use Energy Model performance domain data. This model should be easy to use and could find more users. It is also less complicated to setup the needed structures.

Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/thermal/devfreq_cooling.c | 282 +++++++++--------------------- include/linux/devfreq_cooling.h | 17 -- 2 files changed, 86 insertions(+), 213 deletions(-)

diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 7e091e795284..9b66bba51d09 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -33,20 +33,17 @@ static DEFINE_IDA(devfreq_ida); * @cdev: Pointer to associated thermal cooling device. * @devfreq: Pointer to associated devfreq device. * @cooling_state: Current cooling state. - * @power_table: Pointer to table with maximum power draw for each - * cooling state. State is the index into the table, and - * the power is in mW. * @freq_table: Pointer to a table with the frequencies sorted in descending * order. You can index the table by cooling device state - * @freq_table_size: Size of the @freq_table and @power_table - * @power_ops: Pointer to devfreq_cooling_power, used to generate the - * @power_table. + * @max_state: It is the last index, that is, one less than the number of the + * OPPs + * @power_ops: Pointer to devfreq_cooling_power, a more precised model. * @res_util: Resource utilization scaling factor for the power. * It is multiplied by 100 to minimize the error. It is used * for estimation of the power budget instead of using - * 'utilization' (which is 'busy_time / 'total_time'). - * The 'res_util' range is from 100 to (power_table[state] * 100) - * for the corresponding 'state'. + * 'utilization' (which is 'busy_time' / 'total_time'). + * The 'res_util' range is from 100 to power * 100 for the + * corresponding 'state'. * @capped_state: index to cooling state with in dynamic power budget * @req_max_freq: PM QoS request for limiting the maximum frequency * of the devfreq device. @@ -58,9 +55,8 @@ struct devfreq_cooling_device { struct thermal_cooling_device *cdev; struct devfreq *devfreq; unsigned long cooling_state; - u32 *power_table; u32 *freq_table; - size_t freq_table_size; + size_t max_state; struct devfreq_cooling_power *power_ops; u32 res_util; int capped_state; @@ -74,7 +70,7 @@ static int devfreq_cooling_get_max_state(struct thermal_cooling_device *cdev, { struct devfreq_cooling_device *dfc = cdev->devdata;

- *state = dfc->freq_table_size - 1; + *state = dfc->max_state;

return 0; } @@ -102,10 +98,16 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev,

dev_dbg(dev, "Setting cooling state %lu\n", state);

- if (state >= dfc->freq_table_size) + if (state > dfc->max_state) return -EINVAL;

- freq = dfc->freq_table[state]; + if (dfc->em) { + /* Energy Model frequencies are in kHz */ + freq = dfc->em->table[dfc->max_state - state].frequency; + freq *= 1000; + } else { + freq = dfc->freq_table[state]; + }

dev_pm_qos_update_request(&dfc->req_max_freq, DIV_ROUND_UP(freq, HZ_PER_KHZ)); @@ -116,11 +118,11 @@ static int devfreq_cooling_set_cur_state(struct thermal_cooling_device *cdev, }

/** - * freq_get_state() - get the cooling state corresponding to a frequency + * freq_get_state() - get the performance index corresponding to a frequency * @dfc: Pointer to devfreq cooling device - * @freq: frequency in Hz + * @freq: frequency in kHz * - * Return: the cooling state associated with the @freq, or + * Return: the performance index associated with the @freq, or * THERMAL_CSTATE_INVALID if it wasn't found. */ static unsigned long @@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq) { int i;

- for (i = 0; i < dfc->freq_table_size; i++) { - if (dfc->freq_table[i] == freq) + for (i = 0; i <= dfc->max_state; i++) { + if (dfc->em->table[i].frequency == freq) return i; }

@@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq) return voltage; }

-/** - * get_static_power() - calculate the static power - * @dfc: Pointer to devfreq cooling device - * @freq: Frequency in Hz - * - * Calculate the static power in milliwatts using the supplied - * get_static_power(). The current voltage is calculated using the - * OPP library. If no get_static_power() was supplied, assume the - * static power is negligible. - */ -static unsigned long -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq) +static void dfc_em_get_requested_power(struct em_perf_domain *em, + struct devfreq_dev_status *status, + u32 *power, int em_perf_idx) { - struct devfreq *df = dfc->devfreq; - unsigned long voltage; + *power = em->table[em_perf_idx].power;

- if (!dfc->power_ops->get_static_power) - return 0; - - voltage = get_voltage(df, freq); - - if (voltage == 0) - return 0; - - return dfc->power_ops->get_static_power(df, voltage); -} - -/** - * get_dynamic_power - calculate the dynamic power - * @dfc: Pointer to devfreq cooling device - * @freq: Frequency in Hz - * @voltage: Voltage in millivolts - * - * Calculate the dynamic power in milliwatts consumed by the device at - * frequency @freq and voltage @voltage. If the get_dynamic_power() - * was supplied as part of the devfreq_cooling_power struct, then that - * function is used. Otherwise, a simple power model (Pdyn = Coeff * - * Voltage^2 * Frequency) is used. - */ -static unsigned long -get_dynamic_power(struct devfreq_cooling_device *dfc, unsigned long freq, - unsigned long voltage) -{ - u64 power; - u32 freq_mhz; - struct devfreq_cooling_power *dfc_power = dfc->power_ops; - - if (dfc_power->get_dynamic_power) - return dfc_power->get_dynamic_power(dfc->devfreq, freq, - voltage); - - freq_mhz = freq / 1000000; - power = (u64)dfc_power->dyn_power_coeff * freq_mhz * voltage * voltage; - do_div(power, 1000000000); - - return power; -} - - -static inline unsigned long get_total_power(struct devfreq_cooling_device *dfc, - unsigned long freq, - unsigned long voltage) -{ - return get_static_power(dfc, freq) + get_dynamic_power(dfc, freq, - voltage); + /* Scale power for utilization */ + *power *= status->busy_time; + *power /= status->total_time; }

static void _normalize_load(struct devfreq_dev_status *status) @@ -260,8 +206,6 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd unsigned long state; unsigned long freq; unsigned long voltage; - u32 dyn_power = 0; - u32 static_power = 0; int res;

mutex_lock(&df->lock); @@ -272,13 +216,14 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd

freq = status.current_frequency;

- state = freq_get_state(dfc, freq); + /* Energy Model frequencies are in kHz */ + state = freq_get_state(dfc, freq / 1000); if (state == THERMAL_CSTATE_INVALID) { res = -EAGAIN; goto fail; }

- if (dfc->power_ops->get_real_power) { + if (dfc->power_ops && dfc->power_ops->get_real_power) { voltage = get_voltage(df, freq); if (voltage == 0) { res = -EINVAL; @@ -288,7 +233,7 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd res = dfc->power_ops->get_real_power(df, power, freq, voltage); if (!res) { state = dfc->capped_state; - dfc->res_util = dfc->power_table[state]; + dfc->res_util = dfc->em->table[state].power; dfc->res_util *= SCALE_ERROR_MITIGATION;

if (*power > 1) @@ -297,17 +242,8 @@ static int devfreq_cooling_get_requested_power(struct thermal_cooling_device *cd goto fail; } } else { - dyn_power = dfc->power_table[state]; - _normalize_load(&status); - - /* Scale dynamic power for utilization */ - dyn_power *= status.busy_time; - dyn_power /= status.total_time; - /* Get static power */ - static_power = get_static_power(dfc, freq); - - *power = dyn_power + static_power; + dfc_em_get_requested_power(dfc->em, &status, power, state); }

trace_thermal_power_devfreq_get_power(cdev, &status, freq, *power); @@ -325,16 +261,14 @@ static int devfreq_cooling_state2power(struct thermal_cooling_device *cdev, u32 *power) { struct devfreq_cooling_device *dfc = cdev->devdata; - unsigned long freq; - u32 static_power; + int idx;

- if (state >= dfc->freq_table_size) + if (state > dfc->max_state) return -EINVAL;

- freq = dfc->freq_table[state]; - static_power = get_static_power(dfc, freq); + idx = dfc->max_state - state; + *power = dfc->em->table[idx].power;

- *power = dfc->power_table[state] + static_power; return 0; }

@@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; struct devfreq_dev_status status; - unsigned long busy_time; + u32 est_power = power; unsigned long freq; - s32 dyn_power; - u32 static_power; - s32 est_power; int i;

mutex_lock(&df->lock); @@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,

freq = status.current_frequency;

- if (dfc->power_ops->get_real_power) { + if (dfc->power_ops && dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; est_power /= SCALE_ERROR_MITIGATION; } else { - static_power = get_static_power(dfc, freq); - - dyn_power = power - static_power; - dyn_power = dyn_power > 0 ? dyn_power : 0; - - /* Scale dynamic power for utilization */ - busy_time = status.busy_time ?: 1; - est_power = (dyn_power * status.total_time) / busy_time; + _normalize_load(&status); + est_power *= status.total_time; + est_power /= status.busy_time; }

/* * Find the first cooling state that is within the power - * budget for dynamic power. + * budget. The EM power table is sorted ascending. */ - for (i = 0; i < dfc->freq_table_size - 1; i++) - if (est_power >= dfc->power_table[i]) + for (i = dfc->max_state; i > 0; i--) + if (est_power >= dfc->em->table[i].power) break;

- *state = i; - dfc->capped_state = i; + *state = dfc->max_state - i; + dfc->capped_state = *state; trace_thermal_power_devfreq_limit(cdev, freq, *state, power); return 0; } @@ -394,91 +320,43 @@ static struct thermal_cooling_device_ops devfreq_cooling_ops = { };

/** - * devfreq_cooling_gen_tables() - Generate power and freq tables. - * @dfc: Pointer to devfreq cooling device. - * - * Generate power and frequency tables: the power table hold the - * device's maximum power usage at each cooling state (OPP). The - * static and dynamic power using the appropriate voltage and - * frequency for the state, is acquired from the struct - * devfreq_cooling_power, and summed to make the maximum power draw. - * - * The frequency table holds the frequencies in descending order. - * That way its indexed by cooling device state. + * devfreq_cooling_gen_tables() - Generate frequency table. + * @dfc: Pointer to devfreq cooling device. + * @num_opps: Number of OPPs * - * The tables are malloced, and pointers put in dfc. They must be - * freed when unregistering the devfreq cooling device. + * Generate frequency table which holds the frequencies in descending + * order. That way its indexed by cooling device state. This is for + * compatibility with drivers which do not register Energy Model. * * Return: 0 on success, negative error code on failure. */ -static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc) +static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc, + int num_opps) { struct devfreq *df = dfc->devfreq; struct device *dev = df->dev.parent; - int ret, num_opps; unsigned long freq; - u32 *power_table = NULL; - u32 *freq_table; int i;

- num_opps = dev_pm_opp_get_opp_count(dev); - - if (dfc->power_ops) { - power_table = kcalloc(num_opps, sizeof(*power_table), - GFP_KERNEL); - if (!power_table) - return -ENOMEM; - } - - freq_table = kcalloc(num_opps, sizeof(*freq_table), + dfc->freq_table = kcalloc(num_opps, sizeof(*dfc->freq_table), GFP_KERNEL); - if (!freq_table) { - ret = -ENOMEM; - goto free_power_table; - } + if (!dfc->freq_table) + return -ENOMEM;

for (i = 0, freq = ULONG_MAX; i < num_opps; i++, freq--) { - unsigned long power, voltage; struct dev_pm_opp *opp;

opp = dev_pm_opp_find_freq_floor(dev, &freq); if (IS_ERR(opp)) { - ret = PTR_ERR(opp); - goto free_tables; + kfree(dfc->freq_table); + return PTR_ERR(opp); }

- voltage = dev_pm_opp_get_voltage(opp) / 1000; /* mV */ dev_pm_opp_put(opp); - - if (dfc->power_ops) { - if (dfc->power_ops->get_real_power) - power = get_total_power(dfc, freq, voltage); - else - power = get_dynamic_power(dfc, freq, voltage); - - dev_dbg(dev, "Power table: %lu MHz @ %lu mV: %lu = %lu mW\n", - freq / 1000000, voltage, power, power); - - power_table[i] = power; - } - - freq_table[i] = freq; + dfc->freq_table[i] = freq; }

- if (dfc->power_ops) - dfc->power_table = power_table; - - dfc->freq_table = freq_table; - dfc->freq_table_size = num_opps; - return 0; - -free_tables: - kfree(freq_table); -free_power_table: - kfree(power_table); - - return ret; }

/** @@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, struct thermal_cooling_device *cdev; struct devfreq_cooling_device *dfc; char dev_name[THERMAL_NAME_LENGTH]; - int err; + int err, num_opps;

dfc = kzalloc(sizeof(*dfc), GFP_KERNEL); if (!dfc) @@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,

dfc->devfreq = df;

- if (dfc_power) { - dfc->power_ops = dfc_power; - + dfc->em = em_pd_get(df->dev.parent); + if (dfc->em) { devfreq_cooling_ops.get_requested_power = devfreq_cooling_get_requested_power; devfreq_cooling_ops.state2power = devfreq_cooling_state2power; devfreq_cooling_ops.power2state = devfreq_cooling_power2state; + + dfc->power_ops = dfc_power; + + num_opps = em_pd_nr_perf_states(dfc->em); + } else { + /* Backward compatibility for drivers which do not use IPA */ + dev_dbg(df->dev.parent, "missing EM for cooling device\n"); + + num_opps = dev_pm_opp_get_opp_count(df->dev.parent); + + err = devfreq_cooling_gen_tables(dfc, num_opps); + if (err) + goto free_dfc; }

- err = devfreq_cooling_gen_tables(dfc); - if (err) + if (num_opps <= 0) { + err = -EINVAL; goto free_dfc; + } + + /* max_state is an index, not a counter */ + dfc->max_state = num_opps - 1;

err = dev_pm_qos_add_request(df->dev.parent, &dfc->req_max_freq, DEV_PM_QOS_MAX_FREQUENCY, PM_QOS_MAX_FREQUENCY_DEFAULT_VALUE); if (err < 0) - goto free_tables; + goto free_table;

err = ida_simple_get(&devfreq_ida, 0, 0, GFP_KERNEL); if (err < 0) goto remove_qos_req; + dfc->id = err;

snprintf(dev_name, sizeof(dev_name), "thermal-devfreq-%d", dfc->id); @@ -553,12 +448,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,

release_ida: ida_simple_remove(&devfreq_ida, dfc->id); - remove_qos_req: dev_pm_qos_remove_request(&dfc->req_max_freq); - -free_tables: - kfree(dfc->power_table); +free_table: kfree(dfc->freq_table); free_dfc: kfree(dfc); @@ -699,9 +591,7 @@ void devfreq_cooling_unregister(struct thermal_cooling_device *cdev) if (dfc->em_registered) em_dev_unregister_perf_domain(dev);

- kfree(dfc->power_table); kfree(dfc->freq_table); - kfree(dfc); } EXPORT_SYMBOL_GPL(devfreq_cooling_unregister); diff --git a/include/linux/devfreq_cooling.h b/include/linux/devfreq_cooling.h index 19868fb922f1..4890b12b54b4 100644 --- a/include/linux/devfreq_cooling.h +++ b/include/linux/devfreq_cooling.h @@ -17,17 +17,6 @@

/** * struct devfreq_cooling_power - Devfreq cooling power ops - * @get_static_power: Take voltage, in mV, and return the static power - * in mW. If NULL, the static power is assumed - * to be 0. - * @get_dynamic_power: Take voltage, in mV, and frequency, in HZ, and - * return the dynamic power draw in mW. If NULL, - * a simple power model is used. - * @dyn_power_coeff: Coefficient for the simple dynamic power model in - * mW/(MHz mV mV). - * If get_dynamic_power() is NULL, then the - * dynamic power is calculated as - * @dyn_power_coeff * frequency * voltage^2 * @get_real_power: When this is set, the framework uses it to ask the * device driver for the actual power. * Some devices have more sophisticated methods @@ -47,14 +36,8 @@ * max total (static + dynamic) power value for each OPP. */ struct devfreq_cooling_power { - unsigned long (*get_static_power)(struct devfreq *devfreq, - unsigned long voltage); - unsigned long (*get_dynamic_power)(struct devfreq *devfreq, - unsigned long freq, - unsigned long voltage); int (*get_real_power)(struct devfreq *df, u32 *power, unsigned long freq, unsigned long voltage); - unsigned long dyn_power_coeff; };

#ifdef CONFIG_DEVFREQ_THERMAL

-- 2.17.1

Ionela Voinescu

7 Oct 7 Oct

3:12 p.m.

New subject: [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:06 (+0100), Lukasz Luba wrote: [..]

...

/**

freq_get_state() - get the cooling state corresponding to a frequency

freq_get_state() - get the performance index corresponding to a frequency

If we change the meaning of the return value, I think the function needs a name change as well.

Also, we do treat this as a cooling state when we do validation and compare it to THERMAL_CSTATE_INVALID, but it's not actually a cooling state (it's max_state - state). It does create confusion if we name "state" both a performance index and a cooling state.

Given that the only user is devfreq_cooling_get_requested_power(), might be good to collapse freq_get_state() in that function and rename the "state" variable in there to "em_perf_idx".

...

@dfc: Pointer to devfreq cooling device

@freq: frequency in Hz

@freq: frequency in kHz

Return: the cooling state associated with the @freq, or

Return: the performance index associated with the @freq, or

THERMAL_CSTATE_INVALID if it wasn't found.

*/

static unsigned long @@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq) { int i;
for (i = 0; i < dfc->freq_table_size; i++) {
if (dfc->freq_table[i] == freq)
for (i = 0; i <= dfc->max_state; i++) {
if (dfc->em->table[i].frequency == freq)
return i;
}
@@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq) return voltage; }

-/**

get_static_power() - calculate the static power

@dfc: Pointer to devfreq cooling device

@freq: Frequency in Hz

Calculate the static power in milliwatts using the supplied

get_static_power(). The current voltage is calculated using the

OPP library. If no get_static_power() was supplied, assume the

static power is negligible.

*/

-static unsigned long -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq) +static void dfc_em_get_requested_power(struct em_perf_domain *em,
		       struct devfreq_dev_status *status,
		       u32 *power, int em_perf_idx)

Is there a reason for not directly returning the power value in this function? Also, this only does a few arithmetic operations and it's only called in one place. Is it worth to have this in a separate function?

[..]

...

@@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; struct devfreq_dev_status status;

unsigned long busy_time;

u32 est_power = power;

Nit: You could use power directly and remove est_power as well.

...

unsigned long freq;

s32 dyn_power;

u32 static_power;

s32 est_power; int i;

mutex_lock(&df->lock);

@@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,

freq = status.current_frequency;

if (dfc->power_ops->get_real_power) {

if (dfc->power_ops && dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; est_power /= SCALE_ERROR_MITIGATION; } else {
static_power = get_static_power(dfc, freq);
dyn_power = power - static_power;
dyn_power = dyn_power > 0 ? dyn_power : 0;
/* Scale dynamic power for utilization */
busy_time = status.busy_time ?: 1;
est_power = (dyn_power * status.total_time) / busy_time;
_normalize_load(&status);
est_power *= status.total_time;
est_power /= status.busy_time;
}

/*

Find the first cooling state that is within the power
* budget for dynamic power.
* budget. The EM power table is sorted ascending.
*/
for (i = 0; i < dfc->freq_table_size - 1; i++)
if (est_power >= dfc->power_table[i])
for (i = dfc->max_state; i > 0; i--)
if (est_power >= dfc->em->table[i].power)
break;
*state = i;

dfc->capped_state = i;

*state = dfc->max_state - i;

dfc->capped_state = *state; trace_thermal_power_devfreq_limit(cdev, freq, *state, power); return 0;

}

[..]

...

/** @@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, struct thermal_cooling_device *cdev; struct devfreq_cooling_device *dfc; char dev_name[THERMAL_NAME_LENGTH];

int err;

int err, num_opps;

dfc = kzalloc(sizeof(*dfc), GFP_KERNEL); if (!dfc)

@@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,

dfc->devfreq = df;
if (dfc_power) {
dfc->power_ops = dfc_power;
dfc->em = em_pd_get(df->dev.parent);

if (dfc->em) { devfreq_cooling_ops.get_requested_power = devfreq_cooling_get_requested_power; devfreq_cooling_ops.state2power = devfreq_cooling_state2power; devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
dfc->power_ops = dfc_power;
num_opps = em_pd_nr_perf_states(dfc->em);
} else {
/* Backward compatibility for drivers which do not use IPA */
dev_dbg(df->dev.parent, "missing EM for cooling device\n");
num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
err = devfreq_cooling_gen_tables(dfc, num_opps);
if (err)
	goto free_dfc;
}
err = devfreq_cooling_gen_tables(dfc);

if (err)
if (num_opps <= 0) {
err = -EINVAL;
goto free_dfc;
}

/* max_state is an index, not a counter */

Nit: Might be more clear to replace "index" with cooling state. Then knowledge about cooling states would make this more clear.

Regards, Ionela.

Lukasz Luba

22 Oct 22 Oct

11:26 a.m.

New subject: [PATCH 4/5] thermal: devfreq_cooling: remove old power model and use EM

On 10/7/20 4:12 PM, Ionela Voinescu wrote:

...

Hi Lukasz,

On Monday 21 Sep 2020 at 13:20:06 (+0100), Lukasz Luba wrote: [..]

...
/**

freq_get_state() - get the cooling state corresponding to a frequency

freq_get_state() - get the performance index corresponding to a frequency

If we change the meaning of the return value, I think the function needs a name change as well.

Also, we do treat this as a cooling state when we do validation and compare it to THERMAL_CSTATE_INVALID, but it's not actually a cooling state (it's max_state - state). It does create confusion if we name "state" both a performance index and a cooling state.

Given that the only user is devfreq_cooling_get_requested_power(), might be good to collapse freq_get_state() in that function and rename the "state" variable in there to "em_perf_idx".

I will have a look into this.

...

...
@dfc: Pointer to devfreq cooling device

@freq: frequency in Hz

@freq: frequency in kHz

Return: the cooling state associated with the @freq, or

Return: the performance index associated with the @freq, or

THERMAL_CSTATE_INVALID if it wasn't found.

*/ static unsigned long

@@ -128,8 +130,8 @@ freq_get_state(struct devfreq_cooling_device *dfc, unsigned long freq) { int i;
for (i = 0; i < dfc->freq_table_size; i++) {
if (dfc->freq_table[i] == freq)
for (i = 0; i <= dfc->max_state; i++) {
if (dfc->em->table[i].frequency == freq)
return i;
}
@@ -164,71 +166,15 @@ static unsigned long get_voltage(struct devfreq *df, unsigned long freq) return voltage; }

-/**

get_static_power() - calculate the static power

@dfc: Pointer to devfreq cooling device

@freq: Frequency in Hz

Calculate the static power in milliwatts using the supplied

get_static_power(). The current voltage is calculated using the

OPP library. If no get_static_power() was supplied, assume the

static power is negligible.

*/

-static unsigned long -get_static_power(struct devfreq_cooling_device *dfc, unsigned long freq) +static void dfc_em_get_requested_power(struct em_perf_domain *em,
		       struct devfreq_dev_status *status,
		       u32 *power, int em_perf_idx)
Is there a reason for not directly returning the power value in this function? Also, this only does a few arithmetic operations and it's only called in one place. Is it worth to have this in a separate function?

Good question, maybe I will just put this code where it's called.

...

[..]

...
@@ -345,11 +279,8 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev, struct devfreq_cooling_device *dfc = cdev->devdata; struct devfreq *df = dfc->devfreq; struct devfreq_dev_status status;

unsigned long busy_time;

u32 est_power = power;

Nit: You could use power directly and remove est_power as well.

...
unsigned long freq;

s32 dyn_power;

u32 static_power;

s32 est_power; int i;

mutex_lock(&df->lock);

@@ -358,31 +289,26 @@ static int devfreq_cooling_power2state(struct thermal_cooling_device *cdev,

freq = status.current_frequency;

if (dfc->power_ops->get_real_power) {

if (dfc->power_ops && dfc->power_ops->get_real_power) { /* Scale for resource utilization */ est_power = power * dfc->res_util; est_power /= SCALE_ERROR_MITIGATION; } else {
static_power = get_static_power(dfc, freq);
dyn_power = power - static_power;
dyn_power = dyn_power > 0 ? dyn_power : 0;
/* Scale dynamic power for utilization */
busy_time = status.busy_time ?: 1;
est_power = (dyn_power * status.total_time) / busy_time;
_normalize_load(&status);
est_power *= status.total_time;
est_power /= status.busy_time;
}

/*

Find the first cooling state that is within the power
* budget for dynamic power.
* budget. The EM power table is sorted ascending.
*/
for (i = 0; i < dfc->freq_table_size - 1; i++)
if (est_power >= dfc->power_table[i])
for (i = dfc->max_state; i > 0; i--)
if (est_power >= dfc->em->table[i].power)
break;
*state = i;

dfc->capped_state = i;

*state = dfc->max_state - i;

dfc->capped_state = *state; trace_thermal_power_devfreq_limit(cdev, freq, *state, power); return 0; }
[..]

...
/** @@ -503,7 +381,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, struct thermal_cooling_device *cdev; struct devfreq_cooling_device *dfc; char dev_name[THERMAL_NAME_LENGTH];

int err;

int err, num_opps;

dfc = kzalloc(sizeof(*dfc), GFP_KERNEL); if (!dfc)

@@ -511,28 +389,45 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df,

dfc->devfreq = df;
if (dfc_power) {
dfc->power_ops = dfc_power;
dfc->em = em_pd_get(df->dev.parent);

if (dfc->em) { devfreq_cooling_ops.get_requested_power = devfreq_cooling_get_requested_power; devfreq_cooling_ops.state2power = devfreq_cooling_state2power; devfreq_cooling_ops.power2state = devfreq_cooling_power2state;
dfc->power_ops = dfc_power;
num_opps = em_pd_nr_perf_states(dfc->em);
} else {
/* Backward compatibility for drivers which do not use IPA */
dev_dbg(df->dev.parent, "missing EM for cooling device\n");
num_opps = dev_pm_opp_get_opp_count(df->dev.parent);
err = devfreq_cooling_gen_tables(dfc, num_opps);
if (err)
	goto free_dfc;
}
err = devfreq_cooling_gen_tables(dfc);

if (err)
if (num_opps <= 0) {
err = -EINVAL;
goto free_dfc;
}

/* max_state is an index, not a counter */
Nit: Might be more clear to replace "index" with cooling state. Then knowledge about cooling states would make this more clear.

Similar comment is in cpufreq_cooling.c. The 'index' here means the last valid index in the array.

Thank you for the review comments for all patches.

Regards, Lukasz

...

Regards, Ionela.

Lukasz Luba

21 Sep 21 Sep

12:20 p.m.

New subject: [PATCH 5/5] drm/panfrost: Register devfreq cooling and attempt to add Energy Model

Register devfreq cooling device and attempt to register Energy Model. This will add the devfreq device to the Energy Model framework. It will create a dedicated and unified data structures used i.e. in thermal framework. The last NULL parameter indicates that the power model is simplified and created based on DT 'dynamic-power-coefficient', voltage and frequency.

Reviewed-by: Steven Price steven.price@arm.com Reviewed-by: Alyssa Rosenzweig alyssa.rosenzweig@collabora.com Signed-off-by: Lukasz Luba lukasz.luba@arm.com --- drivers/gpu/drm/panfrost/panfrost_devfreq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_devfreq.c b/drivers/gpu/drm/panfrost/panfrost_devfreq.c index 8ab025d0035f..0d97176c9fb5 100644 --- a/drivers/gpu/drm/panfrost/panfrost_devfreq.c +++ b/drivers/gpu/drm/panfrost/panfrost_devfreq.c @@ -143,7 +143,7 @@ int panfrost_devfreq_init(struct panfrost_device *pfdev) } pfdevfreq->devfreq = devfreq;

- cooling = of_devfreq_cooling_register(dev->of_node, devfreq); + cooling = devfreq_cooling_em_register(devfreq, NULL); if (IS_ERR(cooling)) DRM_DEV_INFO(dev, "Failed to register cooling device\n"); else

-- 2.17.1

1603

Age (days ago)

1674

Last active (days ago)

dri-devel@lists.freedesktop.org

19 comments

3 participants

tags (0)

participants (3)

Daniel Lezcano
Ionela Voinescu
Lukasz Luba