[PATCH] drm/atomic: Convert a warning to dbg atomic printk

List overview All Threads
Download

newer

older

[PATCH 0/3] Correct disp_aal...

[PATCH] drm/i915: consider...

Zack Rusin

4 Apr 2022 4 Apr '22

6:21 p.m.

From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log. Skip printing it by default and instead add it to the atomic drm debug category. It can be trivially enabled via drm.debug=0x10 and it avoids filling up logs by default.

This is particularly important on vm's where presentation might be CPU bound and running multiple vm's or under a CPU heavy load can easily lead to missing mutliple flips and any actual errors are often blamed on this due to it being so prevalent in the logs. Fixes a bunch of bugs reported to RedHat (1792515, 1884401, 1896139 and 1884401) and SUSE (210123) where people complain about having their logs spammed by this message and blame various errors on it.

Signed-off-by: Zack Rusin zackr@vmware.com --- drivers/gpu/drm/drm_atomic_helper.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 9603193d2fa1..e88b4d83fd5c 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1566,8 +1566,8 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev,

ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ); if (ret == 0) - drm_err(dev, "[CRTC:%d:%s] flip_done timed out\n", - crtc->base.id, crtc->name); + drm_dbg_atomic(dev, "[CRTC:%d:%s] flip_done timed out\n", + crtc->base.id, crtc->name); }

if (old_state->fake_commit)

-- 2.32.0

Show replies by date

Michel Dänzer

5 Apr 5 Apr

5:09 p.m.

On 2022-04-04 20:21, Zack Rusin wrote:

...

From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

...

and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I suspect this just papers over the real issue even with vmwgfx though.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer

Zack Rusin

10:47 p.m.

On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:

...

⚠ External Email

On 2022-04-04 20:21, Zack Rusin wrote:

...
From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

Yea, sorry, this entire commit message is not correct. I've sent out a very old diff with a log the best I could remember. I recall our conversation now and iirc we said that maybe an interface through drm atomic code to enable/disable this error is the way to go but after looking at this again I'm not sure. More below.

...

...
and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I'm not going to argue for or against that but I am curious what's the point of the message. The message is basically saying "something could possibly have went very wrong". OK, what's next? Especially if there's no visible problems and it's not reproducible. Even if it would be reproducible there's nothing actionable from the message itself. If the system has no output connected and no users are currently logged in and we missed a flip, does it matter?

...

I suspect this just papers over the real issue even with vmwgfx though.

This goes back to the above. I think you, me and Dave looked at the logs from those bugs and none of us seem to know what do do about it. Lacking some other error messages there seems to be nothing this error adds with vmwgfx or without (we have been adding more logging to vmwgfx so hopefully with newer kernels we could get some actionable errors but that's orthogonal to this).

Michel Dänzer

6 Apr 6 Apr

8:30 a.m.

On 2022-04-06 00:47, Zack Rusin wrote:

...

On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:

...
On 2022-04-04 20:21, Zack Rusin wrote:

...
From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

Yea, sorry, this entire commit message is not correct. I've sent out a very old diff with a log the best I could remember. I recall our conversation now and iirc we said that maybe an interface through drm atomic code to enable/disable this error is the way to go but after looking at this again I'm not sure. More below.

...
...
and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I'm not going to argue for or against that but I am curious what's the point of the message. The message is basically saying "something could possibly have went very wrong". OK, what's next? Especially if there's no visible problems and it's not reproducible. Even if it would be reproducible there's nothing actionable from the message itself. If the system has no output connected and no users are currently logged in and we missed a flip, does it matter?

I don't think waiting for a 10 second timeout is the appropriate behaviour in that case. While a KMS CRTC is enabled, the driver needs to make it work, in the worst case via a timer which ticks at the CRTC refresh rate.

...

...
I suspect this just papers over the real issue even with vmwgfx though.

This goes back to the above. I think you, me and Dave looked at the logs from those bugs and none of us seem to know what do do about it. Lacking some other error messages there seems to be nothing this error adds with vmwgfx or without (we have been adding more logging to vmwgfx so hopefully with newer kernels we could get some actionable errors but that's orthogonal to this).

The error means that either:

* A flip actually didn't complete in 10 seconds. * There's some kind of time tracking issue which results in the timer firing after less than 10 seconds (of the system actually running).

Either way, it's an issue which should be fixed rather than just swept under the rug.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer

Zack Rusin

2:10 p.m.

On Wed, 2022-04-06 at 10:30 +0200, Michel Dänzer wrote:

...

⚠ External Email

On 2022-04-06 00:47, Zack Rusin wrote:

...
On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:

...
On 2022-04-04 20:21, Zack Rusin wrote:

...
From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

Yea, sorry, this entire commit message is not correct. I've sent out a very old diff with a log the best I could remember. I recall our conversation now and iirc we said that maybe an interface through drm atomic code to enable/disable this error is the way to go but after looking at this again I'm not sure. More below.

...
...
and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I'm not going to argue for or against that but I am curious what's the point of the message. The message is basically saying "something could possibly have went very wrong". OK, what's next? Especially if there's no visible problems and it's not reproducible. Even if it would be reproducible there's nothing actionable from the message itself. If the system has no output connected and no users are currently logged in and we missed a flip, does it matter?

I don't think waiting for a 10 second timeout is the appropriate behaviour in that case. While a KMS CRTC is enabled, the driver needs to make it work, in the worst case via a timer which ticks at the CRTC refresh rate.

What if it isn't a driver bug and it just so happens that vmwgfx is the only one that's running on some specific system without a display long enough? Without being able to reproduce how would you know?

...

The error means that either:

A flip actually didn't complete in 10 seconds.

There's some kind of time tracking issue which results in the timer

firing after less than 10 seconds (of the system actually running).

Either way, it's an issue which should be fixed rather than just swept under the rug.

Without being able to reproduce or really confirm that it's an actual problem in vmwgfx I'm going to leave this to you then. I'd be happy to look at it again or help out in any way I can if you find any problems we can do anything about.

Michel Dänzer

7 Apr 7 Apr

8:34 a.m.

On 2022-04-06 16:10, Zack Rusin wrote:

...

On Wed, 2022-04-06 at 10:30 +0200, Michel Dänzer wrote:

...
On 2022-04-06 00:47, Zack Rusin wrote:

...
On Tue, 2022-04-05 at 19:09 +0200, Michel Dänzer wrote:

...
On 2022-04-04 20:21, Zack Rusin wrote:

...
From: Zack Rusin zackr@vmware.com

By default each flip times out after 0.1 sec

10 * HZ is 10 seconds, not 0.1.

Yea, sorry, this entire commit message is not correct. I've sent out a very old diff with a log the best I could remember. I recall our conversation now and iirc we said that maybe an interface through drm atomic code to enable/disable this error is the way to go but after looking at this again I'm not sure. More below.

...
...
and a warning about the time out is added to the kernel log. The warning is harmless because there's another flip coming right after but it can quickly fill up the log, e.g. missing 2 flips every second over a 24 hour span will add about 172 thousand lines to the log.

As we discussed before, while this might be true for the vmwgfx driver, for other drivers this message indicates that either the GPU hung, or something else went wrong spectacularly. As such, I think we do want to see these messages by default for other drivers at least.

I'm not going to argue for or against that but I am curious what's the point of the message. The message is basically saying "something could possibly have went very wrong". OK, what's next? Especially if there's no visible problems and it's not reproducible. Even if it would be reproducible there's nothing actionable from the message itself. If the system has no output connected and no users are currently logged in and we missed a flip, does it matter?

I don't think waiting for a 10 second timeout is the appropriate behaviour in that case. While a KMS CRTC is enabled, the driver needs to make it work, in the worst case via a timer which ticks at the CRTC refresh rate.

What if it isn't a driver bug

Does it make a difference? It's still a bug.

...

and it just so happens that vmwgfx is the only one that's running on some specific system without a display long enough?

FWIW, other drivers have hit this same error before as well, but not just because no outputs were enabled. If it's really related to that, a vmwgfx bug seems most likely.

...

...
The error means that either:

A flip actually didn't complete in 10 seconds.

There's some kind of time tracking issue which results in the timer

firing after less than 10 seconds (of the system actually running).

Either way, it's an issue which should be fixed rather than just swept under the rug.

Without being able to reproduce or really confirm that it's an actual problem in vmwgfx I'm going to leave this to you then.

Nice try. :) This is for vmwgfx developers.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer

1119

Age (days ago)

1122

Last active (days ago)

dri-devel@lists.freedesktop.org

5 comments

3 participants

tags (0)

participants (3)

Michel Dänzer
Zack Rusin
Zack Rusin