Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
BTW looks like you have a possible leak during fb init; amdgpu_display_framebuffer_init() grabs the refs to the BOs, but drm_framebuffer_init() might still fail (at least theoretically) which will then leak those BO refs.
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Yup plane state holds reference for its entire existing (well this holds in general as design principle for atomic state structs, just makes life easier). And the plane state is guaranteed to exist from when we first pin (prepare_fb plane hook) to when it's getting unpinned (cleanup_fb plane hook).
Out of curiosity, what's blowing up? -Daniel
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
Am 05.11.21 um 19:13 schrieb Daniel Vetter:
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Yup plane state holds reference for its entire existing (well this holds in general as design principle for atomic state structs, just makes life easier). And the plane state is guaranteed to exist from when we first pin (prepare_fb plane hook) to when it's getting unpinned (cleanup_fb plane hook).
Out of curiosity, what's blowing up?
The TTM pin count warning. What happens is that we try to free up a BO while it is still being pinned.
My best guess is that some DMA-buf client is doing something wrong, but it could of course also be that the BO was pinned for scanout.
Christian.
-Daniel
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
On Mon, Nov 08, 2021 at 08:44:24AM +0100, Christian König wrote:
Am 05.11.21 um 19:13 schrieb Daniel Vetter:
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Yup plane state holds reference for its entire existing (well this holds in general as design principle for atomic state structs, just makes life easier). And the plane state is guaranteed to exist from when we first pin (prepare_fb plane hook) to when it's getting unpinned (cleanup_fb plane hook).
Out of curiosity, what's blowing up?
The TTM pin count warning. What happens is that we try to free up a BO while it is still being pinned.
My best guess is that some DMA-buf client is doing something wrong, but it could of course also be that the BO was pinned for scanout.
We check in dma_buf_release whether there's anything left over, so I think the dma-buf scenario is rather unlikely.
I guess worst case we could add a cookie struct to dma_buf_pin that you need to pass to dma_buf_unpin, and wherein we can capture a backtrace. Or maybe implement that in ttm even. Otherwise I don't have good ideas. -Daniel
Christian.
-Daniel
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
Am 08.11.21 um 15:59 schrieb Daniel Vetter:
On Mon, Nov 08, 2021 at 08:44:24AM +0100, Christian König wrote:
Am 05.11.21 um 19:13 schrieb Daniel Vetter:
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Yup plane state holds reference for its entire existing (well this holds in general as design principle for atomic state structs, just makes life easier). And the plane state is guaranteed to exist from when we first pin (prepare_fb plane hook) to when it's getting unpinned (cleanup_fb plane hook).
Out of curiosity, what's blowing up?
The TTM pin count warning. What happens is that we try to free up a BO while it is still being pinned.
My best guess is that some DMA-buf client is doing something wrong, but it could of course also be that the BO was pinned for scanout.
We check in dma_buf_release whether there's anything left over, so I think the dma-buf scenario is rather unlikely.
That was unfortunately only added quite recently and is currently backported to older kernels.
I guess worst case we could add a cookie struct to dma_buf_pin that you need to pass to dma_buf_unpin, and wherein we can capture a backtrace. Or maybe implement that in ttm even. Otherwise I don't have good ideas.
I was thinking about something similar for ttm_bo_pin().
BTW: How would I do that? I know that dump_stack() prints the current stack trace into the system log, but how would I get this as string?
Thanks, Christian.
-Daniel
Christian.
-Daniel
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
On Wed, Nov 10, 2021 at 11:18:02AM +0100, Christian König wrote:
Am 08.11.21 um 15:59 schrieb Daniel Vetter:
On Mon, Nov 08, 2021 at 08:44:24AM +0100, Christian König wrote:
Am 05.11.21 um 19:13 schrieb Daniel Vetter:
On Thu, Nov 04, 2021 at 12:44:34PM -0400, Harry Wentland wrote:
+Nick
It looks to be the old drm_plane_state->fb holds that reference. See dm_plane_helper_cleanup_fb() in amdgpu_dm.c.
Yup plane state holds reference for its entire existing (well this holds in general as design principle for atomic state structs, just makes life easier). And the plane state is guaranteed to exist from when we first pin (prepare_fb plane hook) to when it's getting unpinned (cleanup_fb plane hook).
Out of curiosity, what's blowing up?
The TTM pin count warning. What happens is that we try to free up a BO while it is still being pinned.
My best guess is that some DMA-buf client is doing something wrong, but it could of course also be that the BO was pinned for scanout.
We check in dma_buf_release whether there's anything left over, so I think the dma-buf scenario is rather unlikely.
That was unfortunately only added quite recently and is currently backported to older kernels.
I guess worst case we could add a cookie struct to dma_buf_pin that you need to pass to dma_buf_unpin, and wherein we can capture a backtrace. Or maybe implement that in ttm even. Otherwise I don't have good ideas.
I was thinking about something similar for ttm_bo_pin().
BTW: How would I do that? I know that dump_stack() prints the current stack trace into the system log, but how would I get this as string?
stack depot, not stuff it into a string. stack depot is a lot more efficient and capturing backtraces, since it de-dupes and hashes and all you need is a pointer iirc. linux/stackdepot.h for interface, I think we recently gained a few more users of it in drm. It's _really_ nifty. -Daniel
Thanks, Christian.
-Daniel
Christian.
-Daniel
Harry
On 2021-11-04 08:51, Christian König wrote:
Hi guys,
adding the usual suspects which might know that of hand: When we do a KMS page flip, who keeps the reference to the BO while it is scanned out?
We are running into warning backtraces from TTM which look more than odd.
Thanks, Christian.
Hi guys,
Am 10.11.21 um 14:26 schrieb Daniel Vetter:
[SNIP] stack depot, not stuff it into a string. stack depot is a lot more efficient and capturing backtraces, since it de-dupes and hashes and all you need is a pointer iirc. linux/stackdepot.h for interface, I think we recently gained a few more users of it in drm. It's _really_ nifty. -Daniel
thanks for the pointer Daniel, that framework indeed looks like it can become handy.
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=214621
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
What should we do about that?
Regards, Christian.
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=214621
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=214621
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=214621
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
Regards, Christian.
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=214621
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k...
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It should be an imbalance call to pin/unpin.
-- Earthling Michel Dänzer | https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredhat.com... Libre software enthusiast | Mesa and Xwayland developer
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k...
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote: > Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... > > Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target(). > > In other words we somehow have an unbalanced pinning of the scanout buffer in DC. DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable(). So it's not unpinned...
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
-- Earthling Michel Dänzer | https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredhat.com... Libre software enthusiast | Mesa and Xwayland developer
On 2021-11-15 10:04, Lang Yu wrote:
On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote: > On 2021-11-12 13:47, Christian König wrote: >> Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... >> >> Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target(). >> >> In other words we somehow have an unbalanced pinning of the scanout buffer in DC. > DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in > dm_plane_helper_cleanup_fb. > > > With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable(). So it's not unpinned...
__drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only after calling crtc_funcs->disable. Maybe this path can get hit for a CRTC which was already disabled, in which case crtc->primary->fb == NULL in dce_v*_0_crtc_disable is harmless.
Have you checked for the issue I described below? Should be pretty easy to catch.
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
On Mon, Nov 15, 2021 at 10:49:39AM +0100, Michel DDDnzer wrote:
On 2021-11-15 10:04, Lang Yu wrote:
On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer: > On 2021-11-12 15:29, Michel Dänzer wrote: >> On 2021-11-12 13:47, Christian König wrote: >>> Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... >>> >>> Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target(). >>> >>> In other words we somehow have an unbalanced pinning of the scanout buffer in DC. >> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in >> dm_plane_helper_cleanup_fb. >> >> >> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb > This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
>> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable(). So it's not unpinned...
__drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only after calling crtc_funcs->disable. Maybe this path can get hit for a CRTC which was already disabled, in which case crtc->primary->fb == NULL in dce_v*_0_crtc_disable is harmless.
Have you checked for the issue I described below? Should be pretty easy to catch.
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
Form my observations, amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is never called. Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base(). Anyway, these pin/unpin operations looks error-prone.
Regards, Lang
-- Earthling Michel Dänzer | https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredhat.com... Libre software enthusiast | Mesa and Xwayland developer
On 2021-11-15 12:31, Lang Yu wrote:
On Mon, Nov 15, 2021 at 10:49:39AM +0100, Michel DDDnzer wrote:
On 2021-11-15 10:04, Lang Yu wrote:
On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
On 2021-11-12 16:03, Christian König wrote: > Am 12.11.21 um 15:30 schrieb Michel Dänzer: >> On 2021-11-12 15:29, Michel Dänzer wrote: >>> On 2021-11-12 13:47, Christian König wrote: >>>> Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... >>>> >>>> Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target(). >>>> >>>> In other words we somehow have an unbalanced pinning of the scanout buffer in DC. >>> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in >>> dm_plane_helper_cleanup_fb. >>> >>> >>> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb >> This should say amdgpu_display_unpin_work_func. > > Ah! So that is the classic (e.g. non atomic) path?
Presumably.
>>> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned? > > Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called. > > E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable(). So it's not unpinned...
__drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only after calling crtc_funcs->disable. Maybe this path can get hit for a CRTC which was already disabled, in which case crtc->primary->fb == NULL in dce_v*_0_crtc_disable is harmless.
Have you checked for the issue I described below? Should be pretty easy to catch.
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
Form my observations, amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is never called.
It would be expected to happen when the screen turns off, e.g. due to DPMS.
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r;
if (!atomic) {
+ WARN_ON_ONCE(target_fb == fb);
r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM);
if (unlikely(r != 0)) {
amdgpu_bo_unreserve(abo);
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
On 2021-11-15 12:31, Lang Yu wrote:
On Mon, Nov 15, 2021 at 10:49:39AM +0100, Michel DDDnzer wrote:
On 2021-11-15 10:04, Lang Yu wrote:
On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
On 2021-11-15 07:41, Lang Yu wrote:
On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote: > On 2021-11-12 16:03, Christian König wrote: >> Am 12.11.21 um 15:30 schrieb Michel Dänzer: >>> On 2021-11-12 15:29, Michel Dänzer wrote: >>>> On 2021-11-12 13:47, Christian König wrote: >>>>> Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... >>>>> >>>>> Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target(). >>>>> >>>>> In other words we somehow have an unbalanced pinning of the scanout buffer in DC. >>>> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in >>>> dm_plane_helper_cleanup_fb. >>>> >>>> >>>> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb >>> This should say amdgpu_display_unpin_work_func. >> >> Ah! So that is the classic (e.g. non atomic) path? > > Presumably. > > >>>> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned? >> >> Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called. >> >> E.g. not pin unbalance, but rather use after free. > > amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Actually, each call to amdgpu_display_crtc_page_flip_target() will
1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer (crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it? Next call.
The problem is the pinned buffer of last call to amdgpu_display_crtc_page_flip_target() is not unpinned.
It's unpinned in dce_v*_0_crtc_disable.
I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable(). So it's not unpinned...
__drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only after calling crtc_funcs->disable. Maybe this path can get hit for a CRTC which was already disabled, in which case crtc->primary->fb == NULL in dce_v*_0_crtc_disable is harmless.
Have you checked for the issue I described below? Should be pretty easy to catch.
I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO from target_fb unconditionally, but unpin the BO from the fb parameter only if it's different from the former. So if they're the same, the BO's pin count is incremented by 1.
Form my observations, amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is never called.
It would be expected to happen when the screen turns off, e.g. due to DPMS.
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Regards, Lang
-- Earthling Michel Dänzer | https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredhat.com... Libre software enthusiast | Mesa and Xwayland developer
Am 16.11.21 um 04:27 schrieb Lang Yu:
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
[SNIP]
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Ok sounds like we narrowed down the root cause pretty well.
Question is now how can we fix this? Just not pin the BO when target_fb == fb?
Thanks, Christian.
Regards, Lang
On Tue, Nov 16, 2021 at 08:14:08AM +0100, Christian KKKnig wrote:
Am 16.11.21 um 04:27 schrieb Lang Yu:
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
[SNIP]
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Ok sounds like we narrowed down the root cause pretty well.
Question is now how can we fix this? Just not pin the BO when target_fb == fb?
That worked. I did a few simple tests and didn't observe ttm_bo_release warnings any more.
The pin/unpin logic,
1, fist crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), old_fb is NULL.
2, second crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb.
3, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
4, third crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb (it is pinned in last call to amdgpu_display_crtc_page_flip_target)
5, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
.....
x, reboot, amdgpu_display_suspend_helper() is called, the last pinned fb was unpinned.
And I didn't observe amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is called.
If the logic is wrong, please correct me.
Regards, Lang
Thanks, Christian.
Regards, Lang
Am 16.11.21 um 09:00 schrieb Lang Yu:
On Tue, Nov 16, 2021 at 08:14:08AM +0100, Christian KKKnig wrote:
Am 16.11.21 um 04:27 schrieb Lang Yu:
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
[SNIP]
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Ok sounds like we narrowed down the root cause pretty well.
Question is now how can we fix this? Just not pin the BO when target_fb == fb?
That worked. I did a few simple tests and didn't observe ttm_bo_release warnings any more.
The pin/unpin logic,
1, fist crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), old_fb is NULL.
2, second crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb.
3, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
4, third crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb (it is pinned in last call to amdgpu_display_crtc_page_flip_target)
5, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
.....
x, reboot, amdgpu_display_suspend_helper() is called, the last pinned fb was unpinned.
And I didn't observe amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is called.
If the logic is wrong, please correct me.
I can't fully judge because I'm not that deep with my head in the old display code, but from a ten mile high view it sounds sane to me. Michel what do you think?
BTW: Nicholas are there any plans to get rid of all that stuff? It would be a really nice cleanup of rather flaky code I think.
Thanks, Christian.
Regards, Lang
Thanks, Christian.
Regards, Lang
On Tue, Nov 16, 2021 at 3:09 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 16.11.21 um 09:00 schrieb Lang Yu:
On Tue, Nov 16, 2021 at 08:14:08AM +0100, Christian KKKnig wrote:
Am 16.11.21 um 04:27 schrieb Lang Yu:
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
[SNIP]
Though a single call to dce_v*_0_crtc_do_set_base() will only pin the BO, I found it will be unpinned in next call to dce_v*_0_crtc_do_set_base().
Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Ok sounds like we narrowed down the root cause pretty well.
Question is now how can we fix this? Just not pin the BO when target_fb == fb?
That worked. I did a few simple tests and didn't observe ttm_bo_release warnings any more.
The pin/unpin logic,
1, fist crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), old_fb is NULL.
2, second crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb.
3, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
4, third crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb (it is pinned in last call to amdgpu_display_crtc_page_flip_target)
5, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
.....
x, reboot, amdgpu_display_suspend_helper() is called, the last pinned fb was unpinned.
And I didn't observe amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is called.
If the logic is wrong, please correct me.
I can't fully judge because I'm not that deep with my head in the old display code, but from a ten mile high view it sounds sane to me. Michel what do you think?
BTW: Nicholas are there any plans to get rid of all that stuff? It would be a really nice cleanup of rather flaky code I think.
We just need to add analog support to the DC code. Darren was looking into it.
Alex
Thanks, Christian.
Regards, Lang
Thanks, Christian.
Regards, Lang
On 2021-11-16 15:10, Alex Deucher wrote:
On Tue, Nov 16, 2021 at 3:09 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 16.11.21 um 09:00 schrieb Lang Yu:
On Tue, Nov 16, 2021 at 08:14:08AM +0100, Christian KKKnig wrote:
Am 16.11.21 um 04:27 schrieb Lang Yu:
On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:
[SNIP] > Though a single call to dce_v*_0_crtc_do_set_base() will > only pin the BO, I found it will be unpinned in next call to > dce_v*_0_crtc_do_set_base(). Yeah, that's the normal case when the new BO is different from the old one.
To catch the case I described, try something like
diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
index 18a7b3bd633b..5726bd87a355 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c
@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc *crtc,
return r; if (!atomic) {
WARN_ON_ONCE(target_fb == fb); r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM); if (unlikely(r != 0)) { amdgpu_bo_unreserve(abo);
I did some tests, the warning can be triggered.
pin/unpin operations in *_crtc_do_set_base() and amdgpu_display_crtc_page_flip_target() are mixed.
Ok sounds like we narrowed down the root cause pretty well.
Question is now how can we fix this? Just not pin the BO when target_fb == fb?
That worked. I did a few simple tests and didn't observe ttm_bo_release warnings any more.
The pin/unpin logic,
1, fist crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), old_fb is NULL.
2, second crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb.
3, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
4, third crtc_mode_set, dce_v*_0_crtc_do_set_base() pins crtc->primary->fb(new), unpins old fb (it is pinned in last call to amdgpu_display_crtc_page_flip_target)
5, amdgpu_display_crtc_page_flip_target() pin/unpin operations.
.....
x, reboot, amdgpu_display_suspend_helper() is called, the last pinned fb was unpinned.
And I didn't observe amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is called.
If the logic is wrong, please correct me.
I can't fully judge because I'm not that deep with my head in the old display code, but from a ten mile high view it sounds sane to me. Michel what do you think?
Sounds good to me.
Would be nice to address the other issue identified in this thread as well:
The pin in amdgpu_display_crtc_page_flip_target is guarded by if (!adev->enable_virtual_display), but the corresponding unpins in amdgpu_display_unpin_work_func & dce_v*_0_crtc_disable aren't. This probably results in pin count underflows with virtual display.
Am 12.11.21 um 17:10 schrieb Michel Dänzer:
On 2021-11-12 16:03, Christian König wrote:
Am 12.11.21 um 15:30 schrieb Michel Dänzer:
On 2021-11-12 15:29, Michel Dänzer wrote:
On 2021-11-12 13:47, Christian König wrote:
Anyway this unfortunately turned out to be work for Harray and Nicholas. In detail it's about this bug report here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k...
Lang was able to reproduce the issue and narrow it down to the pin in amdgpu_display_crtc_page_flip_target().
In other words we somehow have an unbalanced pinning of the scanout buffer in DC.
DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired with the unpin in dm_plane_helper_cleanup_fb.
With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired with the unpin in dm_plane_helper_cleanup_fb
This should say amdgpu_display_unpin_work_func.
Ah! So that is the classic (e.g. non atomic) path?
Presumably.
& dce_v*_crtc_disable. One thing I notice is that the pin is guarded by if (!adev->enable_virtual_display), but the unpins seem unconditional. So could this be about virtual display, and the problem is actually trying to unpin a BO that was never pinned?
Nope, my educated guess is rather that we free up the BO before amdgpu_display_unpin_work_func is called.
E.g. not pin unbalance, but rather use after free.
amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(&work->old_abo) only after amdgpu_bo_unpin. So what you describe could only happen if there's an imbalance elsewhere such that amdgpu_bo_unref is called more often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer after flip" error message should appear in dmesg).
Yeah, seen that in the meantime as well.
But we also have a WARN_ON() when the pincount overruns, so that can't be it either.
Long story short I have no idea what's going on here.
Regards, Christian.
dri-devel@lists.freedesktop.org