Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

13 Mar 2020

      On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer michel@daenzer.net wrote:
...
On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
...
On 2020-03-12 10:32 a.m., Alex Deucher wrote:
...
On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
mario.kleiner.de@gmail.com wrote:
...
Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
events at vsartup for DCN")' introduces a new way of pageflip
completion handling for DCN, and some trouble.
The current implementation introduces a race condition, which
can cause pageflip completion events to be sent out one vblank
too early, thereby confusing userspace and causing flicker:
prepare_flip_isr():

Pageflip programming takes the ddev->event_lock.
Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
Releases ddev->event_lock.

--> Deadline for surface address regs double-buffering passes on
     target pipe.

dc_commit_updates_for_stream() MMIO programs the new pageflip
 into hw, but too late for current vblank.

=> pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
    in current vblank due to missing the double-buffering deadline
    by a tiny bit.

VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
 dm_dcn_crtc_high_irq() gets called.

Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
 pageflip has been completed/will complete in this vblank and
 sends out pageflip completion event to userspace and resets
 pflip_status = AMDGPU_FLIP_NONE.

=> Flip completion event sent out one vblank too early.
This behaviour has been observed during my testing with measurement
hardware a couple of time.
The commit message says that the extra flip event code was added to
dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
in case the pflip irq doesn't fire, because the "DCH HUBP" component
is clock gated and doesn't fire pflip irqs in that state. Also that
this clock gating may happen if no planes are active. According to
Nicholas, the clock gating can also happen if psr is active, and the
gating is controlled independently by the hardware, so difficult to
detect if and when the completion code in above commit is needed.
This patch tries the following solution: It only executes the extra
pflip
completion code in dm_dcn_crtc_high_irq() iff the hardware reports
that there aren't any surface updated pending in the double-buffered
surface scanout address registers. Otherwise it leaves pflip completion
to the pflip irq handler, for a more race-free experience.
This would only guard against the order of events mentioned above.
If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
at all, because 1-3 + 5 might happen even without the hw being
programmed
at all, ie. no surface update pending because none yet programmed
into hw.
Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
under event_lock protection within the same critical section.
v2: Take Nicholas comments into account, try a different solution.
Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
Seems to work without causing obvious new trouble.
Nick, any comments on this?  Can we get this committed or do you think
it needs additional rework?
Thanks,
Alex
Hi Alex, Mario,
This might be a little strange, but if we want to get this in as a fix
for regressions caused by the original vblank and user events at
vstartup patch then I'm actually going to give my reviewed by on the
*v1* of this patch (but not this v2):
Reviewed-by: Nicholas Kazlauskas nicholas.kazlauskas@amd.com
You can feel free to apply that one.
Reason 1: After having thought about it some more I don't think we
enable anything today that has hubp powered down at the same time we
expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
entry. Static screen interrupt should happen after that flip finishes I
think.
The CRTC can still be powered on with zero planes, and I don't think any
userspace explicitly asks for vblank events in this case but it doesn't
hurt to have the check.
Ok, so the original commit that causes the races currently solves a
non-existing - but potential future - problem?
I guess then my v1 patch makes sense for the moment and fixes the immediate
problem for Linux 5.6-rc.
Maybe there's a way to ask the hw if hubp is clock-gated and so if that
code needs to run or not in the future?
As mentioned before, it would be helpful at least for me to get a better
overview about which hw events happen when in vblank, which irq's fire in
response etc., how this relates to things like power management actions,
psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years,
and my guideline are still the public pdf files with the DCE register specs
for DCE-1'ish hw from sometime around the year 2007.
...
Reason 2: This new patch will need much more thorough testing from side
...
to fully understand the consequences of locking the entire DC commit
sequence. For just a page flip that sounds fine, but for anything more
than (eg. full updates, modesets, etc) I don't think we want to be
disabling interrupts for potentially many milliseconds.
Ah! I was wondering where the attached splat comes from, but I think
this explains it: With this patch amdgpu_dm_commit_planes keeps the
pcrtc->dev->event_lock spinlock locked while calling
dc_commit_updates_for_stream, which ends up calling
smu_set_display_count, which tries to lock a mutex.
Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and
that function only gets called on Navi /DCN2 afaics. All my testing is
limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm.
thanks,
-mario
-- 
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)