On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
From: Ville Syrjälä ville.syrjala@linux.intel.com
Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.
How does this solve the deadlock? Afaiui the deadlock is that the gpu reset stopped unconditionally completing all requests before it did anything else, including anything with the hw or taking modeset locks.
This ensured that any outstanding flips (we only had pageflips, no atomic ones back then) could complete (although maybe displaying garbage). The only thing we had to do was grab the locks (to avoid concurrent modesets) and then we could happily nuke the hw (since the flips where lost in the CS anyway), and restore it afterwards.
Then the TDR rewrite came around and broke that, which now makes atomic time out waiting for the gpu to complete (because the gpu reset waits for the modeset to complete first). Which means if you want to fix this without breaking TDR, then you need to cancel the pending atomic commits. That seems somewhat what you're attempting here with trying to figure out what the latest hw-committed step is (but that function gets it wrong), and lots more locking and stuff on top.
Why exactly can't we just go back to the old world of force-completing dead requests on gen4 and earlier? That would be tons simpler imo instead of throwing a pile of hacks (which really needs a complete rewrite of the atomic code in i915) in as a work-around. We didn't have TDR on gen4 and earlier for years, going back to that isn't going to hurt anyone.
Making working gen4 gpu reset contigent on cancellable atomic commits and the full commit queue is imo nuts.
And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.
And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.