On 10/29/21 00:55, Matthew Brost wrote:
On Thu, Oct 28, 2021 at 02:01:27PM +0200, Thomas Hellström wrote:
With asynchronous migrations, the vma state may be several migrations ahead of the state that matches the request we're capturing. Address that by introducing an i915_vma_snapshot structure that can be used to snapshot relevant state at request submission. In order to make sure we access the correct memory, the snapshots take references on relevant sg-tables and memory regions.
Also move the capture list allocation out of the fence signaling critical path and use the CONFIG_DRM_I915_CAPTURE_ERROR define to avoid compiling in members and functions used for error capture when they're not used.
Finally, correct lockdep annotation would reveal that error capture is typically done in the fence signalling critical path. Alter the error capture memory allocation mode accordingly.
I've seen this as well: https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5
John Harrison and Daniele feeling was if a NOWAIT memory allocation context was used if the system was under any amount of memory pressure the error capture is likely to fail due to the size of the objects being allocated. Daniel's Vetter has purposed another solution - basically allocate a page at the NOWAIT context which is a larger rework.
We have Jira for this. I'll dig this up and send it over off the list if you want to join that discussion.
Matt
Please do, I basically agree with John and Daniele error capture may fail under memory pressure, but I couldn't see how we could avoid that short of exposing us to dma-fence deadlocks.
I figure basically we'd have to pin all vmas, reset, retire the request and *then* do the allocating parts of the capture.
I'll ping Daniel about the best course of action meanwhile for the above series.
/Thomas