Hi, Christian,
On Thu, 2021-10-07 at 16:57 +0200, Christian König wrote:
Am 07.10.21 um 15:26 schrieb Thomas Hellström:
While the range notifier is executing, we have the write-side mmu interval seqlock, and mmu_interval_read_retry() is always returning true, which means that if amdgpu_cs_submit grabs the notifier lock during the fence wait, it will retry anyway when checking the userptr validity and block when retrying in mmu_interval_read_begin(). (See the extensive comments in mmu_interval_read_begin())
Hence we can release the notifier lock before the fence wait and avoid a device-wide command submission block during invalidation.
First of all I'm not convinced that this works and second blocking the CS while an MMU invalidation is underway is completely intentional.
In other words when for example fork() is called in parallel with a CS the CS should be blocked until the invalidation caused by the fork() operation is completed and *NOT* risk that the CS succeeds and adds another dependency to the MMU invalidation.
The point is that command submission can't succeed since while the notifer runs, we have the write-side seqlock. It's released when the notifier ends, and we can rely on that. The following will happen:
Thread1 Thread 2
enter_cs(); enter_invalidation(); notifier_lock() update_seqno(); notifier_unlock(); notifier_lock(); fence_wait(); validate_userptr(); (invalid_seqno) notifier_unlock(); (retry) mmu_interval_read_begin(); (blocks) fence_signal(); (notifier ends) mn_itree_inv_end(); (unblocks, new seqno); enter_cs(); (succeeds)
So the difference is that you block only CS that affect that particular userptr, not CS across the entire device.
So for the sake of discussion, is there a particular situation where you see a race that might happen here?
Thanks, Thomas
Christian.
Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 4b153daf283d..d3d340a6129c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -75,9 +75,10 @@ static bool amdgpu_mn_invalidate_gfx(struct mmu_interval_notifier *mni, mmu_interval_set_seq(mni, cur_seq); + mutex_unlock(&adev->notifier_lock);
r = dma_resv_wait_timeout(bo->tbo.base.resv, true, false, MAX_SCHEDULE_TIMEOUT); - mutex_unlock(&adev->notifier_lock); if (r <= 0) DRM_ERROR("(%ld) failed to wait for user bo\n", r); return true;