Yiqing raised a problem of negative fence refcount for resubmitted jobs in amdgpu and suggested a workaround in [1]. I took a look myself and discovered some deeper problems both in amdgpu and scheduler code.
Yiqing helped with testing the new code and also drew a detailed refcount and flow tracing diagram for parent (HW) fence life cycle and refcount under various cases for the proposed patchset at [2].
[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#... [2] - https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=s...
Andrey Grodzovsky (5): drm/amdgpu: Fix possible refcount leak for release of external_hw_fence drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences drm/amdgpu: Prevent race between late signaled fences and GPU reset. drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' drm/amdgpu: Follow up change to previous drm scheduler change.
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 37 ++++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 12 +++---- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++-- 6 files changed, 78 insertions(+), 17 deletions(-)