These patches look good to me. JingWen will pull these patches and do some basic TDR test on sriov environment, and give feedback.
Best wishes Emily Deng
-----Original Message----- From: Liu, Monk Monk.Liu@amd.com Sent: Thursday, December 23, 2021 6:14 PM To: Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; dri-devel@lists.freedesktop.org; amd- gfx@lists.freedesktop.org; Chen, Horace Horace.Chen@amd.com; Chen, JingWen JingWen.Chen2@amd.com; Deng, Emily Emily.Deng@amd.com Cc: daniel@ffwll.ch Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
[AMD Official Use Only]
@Chen, Horace @Chen, JingWen @Deng, Emily
Please take a review on Andrey's patch
Thanks
Monk Liu | Cloud GPU & Virtualization Solution | AMD
we are hiring software manager for CVS core team
-----Original Message----- From: Koenig, Christian Christian.Koenig@amd.com Sent: Thursday, December 23, 2021 4:42 PM To: Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; dri- devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Cc: daniel@ffwll.ch; Liu, Monk Monk.Liu@amd.com; Chen, Horace Horace.Chen@amd.com Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
Since now flr work is serialized against GPU resets there is no need for this.
Signed-off-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
Acked-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- 2 files changed, 22 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index 487cd654b69e..7d59a66e3988 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
work_struct *work)
struct amdgpu_device *adev = container_of(virt, struct
amdgpu_device, virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
* otherwise the mailbox msg will be ruined/reseted by
* the VF FLR.
*/
if (!down_write_trylock(&adev->reset_sem))
return;
amdgpu_virt_fini_data_exchange(adev);
atomic_set(&adev->in_gpu_reset, 1);
xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
work_struct *work)
} while (timeout > 1);
flr_done:
- atomic_set(&adev->in_gpu_reset, 0);
- up_write(&adev->reset_sem);
- /* Trigger recovery for world switch failure if no TDR */ if (amdgpu_device_should_recover_gpu(adev) && (!amdgpu_device_has_job_running(adev) || diff --git
a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index e3869067a31d..f82c066c8e8d 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
work_struct *work)
struct amdgpu_device *adev = container_of(virt, struct
amdgpu_device, virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
* otherwise the mailbox msg will be ruined/reseted by
* the VF FLR.
*/
if (!down_write_trylock(&adev->reset_sem))
return;
amdgpu_virt_fini_data_exchange(adev);
atomic_set(&adev->in_gpu_reset, 1);
xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
work_struct *work)
} while (timeout > 1);
flr_done:
- atomic_set(&adev->in_gpu_reset, 0);
- up_write(&adev->reset_sem);
- /* Trigger recovery for world switch failure if no TDR */ if (amdgpu_device_should_recover_gpu(adev) && (!amdgpu_device_has_job_running(adev) ||