[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

List overview All Threads
Download

newer

older

[PATCH] drm/i915/selftests: fixup...

[Bug 107381] New: radeon VCE init...

Liu, Monk

1 Sep 2021 1 Sep '21

1:23 a.m.

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and quite painful to me .... Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park() For other aspects, can we put all our opinion synthesized here ?

Thanks !

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

Attachments:

attachment.html (text/html — 3.6 KB)

Show replies by date

Liu, Monk

1 Sep 1 Sep

1:58 a.m.

[AMD Official Use Only]

In the previous discussion, you guys stated that we should drop the "kthread_should_park" in cleanup_job.

@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) { struct drm_sched_job *job, *next;

- /* - * Don't destroy jobs while the timeout worker is running OR thread - * is being parked and hence assumed to not touch pending_list - */ - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && - !cancel_delayed_work(&sched->work_tdr)) || - kthread_should_park()) - return NULL;

But I suddenly have a question here: if return the timedout job no matter kthread_should_park() or not, then we are backing to the original problem again: that the timedout_job is suddenly signaling and cleanup_job still returns it to sched_main and job is freed while it is still handling by vendor's timeout callback

If we return NULL when kthread_should_park() in cleanup_job, we can prevent above scenario from happening: once a job is processed by job_timedout we can stop its scheduler, and after that even this job suddenly signaled the cleanup_job won't return it so sched_main won't free it in parallel ...

What do you think ? Thanks

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

From: Liu, Monk Sent: Wednesday, September 1, 2021 9:23 AM To: Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Daniel Vetter daniel@ffwll.ch; Chen, JingWen JingWen.Chen2@amd.com Cc: DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park() For other aspects, can we put all our opinion synthesized here ?

Thanks !

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

Andrey Grodzovsky

4:04 a.m.

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:

...

[AMD Official Use Only]

In the previous discussion, you guys stated that we should drop the “kthread_should_park” in cleanup_job.

@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)

{

struct drm_sched_job *job, *next;

-       /*

-        * Don't destroy jobs while the timeout worker is running OR thread

-        * is being parked and hence assumed to not touch pending_list

-        */

-       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

!cancel_delayed_work(&sched->work_tdr)) ||

-           kthread_should_park())

-               return NULL;

But I suddenly have a question here: if return the timedout job no matter kthread_should_park() or not, then we are backing to the original problem again: that the timedout_job is suddenly signaling and cleanup_job still returns it to sched_main and job is freed while it is still handling by vendor’s timeout callback

If we return NULL when kthread_should_park() in cleanup_job, we can prevent above scenario from happening: once a job is processed by job_timedout we can stop its scheduler, and after that even this job suddenly signaled the cleanup_job won’t return it so sched_main won’t free it in parallel …

What do you think ?

Is your analysis above takes into account that you also submit '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a problem - I think that as long as you put kthread_park(sched->thread) BEFORE fetching next bad job from pending list (under spinlock) there is no such issue as in the case you describe because this potential bad job that became signaled will be removed from pending list before you even fetch the next job and by the time you fetch it the scheduler thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem because we temporary remove the job we fetch for TDR from pending list under spinlock exactly to avoid this race

...

Thanks

Monk Liu | Cloud-GPU Core team

*From:* Liu, Monk *Sent:* Wednesday, September 1, 2021 9:23 AM *To:* Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Daniel Vetter daniel@ffwll.ch; Chen, JingWen JingWen.Chen2@amd.com *Cc:* DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org *Subject:* [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ….

Can we try to put all our opinions, suggestions, or even objects here together, let’s go through them one by one, it’s too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

I have no objections for this one besides getting rid of the kthread_should_park()) return null part, if my answer above is not wrong then it seems superfluous to me

...

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()

Is this race free ? Can't the other thread execute kthread_park after the check ?

...

For other aspects, can we put all our opinion synthesized here ?

So to summarize from previous threads I think that the best solution to the problem being solved in this patch is if we do HW fence embedding at the drm_sched_job level instead of doing it only for amdgpu, and modifying all the drivers to support this we can both remove this hack and solve the race against concurrent drm_sched_cleanup_jobs job freeing just by taking reference to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and you need a quick solution such as the serialization you do here then take into account other concurrent TDR handlers that might run, as mentioned before, without serializing against other TDR handlers from other schedulers you just race here against them, e.g. you parked it now but another one in progress will unpark it as part of calling drm_sched_start for other rings. So you either need a global lock or dedicated single threaded queue as Daniel suggested. At minimum we should change cancel_delayed_work in drm_sched_stop to cancel_delayed_work_sync to cancel and flush all concurrent TDRs and probably move it to the begining of the function, after kthread_park and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we have multiple TDR arising from same event we will proceed to resetting multiple times - something that with trylock we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and amdgpu_device_lock_hive_adev) Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey

...

Thanks !

Monk Liu | Cloud-GPU Core team

Jingwen Chen

4:25 a.m.

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

...

I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:

[AMD Official Use Only]



In the previous discussion, you guys stated that we should drop the
“kthread_should_park” in cleanup_job.



@@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
*sched)

{

        struct drm_sched_job *job, *next;



-       /*

-        * Don't destroy jobs while the timeout worker is running  OR
thread

-        * is being parked and hence assumed to not touch pending_list

-        */

-       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

-           !cancel_delayed_work(&sched->work_tdr)) ||

-           kthread_should_park())

-               return NULL;



But I suddenly have a question here: if return the timedout job no matter
kthread_should_park() or not, then we are backing to the original problem
again: that the timedout_job is suddenly signaling and cleanup_job still
returns it to sched_main and job is freed while it is still handling by
vendor’s timeout callback



If we return NULL when kthread_should_park() in cleanup_job, we can prevent
above scenario from happening: once a job is processed by job_timedout we
can stop its scheduler, and after that even this job suddenly signaled the
cleanup_job won’t return it so sched_main won’t free it in parallel …



What do you think ?

Is your analysis above takes into account that you also submit '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a problem -

Hi Andrey, Monk has talked to me and we agreed that as there're multiple opinions about the '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch 1 is an independent patch to fix some error. So we should not take the patch 2 into analysis.

...

I think that as long as you put kthread_park(sched->thread) BEFORE fetching next bad job from pending list (under spinlock) there is no such issue as in the case you describe because this potential bad job that became signaled will be removed from pending list before you even fetch the next job and by the time you fetch it the scheduler thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem because we temporary remove the job we fetch for TDR from pending list under spinlock exactly to avoid this race

So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)? patch v3 keeps this kthread_should_park check.

Best Regards, JingWen

...

Thanks



------------------------------------------

Monk Liu | Cloud-GPU Core team

------------------------------------------



From: Liu, Monk
Sent: Wednesday, September 1, 2021 9:23 AM
To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
<Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
<JingWen.Chen2@amd.com>
Cc: DRI Development <dri-devel@lists.freedesktop.org>;
amd-gfx@lists.freedesktop.org
Subject: [diagnostic TDR mode patches] unify our solution opinions/
suggestions in one thread



[AMD Official Use Only]



Hi Daniel/Christian/Andrey



It looks the voice from you three are spread over those email floods to me,
the feature we are working on (diagnostic TDR scheme) is pending there for
more than 6 month (we started it from feb 2021).



Honestly speaking the email ways that we are using now is not friendly and
quite painful to me ….

Can we try to put all our opinions, suggestions, or even objects here
together, let’s go through them one by one, it’s too hard for us to reply
each email on different questions .



For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)



This is a fixing patch on the timeout timer in scheduler, can we complete
this one first ? it should already resolved all the questions and
suggestions.
I have no objections for this one besides getting rid of the kthread_should_park()) return null part, if my answer above is not wrong then it seems superfluous to me
For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler



I think I already explained the questions raised by Daniel in other thread
, regarding why I use __kthread_should_park()
Is this race free ? Can't the other thread execute kthread_park after the check ?
For other aspects, can we put all our opinion synthesized here ?
So to summarize from previous threads I think that the best solution to the problem being solved in this patch is if we do HW fence embedding at the drm_sched_job level instead of doing it only for amdgpu, and modifying all the drivers to support this we can both remove this hack and solve the race against concurrent drm_sched_cleanup_jobs job freeing just by taking reference to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and you need a quick solution such as the serialization you do here then take into account other concurrent TDR handlers that might run, as mentioned before, without serializing against other TDR handlers from other schedulers you just race here against them, e.g. you parked it now but another one in progress will unpark it as part of calling drm_sched_start for other rings. So you either need a global lock or dedicated single threaded queue as Daniel suggested. At minimum we should change cancel_delayed_work in drm_sched_stop to cancel_delayed_work_sync to cancel and flush all concurrent TDRs and probably move it to the begining of the function, after kthread_park and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we have multiple TDR arising from same event we will proceed to resetting multiple times - something that with trylock we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and amdgpu_device_lock_hive_adev) Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey
Thanks !



------------------------------------------

Monk Liu | Cloud-GPU Core team

------------------------------------------

Andrey Grodzovsky

4:28 a.m.

On 2021-09-01 12:25 a.m., Jingwen Chen wrote:

...

On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

...
I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:
 [AMD Official Use Only]



 In the previous discussion, you guys stated that we should drop the
 “kthread_should_park” in cleanup_job.



 @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
 *sched)

 {

         struct drm_sched_job *job, *next;



 -       /*

 -        * Don't destroy jobs while the timeout worker is running  OR
 thread

 -        * is being parked and hence assumed to not touch pending_list

 -        */

 -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

 -           !cancel_delayed_work(&sched->work_tdr)) ||

 -           kthread_should_park())

 -               return NULL;



 But I suddenly have a question here: if return the timedout job no matter
 kthread_should_park() or not, then we are backing to the original problem
 again: that the timedout_job is suddenly signaling and cleanup_job still
 returns it to sched_main and job is freed while it is still handling by
 vendor’s timeout callback



 If we return NULL when kthread_should_park() in cleanup_job, we can prevent
 above scenario from happening: once a job is processed by job_timedout we
 can stop its scheduler, and after that even this job suddenly signaled the
 cleanup_job won’t return it so sched_main won’t free it in parallel …



 What do you think ?
Is your analysis above takes into account that you also submit '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a problem -
Hi Andrey, Monk has talked to me and we agreed that as there're multiple opinions about the '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch 1 is an independent patch to fix some error. So we should not take the patch 2 into analysis.

...
I think that as long as you put kthread_park(sched->thread) BEFORE fetching next bad job from pending list (under spinlock) there is no such issue as in the case you describe because this potential bad job that became signaled will be removed from pending list before you even fetch the next job and by the time you fetch it the scheduler thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem because we temporary remove the job we fetch for TDR from pending list under spinlock exactly to avoid this race

So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)? patch v3 keeps this kthread_should_park check.

But since in both cases looks like there is no danger of use after free then I see no reason to keep kthread_should_park check.

Andrey

...

Best Regards, JingWen

...
 Thanks



 ------------------------------------------

 Monk Liu | Cloud-GPU Core team

 ------------------------------------------



 From: Liu, Monk
 Sent: Wednesday, September 1, 2021 9:23 AM
 To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
 <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
 <JingWen.Chen2@amd.com>
 Cc: DRI Development <dri-devel@lists.freedesktop.org>;
 amd-gfx@lists.freedesktop.org
 Subject: [diagnostic TDR mode patches] unify our solution opinions/
 suggestions in one thread



 [AMD Official Use Only]



 Hi Daniel/Christian/Andrey



 It looks the voice from you three are spread over those email floods to me,
 the feature we are working on (diagnostic TDR scheme) is pending there for
 more than 6 month (we started it from feb 2021).



 Honestly speaking the email ways that we are using now is not friendly and
 quite painful to me ….

 Can we try to put all our opinions, suggestions, or even objects here
 together, let’s go through them one by one, it’s too hard for us to reply
 each email on different questions .



 For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)



 This is a fixing patch on the timeout timer in scheduler, can we complete
 this one first ? it should already resolved all the questions and
 suggestions.
I have no objections for this one besides getting rid of the kthread_should_park()) return null part, if my answer above is not wrong then it seems superfluous to me
 For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler



 I think I already explained the questions raised by Daniel in other thread
 , regarding why I use __kthread_should_park()
Is this race free ? Can't the other thread execute kthread_park after the check ?
 For other aspects, can we put all our opinion synthesized here ?
So to summarize from previous threads I think that the best solution to the problem being solved in this patch is if we do HW fence embedding at the drm_sched_job level instead of doing it only for amdgpu, and modifying all the drivers to support this we can both remove this hack and solve the race against concurrent drm_sched_cleanup_jobs job freeing just by taking reference to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and you need a quick solution such as the serialization you do here then take into account other concurrent TDR handlers that might run, as mentioned before, without serializing against other TDR handlers from other schedulers you just race here against them, e.g. you parked it now but another one in progress will unpark it as part of calling drm_sched_start for other rings. So you either need a global lock or dedicated single threaded queue as Daniel suggested. At minimum we should change cancel_delayed_work in drm_sched_stop to cancel_delayed_work_sync to cancel and flush all concurrent TDRs and probably move it to the begining of the function, after kthread_park and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we have multiple TDR arising from same event we will proceed to resetting multiple times - something that with trylock we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and amdgpu_device_lock_hive_adev) Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey
 Thanks !



 ------------------------------------------

 Monk Liu | Cloud-GPU Core team

 ------------------------------------------

Jingwen Chen

4:40 a.m.

On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:

...

On 2021-09-01 12:25 a.m., Jingwen Chen wrote:

...
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

...
I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:
 [AMD Official Use Only]


 In the previous discussion, you guys stated that we should drop the
 “kthread_should_park” in cleanup_job.


 @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
 *sched)

 {

         struct drm_sched_job *job, *next;


 -       /*

 -        * Don't destroy jobs while the timeout worker is running  OR
 thread

 -        * is being parked and hence assumed to not touch pending_list

 -        */

 -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

 -           !cancel_delayed_work(&sched->work_tdr)) ||

 -           kthread_should_park())

 -               return NULL;


 But I suddenly have a question here: if return the timedout job no matter
 kthread_should_park() or not, then we are backing to the original problem
 again: that the timedout_job is suddenly signaling and cleanup_job still
 returns it to sched_main and job is freed while it is still handling by
 vendor’s timeout callback


 If we return NULL when kthread_should_park() in cleanup_job, we can prevent
 above scenario from happening: once a job is processed by job_timedout we
 can stop its scheduler, and after that even this job suddenly signaled the
 cleanup_job won’t return it so sched_main won’t free it in parallel …


 What do you think ?
Is your analysis above takes into account that you also submit '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a problem -
Hi Andrey, Monk has talked to me and we agreed that as there're multiple opinions about the '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch 1 is an independent patch to fix some error. So we should not take the patch 2 into analysis.

...
I think that as long as you put kthread_park(sched->thread) BEFORE fetching next bad job from pending list (under spinlock) there is no such issue as in the case you describe because this potential bad job that became signaled will be removed from pending list before you even fetch the next job and by the time you fetch it the scheduler thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem because we temporary remove the job we fetch for TDR from pending list under spinlock exactly to avoid this race

So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)? patch v3 keeps this kthread_should_park check.
But since in both cases looks like there is no danger of use after free then I see no reason to keep kthread_should_park check.

Andrey

OK, I get it. So patch v4 has removed this check, can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?

...

...
Best Regards, JingWen

...
 Thanks


 ------------------------------------------

 Monk Liu | Cloud-GPU Core team

 ------------------------------------------


 From: Liu, Monk
 Sent: Wednesday, September 1, 2021 9:23 AM
 To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
 <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
 <JingWen.Chen2@amd.com>
 Cc: DRI Development <dri-devel@lists.freedesktop.org>;
 amd-gfx@lists.freedesktop.org
 Subject: [diagnostic TDR mode patches] unify our solution opinions/
 suggestions in one thread


 [AMD Official Use Only]


 Hi Daniel/Christian/Andrey


 It looks the voice from you three are spread over those email floods to me,
 the feature we are working on (diagnostic TDR scheme) is pending there for
 more than 6 month (we started it from feb 2021).


 Honestly speaking the email ways that we are using now is not friendly and
 quite painful to me ….

 Can we try to put all our opinions, suggestions, or even objects here
 together, let’s go through them one by one, it’s too hard for us to reply
 each email on different questions .


 For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)


 This is a fixing patch on the timeout timer in scheduler, can we complete
 this one first ? it should already resolved all the questions and
 suggestions.
I have no objections for this one besides getting rid of the kthread_should_park()) return null part, if my answer above is not wrong then it seems superfluous to me
 For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler


 I think I already explained the questions raised by Daniel in other thread
 , regarding why I use __kthread_should_park()
Is this race free ? Can't the other thread execute kthread_park after the check ?
 For other aspects, can we put all our opinion synthesized here ?
So to summarize from previous threads I think that the best solution to the problem being solved in this patch is if we do HW fence embedding at the drm_sched_job level instead of doing it only for amdgpu, and modifying all the drivers to support this we can both remove this hack and solve the race against concurrent drm_sched_cleanup_jobs job freeing just by taking reference to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and you need a quick solution such as the serialization you do here then take into account other concurrent TDR handlers that might run, as mentioned before, without serializing against other TDR handlers from other schedulers you just race here against them, e.g. you parked it now but another one in progress will unpark it as part of calling drm_sched_start for other rings. So you either need a global lock or dedicated single threaded queue as Daniel suggested. At minimum we should change cancel_delayed_work in drm_sched_stop to cancel_delayed_work_sync to cancel and flush all concurrent TDRs and probably move it to the begining of the function, after kthread_park and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we have multiple TDR arising from same event we will proceed to resetting multiple times - something that with trylock we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and amdgpu_device_lock_hive_adev) Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey
 Thanks !


 ------------------------------------------

 Monk Liu | Cloud-GPU Core team

 ------------------------------------------

Andrey Grodzovsky

4:49 a.m.

On 2021-09-01 12:40 a.m., Jingwen Chen wrote:

...

On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:

...
On 2021-09-01 12:25 a.m., Jingwen Chen wrote:

...
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:

...
I will answer everything here -

On 2021-08-31 9:58 p.m., Liu, Monk wrote:
  [AMD Official Use Only]


  In the previous discussion, you guys stated that we should drop the
  “kthread_should_park” in cleanup_job.


  @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
  *sched)

  {

          struct drm_sched_job *job, *next;


  -       /*

  -        * Don't destroy jobs while the timeout worker is running  OR
  thread

  -        * is being parked and hence assumed to not touch pending_list

  -        */

  -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&

  -           !cancel_delayed_work(&sched->work_tdr)) ||

  -           kthread_should_park())

  -               return NULL;


  But I suddenly have a question here: if return the timedout job no matter
  kthread_should_park() or not, then we are backing to the original problem
  again: that the timedout_job is suddenly signaling and cleanup_job still
  returns it to sched_main and job is freed while it is still handling by
  vendor’s timeout callback


  If we return NULL when kthread_should_park() in cleanup_job, we can prevent
  above scenario from happening: once a job is processed by job_timedout we
  can stop its scheduler, and after that even this job suddenly signaled the
  cleanup_job won’t return it so sched_main won’t free it in parallel …


  What do you think ?
Is your analysis above takes into account that you also submit '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a problem -
Hi Andrey, Monk has talked to me and we agreed that as there're multiple opinions about the '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch 1 is an independent patch to fix some error. So we should not take the patch 2 into analysis.

...
I think that as long as you put kthread_park(sched->thread) BEFORE fetching next bad job from pending list (under spinlock) there is no such issue as in the case you describe because this potential bad job that became signaled will be removed from pending list before you even fetch the next job and by the time you fetch it the scheduler thread is already stopped anyway

If you don't submit and we keep the removal hack for now then also no problem because we temporary remove the job we fetch for TDR from pending list under spinlock exactly to avoid this race

So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)? patch v3 keeps this kthread_should_park check.
But since in both cases looks like there is no danger of use after free then I see no reason to keep kthread_should_park check.

Andrey
OK, I get it. So patch v4 has removed this check, can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?

Sure

Andrey

...

...
...
Best Regards, JingWen

...
  Thanks


  ------------------------------------------

  Monk Liu | Cloud-GPU Core team

  ------------------------------------------


  From: Liu, Monk
  Sent: Wednesday, September 1, 2021 9:23 AM
  To: Koenig, Christian <Christian.Koenig@amd.com>; Grodzovsky, Andrey
  <Andrey.Grodzovsky@amd.com>; Daniel Vetter <daniel@ffwll.ch>; Chen, JingWen
  <JingWen.Chen2@amd.com>
  Cc: DRI Development <dri-devel@lists.freedesktop.org>;
  amd-gfx@lists.freedesktop.org
  Subject: [diagnostic TDR mode patches] unify our solution opinions/
  suggestions in one thread


  [AMD Official Use Only]


  Hi Daniel/Christian/Andrey


  It looks the voice from you three are spread over those email floods to me,
  the feature we are working on (diagnostic TDR scheme) is pending there for
  more than 6 month (we started it from feb 2021).


  Honestly speaking the email ways that we are using now is not friendly and
  quite painful to me ….

  Can we try to put all our opinions, suggestions, or even objects here
  together, let’s go through them one by one, it’s too hard for us to reply
  each email on different questions .


  For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)


  This is a fixing patch on the timeout timer in scheduler, can we complete
  this one first ? it should already resolved all the questions and
  suggestions.
I have no objections for this one besides getting rid of the kthread_should_park()) return null part, if my answer above is not wrong then it seems superfluous to me
  For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler


  I think I already explained the questions raised by Daniel in other thread
  , regarding why I use __kthread_should_park()
Is this race free ? Can't the other thread execute kthread_park after the check ?
  For other aspects, can we put all our opinion synthesized here ?
So to summarize from previous threads I think that the best solution to the problem being solved in this patch is if we do HW fence embedding at the drm_sched_job level instead of doing it only for amdgpu, and modifying all the drivers to support this we can both remove this hack and solve the race against concurrent drm_sched_cleanup_jobs job freeing just by taking reference to the hw fence of the job at the beginning of drm_sched_job_timedout

If doing this refactoring for all the drivers is not an option now and you need a quick solution such as the serialization you do here then take into account other concurrent TDR handlers that might run, as mentioned before, without serializing against other TDR handlers from other schedulers you just race here against them, e.g. you parked it now but another one in progress will unpark it as part of calling drm_sched_start for other rings. So you either need a global lock or dedicated single threaded queue as Daniel suggested. At minimum we should change cancel_delayed_work in drm_sched_stop to cancel_delayed_work_sync to cancel and flush all concurrent TDRs and probably move it to the begining of the function, after kthread_park and before we start playing with the pending list.

P.S One comment I had regarding single threaded queue is that in case we have multiple TDR arising from same event we will proceed to resetting multiple times - something that with trylock we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and amdgpu_device_lock_hive_adev) Daniel mentioned it's not a problem but I still haven't understood why not.

Andrey
  Thanks !


  ------------------------------------------

  Monk Liu | Cloud-GPU Core team

  ------------------------------------------

Daniel Vetter

8:18 a.m.

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk Monk.Liu@amd.com wrote:

...

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffw...

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here. Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr. That resulted in lots of discussions and documentation improvements. Those patches are merged now, link https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@c...

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free. Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

...

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ….

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start, 6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

...

Can we try to put all our opinions, suggestions, or even objects here together, let’s go through them one by one, it’s too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()

For other aspects, can we put all our opinion synthesized here ?

Thanks !

Monk Liu | Cloud-GPU Core team

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Liu, Monk

10:19 a.m.

[AMD Official Use Only]

Daniel

...

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@c... And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffw... Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

-----Original Message----- From: amd-gfx amd-gfx-bounces@lists.freedesktop.org On Behalf Of Daniel Vetter Sent: Wednesday, September 1, 2021 4:18 PM To: Liu, Monk Monk.Liu@amd.com Cc: Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk Monk.Liu@amd.com wrote:

...

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here. Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr. That resulted in lots of discussions and documentation improvements. Those patches are merged now, link https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free. Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

...

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

Cheers, Daniel

...

Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()

For other aspects, can we put all our opinion synthesized here ?

Thanks !

Monk Liu | Cloud-GPU Core team

-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....

Alex Deucher

3:19 p.m.

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...

[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@c... And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffw... Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

Alex

...

Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: amd-gfx amd-gfx-bounces@lists.freedesktop.org On Behalf Of Daniel Vetter Sent: Wednesday, September 1, 2021 4:18 PM To: Liu, Monk Monk.Liu@amd.com Cc: Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

the other one is the timeout issue for the patches you cite here.

Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr. That resulted in lots of discussions and documentation improvements. Those patches are merged now, link https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free. Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

...
Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start, 6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

...
Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()

For other aspects, can we put all our opinion synthesized here ?

Thanks !

Monk Liu | Cloud-GPU Core team

-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....

Dave Airlie

6:50 p.m.

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@c... And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffw... Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

Liu, Monk

2 Sep 2 Sep

5:52 a.m.

[AMD Official Use Only]

...

...
...

I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable? just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Thanks

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...

On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSkKN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjHkM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

Christian König

11 a.m.

Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:

...

[AMD Official Use Only]

I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?

I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Anybody can of course technically object to upstream design decisions, but that means that you need to pay attention to the mailing lists in the first place.

...

just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Well because AMD is supposed to work in public as much as possible and ask upstream before doing changes to the code base.

Additional to that design decisions are supposed to be discussed on the mailing list and *not* internally.

Regards, Christian.

...

Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSkKN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjHkM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

Daniel Vetter

4:11 p.m.

On Thu, Sep 2, 2021 at 1:00 PM Christian König christian.koenig@amd.com wrote:

...

Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:

...
[AMD Official Use Only]

I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?

I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Anybody can of course technically object to upstream design decisions, but that means that you need to pay attention to the mailing lists in the first place.

...
just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Well because AMD is supposed to work in public as much as possible and ask upstream before doing changes to the code base.

Additional to that design decisions are supposed to be discussed on the mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion here. With Alex, Christian and others amd has a lot of folks with years/decades of experience in how to collaborate in upstream, when to pull in others proactively and when that's not needed, and in general how to plan upstream work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then re-planning this effort, before we get back to technical stuff. Otherwise we'll just get bogged down in pain because expectations about the process don't pan out. -Daniel

...

Regards, Christian.

...
Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSkKN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjHkM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Liu, Monk

6 Sep 6 Sep

6:36 a.m.

[AMD Official Use Only]

...

I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Hi Daniel

Anyway thanks for officially confirm to me of working model & policy in community, I don't want to put my opinion here due to that's not my call to change no matter how. I only want to let this diagnostic TDR scheme going to a good end for AMD or even for all DRM vendor.

How about this way, we still have a final patch not landed in DRM scheduler and I would like jingwen to present it to you and AlexD/Christian/Andrey, I believe you will have concerns or objections regarding this patch, but that's fine, let us figure it out together, how to make it acceptable by you and other vendors that working with DRM scheduler.

P.S.: I had to repeat myself again, we are not popping up new idea suddenly, it is disconnection issue, we didn't have changes (or plan to have changes) in DRM scheduler before, but eventually we found we must make job_timeout and sched_main to work in a serialized otherwise it won't work based on current scheduler's code structure.

Thanks

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

-----Original Message----- From: Daniel Vetter daniel@ffwll.ch Sent: Friday, September 3, 2021 12:11 AM To: Koenig, Christian Christian.Koenig@amd.com Cc: Liu, Monk Monk.Liu@amd.com; Dave Airlie airlied@gmail.com; Alex Deucher alexdeucher@gmail.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, Sep 2, 2021 at 1:00 PM Christian König christian.koenig@amd.com wrote:

...

Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:

...
[AMD Official Use Only]

I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?

I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Anybody can of course technically object to upstream design decisions, but that means that you need to pay attention to the mailing lists in the first place.

...
just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Well because AMD is supposed to work in public as much as possible and ask upstream before doing changes to the code base.

Additional to that design decisions are supposed to be discussed on the mailing list and *not* internally.

...

Regards, Christian.

...
Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F lo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezill on %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d 18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C 0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL CJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSk KN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F lo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter %4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d6534 1e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637 66 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi V2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjH kM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....

Jingwen Chen

10:35 a.m.

Hi Christian/Andrey/Daniel,

I read Boris's patch about ordered workqueue and I think maybe we can leverage this change. https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@c...

As the TDR race condition we are talking about is caused by a bailing job being deleted from pending list. While if we use the ordered workqueue for timedout in the driver, there will be no bailing job.

Do you have any suggestions?

Best Regards, JingWen Chen

On Mon Sep 06, 2021 at 02:36:52PM +0800, Liu, Monk wrote:

...

[AMD Official Use Only]

...
I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Hi Daniel

Anyway thanks for officially confirm to me of working model & policy in community, I don't want to put my opinion here due to that's not my call to change no matter how. I only want to let this diagnostic TDR scheme going to a good end for AMD or even for all DRM vendor.

How about this way, we still have a final patch not landed in DRM scheduler and I would like jingwen to present it to you and AlexD/Christian/Andrey, I believe you will have concerns or objections regarding this patch, but that's fine, let us figure it out together, how to make it acceptable by you and other vendors that working with DRM scheduler.

P.S.: I had to repeat myself again, we are not popping up new idea suddenly, it is disconnection issue, we didn't have changes (or plan to have changes) in DRM scheduler before, but eventually we found we must make job_timeout and sched_main to work in a serialized otherwise it won't work based on current scheduler's code structure.

Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Daniel Vetter daniel@ffwll.ch Sent: Friday, September 3, 2021 12:11 AM To: Koenig, Christian Christian.Koenig@amd.com Cc: Liu, Monk Monk.Liu@amd.com; Dave Airlie airlied@gmail.com; Alex Deucher alexdeucher@gmail.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, Sep 2, 2021 at 1:00 PM Christian König christian.koenig@amd.com wrote:

...
Hi Monk,

Am 02.09.21 um 07:52 schrieb Liu, Monk:

...
[AMD Official Use Only]

I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?

I'm fearing that just repeating what Alex said, but to make it clear once more: That is *not* necessary!

The shared repository is owned by upstream maintainers and they are usually free to do restructuring work without getting acknowledge from every single driver maintainer.

Anybody can of course technically object to upstream design decisions, but that means that you need to pay attention to the mailing lists in the first place.

...
just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Well because AMD is supposed to work in public as much as possible and ask upstream before doing changes to the code base.

Additional to that design decisions are supposed to be discussed on the mailing list and *not* internally.

Yeah I'm honestly really surprised about the course of this discussion here. With Alex, Christian and others amd has a lot of folks with years/decades of experience in how to collaborate in upstream, when to pull in others proactively and when that's not needed, and in general how to plan upstream work with the lest amount of risk and surprises.

I think step zero here needs to be some training at amd and then re-planning this effort, before we get back to technical stuff. Otherwise we'll just get bogged down in pain because expectations about the process don't pan out. -Daniel

...
Regards, Christian.

...
Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F lo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezill on %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d 18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C 0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL CJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSk KN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F lo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter %4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d6534 1e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637 66 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi V2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjH kM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....

Alex Deucher

2 Sep 2 Sep

1:31 p.m.

On Thu, Sep 2, 2021 at 1:52 AM Liu, Monk Monk.Liu@amd.com wrote:

...

[AMD Official Use Only]

...
...
...
I'm not sure I can add much to help this along, I'm sure Alex has some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand. Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier. The best time to interact with upstream was 6 months ago, the second best time is now. <<<

Daniel/AlexD

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable? just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

We keep as up to date as possible with upstream. I don't have the bandwidth to verify every patch, but in most cases I try and look at them. In your first example, the patch basically just adds a new parameter to some common functions. Drivers that don't need that parameter don't use it. It shouldn't really affect the functionality. There are lots of changes that touch our driver that we are largely not aware of. E.g., APIs that we may use may change the function signatures with no intended functional changes. If problems are found they are reported and resolved. It is a collective effort. If there are changes that would conflict with stuff we are doing in our tree we should bring them up when the relevant patches are being discussed. We can also make changes to core functionality like scheduler, ttm, etc. that would affect other drivers. When we send out the patches we cc the relevant maintainers, but ultimately the ones who participate in the discussion set the direction. That's why participation is important.

Alex

...

Thanks

Monk Liu | Cloud-GPU Core team

-----Original Message----- From: Dave Airlie airlied@gmail.com Sent: Thursday, September 2, 2021 2:51 AM To: Alex Deucher alexdeucher@gmail.com Cc: Liu, Monk Monk.Liu@amd.com; Daniel Vetter daniel@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; Grodzovsky, Andrey Andrey.Grodzovsky@amd.com; Chen, JingWen JingWen.Chen2@amd.com; DRI Development dri-devel@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

On Thu, 2 Sept 2021 at 01:20, Alex Deucher alexdeucher@gmail.com wrote:

...
On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk Monk.Liu@amd.com wrote:

...
[AMD Official Use Only]

Daniel

From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo....

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....

This one changes AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18 d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0% 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSkKN y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0 And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter%4 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d65341e f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63766 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2 luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjHkM 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0 Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Monk, this is not how upstream works. You need to participate. That's how communities work. There's a reason all these discussions happen on public mailing lists. The patch author can't be expected to know every person on every vendor team to CC with a patch. If you have concerns, you need to raise them when the patches are being discussed.

I'm not sure I can add much to help this along, I'm sure Alex has some internal training,

Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.

Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.

The best time to interact with upstream was 6 months ago, the second best time is now.

Dave.

Daniel Stone

4:57 p.m.

Hi Monk,

On Thu, 2 Sept 2021 at 06:52, Liu, Monk Monk.Liu@amd.com wrote:

...

I didn't mean your changes on AMD driver need my personal approval or review ... and I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable? just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...

by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?

Looking at the patches in question, they were (at least mostly) CCed both to the amd-gfx@ mailing list and also to ckoenig. Unfortunately it is not possible for every single patch to get mandatory signoff from every single stakeholder - e.g. if every AMD patch which touched the scheduler required explicit approval from Etnaviv, Freedreno, Lima, Panfrost, and V3D teams, it would become very difficult for AMD to merge any code.

So the approach is that patches are sent for approval, they are CCed to people who should be interested, and after some time with no comments, they may be merged if it seems like a reasonable thing to do.

The problem with internal work is that, well, it's internal. If the community sends patches to amd-gfx@, there is no comment from AMD, and then months later we are told that it should not have happened because it conflicts with development that AMD has been doing - how should the rest of the community have known about this? So unfortunately this is the compromise: if you decide to do private development, not inform anyone about your plans, and not join in any common discussion, then it is your responsibility to deal with any changes or conflicts that happen whilst you are developing privately.

The only way we can successfully have support in the same ecosystem for AMD, Arm, Broadcom, Intel, NVIDIA, Qualcomm, and VeriSilicon, is that we are all working together openly. If community development had to stop because each of these vendors had been doing internal development for several months without even informing the community of their plans, any kind of shared development is clearly impossible.

Cheers, Daniel

Liu, Monk

1 Sep 1 Sep

10:27 a.m.

[AMD Official Use Only]

...

...
For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

This is partially true, because in the past months our change only resident in AMD driver, it is till now that we found we had to make changes in SCHED level

...

...
Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

if our changes on DRI level part cannot get merged soon that's fine, we can discuss more, but that's not suddenly pops up from nowhere, we already worked on it for months inside of AMD drivers.

...

...
I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

We are not objecting this approach, we didn't do that because previously all what we need to do is resident inside AMD driver ... because we try to avoid change DRI/DRM interface part ...

For the patches you shows to us with links I'm sorry that due to some IT infrastructure reason me and my team didn't see it before (we kind of work in AMD repo ... the patches you shows didn't get merged in our repo yet...) One thing I also want to emphasis here: if any code need change inside AMD driver please always let us know and review.

Thanks

------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk Monk.Liu@amd.com wrote:

...

[AMD Official Use Only]

Hi Daniel/Christian/Andrey

It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here. Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr. That resulted in lots of discussions and documentation improvements. Those patches are merged now, link https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kerne...

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free. Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

...

Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

Cheers, Daniel

...

Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .

For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)

This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.

For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()

For other aspects, can we put all our opinion synthesized here ?

Thanks !

Monk Liu | Cloud-GPU Core team

-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....

1322

Age (days ago)

1327

Last active (days ago)

dri-devel@lists.freedesktop.org

18 comments

8 participants

tags (0)

participants (8)

Alex Deucher
Andrey Grodzovsky
Christian König
Daniel Stone
Daniel Vetter
Dave Airlie
Jingwen Chen
Liu, Monk