Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

26 Aug 2021

Am 26.08.21 um 06:55 schrieb Monk Liu:
...
issue:
in cleanup_job the cancle_delayed_work will cancel a TO timer
even the its corresponding job is still running.
Yeah, that makes a lot more sense.
...
fix:
do not cancel the timer in cleanup_job, instead do the cancelling
only when the heading job is signaled, and if there is a "next" job
we start_timeout again.
v2:
further cleanup the logic, and do the TDR timer cancelling if the signaled job
is the last one in its scheduler.
v3:
change the issue description
remove the cancel_delayed_work in the begining of the cleanup_job
recover the implement of drm_sched_job_begin.
TODO:
1)introduce pause/resume scheduler in job_timeout to serial the handling
of scheduler and job_timeout.
2)drop the bad job's del and insert in scheduler due to above serialization
(no race issue anymore with the serialization)
Signed-off-by: Monk Liu Monk.Liu@amd.com
drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++---------------
  1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..ecf8140 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
  {
   struct drm_sched_job *job, *next;

/*
* Don't destroy jobs while the timeout worker is running  OR thread


* is being parked and hence assumed to not touch pending_list


*/


if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
   !cancel_delayed_work(&sched->work_tdr)) ||


   kthread_should_park())




if (kthread_should_park())
return NULL;
spin_lock(&sched->job_list_lock);


@@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
   if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
   	/* remove job from pending_list */
   	list_del_init(&job->list);


/* cancel this job's TO timer */


cancel_delayed_work(&sched->work_tdr);



I'm not sure if the work_tdr is initialized when a maximum timeout is 
specified. Please double check.
BTW: Can we please drop the "tdr" naming from the scheduler? That is 
just a timeout functionality and not related to recovery in any way.
We even do not start hardware recovery in a lot of cases now (when wave 
kill is successfully).
Regards,
Christian.
...
/* make the scheduled timestamp more accurate */
next = list_first_entry_or_null(&sched->pending_list,
				typeof(*next), list);


if (next)





if (next) {
next->s_fence->scheduled.timestamp =
	job->s_fence->finished.timestamp;







	/* start TO timer for next job */


	drm_sched_start_timeout(sched);


}

} else {
job = NULL;


/* queue timeout for next job */


drm_sched_start_timeout(sched);

}
spin_unlock(&sched->job_list_lock);


@@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
   				  (entity = drm_sched_select_entity(sched))) ||
   				 kthread_should_stop());

if (cleanup_job) {




if (cleanup_job)
sched->ops->free_job(cleanup_job);




	/* queue timeout for next job */


	drm_sched_start_timeout(sched);


}


if (!entity)
	continue;



    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

Signed-off-by: Monk Liu Monk.Liu@amd.com