[AMD Official Use Only]
Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
Are you referring to drm_sched_stop ?
That's different, we don't need the logic from it, see that it go through pending list and remove all callbacks , etc... meanwhile vendor's timeout callback will call drm_sched_stop in a proper way, All we want in my patch is to simply park scheduler, Besides, even you call drm_sched_stop in job_timeout you still run into the warning issue I hit.
Thanks
------------------------------------------ Monk Liu | Cloud-GPU Core team ------------------------------------------
-----Original Message----- From: Daniel Vetter daniel@ffwll.ch Sent: Tuesday, August 31, 2021 9:02 PM To: Liu, Monk Monk.Liu@amd.com Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Chen, Jingwen Jingwen.Chen@amd.com Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
Can we please have some actual commit message here, with detailed explanation of the race/bug/whatever, how you fix it and why this is the best option?
On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
tested-by: jingwen chen jingwen.chen@amd.com Signed-off-by: Monk Liu Monk.Liu@amd.com Signed-off-by: jingwen chen jingwen.chen@amd.com
drivers/gpu/drm/scheduler/sched_main.c | 24 ++++-------------------- 1 file changed, 4 insertions(+), 20 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index ecf8140..894fdb24 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work) sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
- if (!__kthread_should_park(sched->thread))
This is a __ function, i.e. considered internal, and it's lockless atomic, i.e. unordered. And you're not explaining why this works.
Iow it's probably buggy, and an just unconditionally parking the kthread is probably the right thing to do. If it's not the right thing to do, there's a bug here for sure.
Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much. -Daniel
kthread_park(sched->thread);
spin_lock(&sched->job_list_lock); job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list);
if (job) {
/*
* Remove the bad job so it cannot be freed by concurrent
* drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
* is parked at which point it's safe.
*/
spin_unlock(&sched->job_list_lock);list_del_init(&job->list);
/* vendor's timeout_job should call drm_sched_start() */
status = job->sched->ops->timedout_job(job);
/*
@@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) kthread_park(sched->thread);
/*
* Reinsert back the bad job here - now it's safe as
* drm_sched_get_cleanup_job cannot race against us and release the
* bad job at this point - we parked (waited for) any in progress
* (earlier) cleanups and drm_sched_get_cleanup_job will not be called
* now until the scheduler thread is unparked.
*/
- if (bad && bad->sched == sched)
/*
* Add at the head of the queue to reflect it was the earliest
* job extracted.
*/
list_add(&bad->list, &sched->pending_list);
- /*
- Iterate the job list from later to earlier one and either deactive
- their HW callbacks or remove them from pending list if they already
- signaled.
-- 2.7.4
-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76 b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLg KeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0
-- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....