Am 07.09.21 um 10:47 schrieb Daniel Vetter:
[SNIP]
If we embed it, then I think it should start existing latest from drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So the right thing to do here is to have the hw fence as a pointer in struct drm_sched_job. And check in drm_sched_job_arm() that it's at least allocated.
Why we need to allocate the HW fence if it's embedded within a job struct ?
the hw fence is a refcounted struct, and the drm_sched_job is a different struct. And we didn't have a dri-devel discussion about whether it's correct to conflate these two lifetimes, amdgpu folks simply hacked something together.
Obviously scheduler level changes must be discussed at dri-devel forum level. What happened here and as Monk already mentioned - we had internal discussion about how to fix the problem in the header of this thread - avoiding accessing feed job from TDR handler without the current hack in place of removal and back insertion into pending list. It's there we we came up (I think Christian first mentioned this) with the idea of embedding the HW fence into amdgpu job - both for avoiding memory allocations but also - because this allows to use the HW fence's recounting as a solution to the above problem by simply grabbing a reference to the next fence in the pending list as a first step in the TDR handler. What we didn't take into account at the time is that indeed this change cannot be limited to amdgpu level - this we noticed much later during final code reviews.
Not sure where this fell through cracks, but imo at least changing where the hw fence is allocated is a very fundamental change, and latest then you should have discussed this on dri-devel.
I'm the one who kicked this off in April and I made a nice internal presentation to explain what the problems is etc... So the idea of embedding the hardware fence into the job came from me.
But during the presentation I also noted that we need to sync up with a guy named Daniel Vetter because it was his patch set which surfaced this issue by annotating fence completion prerequisite in lockdep.
But even the tdr races would probably have been good to start on dri-devel. Now it looks like Monk&team have lost 6 months for nothing.
Well to make it clear I've noted during the presentation in April that this needs to be discussed with you, I've also noted to the first guy working on this that this needs to be discussed on dri-devel instead of internally and I'm pretty sure that I've noted this a couple of more times after it moved to somebody else. And IIRC Andrey also noted that we should not discuss this internally pretty early as well.
So if people are not listening it is not a surprise that they spend time on stuff which isn't upstreamable like this.
Christian.
-Daniel
Andrey
Otherwise we're just diverging across drivers and tempting them to do the wrong thing with the current ->run_job callback interface.
Maybe we should switch from embedding in driver level job struct as it's now to drm_sched_job and just leave the fence initialization to driver specific code ?
Maybe? Like I've not been involved in these discussion ont he amd side at all, I'm just noticing that we do have a now rather inconsistently used inteface across drivers. Which is no good. -Daniel
Andrey
Can you guys look into this? -Daniel
>> Another thing I recently pondered for tdr races looking at i915 code is >> whether the tdr should first block the completion fence for that job. My >> motivation is to have a race-free error capture (if the completion races >> then we might start evicting memory and everything goes boom), but maybe >> that helps here too. Some kind of atomic "block this fence from >> completing thing. >> >> Or I'm I completely guessing in the wrong direction? > I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.boo... Ah yes this works becase drm/sched has separate hw fence from the logical job fence. -Daniel
> Andrey > > >> -Daniel >> >>> Andrey >>> >>> >>>>> Andrey >>>>> >>>>> >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>>>> spin_unlock(&sched->job_list_lock); >>>>>>>>> >>>>>>>>> status = job->sched->ops->timedout_job(job); >>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work) >>>>>>>>> job->sched->ops->free_job(job); >>>>>>>>> sched->free_guilty = false; >>>>>>>>> } >>>>>>>>> + dma_fence_put(fence); >>>>>>>>> } else { >>>>>>>>> spin_unlock(&sched->job_list_lock); >>>>>>>>> } >>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) >>>>>>>>> >>>>>>>>> kthread_park(sched->thread); >>>>>>>>> >>>>>>>>> - /* >>>>>>>>> - * Reinsert back the bad job here - now it's safe as >>>>>>>>> - * drm_sched_get_cleanup_job cannot race against us and release the >>>>>>>>> - * bad job at this point - we parked (waited for) any in progress >>>>>>>>> - * (earlier) cleanups and drm_sched_get_cleanup_job will not be called >>>>>>>>> - * now until the scheduler thread is unparked. >>>>>>>>> - */ >>>>>>>>> - if (bad && bad->sched == sched) >>>>>>>>> - /* >>>>>>>>> - * Add at the head of the queue to reflect it was the earliest >>>>>>>>> - * job extracted. >>>>>>>>> - */ >>>>>>>>> - list_add(&bad->list, &sched->pending_list); >>>>>>>>> - >>>>>>>>> /* >>>>>>>>> * Iterate the job list from later to earlier one and either deactive >>>>>>>>> * their HW callbacks or remove them from pending list if they already >>>>>>>>> -- >>>>>>>>> 2.25.1
>>>>>>>>>
Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll....