On 2021-09-08 2:50 a.m., Boris Brezillon wrote:
On Tue, 7 Sep 2021 14:53:58 -0400 Andrey Grodzovsky andrey.grodzovsky@amd.com wrote:
On 2021-06-29 7:24 a.m., Christian König wrote:
Am 29.06.21 um 13:18 schrieb Boris Brezillon:
Hi Christian,
On Tue, 29 Jun 2021 13:03:58 +0200 Christian König christian.koenig@amd.com wrote:
Am 29.06.21 um 09:34 schrieb Boris Brezillon:
Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU reset. This leads to extra complexity when we need to synchronize timeout works with the reset work. One solution to address that is to have an ordered workqueue at the driver level that will be used by the different schedulers to queue their timeout work. Thanks to the serialization provided by the ordered workqueue we are guaranteed that timeout handlers are executed sequentially, and can thus easily reset the GPU from the timeout handler without extra synchronization.
Well, we had already tried this and it didn't worked the way it is expected.
The major problem is that you not only want to serialize the queue, but rather have a single reset for all queues.
Otherwise you schedule multiple resets for each hardware queue. E.g. for your 3 hardware queues you would reset the GPU 3 times if all of them time out at the same time (which is rather likely).
Using a single delayed work item doesn't work either because you then only have one timeout.
What could be done is to cancel all delayed work items from all stopped schedulers.
drm_sched_stop() does that already, and since we call drm_sched_stop() on all queues in the timeout handler, we end up with only one global reset happening even if several queues report a timeout at the same time.
Ah, nice. Yeah, in this case it should indeed work as expected.
Feel free to add an Acked-by: Christian König christian.koenig@amd.com to it.
Regards, Christian.
Seems to me that for this to work we need to change cancel_delayed_work to cancel_delayed_work_sync so not only pending TO handlers are cancelled but also any in progress are waited for and to to prevent rearming. Also move it right after kthread_park - before we start touching pending list.
I'm probably missing something, but I don't really see why this specific change would require replacing cancel_delayed_work() calls by the sync variant.
I see, I missed the point that since now we have a single threaded processing and only one TDR handler runs at given time there is no need to wait for other parallel in flight TDR handlers.
I mean, if there's a situation where we need to wait for in-flight timeout handler to return, it was already the case before that patch.
In amdgpu case we avoided that by trylock on a common lock and returning early in case it was already taken by another TDR handler
Note that we need to be careful to not call the sync variant in helpers that are called from the interrupt handler itself to avoid deadlocks (i.e. drm_sched_stop()).
I am not clear here - which interrupt handler is drm_sched_stop called from ? It's called from TDR work as far as I see in the code.
Andrey