Re: [PATCH] drm/radeon: add an exclusive lock for GPU reset

3 Jul 2012


      On Tue, Jul 3, 2012 at 5:26 AM, Christian König deathsimple@vodafone.de wrote:
...
On 02.07.2012 19:27, Jerome Glisse wrote:
...
On Mon, Jul 2, 2012 at 1:05 PM, Christian König deathsimple@vodafone.de
wrote:
...
On 02.07.2012 18:41, Jerome Glisse wrote:
...
On Mon, Jul 2, 2012 at 12:26 PM, Christian König
deathsimple@vodafone.de wrote:
...
On 02.07.2012 17:39, j.glisse@gmail.com wrote:
...
From: Jerome Glisse jglisse@redhat.com
GPU reset need to be exclusive, one happening at a time. For this
add a rw semaphore so that any path that trigger GPU activities
have to take the semaphore as a reader thus allowing concurency.
The GPU reset path take the semaphore as a writer ensuring that
no concurrent reset take place.
Signed-off-by: Jerome Glisse jglisse@redhat.com
NAK, that isn't as bad as the cs mutex was but still to complicated.
It's
just too far up in the call stack, e.g. it tries to catch ioctl
operations,
instead of catching the underlying hardware operation which is caused
by
the
ioctl/ttm/etc...
Why not just take the ring look as I suggested?
No we can't use ring lock, we need to protect suspend/resume path and
we need an exclusive lock for that so we need a reset mutex at the
very least. But instead of having a reset mutex i prefer using a rw
lock so that we can stop ioctl until a reset goes through an let
pending ioctl take proper action. Think about multiple context trying
to reset GPU ...
Really this is the best option, the rw locking wont induce any lock
contention execept in gpu reset case which is anyway bad news.
Why? That makes no sense to me. Well I don't want to prevent lock
contention, but understand why we should add locking at the ioctl level.
That violates locking rule number one "lock data instead of code" (or in
our
case "lock hardware access instead of code path") and it really is the
reason why we ended up with the cs_mutex protecting practically
everything.
Multiple context trying to reset the GPU should be pretty fine, current
it
would just reset the GPU twice, but in the future asic_reset should be
much
more fine grained and only reset the parts of the GPU which really needs
an
reset.
Cheers,
Christian.
No multiple reset is not fine, try it your self and you will see all
kind of oops (strongly advise you to sync you hd before stress testing
that). Yes we need to protect code path because suspend/resume code
path is special one it touch many data in the device structure. GPU
reset is a rare event or should be a rare event.
Yeah, but I thought that fixing those oops as the second step. I see the
fact that suspend/resume is unpinning all the ttm memory and then pinning it
again as a bug that needs to be fixed. Or as an alternative we could split
the suspend/resume calls into suspend/disable and resume/enable calls, where
we only call disable/enable in the gpu_reset path rather than a complete
suspend/resume (not really sure about that).
Fixing oops are not second step, they are first step. I am not saying
that the suspend/resume as it happens right now is a good thing, i am
saying it's what it's and we have to deal with it until we do
something better. There is no excuse to not fix oops with a simple
patch 16 lines patch.
...
Also a GPU reset isn't such a rare event, currently it just occurs when
userspace is doing something bad, for example submitting an invalid shader
or stuff like that. But with VM and IO protection coming into the driver we
are going to need a GPU reset when there is an protection fault, and with
that it is really desirable to just reset the hardware in a way where we can
say: This IB was faulty skip over it and resume with whatever is after it on
the ring.
There is mecanism to handle that properly from irq handler that AMD
need to release, the pagefault thing could be transparent and should
only need few lines in the irq handler (i think i did a patch for that
and sent it to AMD for review but i am wondering if i wasn't lacking
some doc).
...
And todo that we need to keep the auxiliary data like sub allocator memory,
blitting shader bo, and especially vm page tables at the same place in
hardware memory.
I agree that we need a lightweight reset but we need to keep the heavy
reset as it is right now, if you want to do a light weight reset do it
as a new function. I always intended to have two reset path, hint gpu
soft reset name vs what is call hard reset but not released, i even
made patch for that long time ago but never got them cleared from AMD
review.
...
...
I stress it we need at very least a mutex to protect gpu reset and i
will stand on that position because there is no other way around.
Using rw lock have a bonus of allowing proper handling of gpu reset
failure and that what the patch i sent to linus fix tree is about, so
to make drm next merge properly while preserving proper behavior in
gpu reset failure the rw semaphore is the best option.
Oh well, I'm not arguing that we don't need a look here. I'm just
questioning to put it at the ioctl level (e.g. the driver entry from
userspace), that wasn't such a good idea with the cs_mutex and doesn't seems
like a good idea now. Instead we should place the looking between the
ioctl/ttm and the actual hardware submission and that brings it pretty close
(if not identically) to the ring mutex.
Cheers,
Christian.
No, locking at the ioctl level make sense please show me figure that
it hurt performance, i did a quick sysprof and i couldn't see them
impacting anything. No matter how much you hate this, this is the best
solution, it avoids each ioctl to do useless things in case of gpu
lockup and it touch a lot less code than moving a lock down the call
path would. So again this is the best solution for the heavy reset,
and i am not saying that a soft reset would need to take this lock or
that we can't improve the way it's done. All i am saying is that ring
lock is the wrong solution for heavy reset, it should be ok for light
weight reset.
Cheers,
Jerome

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH] drm/radeon: add an exclusive lock for GPU reset