Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

List overview All Threads
Download

newer

older

[Bug 70035] GPU Lockup on AMD...

[Bug 60827] X under radeon driver...

Christian König

2 Oct 2013 2 Oct '13

2:50 p.m.

Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...

I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote:

...
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
From: Marek Olšák marek.olsak@amd.com

After a lockup, fences are not signalled sometimes, causing the GEM_WAIT_IDLE ioctl to never return, which sometimes results in an X server freeze.

This fixes only one of many deadlocks which can occur during a lockup.

Signed-off-by: Marek Olšák marek.olsak@amd.com

drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 841d0e0..7b97baa 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev) radeon_save_bios_scratch_regs(rdev); /* block TTM */ resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
 mutex_lock(&rdev->ring_lock);
 radeon_fence_driver_force_completion(rdev);
 mutex_unlock(&rdev->ring_lock);
radeon_pm_suspend(rdev);
radeon_suspend(rdev);
-- 1.8.1.2

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Show replies by date

Marek Olšák

3 Oct 3 Oct

12:45 a.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...

Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote:

...
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
From: Marek Olšák marek.olsak@amd.com

After a lockup, fences are not signalled sometimes, causing the GEM_WAIT_IDLE ioctl to never return, which sometimes results in an X server freeze.

This fixes only one of many deadlocks which can occur during a lockup.

Signed-off-by: Marek Olšák marek.olsak@amd.com

drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 841d0e0..7b97baa 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev) radeon_save_bios_scratch_regs(rdev); /* block TTM */ resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
 mutex_lock(&rdev->ring_lock);
 radeon_fence_driver_force_completion(rdev);
 mutex_unlock(&rdev->ring_lock);
radeon_pm_suspend(rdev);
radeon_suspend(rdev);
-- 1.8.1.2

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

7 Oct 7 Oct

11:08 a.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

...

First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...

First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote:

...
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
From: Marek Olšák marek.olsak@amd.com

After a lockup, fences are not signalled sometimes, causing the GEM_WAIT_IDLE ioctl to never return, which sometimes results in an X server freeze.

This fixes only one of many deadlocks which can occur during a lockup.

Signed-off-by: Marek Olšák marek.olsak@amd.com

drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 841d0e0..7b97baa 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev) radeon_save_bios_scratch_regs(rdev); /* block TTM */ resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
 mutex_lock(&rdev->ring_lock);
 radeon_fence_driver_force_completion(rdev);
 mutex_unlock(&rdev->ring_lock);
 radeon_pm_suspend(rdev);
 radeon_suspend(rdev);
-- 1.8.1.2

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

8 Oct 8 Oct

4:21 p.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote:

...
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
From: Marek Olšák marek.olsak@amd.com

After a lockup, fences are not signalled sometimes, causing the GEM_WAIT_IDLE ioctl to never return, which sometimes results in an X server freeze.

This fixes only one of many deadlocks which can occur during a lockup.

Signed-off-by: Marek Olšák marek.olsak@amd.com

drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index 841d0e0..7b97baa 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev) radeon_save_bios_scratch_regs(rdev); /* block TTM */ resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
 mutex_lock(&rdev->ring_lock);
 radeon_fence_driver_force_completion(rdev);
 mutex_unlock(&rdev->ring_lock);
 radeon_pm_suspend(rdev);
 radeon_suspend(rdev);
-- 1.8.1.2

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Marek Olšák

9 Oct 9 Oct

10:36 a.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...

Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...
...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote:

...
NAK, after recovering from a lockup the first thing we do is signalling all remaining fences with an IB test.

If we don't recover we indeed signal all fences manually.

Signalling all fences regardless of the outcome of the reset creates problems with both types of partial resets.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

> From: Marek Olšák marek.olsak@amd.com > > After a lockup, fences are not signalled sometimes, causing > the GEM_WAIT_IDLE ioctl to never return, which sometimes results > in an X server freeze. > > This fixes only one of many deadlocks which can occur during a > lockup. > > Signed-off-by: Marek Olšák marek.olsak@amd.com > --- > drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/radeon/radeon_device.c > b/drivers/gpu/drm/radeon/radeon_device.c > index 841d0e0..7b97baa 100644 > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device > *rdev) > radeon_save_bios_scratch_regs(rdev); > /* block TTM */ > resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); > + > + mutex_lock(&rdev->ring_lock); > + radeon_fence_driver_force_completion(rdev); > + mutex_unlock(&rdev->ring_lock); > + > radeon_pm_suspend(rdev); > radeon_suspend(rdev); > > -- > 1.8.1.2 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

11:09 a.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Mhm, that doesn't looks like anything related but more like the reset of the compute ring didn't worked.

How often does that happen? And do you still get the problem where X waits for a fence that never comes back?

Christian.

Am 09.10.2013 12:36, schrieb Marek Olšák:

...

I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...
Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...
...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

...
I'm afraid signalling the fences with an IB test is not reliable.

Marek

On Wed, Oct 2, 2013 at 3:52 PM, Christian König deathsimple@vodafone.de wrote: > NAK, after recovering from a lockup the first thing we do is > signalling all remaining fences with an IB test. > > If we don't recover we indeed signal all fences manually. > > Signalling all fences regardless of the outcome of the reset creates > problems with both types of partial resets. > > Christian. > > Marek Olšák maraeo@gmail.com schrieb: > >> From: Marek Olšák marek.olsak@amd.com >> >> After a lockup, fences are not signalled sometimes, causing >> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >> in an X server freeze. >> >> This fixes only one of many deadlocks which can occur during a >> lockup. >> >> Signed-off-by: Marek Olšák marek.olsak@amd.com >> --- >> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >> b/drivers/gpu/drm/radeon/radeon_device.c >> index 841d0e0..7b97baa 100644 >> --- a/drivers/gpu/drm/radeon/radeon_device.c >> +++ b/drivers/gpu/drm/radeon/radeon_device.c >> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >> *rdev) >> radeon_save_bios_scratch_regs(rdev); >> /* block TTM */ >> resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >> + >> + mutex_lock(&rdev->ring_lock); >> + radeon_fence_driver_force_completion(rdev); >> + mutex_unlock(&rdev->ring_lock); >> + >> radeon_pm_suspend(rdev); >> radeon_suspend(rdev); >> >> -- >> 1.8.1.2 >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Marek Olšák

12:04 p.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

The ring test of the first compute ring always fails and it shouldn't affect the GPU reset in any way.

I can't tell if the deadlock issue is fixed, because the GPU reset usually fails with your patch. It always succeeded without your patch.

Marek

On Wed, Oct 9, 2013 at 1:09 PM, Christian König deathsimple@vodafone.de wrote:

...

Mhm, that doesn't looks like anything related but more like the reset of the compute ring didn't worked.

How often does that happen? And do you still get the problem where X waits for a fence that never comes back?

Christian.

Am 09.10.2013 12:36, schrieb Marek Olšák:

...
I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...
Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...
...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote:

...
Possible, but I would rather guess that this doesn't work because the IB test runs into a deadlock situation and so the GPU reset never fully completes.

Can you reproduce the problem?

If you want to make GPU resets more reliable I would rather suggest to remove the ring lock dependency. Then we should try to give all the fence wait functions a (reliable) timeout and move reset handling a layer up into the ioctl functions. But for this you need to rip out the old PM code first.

Christian.

Marek Olšák maraeo@gmail.com schrieb:

> I'm afraid signalling the fences with an IB test is not reliable. > > Marek > > On Wed, Oct 2, 2013 at 3:52 PM, Christian König > deathsimple@vodafone.de wrote: >> >> NAK, after recovering from a lockup the first thing we do is >> signalling all remaining fences with an IB test. >> >> If we don't recover we indeed signal all fences manually. >> >> Signalling all fences regardless of the outcome of the reset creates >> problems with both types of partial resets. >> >> Christian. >> >> Marek Olšák maraeo@gmail.com schrieb: >> >>> From: Marek Olšák marek.olsak@amd.com >>> >>> After a lockup, fences are not signalled sometimes, causing >>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>> in an X server freeze. >>> >>> This fixes only one of many deadlocks which can occur during a >>> lockup. >>> >>> Signed-off-by: Marek Olšák marek.olsak@amd.com >>> --- >>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>> b/drivers/gpu/drm/radeon/radeon_device.c >>> index 841d0e0..7b97baa 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>> *rdev) >>> radeon_save_bios_scratch_regs(rdev); >>> /* block TTM */ >>> resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>> + >>> + mutex_lock(&rdev->ring_lock); >>> + radeon_fence_driver_force_completion(rdev); >>> + mutex_unlock(&rdev->ring_lock); >>> + >>> radeon_pm_suspend(rdev); >>> radeon_suspend(rdev); >>> >>> -- >>> 1.8.1.2 >>> >>> _______________________________________________ >>> dri-devel mailing list >>> dri-devel@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

13 Oct 13 Oct

12:47 p.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

I've figured out what was wrong with the patch. We need to reset the "needs_reset" flag earlier, otherwise the IB test might think we are in a lockup and aborts the reset after waiting for the minimum timeout period.

Please try the attached patch instead.

Thanks, Christian.

Am 09.10.2013 14:04, schrieb Marek Olšák:

...

The ring test of the first compute ring always fails and it shouldn't affect the GPU reset in any way.

I can't tell if the deadlock issue is fixed, because the GPU reset usually fails with your patch. It always succeeded without your patch.

Marek

On Wed, Oct 9, 2013 at 1:09 PM, Christian König deathsimple@vodafone.de wrote:

...
Mhm, that doesn't looks like anything related but more like the reset of the compute ring didn't worked.

How often does that happen? And do you still get the problem where X waits for a fence that never comes back?

Christian.

Am 09.10.2013 12:36, schrieb Marek Olšák:

...
I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...
Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...
...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák:

...
First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets...

A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König deathsimple@vodafone.de wrote: > Possible, but I would rather guess that this doesn't work because the > IB > test runs into a deadlock situation and so the GPU reset never fully > completes. > > Can you reproduce the problem? > > If you want to make GPU resets more reliable I would rather suggest to > remove the ring lock dependency. > Then we should try to give all the fence wait functions a (reliable) > timeout and move reset handling a layer up into the ioctl functions. > But for > this you need to rip out the old PM code first. > > Christian. > > Marek Olšák maraeo@gmail.com schrieb: > >> I'm afraid signalling the fences with an IB test is not reliable. >> >> Marek >> >> On Wed, Oct 2, 2013 at 3:52 PM, Christian König >> deathsimple@vodafone.de wrote: >>> NAK, after recovering from a lockup the first thing we do is >>> signalling all remaining fences with an IB test. >>> >>> If we don't recover we indeed signal all fences manually. >>> >>> Signalling all fences regardless of the outcome of the reset creates >>> problems with both types of partial resets. >>> >>> Christian. >>> >>> Marek Olšák maraeo@gmail.com schrieb: >>> >>>> From: Marek Olšák marek.olsak@amd.com >>>> >>>> After a lockup, fences are not signalled sometimes, causing >>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>> in an X server freeze. >>>> >>>> This fixes only one of many deadlocks which can occur during a >>>> lockup. >>>> >>>> Signed-off-by: Marek Olšák marek.olsak@amd.com >>>> --- >>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>> 1 file changed, 5 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>> index 841d0e0..7b97baa 100644 >>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>>> *rdev) >>>> radeon_save_bios_scratch_regs(rdev); >>>> /* block TTM */ >>>> resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>> + >>>> + mutex_lock(&rdev->ring_lock); >>>> + radeon_fence_driver_force_completion(rdev); >>>> + mutex_unlock(&rdev->ring_lock); >>>> + >>>> radeon_pm_suspend(rdev); >>>> radeon_suspend(rdev); >>>> >>>> -- >>>> 1.8.1.2 >>>> >>>> _______________________________________________ >>>> dri-devel mailing list >>>> dri-devel@lists.freedesktop.org >>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Marek Olšák

8:16 p.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

This seems to be better. It can do about 3-5 resets correctly, then the GPU resuming fails:

[ 246.882780] [drm:cik_resume] *ERROR* cik startup failed on resume

and then the GPU is being reset again and again endlessly without success.

The dmesg of the endless resets is attached.

Marek

On Sun, Oct 13, 2013 at 2:47 PM, Christian König deathsimple@vodafone.de wrote:

...

I've figured out what was wrong with the patch. We need to reset the "needs_reset" flag earlier, otherwise the IB test might think we are in a lockup and aborts the reset after waiting for the minimum timeout period.

Please try the attached patch instead.

Thanks, Christian.

Am 09.10.2013 14:04, schrieb Marek Olšák:

...
The ring test of the first compute ring always fails and it shouldn't affect the GPU reset in any way.

I can't tell if the deadlock issue is fixed, because the GPU reset usually fails with your patch. It always succeeded without your patch.

Marek

On Wed, Oct 9, 2013 at 1:09 PM, Christian König deathsimple@vodafone.de wrote:

...
Mhm, that doesn't looks like anything related but more like the reset of the compute ring didn't worked.

How often does that happen? And do you still get the problem where X waits for a fence that never comes back?

Christian.

Am 09.10.2013 12:36, schrieb Marek Olšák:

...
I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...
Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

...
> First of all, I can't complain about the reliability of the hardware > GPU reset. It's mostly the kernel driver that happens to run into a > deadlock at the same time.

Alex and I spend quite some time on making this reliable again after activating more rings and adding VM support. The main problem is that I couldn't figure out where the CPU deadlock comes from, cause I couldn't reliable reproduce the issue.

What is the content of /proc/<pid of X server>/task/*/stack and sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the deadlock situation?

I'm pretty sure that we nearly always have a problem when two threads are waiting for fences on one of them detects that we have a lockup while the other one keeps holding the exclusive lock. Signaling all fences might work around that problem, but it probably would be better to fix the underlying issue.

Going to take a deeper look into it.

Christian.

Am 03.10.2013 02:45, schrieb Marek Olšák: > > First of all, I can't complain about the reliability of the hardware > GPU reset. It's mostly the kernel driver that happens to run into a > deadlock at the same time. > > Regarding the issue with fences, the problem is that the GPU reset > completes successfully according to dmesg, but X doesn't respond. I > can move the cursor on the screen, but I can't do anything else and > the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can > easily reproduce this, because it's the most common reason why a GPU > lockup leads to frozen X. The GPU actually recovers, but X is hung. I > can't tell whether the fences are just not signalled or whether there > is actually a real CPU deadlock I can't see. > > This patch makes the problem go away and GPU resets are successful > (except for extreme cases, see below). With a small enough lockup > timeout, the lockups are just a minor annoyance and I thought I could > get through a piglit run just with a few tens or hundreds of GPU > resets... > > A different type of deadlock showed up, though it needs a lot of > concurrently-running apps like piglit. What happened is that the > kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due > to a GPU hang while holding onto the exclusive lock, and another > thread wanting to do the GPU reset was unable to acquire the lock. > > That said, I will use the patch locally, because it helps a lot. I > got > a few lockups while writing this email and I'm glad I didn't have to > reboot. > > Marek > > On Wed, Oct 2, 2013 at 4:50 PM, Christian König > deathsimple@vodafone.de > wrote: >> >> Possible, but I would rather guess that this doesn't work because >> the >> IB >> test runs into a deadlock situation and so the GPU reset never fully >> completes. >> >> Can you reproduce the problem? >> >> If you want to make GPU resets more reliable I would rather suggest >> to >> remove the ring lock dependency. >> Then we should try to give all the fence wait functions a (reliable) >> timeout and move reset handling a layer up into the ioctl functions. >> But for >> this you need to rip out the old PM code first. >> >> Christian. >> >> Marek Olšák maraeo@gmail.com schrieb: >> >>> I'm afraid signalling the fences with an IB test is not reliable. >>> >>> Marek >>> >>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König >>> deathsimple@vodafone.de wrote: >>>> >>>> NAK, after recovering from a lockup the first thing we do is >>>> signalling all remaining fences with an IB test. >>>> >>>> If we don't recover we indeed signal all fences manually. >>>> >>>> Signalling all fences regardless of the outcome of the reset >>>> creates >>>> problems with both types of partial resets. >>>> >>>> Christian. >>>> >>>> Marek Olšák maraeo@gmail.com schrieb: >>>> >>>>> From: Marek Olšák marek.olsak@amd.com >>>>> >>>>> After a lockup, fences are not signalled sometimes, causing >>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>>> in an X server freeze. >>>>> >>>>> This fixes only one of many deadlocks which can occur during a >>>>> lockup. >>>>> >>>>> Signed-off-by: Marek Olšák marek.olsak@amd.com >>>>> --- >>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>>> 1 file changed, 5 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>> index 841d0e0..7b97baa 100644 >>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>>>> *rdev) >>>>> radeon_save_bios_scratch_regs(rdev); >>>>> /* block TTM */ >>>>> resched = >>>>> ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>>> + >>>>> + mutex_lock(&rdev->ring_lock); >>>>> + radeon_fence_driver_force_completion(rdev); >>>>> + mutex_unlock(&rdev->ring_lock); >>>>> + >>>>> radeon_pm_suspend(rdev); >>>>> radeon_suspend(rdev); >>>>> >>>>> -- >>>>> 1.8.1.2 >>>>> >>>>> _______________________________________________ >>>>> dri-devel mailing list >>>>> dri-devel@lists.freedesktop.org >>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

14 Oct 14 Oct

9:19 a.m.

New subject: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Ok, that one was easy to fix. Please apply the attached patch as well.

Going to send out both for inclusion in 3.12 in a minute.

Christian.

Am 13.10.2013 22:16, schrieb Marek Olšák:

...

This seems to be better. It can do about 3-5 resets correctly, then the GPU resuming fails:

[ 246.882780] [drm:cik_resume] *ERROR* cik startup failed on resume

and then the GPU is being reset again and again endlessly without success.

The dmesg of the endless resets is attached.

Marek

On Sun, Oct 13, 2013 at 2:47 PM, Christian König deathsimple@vodafone.de wrote:

...
I've figured out what was wrong with the patch. We need to reset the "needs_reset" flag earlier, otherwise the IB test might think we are in a lockup and aborts the reset after waiting for the minimum timeout period.

Please try the attached patch instead.

Thanks, Christian.

Am 09.10.2013 14:04, schrieb Marek Olšák:

...
The ring test of the first compute ring always fails and it shouldn't affect the GPU reset in any way.

I can't tell if the deadlock issue is fixed, because the GPU reset usually fails with your patch. It always succeeded without your patch.

Marek

On Wed, Oct 9, 2013 at 1:09 PM, Christian König deathsimple@vodafone.de wrote:

...
Mhm, that doesn't looks like anything related but more like the reset of the compute ring didn't worked.

How often does that happen? And do you still get the problem where X waits for a fence that never comes back?

Christian.

Am 09.10.2013 12:36, schrieb Marek Olšák:

...
I'm afraid your patch sometimes causes the GPU reset to fail, which had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König deathsimple@vodafone.de wrote:

...
Hi Marek,

please try the attached patch as a replacement for your signaling all fences patch. I'm not 100% sure if it fixes all issues, but it's at least a start.

Thanks, Christian.

Am 07.10.2013 13:08, schrieb Christian König:

>> First of all, I can't complain about the reliability of the hardware >> GPU reset. It's mostly the kernel driver that happens to run into a >> deadlock at the same time. > > Alex and I spend quite some time on making this reliable again after > activating more rings and adding VM support. The main problem is that > I > couldn't figure out where the CPU deadlock comes from, cause I > couldn't > reliable reproduce the issue. > > What is the content of /proc/<pid of X server>/task/*/stack and > sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in > the > deadlock situation? > > I'm pretty sure that we nearly always have a problem when two threads > are > waiting for fences on one of them detects that we have a lockup while > the > other one keeps holding the exclusive lock. Signaling all fences might > work > around that problem, but it probably would be better to fix the > underlying > issue. > > Going to take a deeper look into it. > > Christian. > > Am 03.10.2013 02:45, schrieb Marek Olšák: >> First of all, I can't complain about the reliability of the hardware >> GPU reset. It's mostly the kernel driver that happens to run into a >> deadlock at the same time. >> >> Regarding the issue with fences, the problem is that the GPU reset >> completes successfully according to dmesg, but X doesn't respond. I >> can move the cursor on the screen, but I can't do anything else and >> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can >> easily reproduce this, because it's the most common reason why a GPU >> lockup leads to frozen X. The GPU actually recovers, but X is hung. I >> can't tell whether the fences are just not signalled or whether there >> is actually a real CPU deadlock I can't see. >> >> This patch makes the problem go away and GPU resets are successful >> (except for extreme cases, see below). With a small enough lockup >> timeout, the lockups are just a minor annoyance and I thought I could >> get through a piglit run just with a few tens or hundreds of GPU >> resets... >> >> A different type of deadlock showed up, though it needs a lot of >> concurrently-running apps like piglit. What happened is that the >> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due >> to a GPU hang while holding onto the exclusive lock, and another >> thread wanting to do the GPU reset was unable to acquire the lock. >> >> That said, I will use the patch locally, because it helps a lot. I >> got >> a few lockups while writing this email and I'm glad I didn't have to >> reboot. >> >> Marek >> >> On Wed, Oct 2, 2013 at 4:50 PM, Christian König >> deathsimple@vodafone.de >> wrote: >>> Possible, but I would rather guess that this doesn't work because >>> the >>> IB >>> test runs into a deadlock situation and so the GPU reset never fully >>> completes. >>> >>> Can you reproduce the problem? >>> >>> If you want to make GPU resets more reliable I would rather suggest >>> to >>> remove the ring lock dependency. >>> Then we should try to give all the fence wait functions a (reliable) >>> timeout and move reset handling a layer up into the ioctl functions. >>> But for >>> this you need to rip out the old PM code first. >>> >>> Christian. >>> >>> Marek Olšák maraeo@gmail.com schrieb: >>> >>>> I'm afraid signalling the fences with an IB test is not reliable. >>>> >>>> Marek >>>> >>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König >>>> deathsimple@vodafone.de wrote: >>>>> NAK, after recovering from a lockup the first thing we do is >>>>> signalling all remaining fences with an IB test. >>>>> >>>>> If we don't recover we indeed signal all fences manually. >>>>> >>>>> Signalling all fences regardless of the outcome of the reset >>>>> creates >>>>> problems with both types of partial resets. >>>>> >>>>> Christian. >>>>> >>>>> Marek Olšák maraeo@gmail.com schrieb: >>>>> >>>>>> From: Marek Olšák marek.olsak@amd.com >>>>>> >>>>>> After a lockup, fences are not signalled sometimes, causing >>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>>>> in an X server freeze. >>>>>> >>>>>> This fixes only one of many deadlocks which can occur during a >>>>>> lockup. >>>>>> >>>>>> Signed-off-by: Marek Olšák marek.olsak@amd.com >>>>>> --- >>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>>>> 1 file changed, 5 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>> index 841d0e0..7b97baa 100644 >>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>>>>> *rdev) >>>>>> radeon_save_bios_scratch_regs(rdev); >>>>>> /* block TTM */ >>>>>> resched = >>>>>> ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>>>> + >>>>>> + mutex_lock(&rdev->ring_lock); >>>>>> + radeon_fence_driver_force_completion(rdev); >>>>>> + mutex_unlock(&rdev->ring_lock); >>>>>> + >>>>>> radeon_pm_suspend(rdev); >>>>>> radeon_suspend(rdev); >>>>>> >>>>>> -- >>>>>> 1.8.1.2 >>>>>> >>>>>> _______________________________________________ >>>>>> dri-devel mailing list >>>>>> dri-devel@lists.freedesktop.org >>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel

4242

Age (days ago)

4254

Last active (days ago)

dri-devel@lists.freedesktop.org

9 comments

2 participants

tags (0)

participants (2)

Christian König
Marek Olšák