Re: [OSADL QA 3.18.9-rt4 #1] Radeon driver hangs

22 Mar 2015


      Hi Michel,
...
...
...
...
[..]
The most striking problem of kernel 3.18.9-rt4 affects all systems that
are equipped with Radeon graphics (irrespective whether PCIe cards or
APUs with on-chip graphics). They suffer from a hanging radeon driver.
The block occurs when accelerated graphics load is created by x11perf or
gltestperf. Sometimes only the graphics are frozen while ssh login still
is possible, somtimes the entire box is no longer accessible at all. In
any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...]
The commits from
http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706...
to
http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd...
and
http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101...
might help for this.
Thanks a lot. I have applied these patches to a number of systems:
# quilt applied | tail -7
patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch
patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch
patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch
patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch
patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch
patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch
patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
The graphic boards still crash and freeze the screen, but in contrast
to the earlier situation the systems remain accessible, and the X
Window server can be restarted after the offensive programs are
removed. The crashes were reliably triggered by

gltestperf
 or
x11perf -repeat 3 -subs 25 -time 2 -rect10

This is not entirely correct, since gltestperf does not reliably crash
the graphics controller. However, "x11perf -repeat 3 -subs 25 -time 2
-rect10" always does a reliable job to trigger the crash.
...
...
but the crashes also occur several times per day during normal work
such as browsing the Internet or writing a text document. If you wish
me to provide additional diagnostic information such as running test
programs while the graphic boards are unresponsive, I certainly can do
that.
Does it also happen with a kernel built from a current drm-fixes tree?
http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
No. Apparently, you need full preemption to expose the problem.
The following list contains the results whether the command "x11perf
-repeat 3 -subs 25 -time 2 -rect10" freezes the Radeon board under test
(Radeon HD 7970 XFS / R9 280X) or not:
linux-3.12.33-rt47               no
linux-3.14.34-rt32               no
linux-3.14.34-drm-3.16.7-rt32*   no
linux-3.18.7-rt1                YES
linux-3.18.9-rt4                YES
linux-3.18.9-rt5                YES
linux-3.18.9-drm-3.16.7-rt5**    no
linux-4.0.0-rc4                  no
linux-drm-fixes                  no
*DRM subsystem backported from linux-3.16.7 to linux-3.14.34-rt32.
**DRM subsystem ported from linux-3.16.7 to linux-3.18.9-rt5.
More observations:
If full function tracing is enabled (which makes the system about five
times slower), the graphics controller no longer freezes. With partial
function tracing such as "echo *drm* >set_ftrace_filter", the
controller still freezes. The trace then contains vblank interrupt
processing only, ioctls are no longer executed.
This is the location where the driver hangs:
[25104.509258] INFO: task Xorg.bin:16591 blocked for more than 120 seconds.
[25104.516322]       Not tainted 3.18.9-rt5 #2
[25104.520715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[25104.528853] Xorg.bin        D ffffffff8171ed90     0 16591  16239 
0x10400080
[25104.536102]  ffff8800ba0bb8d8 0000000000000002 ffff8800ba0bbfd8 
0000000000000006
[25104.536103]  000000000000dc08 ffff880626d0dc08 ffff8800ba0bbfd8 
000000000000dc08
[25104.536104]  ffff88061b2cdcd0 ffff880616d3a940 ffff880035c10000 
ffff880616d3a940
[25104.559274] Call Trace:
[25104.561844]  [<ffffffff8171bb54>] schedule+0x34/0xa0
[25104.561846]  [<ffffffff8171e2ac>] schedule_timeout+0x23c/0x2a0
[25104.561870]  [<ffffffffa00e3ab6>] ? radeon_fence_process+0x16/0x40 
[radeon]
[25104.561879]  [<ffffffffa00e3b24>] ? 
radeon_fence_any_seq_signaled+0x44/0x90 [radeon]
[25104.561887]  [<ffffffffa00e3e97>] 
radeon_fence_wait_seq_timeout.constprop.8+0x327/0x380 [radeon]
[25104.561889]  [<ffffffff810d19c0>] ? __wake_up_sync+0x20/0x20
[25104.561898]  [<ffffffffa00e4287>] radeon_fence_wait_any+0x57/0x70 
[radeon]
[25104.561914]  [<ffffffffa015a36f>] radeon_sa_bo_new+0x2af/0x4b0 [radeon]
[25104.561916]  [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
[25104.561918]  [<ffffffff811d0b4a>] ? __kmalloc+0x8a/0x300
[25104.561932]  [<ffffffffa01b2197>] radeon_ib_get+0x37/0xe0 [radeon]
[25104.561943]  [<ffffffffa01003ee>] radeon_cs_ioctl+0x22e/0x860 [radeon]
[25104.561952]  [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm]
[25104.561954]  [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20
[25104.561956]  [<ffffffff810901ba>] ? unpin_current_cpu+0x1a/0x80
[25104.561959]  [<ffffffff810ba200>] ? migrate_enable+0x90/0x1a0
[25104.561966]  [<ffffffffa00c604c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
[25104.561967]  [<ffffffff811fdb88>] do_vfs_ioctl+0x2c8/0x4c0
[25104.561969]  [<ffffffff81208a92>] ? __fget+0x72/0xb0
[25104.561970]  [<ffffffff811fde01>] SyS_ioctl+0x81/0xa0
[25104.561971]  [<ffffffff8171f99e>] tracesys_phase2+0xd4/0xd9
Conclusion:
An upgrade change of the DRM subsystem between 3.16.7 and 3.18.9
introduced a race condition that freezes Radeon graphics. It requires
full preemption to be exposed reliably.
Thanks,
    -Carsten.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [OSADL QA 3.18.9-rt4 #1] Radeon driver hangs