(About 30 OSADL QA Farm systems are now running 3.18.9-rt4. BTW: To check out what kernels are under test you may sort the kernel list (https://www.osadl.org/?id=933) by kernel version (https://www.osadl.org/?id=1001) and scroll down the page.)
The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
Rack #0/Slot #3 [AMD/ATI] RV730 XT [Radeon HD 4670]: [16081.272035] INFO: task kworker/u24:4:268 blocked for more than 120 seconds. [16081.285776] Not tainted 3.18.9-rt4 #26 [16081.294286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [16081.309901] kworker/u24:4 D ffff88081ed8b340 0 268 2 0x10000000 [16081.309938] Workqueue: radeon-crtc radeon_flip_work_func [radeon] [16081.309960] ffff880805ccfbe8 0000000000000046 ffff88081ed0c700 0000000000000000 [16081.309962] 0000000000009000 000000000000c920 ffff8808112fb420 ffff880805cc1a10 [16081.309963] ffff880805ccfbf8 000001008108a0da ffff880805ccfc98 ffff880805cc1a10 [16081.309966] Call Trace: [16081.309972] [<ffffffff81721ce4>] schedule+0x34/0xa0 [16081.309974] [<ffffffff8172425c>] schedule_timeout+0x22c/0x2d0 [16081.309984] [<ffffffffa046ca86>] ? radeon_fence_process+0x16/0x40 [radeon] [16081.309993] [<ffffffffa046caf4>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [16081.310001] [<ffffffffa046ce27>] radeon_fence_wait_seq_timeout.constprop.8+0x2e7/0x340 [radeon] [16081.310004] [<ffffffff81098be0>] ? __wake_up_sync+0x20/0x20 [16081.310013] [<ffffffffa046d186>] radeon_fence_wait+0x86/0xc0 [radeon] [16081.310023] [<ffffffffa047af6c>] radeon_flip_work_func+0x15c/0x190 [radeon] [16081.310025] [<ffffffff810709c4>] process_one_work+0x154/0x450 [16081.310026] [<ffffffff81070fbb>] worker_thread+0x6b/0x4d0 [16081.310028] [<ffffffff81070f50>] ? rescuer_thread+0x290/0x290 [16081.310029] [<ffffffff81075fed>] kthread+0xcd/0xf0 [16081.310031] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0 [16081.310034] [<ffffffff81725aec>] ret_from_fork+0x7c/0xb0 [16081.310035] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0
Rack #0/Slot #7 [AMD/ATI] Cayman XT [Radeon HD 6970]: INFO: task Xorg:10038 blocked for more than 120 seconds. Not tainted 3.18.9-rt4 #25 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Xorg D ffffffff816b7f88 0 10038 10032 0x10400004 ffff8800c5ad78e8 0000000000000002 ffff88041e80c460 000000000000c5c8 ffff88041e80c5c8 0000000000000002 000000000000c5a8 000000000000c5c8 ffff880417728000 ffff880414010000 000000000000000c ffff880414010000 Call Trace: [<ffffffff816b50f4>] schedule+0x34/0xa0 [<ffffffff816b72f4>] schedule_timeout+0x204/0x270 [<ffffffffa00cd8e6>] ? radeon_fence_process+0x16/0x40 [radeon] [<ffffffffa00cd954>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [<ffffffffa00cdbc7>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [<ffffffff810ac310>] ? prepare_to_wait_event+0x110/0x110 [<ffffffffa00ce027>] radeon_fence_wait_any+0x57/0x70 [radeon] [<ffffffffa014334f>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [<ffffffffa019d477>] radeon_ib_get+0x37/0xf0 [radeon] [<ffffffffa00e9a3d>] radeon_cs_ioctl+0x22d/0x820 [radeon] [<ffffffffa001bc04>] drm_ioctl+0x1a4/0x630 [drm] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff8106e8da>] ? unpin_current_cpu+0x1a/0x70 [<ffffffff81097440>] ? migrate_enable+0xb0/0x1b0 [<ffffffffa00b004b>] radeon_drm_ioctl+0x4b/0x80 [radeon] [<ffffffff811c7040>] do_vfs_ioctl+0x2e0/0x4d0 [<ffffffff811d1aa2>] ? __fget+0x72/0xa0 [<ffffffff811c72b1>] SyS_ioctl+0x81/0xa0 [<ffffffff816b8cb2>] tracesys_phase2+0xd4/0xd9
Rack #4/Slot #1 Chipset: "KAVERI" (ChipID = 0x130c) [ 600.266245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.281856] Xorg D 0000000000000002 0 3821 3812 0x00400080 [ 600.281865] ffff880223ddf908 0000000000000082 000000000000c1c0 000000000000c328 [ 600.281867] ffff88023720c328 0000000000000002 000000000000c308 000000000000c328 [ 600.281869] ffffffff81c1b480 ffff880036cfcb60 000000000000000c ffff880036cfcb60 [ 600.281873] Call Trace: [ 600.281882] [<ffffffff81736a14>] schedule+0x34/0xa0 [ 600.281885] [<ffffffff81738a44>] schedule_timeout+0x204/0x270 [ 600.281929] [<ffffffffa00b8756>] ? radeon_fence_process+0x16/0x40 [radeon] [ 600.281949] [<ffffffffa00b87c4>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [ 600.281968] [<ffffffffa00b8a37>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [ 600.281972] [<ffffffff810815c0>] ? prepare_to_wait_event+0x110/0x110 [ 600.281992] [<ffffffffa00b8e97>] radeon_fence_wait_any+0x57/0x70 [radeon] [ 600.282023] [<ffffffffa012df5f>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [ 600.282027] [<ffffffff81077a1e>] ? dequeue_task_fair+0x43e/0x650 [ 600.282055] [<ffffffffa0188087>] radeon_ib_get+0x37/0xf0 [radeon] [ 600.282078] [<ffffffffa00d46bd>] radeon_cs_ioctl+0x22d/0x820 [radeon] [ 600.282098] [<ffffffffa000ec04>] drm_ioctl+0x1a4/0x630 [drm] [ 600.282104] [<ffffffff810b2489>] ? do_futex+0x109/0xb20 [ 600.282106] [<ffffffff810787c6>] ? put_prev_entity+0x96/0x3f0 [ 600.282122] [<ffffffffa009b00e>] radeon_drm_ioctl+0xe/0x10 [radeon] [ 600.282125] [<ffffffff81190db0>] do_vfs_ioctl+0x2e0/0x4d0 [ 600.282128] [<ffffffff8119b792>] ? __fget+0x72/0xa0 [ 600.282131] [<ffffffff81191021>] SyS_ioctl+0x81/0xa0 [ 600.282134] [<ffffffff810d45c6>] ? __audit_syscall_exit+0x236/0x2e0 [ 600.282137] [<ffffffff8173a1d6>] system_call_fastpath+0x16/0x1b
On 13.03.2015 08:23, Carsten Emde wrote:
(About 30 OSADL QA Farm systems are now running 3.18.9-rt4. BTW: To check out what kernels are under test you may sort the kernel list (https://www.osadl.org/?id=933) by kernel version (https://www.osadl.org/?id=1001) and scroll down the page.)
The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...]
The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706... to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd... and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101... might help for this.
On 03/13/2015 03:23 AM, Michel Dänzer wrote:
The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706... to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd... and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101... might help for this.
Thanks.
I can't reproduce this myself but I pulled in the commits you mentioned and "drm/radeon: only enable kv/kb dpm interrupts once v3" to avoid a reject. The box runs, glxgears and so on seem to do something, can't look at the screen :) All of those commits (and a ton more) are marked stable so I will probably get them anyway…
Sebastian
Hi Michel,
[..] The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...] The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706... to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd... and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101... might help for this.
Thanks a lot. I have applied these patches to a number of systems: # quilt applied | tail -7 patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
The graphic boards still crash and freeze the screen, but in contrast to the earlier situation the systems remain accessible, and the X Window server can be restarted after the offensive programs are removed. The crashes were reliably triggered by - gltestperf or - x11perf -repeat 3 -subs 25 -time 2 -rect10 but the crashes also occur several times per day during normal work such as browsing the Internet or writing a text document. If you wish me to provide additional diagnostic information such as running test programs while the graphic boards are unresponsive, I certainly can do that.
Below are the related kernel messages.
Thanks, -Carsten.
Rack #0/Slot #3 [AMD/ATI] RV730 XT [Radeon HD 4670]:
[21001.244036] INFO: task kworker/u24:6:267 blocked for more than 120 seconds. [21001.257773] Not tainted 3.18.9-rt4 #27 [21001.266284] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21001.281911] kworker/u24:6 D ffff88081ed8b340 0 267 2 0x10000000 [21001.281937] Workqueue: radeon-crtc radeon_flip_work_func [radeon] [21001.281940] ffff880805d2fbe8 0000000000000046 ffff88081ed0c700 0000000000000000 [21001.281941] 0000000000009000 000000000000c920 ffff8808112fb420 ffff880035254e30 [21001.281943] 000000000000c280 000001000000c280 0000000000000003 ffff880035254e30 [21001.281945] Call Trace: [21001.281950] [<ffffffff81721ce4>] schedule+0x34/0xa0 [21001.281953] [<ffffffff8172425c>] schedule_timeout+0x22c/0x2d0 [21001.281962] [<ffffffffa0439a06>] ? radeon_fence_process+0x16/0x40 [radeon] [21001.281971] [<ffffffffa0439a74>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [21001.281979] [<ffffffffa0439da7>] radeon_fence_wait_seq_timeout.constprop.8+0x2e7/0x340 [radeon] [21001.281982] [<ffffffff81098be0>] ? __wake_up_sync+0x20/0x20 [21001.281991] [<ffffffffa043a106>] radeon_fence_wait+0x86/0xc0 [radeon] [21001.282000] [<ffffffffa0447eec>] radeon_flip_work_func+0x15c/0x190 [radeon] [21001.282003] [<ffffffff810709c4>] process_one_work+0x154/0x450 [21001.282004] [<ffffffff81070fbb>] worker_thread+0x6b/0x4d0 [21001.282006] [<ffffffff81070f50>] ? rescuer_thread+0x290/0x290 [21001.282007] [<ffffffff81070f50>] ? rescuer_thread+0x290/0x290 [21001.282009] [<ffffffff81075fed>] kthread+0xcd/0xf0 [21001.282010] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0 [21001.282013] [<ffffffff81725aec>] ret_from_fork+0x7c/0xb0 [21001.282014] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0
Rack #0/Slot #7 [AMD/ATI] Cayman XT [Radeon HD 6970]
[ 481.091132] INFO: task Xorg:3459 blocked for more than 120 seconds. [ 481.103594] Not tainted 3.18.9-rt4 #28 [ 481.112101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 481.127746] Xorg D ffff88041e68ab40 0 3459 3452 0x10400004 [ 481.141882] ffff880413da38e8 0000000000000002 ffff88041e60c460 ffff8800c3ea3380 [ 481.141882] ffff880413da38d8 ffffffff8108603f 000000000000c5a8 000000000000c5c8 [ 481.141883] ffffffff81c19460 ffff8800c3ea3380 000000000000000c ffff8800c3ea3380 [ 481.186228] Call Trace: [ 481.191114] [<ffffffff8108603f>] ? queue_delayed_work_on+0xff/0x110 [ 481.191118] [<ffffffff816b50f4>] schedule+0x34/0xa0 [ 481.191119] [<ffffffff816b72f4>] schedule_timeout+0x204/0x270 [ 481.191148] [<ffffffffa00cd826>] ? radeon_fence_process+0x16/0x40 [radeon] [ 481.191157] [<ffffffffa00cd894>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [ 481.191165] [<ffffffffa00cdb07>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [ 481.191167] [<ffffffff810ac310>] ? prepare_to_wait_event+0x110/0x110 [ 481.191175] [<ffffffffa00cdf67>] radeon_fence_wait_any+0x57/0x70 [radeon] [ 481.191191] [<ffffffffa01432af>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [ 481.191194] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191207] [<ffffffffa019d3e7>] radeon_ib_get+0x37/0xf0 [radeon] [ 481.191218] [<ffffffffa00e997d>] radeon_cs_ioctl+0x22d/0x820 [radeon] [ 481.191219] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191228] [<ffffffffa001bc04>] drm_ioctl+0x1a4/0x630 [drm] [ 481.191231] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191234] [<ffffffff8106e8da>] ? unpin_current_cpu+0x1a/0x70 [ 481.191237] [<ffffffff81097440>] ? migrate_enable+0xb0/0x1b0 [ 481.191243] [<ffffffffa00b004b>] radeon_drm_ioctl+0x4b/0x80 [radeon] [ 481.191245] [<ffffffff811c7040>] do_vfs_ioctl+0x2e0/0x4d0 [ 481.191247] [<ffffffff811d1aa2>] ? __fget+0x72/0xa0 [ 481.191248] [<ffffffff811c72b1>] SyS_ioctl+0x81/0xa0 [ 481.191250] [<ffffffff816b8cb2>] tracesys_phase2+0xd4/0xd9
Rack #0/Slot #8 [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X]:
[19579.220958] INFO: task Xorg.bin:16569 blocked for more than 120 seconds. [19579.228008] Not tainted 3.18.9-rt4 #25 [19579.232491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19579.240719] Xorg.bin D ffffffff81716c70 0 16569 16215 0x10400080 [19579.248076] ffff8805f78bf818 0000000000000002 ffff8805f78bf7f8 0000000000000002 [19579.248077] 000000000000dc08 ffff880626a0dc08 000000000000dbe8 000000000000dc08 [19579.248078] ffffffff81c1b500 ffff880606c614a0 ffff880614f7c000 ffff880606c614a0 [19579.271393] Call Trace: [19579.273964] [<ffffffff81713da4>] schedule+0x34/0xa0 [19579.273965] [<ffffffff817162dc>] schedule_timeout+0x1fc/0x280 [19579.273990] [<ffffffffa00c7aa6>] ? radeon_fence_process+0x16/0x40 [radeon] [19579.273999] [<ffffffffa00c7b14>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [19579.274008] [<ffffffffa00c7e47>] radeon_fence_wait_seq_timeout.constprop.8+0x2e7/0x340 [radeon] [19579.274011] [<ffffffff810cf310>] ? __wake_up_sync+0x20/0x20 [19579.274020] [<ffffffffa00c8237>] radeon_fence_wait_any+0x57/0x70 [radeon] [19579.274035] [<ffffffffa013e2cf>] radeon_sa_bo_new+0x2af/0x4b0 [radeon] [19579.274049] [<ffffffffa0196077>] radeon_ib_get+0x37/0xe0 [radeon] [19579.274062] [<ffffffffa0194bbc>] radeon_vm_update_page_directory+0x6c/0x290 [radeon] [19579.274078] [<ffffffffa0144916>] ? si_ib_parse+0x396/0x430 [radeon] [19579.274089] [<ffffffffa00e44ab>] radeon_cs_ioctl+0x35b/0x850 [radeon] [19579.274098] [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm] [19579.274102] [<ffffffff81373337>] ? debug_smp_processor_id+0x17/0x20 [19579.274103] [<ffffffff8108ec2a>] ? unpin_current_cpu+0x1a/0x80 [19579.274105] [<ffffffff810b85c4>] ? migrate_enable+0x84/0x160 [19579.274111] [<ffffffffa00aa04c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [19579.274114] [<ffffffff811f8ae8>] do_vfs_ioctl+0x2c8/0x4c0 [19579.274116] [<ffffffff81203902>] ? __fget+0x72/0xb0 [19579.274117] [<ffffffff811f8d61>] SyS_ioctl+0x81/0xa0 [19579.274118] [<ffffffff817179de>] tracesys_phase2+0xd4/0xd9
Rack #4/Slot #1 Chipset: "KAVERI" (ChipID = 0x130c):
[21721.088164] INFO: task Xorg:7436 blocked for more than 120 seconds. [21721.100625] Not tainted 3.18.9-rt4 #26 [21721.109150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21721.124795] Xorg D ffffffff816b7f88 0 7436 7430 0x10400004 [21721.138897] ffff880409f278e8 0000000000000002 ffff88041e90c460 000000000000c5c8 [21721.138898] ffff88041e90c5c8 0000000000000006 000000000000c5a8 000000000000c5c8 [21721.138899] ffff8804177299c0 ffff880409f299c0 000000000000000c ffff880409f299c0 [21721.183222] Call Trace: [21721.188110] [<ffffffff816b50f4>] schedule+0x34/0xa0 [21721.188112] [<ffffffff816b72f4>] schedule_timeout+0x204/0x270 [21721.188143] [<ffffffffa00cd826>] ? radeon_fence_process+0x16/0x40 [radeon] [21721.188153] [<ffffffffa00cd894>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [21721.188163] [<ffffffffa00cdb07>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [21721.188165] [<ffffffff810ac310>] ? prepare_to_wait_event+0x110/0x110 [21721.188176] [<ffffffffa00cdf67>] radeon_fence_wait_any+0x57/0x70 [radeon] [21721.188193] [<ffffffffa01432af>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [21721.188196] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [21721.188210] [<ffffffffa019d3e7>] radeon_ib_get+0x37/0xf0 [radeon] [21721.188223] [<ffffffffa00e997d>] radeon_cs_ioctl+0x22d/0x820 [radeon] [21721.188233] [<ffffffffa001bc04>] drm_ioctl+0x1a4/0x630 [drm] [21721.188236] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [21721.188238] [<ffffffff8106e8da>] ? unpin_current_cpu+0x1a/0x70 [21721.188240] [<ffffffff81097440>] ? migrate_enable+0xb0/0x1b0 [21721.188248] [<ffffffffa00b004b>] radeon_drm_ioctl+0x4b/0x80 [radeon] [21721.188250] [<ffffffff811c7040>] do_vfs_ioctl+0x2e0/0x4d0 [21721.188252] [<ffffffff811d1aa2>] ? __fget+0x72/0xa0 [21721.188254] [<ffffffff811c72b1>] SyS_ioctl+0x81/0xa0 [21721.188255] [<ffffffff816b8cb2>] tracesys_phase2+0xd4/0xd9
Rack #c/Slot #5 Chipsed: "ATI Radeon HD 5800 Series" (ChipID = 0x6898)
[19711.965733] INFO: task kworker/u24:13:197 blocked for more than 120 seconds. [19711.965737] Not tainted 3.18.9-rt4 #26 [19711.965749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19711.965751] kworker/u24:13 D ffff88032901a560 0 197 2 0x10000000 [19711.965784] Workqueue: radeon-crtc radeon_flip_work_func [radeon] [19711.965788] ffff880328b3bc58 0000000000000002 000000000001d65e 0000000000000000 [19711.965789] ffff880328b3bfd8 000000000008a5c0 ffff880328b3bc78 ffffffffa0482589 [19711.965791] ffff88032fa81920 ffff880328b30000 ffff88032c63d5f0 ffff880328b30000 [19711.965794] Call Trace: [19711.965813] [<ffffffffa0482589>] ? radeon_fence_activity+0x160/0x172 [radeon] [19711.965818] [<ffffffff814e0d38>] schedule+0x7e/0x90 [19711.965820] [<ffffffff814e2143>] schedule_timeout+0x25/0xd3 [19711.965835] [<ffffffffa0482ba3>] ? radeon_fence_any_seq_signaled+0x52/0x69 [radeon] [19711.965850] [<ffffffffa0482d8d>] radeon_fence_wait_seq_timeout.constprop.6+0x1d3/0x2be [radeon] [19711.965853] [<ffffffff81066166>] ? __wake_up_sync+0x12/0x12 [19711.965869] [<ffffffffa04830e1>] radeon_fence_wait+0x92/0xaa [radeon] [19711.965886] [<ffffffffa048dae1>] radeon_flip_work_func+0x11e/0x14f [radeon] [19711.965889] [<ffffffff8104cac1>] process_one_work+0x16e/0x2ae [19711.965891] [<ffffffff8104d0fe>] worker_thread+0x1df/0x2ca [19711.965892] [<ffffffff8104cf1f>] ? cancel_delayed_work+0x91/0x91 [19711.965894] [<ffffffff8104cf1f>] ? cancel_delayed_work+0x91/0x91 [19711.965895] [<ffffffff81051324>] kthread+0xae/0xb6 [19711.965897] [<ffffffff81051276>] ? __kthread_parkme+0x61/0x61 [19711.965899] [<ffffffff814e322c>] ret_from_fork+0x7c/0xb0 [19711.965901] [<ffffffff81051276>] ? __kthread_parkme+0x61/0x61 [19711.965916] INFO: task compiz:2626 blocked for more than 120 seconds. [19711.965929] Not tainted 3.18.9-rt4 #26 [19711.965931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19711.965932] compiz D ffff88032901a560 0 2626 2186 0x30020000 [19711.965937] ffff8800b8ee7bc8 0000000000200002 ffff88032bb9e480 0000000000000000 [19711.965942] ffff8800b8ee7fd8 000000000008a5c0 0000000000000000 ffff8800b8ee7ee0 [19711.965951] ffffffff81a25450 ffff88032bb9e480 ffff8800b8ee7c28 ffff88032bb9e480 [19711.965954] Call Trace: [19711.965958] [<ffffffff814e0d38>] schedule+0x7e/0x90 [19711.965959] [<ffffffff814e1ab7>] __rt_mutex_slowlock+0x9f/0xdc [19711.965961] [<ffffffff814e1f7b>] rt_mutex_slowlock+0x123/0x236 [19711.965964] [<ffffffff8106b234>] rt_mutex_fastlock.constprop.24+0x2e/0x30 [19711.965965] [<ffffffff814e2103>] rt_mutex_lock+0x13/0x15 [19711.965967] [<ffffffff8106b613>] __rt_down_read.isra.1+0x29/0x30 [19711.965968] [<ffffffff8106b628>] rt_down_read+0xe/0x10 [19711.965988] [<ffffffffa04942ff>] radeon_gem_create_ioctl+0x2c/0xc6 [radeon] [19711.965990] [<ffffffff812004f9>] ? avc_has_perm_noaudit+0xf7/0x109 [19711.966004] [<ffffffffa010bc26>] drm_ioctl+0x380/0x3f8 [drm] [19711.966025] [<ffffffffa04942d3>] ? radeon_gem_pwrite_ioctl+0x28/0x28 [radeon] [19711.966027] [<ffffffff81200ca6>] ? inode_has_perm+0x2f/0x34 [19711.966029] [<ffffffff81200e58>] ? file_has_perm+0x5d/0x81 [19711.966040] [<ffffffffa046e00e>] radeon_drm_ioctl+0xe/0x10 [radeon] [19711.966067] [<ffffffffa0518b9c>] radeon_kms_compat_ioctl+0x1b/0x1f [radeon] [19711.966070] [<ffffffff8115e692>] compat_SyS_ioctl+0x1c3/0xf6e [19711.966072] [<ffffffff8100e7b1>] ? syscall_trace_enter+0x52/0x57 [19711.966074] [<ffffffff814e5679>] ia32_do_call+0x13/0x13
On 16.03.2015 23:52, Carsten Emde wrote:
Hi Michel,
[..] The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...] The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706...
to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd...
and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101...
might help for this.
Thanks a lot. I have applied these patches to a number of systems: # quilt applied | tail -7 patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
The graphic boards still crash and freeze the screen, but in contrast to the earlier situation the systems remain accessible, and the X Window server can be restarted after the offensive programs are removed. The crashes were reliably triggered by
- gltestperf or
- x11perf -repeat 3 -subs 25 -time 2 -rect10
but the crashes also occur several times per day during normal work such as browsing the Internet or writing a text document. If you wish me to provide additional diagnostic information such as running test programs while the graphic boards are unresponsive, I certainly can do that.
Does it also happen with a kernel built from a current drm-fixes tree? http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
I might have missed other needed fixes.
Rack #0/Slot #3 [AMD/ATI] RV730 XT [Radeon HD 4670]:
[21001.244036] INFO: task kworker/u24:6:267 blocked for more than 120 seconds. [21001.257773] Not tainted 3.18.9-rt4 #27 [21001.266284] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21001.281911] kworker/u24:6 D ffff88081ed8b340 0 267 2 0x10000000 [21001.281937] Workqueue: radeon-crtc radeon_flip_work_func [radeon] [21001.281940] ffff880805d2fbe8 0000000000000046 ffff88081ed0c700 0000000000000000 [21001.281941] 0000000000009000 000000000000c920 ffff8808112fb420 ffff880035254e30 [21001.281943] 000000000000c280 000001000000c280 0000000000000003 ffff880035254e30 [21001.281945] Call Trace: [21001.281950] [<ffffffff81721ce4>] schedule+0x34/0xa0 [21001.281953] [<ffffffff8172425c>] schedule_timeout+0x22c/0x2d0 [21001.281962] [<ffffffffa0439a06>] ? radeon_fence_process+0x16/0x40 [radeon] [21001.281971] [<ffffffffa0439a74>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [21001.281979] [<ffffffffa0439da7>] radeon_fence_wait_seq_timeout.constprop.8+0x2e7/0x340 [radeon] [21001.281982] [<ffffffff81098be0>] ? __wake_up_sync+0x20/0x20 [21001.281991] [<ffffffffa043a106>] radeon_fence_wait+0x86/0xc0 [radeon] [21001.282000] [<ffffffffa0447eec>] radeon_flip_work_func+0x15c/0x190 [radeon] [21001.282003] [<ffffffff810709c4>] process_one_work+0x154/0x450 [21001.282004] [<ffffffff81070fbb>] worker_thread+0x6b/0x4d0 [21001.282006] [<ffffffff81070f50>] ? rescuer_thread+0x290/0x290 [21001.282007] [<ffffffff81070f50>] ? rescuer_thread+0x290/0x290 [21001.282009] [<ffffffff81075fed>] kthread+0xcd/0xf0 [21001.282010] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0 [21001.282013] [<ffffffff81725aec>] ret_from_fork+0x7c/0xb0 [21001.282014] [<ffffffff81075f20>] ? kthread_worker_fn+0x1d0/0x1d0
Rack #0/Slot #7 [AMD/ATI] Cayman XT [Radeon HD 6970]
[ 481.091132] INFO: task Xorg:3459 blocked for more than 120 seconds. [ 481.103594] Not tainted 3.18.9-rt4 #28 [ 481.112101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 481.127746] Xorg D ffff88041e68ab40 0 3459 3452 0x10400004 [ 481.141882] ffff880413da38e8 0000000000000002 ffff88041e60c460 ffff8800c3ea3380 [ 481.141882] ffff880413da38d8 ffffffff8108603f 000000000000c5a8 000000000000c5c8 [ 481.141883] ffffffff81c19460 ffff8800c3ea3380 000000000000000c ffff8800c3ea3380 [ 481.186228] Call Trace: [ 481.191114] [<ffffffff8108603f>] ? queue_delayed_work_on+0xff/0x110 [ 481.191118] [<ffffffff816b50f4>] schedule+0x34/0xa0 [ 481.191119] [<ffffffff816b72f4>] schedule_timeout+0x204/0x270 [ 481.191148] [<ffffffffa00cd826>] ? radeon_fence_process+0x16/0x40 [radeon] [ 481.191157] [<ffffffffa00cd894>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [ 481.191165] [<ffffffffa00cdb07>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [ 481.191167] [<ffffffff810ac310>] ? prepare_to_wait_event+0x110/0x110 [ 481.191175] [<ffffffffa00cdf67>] radeon_fence_wait_any+0x57/0x70 [radeon] [ 481.191191] [<ffffffffa01432af>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [ 481.191194] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191207] [<ffffffffa019d3e7>] radeon_ib_get+0x37/0xf0 [radeon] [ 481.191218] [<ffffffffa00e997d>] radeon_cs_ioctl+0x22d/0x820 [radeon] [ 481.191219] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191228] [<ffffffffa001bc04>] drm_ioctl+0x1a4/0x630 [drm] [ 481.191231] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [ 481.191234] [<ffffffff8106e8da>] ? unpin_current_cpu+0x1a/0x70 [ 481.191237] [<ffffffff81097440>] ? migrate_enable+0xb0/0x1b0 [ 481.191243] [<ffffffffa00b004b>] radeon_drm_ioctl+0x4b/0x80 [radeon] [ 481.191245] [<ffffffff811c7040>] do_vfs_ioctl+0x2e0/0x4d0 [ 481.191247] [<ffffffff811d1aa2>] ? __fget+0x72/0xa0 [ 481.191248] [<ffffffff811c72b1>] SyS_ioctl+0x81/0xa0 [ 481.191250] [<ffffffff816b8cb2>] tracesys_phase2+0xd4/0xd9
Rack #0/Slot #8 [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X]:
[19579.220958] INFO: task Xorg.bin:16569 blocked for more than 120 seconds. [19579.228008] Not tainted 3.18.9-rt4 #25 [19579.232491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19579.240719] Xorg.bin D ffffffff81716c70 0 16569 16215 0x10400080 [19579.248076] ffff8805f78bf818 0000000000000002 ffff8805f78bf7f8 0000000000000002 [19579.248077] 000000000000dc08 ffff880626a0dc08 000000000000dbe8 000000000000dc08 [19579.248078] ffffffff81c1b500 ffff880606c614a0 ffff880614f7c000 ffff880606c614a0 [19579.271393] Call Trace: [19579.273964] [<ffffffff81713da4>] schedule+0x34/0xa0 [19579.273965] [<ffffffff817162dc>] schedule_timeout+0x1fc/0x280 [19579.273990] [<ffffffffa00c7aa6>] ? radeon_fence_process+0x16/0x40 [radeon] [19579.273999] [<ffffffffa00c7b14>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [19579.274008] [<ffffffffa00c7e47>] radeon_fence_wait_seq_timeout.constprop.8+0x2e7/0x340 [radeon] [19579.274011] [<ffffffff810cf310>] ? __wake_up_sync+0x20/0x20 [19579.274020] [<ffffffffa00c8237>] radeon_fence_wait_any+0x57/0x70 [radeon] [19579.274035] [<ffffffffa013e2cf>] radeon_sa_bo_new+0x2af/0x4b0 [radeon] [19579.274049] [<ffffffffa0196077>] radeon_ib_get+0x37/0xe0 [radeon] [19579.274062] [<ffffffffa0194bbc>] radeon_vm_update_page_directory+0x6c/0x290 [radeon] [19579.274078] [<ffffffffa0144916>] ? si_ib_parse+0x396/0x430 [radeon] [19579.274089] [<ffffffffa00e44ab>] radeon_cs_ioctl+0x35b/0x850 [radeon] [19579.274098] [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm] [19579.274102] [<ffffffff81373337>] ? debug_smp_processor_id+0x17/0x20 [19579.274103] [<ffffffff8108ec2a>] ? unpin_current_cpu+0x1a/0x80 [19579.274105] [<ffffffff810b85c4>] ? migrate_enable+0x84/0x160 [19579.274111] [<ffffffffa00aa04c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [19579.274114] [<ffffffff811f8ae8>] do_vfs_ioctl+0x2c8/0x4c0 [19579.274116] [<ffffffff81203902>] ? __fget+0x72/0xb0 [19579.274117] [<ffffffff811f8d61>] SyS_ioctl+0x81/0xa0 [19579.274118] [<ffffffff817179de>] tracesys_phase2+0xd4/0xd9
Rack #4/Slot #1 Chipset: "KAVERI" (ChipID = 0x130c):
[21721.088164] INFO: task Xorg:7436 blocked for more than 120 seconds. [21721.100625] Not tainted 3.18.9-rt4 #26 [21721.109150] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21721.124795] Xorg D ffffffff816b7f88 0 7436 7430 0x10400004 [21721.138897] ffff880409f278e8 0000000000000002 ffff88041e90c460 000000000000c5c8 [21721.138898] ffff88041e90c5c8 0000000000000006 000000000000c5a8 000000000000c5c8 [21721.138899] ffff8804177299c0 ffff880409f299c0 000000000000000c ffff880409f299c0 [21721.183222] Call Trace: [21721.188110] [<ffffffff816b50f4>] schedule+0x34/0xa0 [21721.188112] [<ffffffff816b72f4>] schedule_timeout+0x204/0x270 [21721.188143] [<ffffffffa00cd826>] ? radeon_fence_process+0x16/0x40 [radeon] [21721.188153] [<ffffffffa00cd894>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [21721.188163] [<ffffffffa00cdb07>] radeon_fence_wait_seq_timeout.constprop.7+0x227/0x330 [radeon] [21721.188165] [<ffffffff810ac310>] ? prepare_to_wait_event+0x110/0x110 [21721.188176] [<ffffffffa00cdf67>] radeon_fence_wait_any+0x57/0x70 [radeon] [21721.188193] [<ffffffffa01432af>] radeon_sa_bo_new+0x2cf/0x4e0 [radeon] [21721.188196] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [21721.188210] [<ffffffffa019d3e7>] radeon_ib_get+0x37/0xf0 [radeon] [21721.188223] [<ffffffffa00e997d>] radeon_cs_ioctl+0x22d/0x820 [radeon] [21721.188233] [<ffffffffa001bc04>] drm_ioctl+0x1a4/0x630 [drm] [21721.188236] [<ffffffff8133c2a7>] ? debug_smp_processor_id+0x17/0x20 [21721.188238] [<ffffffff8106e8da>] ? unpin_current_cpu+0x1a/0x70 [21721.188240] [<ffffffff81097440>] ? migrate_enable+0xb0/0x1b0 [21721.188248] [<ffffffffa00b004b>] radeon_drm_ioctl+0x4b/0x80 [radeon] [21721.188250] [<ffffffff811c7040>] do_vfs_ioctl+0x2e0/0x4d0 [21721.188252] [<ffffffff811d1aa2>] ? __fget+0x72/0xa0 [21721.188254] [<ffffffff811c72b1>] SyS_ioctl+0x81/0xa0 [21721.188255] [<ffffffff816b8cb2>] tracesys_phase2+0xd4/0xd9
Rack #c/Slot #5 Chipsed: "ATI Radeon HD 5800 Series" (ChipID = 0x6898)
[19711.965733] INFO: task kworker/u24:13:197 blocked for more than 120 seconds. [19711.965737] Not tainted 3.18.9-rt4 #26 [19711.965749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19711.965751] kworker/u24:13 D ffff88032901a560 0 197 2 0x10000000 [19711.965784] Workqueue: radeon-crtc radeon_flip_work_func [radeon] [19711.965788] ffff880328b3bc58 0000000000000002 000000000001d65e 0000000000000000 [19711.965789] ffff880328b3bfd8 000000000008a5c0 ffff880328b3bc78 ffffffffa0482589 [19711.965791] ffff88032fa81920 ffff880328b30000 ffff88032c63d5f0 ffff880328b30000 [19711.965794] Call Trace: [19711.965813] [<ffffffffa0482589>] ? radeon_fence_activity+0x160/0x172 [radeon] [19711.965818] [<ffffffff814e0d38>] schedule+0x7e/0x90 [19711.965820] [<ffffffff814e2143>] schedule_timeout+0x25/0xd3 [19711.965835] [<ffffffffa0482ba3>] ? radeon_fence_any_seq_signaled+0x52/0x69 [radeon] [19711.965850] [<ffffffffa0482d8d>] radeon_fence_wait_seq_timeout.constprop.6+0x1d3/0x2be [radeon] [19711.965853] [<ffffffff81066166>] ? __wake_up_sync+0x12/0x12 [19711.965869] [<ffffffffa04830e1>] radeon_fence_wait+0x92/0xaa [radeon] [19711.965886] [<ffffffffa048dae1>] radeon_flip_work_func+0x11e/0x14f [radeon] [19711.965889] [<ffffffff8104cac1>] process_one_work+0x16e/0x2ae [19711.965891] [<ffffffff8104d0fe>] worker_thread+0x1df/0x2ca [19711.965892] [<ffffffff8104cf1f>] ? cancel_delayed_work+0x91/0x91 [19711.965894] [<ffffffff8104cf1f>] ? cancel_delayed_work+0x91/0x91 [19711.965895] [<ffffffff81051324>] kthread+0xae/0xb6 [19711.965897] [<ffffffff81051276>] ? __kthread_parkme+0x61/0x61 [19711.965899] [<ffffffff814e322c>] ret_from_fork+0x7c/0xb0 [19711.965901] [<ffffffff81051276>] ? __kthread_parkme+0x61/0x61 [19711.965916] INFO: task compiz:2626 blocked for more than 120 seconds. [19711.965929] Not tainted 3.18.9-rt4 #26 [19711.965931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19711.965932] compiz D ffff88032901a560 0 2626 2186 0x30020000 [19711.965937] ffff8800b8ee7bc8 0000000000200002 ffff88032bb9e480 0000000000000000 [19711.965942] ffff8800b8ee7fd8 000000000008a5c0 0000000000000000 ffff8800b8ee7ee0 [19711.965951] ffffffff81a25450 ffff88032bb9e480 ffff8800b8ee7c28 ffff88032bb9e480 [19711.965954] Call Trace: [19711.965958] [<ffffffff814e0d38>] schedule+0x7e/0x90 [19711.965959] [<ffffffff814e1ab7>] __rt_mutex_slowlock+0x9f/0xdc [19711.965961] [<ffffffff814e1f7b>] rt_mutex_slowlock+0x123/0x236 [19711.965964] [<ffffffff8106b234>] rt_mutex_fastlock.constprop.24+0x2e/0x30 [19711.965965] [<ffffffff814e2103>] rt_mutex_lock+0x13/0x15 [19711.965967] [<ffffffff8106b613>] __rt_down_read.isra.1+0x29/0x30 [19711.965968] [<ffffffff8106b628>] rt_down_read+0xe/0x10 [19711.965988] [<ffffffffa04942ff>] radeon_gem_create_ioctl+0x2c/0xc6 [radeon] [19711.965990] [<ffffffff812004f9>] ? avc_has_perm_noaudit+0xf7/0x109 [19711.966004] [<ffffffffa010bc26>] drm_ioctl+0x380/0x3f8 [drm] [19711.966025] [<ffffffffa04942d3>] ? radeon_gem_pwrite_ioctl+0x28/0x28 [radeon] [19711.966027] [<ffffffff81200ca6>] ? inode_has_perm+0x2f/0x34 [19711.966029] [<ffffffff81200e58>] ? file_has_perm+0x5d/0x81 [19711.966040] [<ffffffffa046e00e>] radeon_drm_ioctl+0xe/0x10 [radeon] [19711.966067] [<ffffffffa0518b9c>] radeon_kms_compat_ioctl+0x1b/0x1f [radeon] [19711.966070] [<ffffffff8115e692>] compat_SyS_ioctl+0x1c3/0xf6e [19711.966072] [<ffffffff8100e7b1>] ? syscall_trace_enter+0x52/0x57 [19711.966074] [<ffffffff814e5679>] ia32_do_call+0x13/0x13
Hi Michel,
[..] The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...] The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706...
to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd...
and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101...
might help for this.
Thanks a lot. I have applied these patches to a number of systems: # quilt applied | tail -7 patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
The graphic boards still crash and freeze the screen, but in contrast to the earlier situation the systems remain accessible, and the X Window server can be restarted after the offensive programs are removed. The crashes were reliably triggered by
- gltestperf or
- x11perf -repeat 3 -subs 25 -time 2 -rect10
This is not entirely correct, since gltestperf does not reliably crash the graphics controller. However, "x11perf -repeat 3 -subs 25 -time 2 -rect10" always does a reliable job to trigger the crash.
but the crashes also occur several times per day during normal work such as browsing the Internet or writing a text document. If you wish me to provide additional diagnostic information such as running test programs while the graphic boards are unresponsive, I certainly can do that.
Does it also happen with a kernel built from a current drm-fixes tree? http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
No. Apparently, you need full preemption to expose the problem.
The following list contains the results whether the command "x11perf -repeat 3 -subs 25 -time 2 -rect10" freezes the Radeon board under test (Radeon HD 7970 XFS / R9 280X) or not: linux-3.12.33-rt47 no linux-3.14.34-rt32 no linux-3.14.34-drm-3.16.7-rt32* no linux-3.18.7-rt1 YES linux-3.18.9-rt4 YES linux-3.18.9-rt5 YES linux-3.18.9-drm-3.16.7-rt5** no linux-4.0.0-rc4 no linux-drm-fixes no *DRM subsystem backported from linux-3.16.7 to linux-3.14.34-rt32. **DRM subsystem ported from linux-3.16.7 to linux-3.18.9-rt5.
More observations: If full function tracing is enabled (which makes the system about five times slower), the graphics controller no longer freezes. With partial function tracing such as "echo *drm* >set_ftrace_filter", the controller still freezes. The trace then contains vblank interrupt processing only, ioctls are no longer executed.
This is the location where the driver hangs: [25104.509258] INFO: task Xorg.bin:16591 blocked for more than 120 seconds. [25104.516322] Not tainted 3.18.9-rt5 #2 [25104.520715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [25104.528853] Xorg.bin D ffffffff8171ed90 0 16591 16239 0x10400080 [25104.536102] ffff8800ba0bb8d8 0000000000000002 ffff8800ba0bbfd8 0000000000000006 [25104.536103] 000000000000dc08 ffff880626d0dc08 ffff8800ba0bbfd8 000000000000dc08 [25104.536104] ffff88061b2cdcd0 ffff880616d3a940 ffff880035c10000 ffff880616d3a940 [25104.559274] Call Trace: [25104.561844] [<ffffffff8171bb54>] schedule+0x34/0xa0 [25104.561846] [<ffffffff8171e2ac>] schedule_timeout+0x23c/0x2a0 [25104.561870] [<ffffffffa00e3ab6>] ? radeon_fence_process+0x16/0x40 [radeon] [25104.561879] [<ffffffffa00e3b24>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [25104.561887] [<ffffffffa00e3e97>] radeon_fence_wait_seq_timeout.constprop.8+0x327/0x380 [radeon] [25104.561889] [<ffffffff810d19c0>] ? __wake_up_sync+0x20/0x20 [25104.561898] [<ffffffffa00e4287>] radeon_fence_wait_any+0x57/0x70 [radeon] [25104.561914] [<ffffffffa015a36f>] radeon_sa_bo_new+0x2af/0x4b0 [radeon] [25104.561916] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20 [25104.561918] [<ffffffff811d0b4a>] ? __kmalloc+0x8a/0x300 [25104.561932] [<ffffffffa01b2197>] radeon_ib_get+0x37/0xe0 [radeon] [25104.561943] [<ffffffffa01003ee>] radeon_cs_ioctl+0x22e/0x860 [radeon] [25104.561952] [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm] [25104.561954] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20 [25104.561956] [<ffffffff810901ba>] ? unpin_current_cpu+0x1a/0x80 [25104.561959] [<ffffffff810ba200>] ? migrate_enable+0x90/0x1a0 [25104.561966] [<ffffffffa00c604c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [25104.561967] [<ffffffff811fdb88>] do_vfs_ioctl+0x2c8/0x4c0 [25104.561969] [<ffffffff81208a92>] ? __fget+0x72/0xb0 [25104.561970] [<ffffffff811fde01>] SyS_ioctl+0x81/0xa0 [25104.561971] [<ffffffff8171f99e>] tracesys_phase2+0xd4/0xd9
Conclusion: An upgrade change of the DRM subsystem between 3.16.7 and 3.18.9 introduced a race condition that freezes Radeon graphics. It requires full preemption to be exposed reliably.
Thanks, -Carsten.
On 23.03.2015 07:14, Carsten Emde wrote:
Hi Michel,
[..] The most striking problem of kernel 3.18.9-rt4 affects all systems that are equipped with Radeon graphics (irrespective whether PCIe cards or APUs with on-chip graphics). They suffer from a hanging radeon driver. The block occurs when accelerated graphics load is created by x11perf or gltestperf. Sometimes only the graphics are frozen while ssh login still is possible, somtimes the entire box is no longer accessible at all. In any case, a reboot is needed to recover from this situation.
Here is a selection of kernel messages:
[...] The commits from http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=f95706...
to http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=cffefd...
and http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=b66101...
might help for this.
Thanks a lot. I have applied these patches to a number of systems: # quilt applied | tail -7 patches/drm-radeon-do-a-posting-read-in-r100_set_irq.patch patches/drm-radeon-do-a-posting-read-in-rs600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-r600_set_irq.patch patches/drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch patches/drm-radeon-do-a-posting-read-in-si_set_irq.patch patches/drm-radeon-do-a-posting-read-in-cik_set_irq.patch patches/drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
The graphic boards still crash and freeze the screen, but in contrast to the earlier situation the systems remain accessible, and the X Window server can be restarted after the offensive programs are removed. The crashes were reliably triggered by
- gltestperf or
- x11perf -repeat 3 -subs 25 -time 2 -rect10
This is not entirely correct, since gltestperf does not reliably crash the graphics controller. However, "x11perf -repeat 3 -subs 25 -time 2 -rect10" always does a reliable job to trigger the crash.
but the crashes also occur several times per day during normal work such as browsing the Internet or writing a text document. If you wish me to provide additional diagnostic information such as running test programs while the graphic boards are unresponsive, I certainly can do that.
Does it also happen with a kernel built from a current drm-fixes tree? http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-fixes
No. Apparently, you need full preemption to expose the problem.
The following list contains the results whether the command "x11perf -repeat 3 -subs 25 -time 2 -rect10" freezes the Radeon board under test (Radeon HD 7970 XFS / R9 280X) or not: linux-3.12.33-rt47 no linux-3.14.34-rt32 no linux-3.14.34-drm-3.16.7-rt32* no linux-3.18.7-rt1 YES linux-3.18.9-rt4 YES linux-3.18.9-rt5 YES linux-3.18.9-drm-3.16.7-rt5** no linux-4.0.0-rc4 no linux-drm-fixes no *DRM subsystem backported from linux-3.16.7 to linux-3.14.34-rt32. **DRM subsystem ported from linux-3.16.7 to linux-3.18.9-rt5.
Can you test a non-rt 3.18.y kernel? There were some intermittent issues around 3.18 fixed by the patches I referenced above. Maybe I missed some other fixes, though. Maarten, do you remember any other fixes offhand that might help?
More observations: If full function tracing is enabled (which makes the system about five times slower), the graphics controller no longer freezes. With partial function tracing such as "echo *drm* >set_ftrace_filter", the controller still freezes. The trace then contains vblank interrupt processing only, ioctls are no longer executed.
This is the location where the driver hangs: [25104.509258] INFO: task Xorg.bin:16591 blocked for more than 120 seconds. [25104.516322] Not tainted 3.18.9-rt5 #2 [25104.520715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [25104.528853] Xorg.bin D ffffffff8171ed90 0 16591 16239 0x10400080 [25104.536102] ffff8800ba0bb8d8 0000000000000002 ffff8800ba0bbfd8 0000000000000006 [25104.536103] 000000000000dc08 ffff880626d0dc08 ffff8800ba0bbfd8 000000000000dc08 [25104.536104] ffff88061b2cdcd0 ffff880616d3a940 ffff880035c10000 ffff880616d3a940 [25104.559274] Call Trace: [25104.561844] [<ffffffff8171bb54>] schedule+0x34/0xa0 [25104.561846] [<ffffffff8171e2ac>] schedule_timeout+0x23c/0x2a0 [25104.561870] [<ffffffffa00e3ab6>] ? radeon_fence_process+0x16/0x40 [radeon] [25104.561879] [<ffffffffa00e3b24>] ? radeon_fence_any_seq_signaled+0x44/0x90 [radeon] [25104.561887] [<ffffffffa00e3e97>] radeon_fence_wait_seq_timeout.constprop.8+0x327/0x380 [radeon] [25104.561889] [<ffffffff810d19c0>] ? __wake_up_sync+0x20/0x20 [25104.561898] [<ffffffffa00e4287>] radeon_fence_wait_any+0x57/0x70 [radeon] [25104.561914] [<ffffffffa015a36f>] radeon_sa_bo_new+0x2af/0x4b0 [radeon] [25104.561916] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20 [25104.561918] [<ffffffff811d0b4a>] ? __kmalloc+0x8a/0x300 [25104.561932] [<ffffffffa01b2197>] radeon_ib_get+0x37/0xe0 [radeon] [25104.561943] [<ffffffffa01003ee>] radeon_cs_ioctl+0x22e/0x860 [radeon] [25104.561952] [<ffffffffa0005bc7>] drm_ioctl+0x197/0x670 [drm] [25104.561954] [<ffffffff81379b07>] ? debug_smp_processor_id+0x17/0x20 [25104.561956] [<ffffffff810901ba>] ? unpin_current_cpu+0x1a/0x80 [25104.561959] [<ffffffff810ba200>] ? migrate_enable+0x90/0x1a0 [25104.561966] [<ffffffffa00c604c>] radeon_drm_ioctl+0x4c/0x80 [radeon] [25104.561967] [<ffffffff811fdb88>] do_vfs_ioctl+0x2c8/0x4c0 [25104.561969] [<ffffffff81208a92>] ? __fget+0x72/0xb0 [25104.561970] [<ffffffff811fde01>] SyS_ioctl+0x81/0xa0 [25104.561971] [<ffffffff8171f99e>] tracesys_phase2+0xd4/0xd9
Conclusion: An upgrade change of the DRM subsystem between 3.16.7 and 3.18.9 introduced a race condition that freezes Radeon graphics. It requires full preemption to be exposed reliably.
dri-devel@lists.freedesktop.org