https://bugs.freedesktop.org/show_bug.cgi?id=62997
Priority: medium Bug ID: 62997 Assignee: dri-devel@lists.freedesktop.org Summary: GPU fault unless R600_DEBUG=nodma Severity: major Classification: Unclassified OS: Linux (All) Reporter: udovdh@xs4all.nl Hardware: x86-64 (AMD64) Status: NEW Version: git Component: Drivers/Gallium/r600 Product: Mesa
Ever since booting into kernel.org 3.8.4 on my AMD A10-5800K (ARUBA graphics), running git mesa and git xf86-video-ati, I get short uptimes (15 minutes, around one hour max) due to crashes. The logs mention stuff like:
[ 1332.480233] radeon 0000:00:01.0: GPU fault detected: 146 0x0134710c [ 1332.480243] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000813 [ 1332.480250] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0407100C
Watching youtube `helps` triggering the issue as it appears. (correlates, no real causation yet) Having R600_DEBUG=nodma in the environment solves the problem.
Occasionally I see a GPU lockup, if that is related:
[29648.098135] disk 0, wo:0, o:1, dev:sda2 [29648.098140] disk 1, wo:0, o:1, dev:sdb2 [29648.098142] disk 2, wo:0, o:1, dev:sdc2 [29648.098145] disk 3, wo:0, o:1, dev:sdd2 [68707.166021] radeon 0000:00:01.0: GPU fault detected: 146 0x0d4c2604 [68707.166030] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000008D4 [68707.166043] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C026004 [70621.378798] radeon 0000:00:01.0: GPU fault detected: 146 0x013c710c [70621.378808] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000813 [70621.378815] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07100C [70621.378837] radeon 0000:00:01.0: GPU fault detected: 147 0x0f0c7102 [70621.378843] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [70621.378848] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [70621.378854] radeon 0000:00:01.0: GPU fault detected: 147 0x0f1c7102 [70621.378859] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [70621.378864] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [70631.857918] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec [70631.857927] radeon 0000:00:01.0: GPU lockup (waiting for 0x00000000007e1fe5 last fence id 0x00000000007e1fe3) [70631.858436] radeon 0000:00:01.0: sa_manager is not empty, clearing anyway [70631.859755] radeon 0000:00:01.0: Saved 951 dwords of commands on ring 0. [70631.859761] radeon 0000:00:01.0: GPU softreset: 0x00000003 [70631.859766] radeon 0000:00:01.0: VM_CONTEXT0_PROTECTION_FAULT_ADDR 0x00000000 [70631.859770] radeon 0000:00:01.0: VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000 [70631.859774] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [70631.859778] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [70631.867299] radeon 0000:00:01.0: GRBM_STATUS = 0xA2703828 [70631.867305] radeon 0000:00:01.0: GRBM_STATUS_SE0 = 0x1D000007 [70631.867309] radeon 0000:00:01.0: GRBM_STATUS_SE1 = 0x00000007 [70631.867313] radeon 0000:00:01.0: SRBM_STATUS = 0x20000040 [70631.867317] radeon 0000:00:01.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [70631.867321] radeon 0000:00:01.0: R_008678_CP_STALLED_STAT2 = 0x00018000 [70631.867325] radeon 0000:00:01.0: R_00867C_CP_BUSY_STAT = 0x00008006 [70631.867328] radeon 0000:00:01.0: R_008680_CP_STAT = 0x80038647 [70631.867332] radeon 0000:00:01.0: GRBM_SOFT_RESET=0x0000DF7B [70631.867386] radeon 0000:00:01.0: GRBM_STATUS = 0x00003828 [70631.867390] radeon 0000:00:01.0: GRBM_STATUS_SE0 = 0x00000007 [70631.867393] radeon 0000:00:01.0: GRBM_STATUS_SE1 = 0x00000007 [70631.867397] radeon 0000:00:01.0: SRBM_STATUS = 0x20000040 [70631.867400] radeon 0000:00:01.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [70631.867404] radeon 0000:00:01.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [70631.867408] radeon 0000:00:01.0: R_00867C_CP_BUSY_STAT = 0x00000000 [70631.867411] radeon 0000:00:01.0: R_008680_CP_STAT = 0x00000000 [70631.883681] radeon 0000:00:01.0: GPU reset succeeded, trying to resume [70631.916445] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [70631.916534] radeon 0000:00:01.0: WB enabled [70631.916536] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000030000c00 and cpu addr 0xffff880235891c00 [70631.916538] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000030000c04 and cpu addr 0xffff880235891c04 [70631.916540] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000030000c08 and cpu addr 0xffff880235891c08 [70631.916541] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000030000c0c and cpu addr 0xffff880235891c0c [70631.916543] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000030000c10 and cpu addr 0xffff880235891c10 [70631.935206] [drm] ring test on 0 succeeded in 3 usecs [70631.935264] [drm] ring test on 3 succeeded in 2 usecs [70631.935271] [drm] ring test on 4 succeeded in 1 usecs [70631.949531] [drm] ib test on ring 0 succeeded in 0 usecs [70631.950057] [drm] ib test on ring 3 succeeded in 0 usecs [70631.950576] [drm] ib test on ring 4 succeeded in 1 usecs
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #1 from udo udovdh@xs4all.nl --- Created attachment 77277 --> https://bugs.freedesktop.org/attachment.cgi?id=77277&action=edit Xorg.0.log with R600_DEBUG=nodma
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #2 from udo udovdh@xs4all.nl --- Created attachment 77278 --> https://bugs.freedesktop.org/attachment.cgi?id=77278&action=edit dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=62997
udo udovdh@xs4all.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|GPU fault unless |Crashes on ARUBA unless |R600_DEBUG=nodma |R600_DEBUG=nodma
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #3 from udo udovdh@xs4all.nl --- With R600_DEBUG=nodma we get some mentions of GPU fault but not as often and no crashing the whole PC.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #4 from udo udovdh@xs4all.nl --- I shttps://bugs.freedesktop.org/show_bug.cgi?id=58667 a related issue?
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #5 from udo udovdh@xs4all.nl --- It does crash, but without reboot. Gui disappears. Pure text mode screne is shown of first few seconds of boot. No network. Kernel alive.
Apr 7 07:59:47 surfplank2 dbus[3118]: [system] Rejected send message, 2 matched rules; type="method_return", sender=":1.2" (uid=0 pid=3090 comm="/usr/lib/systemd/systemd-logind ") interface="(unset)" member ="(unset)" error name="(unset)" requested_reply="0" destination=":1.34" (uid=500 pid=4127 comm="gnome-session ") Apr 7 08:11:39 surfplank2 kernel: [406000.278385] radeon 0000:00:01.0: GPU fault detected: 147 0x0f727102 Apr 7 08:11:39 surfplank2 kernel: [406000.278390] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000018F7 Apr 7 08:11:39 surfplank2 kernel: [406000.278393] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02071002 Apr 7 08:11:39 surfplank2 kernel: [406000.278396] radeon 0000:00:01.0: GPU fault detected: 147 0x0f627102 Apr 7 08:11:39 surfplank2 kernel: [406000.278399] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278401] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278404] radeon 0000:00:01.0: GPU fault detected: 147 0x07527102 Apr 7 08:11:39 surfplank2 kernel: [406000.278406] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278409] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278411] radeon 0000:00:01.0: GPU fault detected: 147 0x07627102 Apr 7 08:11:39 surfplank2 kernel: [406000.278413] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278416] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278418] radeon 0000:00:01.0: GPU fault detected: 147 0x00a27102 Apr 7 08:11:39 surfplank2 kernel: [406000.278420] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278423] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 Apr 7 08:11:39 surfplank2 kernel: [406000.278426] radeon 0000:00:01.0: GPU fault detected: 147 0x00a27102 Apr 7 08:11:39 surfplank2 kernel: [406000.278428] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000000Apr 7 08:17:11 surfplank2 kernel: imklog 5.8.10, log source = /proc/kmsg started. Apr 7 08:17:11 surfplank2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="3041" x-info="http://www.rsyslog.com"] start
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #6 from udo udovdh@xs4all.nl --- FWIW: Another lockup..
[ 9912.997377] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead. [16500.596325] radeon 0000:00:01.0: GPU fault detected: 146 0x0eb27104 [16500.596330] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x000008EB [16500.596332] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02071004 [16500.596335] radeon 0000:00:01.0: GPU fault detected: 146 0x0ec27104 [16500.596337] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [16500.596340] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [16500.596342] radeon 0000:00:01.0: GPU fault detected: 147 0x06b27102 [16500.596344] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [16500.596347] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [16500.596349] radeon 0000:00:01.0: GPU fault detected: 147 0x06c27102 [16500.596351] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [16500.596353] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [16511.077533] radeon 0000:00:01.0: GPU lockup CP stall for more than 10000msec [16511.077537] radeon 0000:00:01.0: GPU lockup (waiting for 0x000000000038b92b last fence id 0x000000000038b928) [16511.078189] radeon 0000:00:01.0: sa_manager is not empty, clearing anyway [16511.079467] radeon 0000:00:01.0: Saved 215 dwords of commands on ring 0. [16511.079470] radeon 0000:00:01.0: GPU softreset: 0x00000003 [16511.079473] radeon 0000:00:01.0: VM_CONTEXT0_PROTECTION_FAULT_ADDR 0x00000000 [16511.079475] radeon 0000:00:01.0: VM_CONTEXT0_PROTECTION_FAULT_STATUS 0x00000000 [16511.079478] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [16511.079480] radeon 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [16511.261445] radeon 0000:00:01.0: GRBM_STATUS = 0xE5702828 [16511.261447] radeon 0000:00:01.0: GRBM_STATUS_SE0 = 0xFC000005 [16511.261450] radeon 0000:00:01.0: GRBM_STATUS_SE1 = 0x00000007 [16511.261451] radeon 0000:00:01.0: SRBM_STATUS = 0x20000040 [16511.261454] radeon 0000:00:01.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [16511.261456] radeon 0000:00:01.0: R_008678_CP_STALLED_STAT2 = 0x00018000 [16511.261458] radeon 0000:00:01.0: R_00867C_CP_BUSY_STAT = 0x00008006 [16511.261461] radeon 0000:00:01.0: R_008680_CP_STAT = 0x80038647 [16511.261462] radeon 0000:00:01.0: GRBM_SOFT_RESET=0x0000DF7B [16511.261515] radeon 0000:00:01.0: GRBM_STATUS = 0x00003828 [16511.261517] radeon 0000:00:01.0: GRBM_STATUS_SE0 = 0x00000007 [16511.261519] radeon 0000:00:01.0: GRBM_STATUS_SE1 = 0x00000007 [16511.261521] radeon 0000:00:01.0: SRBM_STATUS = 0x20000040 [16511.261523] radeon 0000:00:01.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [16511.261525] radeon 0000:00:01.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [16511.261527] radeon 0000:00:01.0: R_00867C_CP_BUSY_STAT = 0x00000000 [16511.261528] radeon 0000:00:01.0: R_008680_CP_STAT = 0x00000000 [16511.274728] radeon 0000:00:01.0: GPU reset succeeded, trying to resume [16511.463803] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [16511.463892] radeon 0000:00:01.0: WB enabled [16511.463895] radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000030000c00 and cpu addr 0xffff8802331cdc00 [16511.463897] radeon 0000:00:01.0: fence driver on ring 1 use gpu addr 0x0000000030000c04 and cpu addr 0xffff8802331cdc04 [16511.463900] radeon 0000:00:01.0: fence driver on ring 2 use gpu addr 0x0000000030000c08 and cpu addr 0xffff8802331cdc08 [16511.463902] radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000030000c0c and cpu addr 0xffff8802331cdc0c [16511.463903] radeon 0000:00:01.0: fence driver on ring 4 use gpu addr 0x0000000030000c10 and cpu addr 0xffff8802331cdc10 [16511.482550] [drm] ring test on 0 succeeded in 2 usecs [16511.482609] [drm] ring test on 3 succeeded in 2 usecs [16511.482617] [drm] ring test on 4 succeeded in 1 usecs [16511.497231] [drm] ib test on ring 0 succeeded in 0 usecs [16511.497751] [drm] ib test on ring 3 succeeded in 0 usecs [16511.498269] [drm] ib test on ring 4 succeeded in 1 usecs
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #7 from Alex Deucher agd5f@yahoo.com --- This may be related to bug 62959. Does attachment 72794 (kernel patch) fix the issue?
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #8 from udo udovdh@xs4all.nl --- Will start testing on 3.8.6 in a few minutes.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #9 from udo udovdh@xs4all.nl --- 3.8.6 with and without patch had crashes of various kind. (hard freeze even!) Now doing 3.8.5 without patch, waiting for the raid check to complete.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #10 from udo udovdh@xs4all.nl --- Despite crashes for other reasons (ARUBA (Cayman) not yet ready for OpenCL) I saw no GPU faults etc in the logs since booting into 3.8.5 with the patch. I want to give it a few more days without OpenCL disruptions to be sure.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #11 from Alex Deucher agd5f@yahoo.com --- This is starting to look like a duplicate of bug 62959. Can you try attachment 77608? That seems to fix 62959, hopefully it will fix this one as well.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #12 from udo udovdh@xs4all.nl --- So I undo the previous patch and try this new one? (Or try them combined?)
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #13 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #12)
So I undo the previous patch and try this new one? (Or try them combined?)
Try them separately, not combined.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
--- Comment #14 from udo udovdh@xs4all.nl --- I guess the second patch also fixes the issue. After 1 day, 15:11 of uptime I saw no GPU faults, hangs, etc. Normally they occurred much sooner than that.
https://bugs.freedesktop.org/show_bug.cgi?id=62997
Alex Deucher agd5f@yahoo.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |DUPLICATE
--- Comment #15 from Alex Deucher agd5f@yahoo.com ---
*** This bug has been marked as a duplicate of bug 62959 ***
dri-devel@lists.freedesktop.org