https://bugs.freedesktop.org/show_bug.cgi?id=107154
Bug ID: 107154 Summary: [drm] GPU recovery disabled. Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: freedesktop.org@nentwig.biz
Hi!
This is a surprisingly long standing problem with a RX 460, more precisely since 4.15 all the way up to 4.18 AMD staging DRM next [1]. After resuming from sleep (echo -n mem > /sys/power/state) amdgpu is dead (always, reliably). Here's what dmesg has to say about it:
[Sun Jul 8 11:01:17 2018] PM: suspend exit [Sun Jul 8 11:01:19 2018] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out. [Sun Jul 8 11:01:19 2018] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on GFX ring (-110). [Sun Jul 8 11:01:19 2018] [drm:process_one_work] *ERROR* ib ring test failed (-110). [Sun Jul 8 11:01:28 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=864, last emitted seq=868 [Sun Jul 8 11:01:28 2018] [drm] GPU recovery disabled.
From ealier versions:
[ 42.802559] PM: suspend exit [ 42.824332] amdgpu 0000:41:00.0: GPU fault detected: 147 0x0bd84802 [ 42.824338] amdgpu 0000:41:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0034F97B [ 42.824341] amdgpu 0000:41:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048002 [ 42.824345] amdgpu 0000:41:00.0: VM fault (0x02, vmid 6) at page 3471739, read from 'TC0' (0x54433000) (72) [ 52.956306] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1287, last emitted seq=1289 [ 52.956316] [drm] IP block:gfx_v8_0 is hung! [ 52.956362] [drm] GPU recovery disabled.
I've also seen fault 146 but other than that it mostly looks the same. 4.14-lts (with dc=0) works fine.
RX 460, Zenith Extreme, 1950x.
[1] arch linux AUR; this versioning is a bit confusing, it may actually already be the 4.19 branch, latest commit is3838e387fd1eb17bfcf6ff7d443d931adb5cb41b
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #1 from dwagner jb5sgc1n.nya@20mm.eu --- Indeed, crashes upon S3 resumes have been abundant with amdgpu.dc=1 for many months now, and seemingly for more than one reason.
One bug I reported in August 2017 with https://bugs.freedesktop.org/show_bug.cgi?id=102323 - that one was fixed quickly.
The next S3 resume crash I reported in October 2017 in https://bugs.freedesktop.org/show_bug.cgi?id=103277, that one stayed without any resolution until April 2018, and the fix found in that report only works if no "drm.edid_firmware=..." kernel command line option is used.
Another crash bug with S3 resumes I reported for 4.17.2 kernels in https://bugs.freedesktop.org/show_bug.cgi?id=107065 - then realized that 4.18 pre-releases exhibit the very same kind of crash immediately upon starting X11. For this crash upon X11 startup, there is a patch in the bug report, but it does not prevent the S3 resume crash.
I currently work around S3 resume crashes by switching to the console display before enterin S3 sleep - but this is really an awkward work-around.
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #2 from freedesktop.org@nentwig.biz --- (In reply to dwagner from comment #1)
I currently work around S3 resume crashes by switching to the console display before enterin S3 sleep - but this is really an awkward work-around.
Oh, that doesn't help either. It crashes the very moment I switch back to X.
And what's more starting with 4.15 amdgpu.dc=0 doesn't appear to make any difference.
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #3 from Michel Dänzer michel@daenzer.net --- Please attach the full dmesg output.
Can you bisect between 4.14 and 4.15?
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #4 from Christian König ckoenig.leichtzumerken@gmail.com --- Do you have a full dmesg?
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #5 from freedesktop.org@nentwig.biz --- Created attachment 140525 --> https://bugs.freedesktop.org/attachment.cgi?id=140525&action=edit dmesg amdgpu.dc=1
Booted with amdgpu.dc=1.
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #6 from freedesktop.org@nentwig.biz --- Created attachment 140526 --> https://bugs.freedesktop.org/attachment.cgi?id=140526&action=edit dmesg /etc/modprobe.d/
Booted with amdgpu.dc=1 in /etc/modprobe.d/
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #7 from freedesktop.org@nentwig.biz --- Sure, attached. AMD staging kernel. I don't know how to tell whether DC=1 is really enabled, so I did two runs: one with amdgpu.dc=1 as boot parameter and one with /etc/modprobe.d/ on top of that.
Procedure was the same both times: - boot - X login - switch to console - sleep, wakeup - switch to X
The drm/amdgpu lines appear already in the console right after waking up, prior to switching to X.
This time "only" X crashed (could still move the pointer); at times the complete machine is dead, no switching to console and and no SSH.
(as a side note: is is normal that waking up on ryzen takes something on the order of 10-30s? I'm used to split second wakeups on Intel.)
HTH
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #8 from freedesktop.org@nentwig.biz --- Created attachment 140528 --> https://bugs.freedesktop.org/attachment.cgi?id=140528&action=edit dmesg 4.14 LTS
Sorry, forgot about the requested 4.14 dmesg log. Attached as well.
This is: boot, login (to KDE this time), do stuff, remember, sleep, wakeup.
https://bugs.freedesktop.org/show_bug.cgi?id=107154
Christian König ckoenig.leichtzumerken@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED
--- Comment #9 from Christian König ckoenig.leichtzumerken@gmail.com --- Yeah, that is a known problem in the PCI subsystem. Will be fixed with 4.19 and then backported to older kernels.
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #10 from freedesktop.org@nentwig.biz --- So, there's 4.19rc1-amd-next \o/
echo: write error: Device or resource busy
This started to happen with 4.18. dmesg:
[ 171.245467] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0): [ 171.245484] systemd-udevd D 0 700 615 0x80000124
So, is this sth. to report to fricking systemd to?
Gee, really...?!
https://bugs.freedesktop.org/show_bug.cgi?id=107154
--- Comment #11 from kyle.devir@mykolab.com ---
systemd-udevd
This is not systemd's fault, but indicative of something hanging in kernel land, which udevd ends up being blocked on.
Experienced this a few major kernel releases ago, which were resolved by the next major version. Never did figure out what caused udevd to block... :/
dri-devel@lists.freedesktop.org