https://bugs.freedesktop.org/show_bug.cgi?id=107572
Bug ID: 107572 Summary: Unrecoverable GPU hang with IP block:gfx_v8_0 is hung Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: madcatx@atlas.cz
Hello,
I have been experiencing a worrying amount of these ever since I got my RX 570 a few months ago. I can reproduce the hang quite reliably by with some 3D workloads, for instance the Unigine Superposition run on High quality or Witcher 3 (through WINE) crash the GPU quite reliably within minutes.
Once that happens I can always SSH into the machine and try to get at least some debugging information. Unfortunately, there does not seem to be much to go on.
dmesg does not tell me more than this: [ 254.704581] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=103742, last emitted seq=103745 [ 254.704586] [drm] IP block:gfx_v8_0 is hung! [ 254.704629] [drm] GPU recovery disabled.
Here are a few things I have tried so far: - Boot with amdgpu.dc=0 - Boot with amdgpu.vm_update_mode=3 - Force the GPU to max power state - Disable IOMMU (both by iommu=off and by disabling VT-d in BIOS) - Boot with amdgpu.gpu_recovery=1 (does not produce any additional info)
I grabbed the umr tool to try to get the state of the GPU when in crashes but it does not seem to be able to read anything. Running:
umr -R gfx[.]
Leaves me with:
[ERROR]: Could not open ring debugfs file#
I check that entries in /sys/kernel/debug/amdgpu that look relevant are there, cat'ing them gives me "Operation not permitted". Yes, I am doing it as root.
Once this happens the only way out is a hard reboot.
I am running up-to-date Fedora 28, kernel 4.17.2, Mesa 18.0 series, LLVM 6.0.1.
Is there anything else I can do?
Thanks.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #1 from Michel Dänzer michel@daenzer.net --- Can you try latest Mesa / LLVM?
Please attach the corresponding Xorg log file and output of dmesg.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #2 from madcatx@atlas.cz --- I remember I tried with an RC of mesa 18.2 and kernel 4.18-rc6 which didn't help in any way. If you want me to try the latest code from git/SVN I'll see what I can do (I can't exactly mess up my production box). In the mean time, is there any way I can get some more useful debugging output?
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #3 from madcatx@atlas.cz --- Created attachment 141125 --> https://bugs.freedesktop.org/attachment.cgi?id=141125&action=edit dmesg right after the GPU hanged
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #4 from madcatx@atlas.cz --- Created attachment 141126 --> https://bugs.freedesktop.org/attachment.cgi?id=141126&action=edit Xorg log
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #5 from madcatx@atlas.cz --- Requested logs attached, I'm afraid they do not contain anything particularly revealing though. Just FTR, my exact version of mesa is 18.0.5, libdrm 2.4.93.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #141126|text/x-log |text/plain mime type| |
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #6 from Asseon asseon@posteo.de --- I believe I have the exact same or at least a very similar Issue. I have a RX 480 though. I can reproduce this very reliable with Witcher 3 as well unless I use dxvk (a vulkan based DX11 implementation for wine), I can play it for hours without any issues using it compared to a few minutes. Which makes me think that the issues might be somewhere in the opengl machinery. "Normal" usage aka browsing an watching videos does occasionally trigger it too.
Relevant software Versions: linux: 4.17.14 mesa: 18.1.5 llvm: 6.0.1
I'm trying to compile current git/svn versions of llvm and mesa right now, but it will take some time. Let's see if that helps.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #7 from madcatx@atlas.cz --- I don't think this is isolated to OpenGL as I got the very same hang in the Vulkan beta of The Talos Principle - it happened only once though. If it is any help I believe that the Unigine Superpostion benchmark always crashes the GPU at a specific point during the benchmark. Reducing the image quality level to "medium" makes the benchmark finish correctly.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|unspecified |18.2 Component|DRM/AMDgpu |Drivers/Gallium/radeonsi Product|DRI |Mesa QA Contact| |dri-devel@lists.freedesktop | |.org
--- Comment #8 from Michel Dänzer michel@daenzer.net --- Reassigning this to Mesa for now; GFX ring hangs are indeed most likely triggered by userspace issues.
Beware that there might be multiple separate issues with similar symptoms, but different causes. It's better to track each issue separately until it's clear that some of them have the same cause. In particular, those issues which can be reliably reproduced with a certain application vs those which happen randomly.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
Asseon asseon@posteo.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |asseon@posteo.de
--- Comment #9 from Asseon asseon@posteo.de --- I just tried running the Witcher 3 with wines own DX11 implementation and svn/git version of llvm and mesa and it hung again.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #10 from Paju gert.pajuvali@eesti.ee --- I'm using RX 480 and experiencing same kind of problems. Running Unigine Superposition crashes GPU 4 times out of 5. I can reproduce these crashes also by playing Euro Truck Simulator 2 but then it's directly dependent how high I set resolution scale in game settings. Larger scale causes crashes to occur more often. When booting my machine to Win10 (I'm running dual boot) everything works fine.
System info:
CPU: Intel i7-3770K GPU: AMD RX480 Arch Linux Linux: 4.17.14 Mesa: 18.1.6 LLVM: 6.0.1
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #11 from madcatx@atlas.cz --- Just out of curiosity, do either of you have a card that is supposed to have some small overclocking done by the manufacturer? My RX570 is supposed to have this and I’m wondering if it could be responsible in any way.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #12 from Paju gert.pajuvali@eesti.ee --- I'm using reference RX480 with default clocks.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #13 from Andrew Cook ariscop@gmail.com --- Having this issue, thought it might be 105733 but no vmfault in dmesg
Last few kernel releases i've been checking the bug by running Obduction under wine using dxvk, gpu hangs before the game loads iirc the first time i launched Obduction it was without dxvk, and it did run
Is there something like apitrace for vulkan? maybe it can be reproduced using one
Asus GL702ZC, Bios 305 CPU: Ryzen 1700 GPU: RX580 Fedora Kernel: 4.17.14-202.fc28.x86_64 Mesa: 18.0.1 llvm: 6.0.1
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #14 from madcatx@atlas.cz --- @Andrew: Could you check that you can reproduce the crash with Unigine Superposition run at High or Ultra quality in 1920x1080? This is what crashes my GPU very reliably. It would be good to have some kind of freely available baseline for this. Note that U:S depends on the older OpenSSL 1.0.2 so a bit of manual library juggling is needed to get it going on F28.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #16 from Andrew Cook ariscop@gmail.com --- https://github.com/ValveSoftware/Proton/blob/proton_3.7/PREREQS.md#directx-1...
Suggests using llvm 7 to avoid gpu hangs, is someone able to test that?
In addition, is it expected for userspace to be capable of hanging the gpu? Really seems like something the kernel should prevent
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #17 from madcatx@atlas.cz --- I just ran a few tests with git/svn versions of LLVM 8.0 and mesa 18.3 and the problem is still there. I attached a dmesg log of the crash in Unigine Superposition. Just FTR the crash with LLVM 8.0/mesa 18.3 happens only on the Extreme settings, High settings survive without a hitch.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #18 from madcatx@atlas.cz --- Created attachment 141261 --> https://bugs.freedesktop.org/attachment.cgi?id=141261&action=edit dmesg log of the crash in Unigine Superposition
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #19 from Andrew Cook ariscop@gmail.com --- Tried again using the debug kernel in fedora
Couldn't reproduce the unigen crash Obduction crashed in the same way, nothing new in dmesg
Kernel: 4.17.19-200.fc28.x86_64+debug
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #20 from Paju gert.pajuvali@eesti.ee --- I ran some Unigine tests with different kernels. No crashes with 4.13.12 and older kernels. Maybe somebody could try to run these tests too and confirm this?
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #21 from madcatx@atlas.cz --- I just tried to run Unigine Superposition with llvm-6.0.1-7 and kernel 4.18.5 as they arrived to F28 and it finished fine twice. Witcher 3 still crashes though.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #22 from Andrew Cook ariscop@gmail.com --- Installed this: https://copr.fedorainfracloud.org/coprs/jerbear64/mesa_dxvk/
Which is mesa 18.2 and the obduction crash seems to have disappeared
https://bugs.freedesktop.org/show_bug.cgi?id=107572
--- Comment #23 from madcatx@atlas.cz --- OK, I just tried Mesa 18.2 from the Copr suggested by Andrew but it does not fix the Witcher 3 for me. Unigine Superposition seems to have been fixed by the 4.18 kernel as I just ran it multiple times even at 4K profile and it always finished successfully. The only thing I cannot try easily is LLVM 7 because it breaks too much dependencies on my Fedora box.
https://bugs.freedesktop.org/show_bug.cgi?id=107572
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |MOVED
--- Comment #24 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1323.
dri-devel@lists.freedesktop.org