https://bugs.freedesktop.org/show_bug.cgi?id=104362
Bug ID: 104362 Summary: GPU fault detected on wine-nine Path of Exile Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: grantipak@gmail.com
Created attachment 136347 --> https://bugs.freedesktop.org/attachment.cgi?id=136347&action=edit dmesg log
When i start game Path of Exile on wine-nine computer freez. I can`t switch in kernel console. When i connect on ssh i can get dmesg output.
Radeon HD 7950 (TAHITI)
kernel 4.13.4 - 4.14.6 kernel module AMDGPU
Mesa 17.2.1 - 17.3.1
wine-nine 2.20 - 2.21
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #1 from Vladimir Usikov grantipak@gmail.com --- What I mean by freezing. The computer does not respond to the keyboard and mouse. When I press 'Num Lock' or 'Caps Lock' the LED does not light up. Clicking on Ctrl+Alt+F# does not switch to the TTY#. Today, when I went to a hung computer, I saw a new error in dmesg.
[26173.119284] INFO: task amdgpu_cs:0:660 blocked for more than 120 seconds. [26173.119292] Tainted: G O 4.14.9-1-ARCH #1 [26173.119295] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [26173.119299] amdgpu_cs:0 D 0 660 629 0x00000000 [26173.119303] Call Trace: [26173.119316] ? __schedule+0x290/0x890 [26173.119320] schedule+0x2f/0x90 [26173.119407] amd_sched_entity_push_job+0xd2/0x110 [amdgpu] [26173.119415] ? wait_woken+0x80/0x80 [26173.119488] amdgpu_job_submit+0x76/0x90 [amdgpu] [26173.119550] amdgpu_vm_bo_update_mapping.constprop.25+0x35a/0x3c0 [amdgpu] [26173.119612] ? amdgpu_vm_prt_cb+0x20/0x20 [amdgpu] [26173.119673] amdgpu_vm_bo_update+0x272/0x550 [amdgpu] [26173.119734] amdgpu_cs_ioctl+0x12a9/0x1a50 [amdgpu] [26173.119797] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu] [26173.119826] drm_ioctl_kernel+0x59/0xb0 [drm] [26173.119851] drm_ioctl+0x2d5/0x370 [drm] [26173.119910] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu] [26173.119964] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [26173.119971] do_vfs_ioctl+0xa1/0x610 [26173.119976] ? SyS_futex+0x12d/0x180 [26173.119980] SyS_ioctl+0x74/0x80 [26173.119984] entry_SYSCALL_64_fastpath+0x1a/0x7d [26173.119988] RIP: 0033:0x7effda21d337 [26173.119990] RSP: 002b:00007effd028eb08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [26173.119993] RAX: ffffffffffffffda RBX: 0000555f9a2f4870 RCX: 00007effda21d337 [26173.119994] RDX: 00007effd028eb70 RSI: 00000000c0186444 RDI: 0000000000000018 [26173.119996] RBP: 00007effd028eae0 R08: 00007effd028ec10 R09: 00007effd028eb50 [26173.119998] R10: 00007effd028ec10 R11: 0000000000000246 R12: 0000000040086409 [26173.119999] R13: 0000000000000018 R14: 0000555f99557420 R15: 0000555f9a1ecf60
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #2 from Vladimir Usikov grantipak@gmail.com --- Created attachment 141280 --> https://bugs.freedesktop.org/attachment.cgi?id=141280&action=edit dmesg
Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #3 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to Vladimir Usikov from comment #2)
Created attachment 141280 [details] dmesg
Freeze still going on Linux 4.18.4 and mesa 18.1.7. Dmesg different.
After freeze i try cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
Please provide clean new dmesg loga also glxinfo.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #4 from Vladimir Usikov grantipak@gmail.com --- Created attachment 141329 --> https://bugs.freedesktop.org/attachment.cgi?id=141329&action=edit clean dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #5 from Vladimir Usikov grantipak@gmail.com --- Created attachment 141330 --> https://bugs.freedesktop.org/attachment.cgi?id=141330&action=edit glxinfo
https://bugs.freedesktop.org/show_bug.cgi?id=104362
Vladimir Usikov grantipak@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #136347|text/x-log |text/plain mime type| |
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #6 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ? In latest log i see no prints of hang.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #7 from Vladimir Usikov grantipak@gmail.com --- Created attachment 141405 --> https://bugs.freedesktop.org/attachment.cgi?id=141405&action=edit dmesg_2
Did you do pm-hibernate before the hang happened ? I see a hibernate print just before the hang in the previous log ?
Yes, several times.
In latest log i see no prints of hang.
Yes, you request clean dmesg.
Now i attach dmesg output without pm-hibernate.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #8 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- We can try and check the gfx command buffer for latest commands and CUs status -
Clone and build our open source register analyzer from here - https://cgit.freedesktop.org/amd/umr/
After hang happens please get following outputs -
sudo umr -lb > umr_dump sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump sudo umr -O halt_waves,use_colour -wa >> umr_dump dmesg > dmesg_dump
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #9 from Vladimir Usikov grantipak@gmail.com --- My Radeon 7950 dead, can`t test any more.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
nmr nnmmrr88+fd@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |nnmmrr88+fd@gmail.com
--- Comment #10 from nmr nnmmrr88+fd@gmail.com --- Created attachment 143036 --> https://bugs.freedesktop.org/attachment.cgi?id=143036&action=edit UMR dump for PoE/gallium-nine induced AMDGPU hang
I am experiencing the same bug, here is the UMR dump.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #11 from nmr nnmmrr88+fd@gmail.com --- Created attachment 143037 --> https://bugs.freedesktop.org/attachment.cgi?id=143037&action=edit dmesg dump for hang
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #12 from nmr nnmmrr88+fd@gmail.com --- Andrey is there anything I can do to help resolve this? Be happy to help. Haven't looked at the ring buffers, is there some kind of deadlock in there?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #13 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to nmr from comment #10)
Created attachment 143036 [details] UMR dump for PoE/gallium-nine induced AMDGPU hang
I am experiencing the same bug, here is the UMR dump.
Marek, I am seeing waves dumps in here during the hang, could you please take a look and advise ?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #14 from Marek Olšák maraeo@gmail.com --- There is some branching and SGPR spilling, so I guess that's the problem.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #15 from nmr nnmmrr88+fd@gmail.com --- Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #16 from Marek Olšák maraeo@gmail.com --- The wave dump suggests that image_sample_lz might be responsible for the hang, but its SGPR inputs seem to contain valid descriptors.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #17 from Marek Olšák maraeo@gmail.com --- (In reply to nmr from comment #15)
Marek forgive my ignorance but why would SGPR spilling or branching cause the hang? Is the shader just timing out somehow and the timeout resulting in a kernel module abort?
Pretty much. The shader is stuck and doesn't continue. Also the shader is insanely huge with lots of SGPR spilling and branching.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #18 from nmr nnmmrr88+fd@gmail.com --- Created attachment 143124 --> https://bugs.freedesktop.org/attachment.cgi?id=143124&action=edit dmesg during reboot/recovery
On the basis that it may be shader induced I repro'd a similar hang with dxvk (v0.95-5-gcc38412).
FWIW here's dmesg during subsequent GPU recovery (which fails :( ) and reboot (which hangs.) It appears hung on a DMA, and/or hung doing a modeset, acquiring the modeset lock.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #19 from nmr nnmmrr88+fd@gmail.com --- Created attachment 143125 --> https://bugs.freedesktop.org/attachment.cgi?id=143125&action=edit UMR dump for similar dxvk hang
I see that there is only one wave noted in the dump, and the shader appears to be of reasonable length.
pgm[7@0x800100025000 + 0x94 ] = 0x3727c5ac ;;
Are these timed NOPs or something to achieve the correct cycle delay to avoid load hazards or something?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #20 from Axel Davy davyaxel0@gmail.com --- It looks like from the previous comments the problem is in radeonsi.
As a 'temporary fix', you could try this patch: https://github.com/iXit/Mesa-3D/commit/976f3fe791b0aa34cc04eaac53147eb60089e...
This patch recompiles the shaders with the boolean and integer constant values given by the app, thus the branches controlled by them are simplified.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #21 from nmr nnmmrr88+fd@gmail.com --- There may also be a bug in radeonsi, and thanks for the heads up, but every circumstance where user code causes a kernel hang is a bug.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #22 from nmr nnmmrr88+fd@gmail.com --- amdgpu still hangs kernel in Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #23 from nmr nnmmrr88+fd@gmail.com --- Marek, is this even the right bug tracker for the kernel module or is this just for user space?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #24 from Alex Deucher alexdeucher@gmail.com --- (In reply to nmr from comment #23)
Marek, is this even the right bug tracker for the kernel module or is this just for user space?
Same bug tracker for all components.
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #25 from nmr nnmmrr88+fd@gmail.com --- Is it likely that this hang will get any traction with the AMDGPU team? Or should I just close it and reset my expectations?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
Alex Deucher alexdeucher@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Product|DRI |Mesa Component|DRM/AMDgpu |Drivers/Gallium/radeonsi QA Contact| |dri-devel@lists.freedesktop | |.org
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #26 from nmr nnmmrr88+fd@gmail.com --- I'm getting the impression that AMD does not regard the kernel hang as the underlying issue. Is that correct?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #27 from Alex Deucher alexdeucher@gmail.com --- (In reply to nmr from comment #26)
I'm getting the impression that AMD does not regard the kernel hang as the underlying issue. Is that correct?
The GPU hang is most likely caused by a bug in mesa. What kernel are you using? GPU reset was only recently enabled by default on certain asics. Even if a GPU reset is successful, user mode programs (like X or the wayland desktop compositor) need to properly catch and handle GPU resets which they don't currently today. Can you try 4.20 or newer?
https://bugs.freedesktop.org/show_bug.cgi?id=104362
--- Comment #28 from nmr nnmmrr88+fd@gmail.com --- I get that it's triggered by Mesa, but don't you think it's a bug itself that user-space can hang the kernel? I can't even switch virtual consoles when it hangs.
I'm currently running Linux waldorf 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux
I'll report back when I upgrade to 4.20
https://bugs.freedesktop.org/show_bug.cgi?id=104362
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |MOVED
--- Comment #29 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1295.
dri-devel@lists.freedesktop.org