https://bugs.freedesktop.org/show_bug.cgi?id=103736
Bug ID: 103736 Summary: Sudden system freezes, dmesg errors Product: DRI Version: XOrg git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: shiverly@mt2015.com
Created attachment 135450 --> https://bugs.freedesktop.org/attachment.cgi?id=135450&action=edit dmesg errors
I installed Ubuntu Mate 17.10 and M-bab drivers (https://github.com/M-Bab/linux-kernel-amdgpu-binaries, without them one monitor is always black but powered on).
Almost every day system freezes suddenly after random amount of time, which can be from 5 minutes to 3+ hours. Only power button helps, no logs are saved but dmesg has errors.
I think this is either AMDGPU bug or something ryzen related (most likely not, because they manifest as sudden reboots, never as system freezes. And last bios update stopped them 2 months ago).
Graphics: Card: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380] Display Server: x11 (X.Org 1.19.5 ) drivers: ati,amdgpu (unloaded: modesetting,fbdev,vesa,radeon) Resolution: 1920x1080@60.00hz, 1920x1080@60.00hz OpenGL: renderer: AMD Radeon R9 200 Series (TONGA / DRM 3.23.0 / 4.13.11+, LLVM 5.0.1) version: 4.5 Mesa 17.4.0-devel
https://bugs.freedesktop.org/show_bug.cgi?id=103736
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |andresx7@gmail.com
--- Comment #1 from Michel Dänzer michel@daenzer.net --- (In reply to Shiverly from comment #0)
[...] dmesg has errors.
I only see messages about failing to allocate a larger BAR, which is harmless.
I think this is either AMDGPU bug or something ryzen related (most likely not, because they manifest as sudden reboots, never as system freezes. And last bios update stopped them 2 months ago).
FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC, and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed them for him.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #2 from Andres Rodriguez andresx7@gmail.com ---
FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC, and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed them for him.
I raised the memory and the core voltages specifically. The other voltages like SoC were left untouched.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #3 from Shiverly shiverly@mt2015.com --- (In reply to Andres Rodriguez from comment #2)
FWIW, Andres Rodriguez reported similar symptoms with a Ryzen system on IRC, and raising voltages / disabling Cool'n'Quiet / disabling C6 states fixed them for him.
I raised the memory and the core voltages specifically. The other voltages like SoC were left untouched.
I didn't have these symptoms in arch or ubuntu 16.04 LTS, only when using this driver/kernel combination (which is only one that keeps both monitors usable). Long compilation jobs don't cause system freezes either.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #4 from Shiverly shiverly@mt2015.com --- I got some logs. Maybe they are related (found them in journalctl)
Nov 15 20:09:06 tibu-pc kernel: gmc_v8_0_process_interrupt: 626 callbacks suppressed Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068154, read from 'TC5' (0x54433500) (192) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A0C0001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4BA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d0c001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5545728, read from 'TC7' (0x54433700) (68) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044002 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00549F00 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d00001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068154, read from 'TC0' (0x54433000) (8) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A008001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4BA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x05d00801 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5541638, read from 'TC9' (0x54433900) (136) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088002 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00548F06 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC0' (0x54433000) (8) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A008001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4CA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500801 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x02, vmid 5) at page 5541634, read from 'TC8' (0x54433800) (64) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A040002 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00548F02 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC7' (0x54433700) (68) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4CA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504401 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC2' (0x54433200) (0) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A000001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4CA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06500001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068171, read from 'TC7' (0x54433700) (68) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4CB Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06584401 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 5) at page 154068170, read from 'TC7' (0x54433700) (68) Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A044001 Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EE4CA Nov 15 20:08:20 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x06504401 Nov 15 20:08:20 tibu-pc kernel: gmc_v8_0_process_interrupt: 1830 callbacks suppressed Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC5' (0x54433500) (192) Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C0001 Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EB0FF Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x07f8c001 Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC0' (0x54433000) (8) Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008001 Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x092EB0FF Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: GPU fault detected: 147 0x07f80801 Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM fault (0x01, vmid 1) at page 154054911, read from 'TC11' (0x54433131) (128) Nov 15 20:08:03 tibu-pc kernel: amdgpu 0000:22:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02080001
https://bugs.freedesktop.org/show_bug.cgi?id=103736
Shiverly shiverly@mt2015.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|Sudden system freezes, |Sudden system freezes, GPU |dmesg errors |fault detected
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #5 from Shiverly shiverly@mt2015.com --- One way to get crash quickly is to play Overpass map in CS:GO in terrorist spawn. Textures near the stairs show corrupted, and system always hangs in first 5 minutes of gameplay. I think it's 3D related, because just using simple text editor or being in Ctrl-Alt-Fx terminal never hangs, but browser can cause hang but it's less quick to manifest than playing 3D game.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #6 from Lennart Sauerbeck fdobugs@lennart.sauerbeck.org --- Created attachment 137005 --> https://bugs.freedesktop.org/attachment.cgi?id=137005&action=edit Crash while playing Counter-Strike: Global Offensive
I think I'm running into the same issues. Attached is the kernel output while playing Counter-Strike: Global Offensive. It worked during the warmup, but froze in the first round, so I'd say about 3-5 minutes after starting the game.
I'm running an up-to-date Debian unstable with Linux 4.14.13 and Mesa 17.3.3.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #7 from Lennart Sauerbeck fdobugs@lennart.sauerbeck.org --- Created attachment 137006 --> https://bugs.freedesktop.org/attachment.cgi?id=137006&action=edit Errors while playing CS:GO, crash and reboot after opening VLC
Another crash pretty much right after the one from my previous comment. After rebooting the system to continue playing Counter-Strike: Global Offensive the errors kept coming, though the system did not freeze (note the timestamps in the error log).
After shutting down the game, I started VLC to watch a stream and the system froze immediately. After a short while (<5 minutes) I used Magic SysReq keys to reboot the system safely, which can also be seen in the log.
A possibly important detail: My system doesn't freeze entirely, only the graphics output does. Sound still works for a time, even voice chatting continues to work. However, all X output freezes (e.g. conky on desktop).
I haven't tried going to a virtual console, so do not know whether that still works.
I also had the same issue while playing Euro Truck Simulator 2, but it never happened while playing Dota 2. Given this, it seems like some illegal instruction is passed to the graphics driver. Would an ApiTrace help? If so, I can try to record one.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #8 from Lennart Sauerbeck fdobugs@lennart.sauerbeck.org --- I was able to record an ApiTrace which shows the problem consistently. However, it's 2.5 gigabytes and contains personal information I'd rather not share on a public bugtracker -- I think a trace can only be truncated, removing stuff from the beginning messes up the OpenGL context?
I cannot switch to the virtual console when the freeze is triggered.
I also built radeonsi from current Mesa git (9b9a89cd795fda462a6ee898ef6e5135ca79d94e) but the problem persisted.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #9 from Ernst Sjöstrand ernstp@gmail.com --- I get
[ 133.978908] amdgpu 0000:09:00.0: GPU fault detected: 147 0x00198802 [ 133.978911] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500003 [ 133.978912] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02188002 [ 133.978914] amdgpu 0000:09:00.0: VM fault (0x02, vmid 1) at page 5242883, read from 'TC4' (0x54433400) (392)
or from another boot
[ 204.841497] amdgpu 0000:09:00.0: GPU fault detected: 147 0x00188402 [ 204.841501] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00500003 [ 204.841502] amdgpu 0000:09:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A084002 [ 204.841504] amdgpu 0000:09:00.0: VM fault (0x02, vmid 5) at page 5242883, read from '' (0x00000000) (132)
When I try to launch steam. It never gets to draw any UI, the computer just freezes. This happens with both 4.13(-ubuntu33) and 4.15.2 kernel with Mesa/LLVM from git (padoka). When I reverted to Mesa 17.2.8 + LLVM 5.0.0 I could launch steam again.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #10 from Ernst Sjöstrand ernstp@gmail.com --- The Vehicle Game demo seem to trigger this quite reliably for me: https://wiki.unrealengine.com/Linux_Demos
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #11 from Ernst Sjöstrand ernstp@gmail.com --- Ok, the vm faults I see are caused by using Padoka ppa which currently has https://cgit.freedesktop.org/mesa/mesa/commit/?id=847d0a393d7f0f967f39302900... but not https://reviews.llvm.org/D41663
That means it can't be the same as the original issue, and also that the solution for me is just to update to more recent versions. Sorry for the noise in this bug.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #12 from aceman acelists@atlas.sk --- Ernst, I have also traced the error you have to usage of OpenCL in the Mesa clover driver on RX560 with LLVM upgraded from 5.0.1 to 6.0. What do you say is the solution? Is Mesa using intrinsics that are only in LLVM git? Or is that LLVM changeset you posted already in the release LLVM 6.0?
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #13 from Ernst Sjöstrand ernstp@gmail.com --- aceman: the problem was mismatching development snapshots, couldn't happen if you have any real releases in the mix.
https://bugs.freedesktop.org/show_bug.cgi?id=103736
--- Comment #14 from aceman acelists@atlas.sk --- I'm using Mesa git, but LLVM 6.0 release. Is that fine wrt. this mismatch?
https://bugs.freedesktop.org/show_bug.cgi?id=103736
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #15 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/258.
dri-devel@lists.freedesktop.org