https://bugzilla.kernel.org/show_bug.cgi?id=204559
Bug ID: 204559 Summary: amdgpu: kernel oops with constant gpu resets while using mpv Product: Drivers Version: 2.5 Kernel Version: 5.2.7 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: shoegaze@tutanota.com Regression: No
Created attachment 284335 --> https://bugzilla.kernel.org/attachment.cgi?id=284335&action=edit oops.txt
While watching a video using mpv (default config) the system will hang eventually - this is actually a kernel oops that happens after lots of GPU resets every second or so (in the span of ~5 minutes; it seems to be alright in the beginning):
Aug 12 00:46:49 mashedpotato kernel: [drm] UVD and UVD ENC initialized successfully. Aug 12 00:46:49 mashedpotato kernel: [drm] VCE initialized successfully. Aug 12 00:46:56 mashedpotato kernel: amdgpu 0000:01:00.0: GPU pci config reset Aug 12 00:46:59 mashedpotato kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
This block of warnings repeats itself many times and then it is this error:
Aug 12 00:52:20 mashedpotato kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) Aug 12 00:52:20 mashedpotato kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* resume of IP block <sdma_v3_0> failed -110 Aug 12 00:52:20 mashedpotato kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110). Aug 12 00:52:25 mashedpotato kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Aug 12 00:52:25 mashedpotato kernel: #PF: supervisor instruction fetch in kernel mode Aug 12 00:52:25 mashedpotato kernel: #PF: error_code(0x0010) - not-present page
In the end it is a kernel oops, log is in the attachment. The system is only recoverable via a hard reset afterwards, though the sound from a video keeps playing just fine.
My system is a ASUS laptop, TUF FX505-DY with the latest BIOS. lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP Bridge [6:0] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal PCIe GPP Bridge 0 to Bus A 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: Function 7 01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) 02:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5008 (rev 01) 03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8821CE 802.11ac PCIe Wireless Network Adapter 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2) 05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Raven/Raven2/Fenghuang HDMI/DP Audio Controller 05:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor 05:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 05:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 05:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) HD Audio Controller
I have amdgpu.gpu_reset=1 in my kernel commandline as I want to figure out another issue - sometimes the system hangs after locking and disabling screen, and I guess it is GPU reset-related.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) --- Please attach your full dmesg output from boot.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #2 from Maxim Sheviakov (shoegaze@tutanota.com) --- Created attachment 284337 --> https://bugzilla.kernel.org/attachment.cgi?id=284337&action=edit journalctl --dmesg output
As I can't run dmesg after the system's hung, here's journalctl --dmesg output from that particular boot.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #3 from Alex Deucher (alexdeucher@gmail.com) --- You can fetch the output before the hang.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #4 from Alex Deucher (alexdeucher@gmail.com) --- Looks like your system has two GPUs. Can you try booting with amdgpu.runpm=0? Does that fix the issue?
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #5 from Maxim Sheviakov (shoegaze@tutanota.com) --- Created attachment 284341 --> https://bugzilla.kernel.org/attachment.cgi?id=284341&action=edit dmesg -w without runpm parameter
Here's the whole dmesg from a fresh boot up until the hang, no kernel parameters were modified.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #6 from Maxim Sheviakov (shoegaze@tutanota.com) --- Created attachment 284345 --> https://bugzilla.kernel.org/attachment.cgi?id=284345&action=edit dmesg -w with runpm=0 parameter
I have left my laptop with a video playing for about half an hour and it seems like no GPU-related warnings have been produced so far, only RTL8821CE spam. Seems like the root cause of the problem is somewhere in the runtime power management and/or GPU switching stuff as far as I can see.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #7 from Maxim Sheviakov (shoegaze@tutanota.com) --- By the way, how *exactly* does disabling runpm affect the system? Does it leave the discrete GPU always-on or vice verse? Or does it vary on each system? I have tried running The Crew via Wine + DXVK while having amdgpu.runpm=0 in my kernel params and it seems that discrete GPU was being used as the framerate was more than fine.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #8 from Alex Deucher (alexdeucher@gmail.com) --- (In reply to Maxim Sheviakov from comment #7)
By the way, how *exactly* does disabling runpm affect the system? Does it leave the discrete GPU always-on or vice verse? Or does it vary on each system?
It leaves the dGPU powered up all the time rather than dynmically powering it on/off as needed.
I have tried running The Crew via Wine + DXVK while having amdgpu.runpm=0 in my kernel params and it seems that discrete GPU was being used as the framerate was more than fine.
You can use xrandr to pick which GPU you want to use for rendering.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #9 from Maxim Sheviakov (shoegaze@tutanota.com) --- Thanks for your explanation. By the way, disabling runpm also seems to fix the other issue with disabling the display after activating the lockscreen as a powersaving measure. Is there anything else I can do to help with this one? The whole thing seems to be an issue somewhere in the dynamic switching mechanism, which works - but is not really stable with all these hangs at certain conditions.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
Christopher Snowhill (kode54@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |kode54@gmail.com
--- Comment #10 from Christopher Snowhill (kode54@gmail.com) --- This looks like an issue I'm having intermittently with the GPU failing to resume from system sleep mode. Do I need to report a separate issue for this? Should I also bother to test the runpm=0 workaround?
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #11 from Christopher Snowhill (kode54@gmail.com) --- Oops, I neglected to mention: The system is non-responsive to input devices, as the USB input appears to all be completely powered off after the GPU crashes, but the network interface is still working, as is sound output, and I'm able to log into the machine via SSH. It does, however, lock up if I attempt to soft reboot it.
The full dmesg from the session that eventually crashed is still available in the journal, up to where it was flooding sdma0 timeouts and failures.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
thejoe@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |thejoe@gmail.com
--- Comment #12 from thejoe@gmail.com --- Have seen the same kernel oops on a dell XPS 15 2-in-1 9575 with vega m hybrid graphics. As far as I could tell it was not triggered by anything specific (eg mpv playbck) though. Running run runpm=0 now, and haven't seen it again yet, but only have seen it once or twice without runpm=0.
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #13 from Maxim Sheviakov (shoegaze@tutanota.com) --- I'm on kernel 5.4.7 now and seems like this particular issue is fixed - I tried playing some movies with runpm enabled and things seemed to be okay. Though it looks like dGPU performance with runpm is considerably worse than without runpm, but I guess that's another issue :)
Can anyone confirm if everything's fine now?
https://bugzilla.kernel.org/show_bug.cgi?id=204559
--- Comment #14 from thejoe@gmail.com --- i have not seen the oops on a 5.3.x kernel (ubuntu eoan), even without tweaking the runpm setting (again, only saw it a few times on an earlier kernel).
https://bugzilla.kernel.org/show_bug.cgi?id=204559
Maxim Sheviakov (shoegaze@tutanota.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
dri-devel@lists.freedesktop.org