https://bugzilla.kernel.org/show_bug.cgi?id=212739
Bug ID: 212739 Summary: [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups on Vega 10 (Raven Ridge) Product: Drivers Version: 2.5 Kernel Version: 5.11.14-1, 5.12.rc7.d0411.gd434405-1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: tunas@cryptolab.net Regression: No
Created attachment 296449 --> https://bugzilla.kernel.org/attachment.cgi?id=296449&action=edit Example of GPU artifacts from the recoverable variant of this error
From time to time, the amdgpu driver will report a page fault (sometimes coming
from pid 0, sometimes coming from the web browser, sometimes the screen compositor or Xorg, sometimes a video player, etc.) as shown below:
kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:0, for process pid 0 thread pid 0) kernel: amdgpu 0000:05:00.0: amdgpu: in page starting at address 0x800101606000 from client 27 kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00401031 kernel: amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) kernel: amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1 kernel: amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 kernel: amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x3 kernel: amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 kernel: amdgpu 0000:05:00.0: amdgpu: RW: 0x0`
This message is repeated several thousand times in dmesg ("x callbacks suppressed") with different addresses of form 0x80010160Y000 (where Y is a hex digit between 1-8.) In the meantime, the computer is completely hung in terms of display, i.e. inputs go through, music keeps playing, but the screen is static.
Then, several seconds later, it's followed by:
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
And finally,
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
After this, the computer resumes operation (but with GPU artifacts having appeared on the screen - for an example of these, see attached screenshot).
Alternatively, sometimes instead of the soft recovery message, the GPU cannot recover and displays the following messages in the kernel log:
kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=3356413, emitted seq=3356415 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 14524 thread Xorg:cs0 pid 14539 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin! kernel: [drm] free PSP TMR buffer kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000). kernel: [drm] PSP is resuming... kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available kernel: [drm] kiq ring mec 2 pipe 1 q 0 kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring sdma0 test failed (-110) kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <sdma_v4_0> failed -110 kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(4) failed kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
at which point rebooting is necessary as the GPU will not resume operation.
This also happens on the latest 5.12 rc (as of the writing of this bug report, this is rc7).
dri-devel@lists.freedesktop.org