[Bug 212739] New: [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups on Vega 10 (Raven Ridge) - dri-devel - freedesktop.org experimental mailing list

21 Apr 2021

      https://bugzilla.kernel.org/show_bug.cgi?id=212739
Bug ID: 212739
           Summary: [amdgpu] Sporadic GPU errors, screen artifacts and
                    GPU-induced system lockups on Vega 10 (Raven Ridge)
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.11.14-1, 5.12.rc7.d0411.gd434405-1
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: tunas@cryptolab.net
        Regression: No
Created attachment 296449
  --> https://bugzilla.kernel.org/attachment.cgi?id=296449&action=edit
Example of GPU artifacts from the recoverable variant of this error
...
From time to time, the amdgpu driver will report a page fault (sometimes coming
from pid 0, sometimes coming from the web browser, sometimes the screen
compositor or Xorg, sometimes a video player, etc.) as shown below:
...
kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0
ring:0 vmid:4 pasid:0, for process  pid 0 thread  pid 0)
kernel: amdgpu 0000:05:00.0: amdgpu:   in page starting at address
0x800101606000 from client 27
kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
kernel: amdgpu 0000:05:00.0: amdgpu:          Faulty UTCL2 client ID: TCP
(0x8)
kernel: amdgpu 0000:05:00.0: amdgpu:          MORE_FAULTS: 0x1
kernel: amdgpu 0000:05:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:05:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:05:00.0: amdgpu:          RW: 0x0`
This message is repeated several thousand times in dmesg ("x callbacks
suppressed") with different addresses of form 0x80010160Y000 (where Y is a hex
digit between 1-8.)
In the meantime, the computer is completely hung in terms of display, i.e.
inputs go through, music keeps playing, but the screen is static.
Then, several seconds later, it's followed by:
...
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences
timed out!
And finally,
...
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft
recovered
After this, the computer resumes operation (but with GPU artifacts having
appeared on the screen - for an example of these, see attached screenshot).
Alternatively, sometimes instead of the soft recovery message, the GPU cannot
recover and displays the following messages in the kernel log:
...
kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access
in command stream
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
seq=3356413, emitted seq=3356415
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process Xorg pid 14524 thread Xorg:cs0 pid 14539
kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
kernel: [drm] free PSP TMR buffer
kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
kernel: [drm] PSP is resuming...
kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not
available
kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not
available
kernel: [drm] kiq ring mec 2 pipe 1 q 0
kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR*
ring sdma0 test failed (-110)
kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP
block <sdma_v4_0> failed -110
kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(4) failed
kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
at which point rebooting is necessary as the GPU will not resume operation.
This also happens on the latest 5.12 rc (as of the writing of this bug report,
this is rc7).
-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.