https://bugs.freedesktop.org/show_bug.cgi?id=102322
--- Comment #55 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #54)
(In reply to Andrey Grodzovsky from comment #53)
Created attachment 141198 [details] [review] [review] add_debug_info2.patch
Try this patch instead, i might be missing some prints in the first one.
Can try that this evening.
In the last log you attached I haven't seen any UMR dumps or GPU fault prints in dmesg. THe GPU fault has to be in the log to compare the faulty address against the debug prints in the patch.
In above attached file "xz-compressed output of gpu_debug3.sh" there is umr output at the time of the crash (238 seconds after the reboot):
... mpv/vo-897 [005] .... 235.191542: dma_fence_wait_start: driver=drm_sched timeline=gfx context=162 seqno=87 mpv/vo-897 [005] d... 235.191548: dma_fence_enable_signal: driver=drm_sched timeline=gfx context=162 seqno=87 kworker/0:2-92 [000] .... 238.275988: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=210 kworker/0:2-92 [000] .... 238.276004: dma_fence_signaled: driver=amdgpu timeline=sdma1 context=11 seqno=211 [ 238.180634] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=32624, emitted seq=32626 [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin! [ 238.180641] amdgpu 0000:0a:00.0: GPU reset begin!
crash detected!
executing umr -O halt_waves -wa No active waves!
Did you use amdgpu.vm_fault_stop=2 parameter ? In case a fault happened that should have froze GPUs compute units and hence the above command would produce a lot of wave info.
executing umr -O verbose -R gfx[.]
polaris11.gfx.rptr == 1792 polaris11.gfx.wptr == 1792 polaris11.gfx.drv_wptr == 1792 polaris11.gfx.ring[1761] == 0xffff1000 ... polaris11.gfx.ring[1762] == 0xffff1000 ... polaris11.gfx.ring[1763] == 0xffff1000 ... polaris11.gfx.ring[1764] == 0xffff1000 ... polaris11.gfx.ring[1765] == 0xffff1000 ... polaris11.gfx.ring[1766] == 0xffff1000 ... polaris11.gfx.ring[1767] == 0xffff1000 ... polaris11.gfx.ring[1768] == 0xffff1000 ... polaris11.gfx.ring[1769] == 0xffff1000 ... polaris11.gfx.ring[1770] == 0xffff1000 ... polaris11.gfx.ring[1771] == 0xffff1000 ... polaris11.gfx.ring[1772] == 0xffff1000 ... polaris11.gfx.ring[1773] == 0xffff1000 ... polaris11.gfx.ring[1774] == 0xffff1000 ... polaris11.gfx.ring[1775] == 0xffff1000 ... polaris11.gfx.ring[1776] == 0xffff1000 ... polaris11.gfx.ring[1777] == 0xffff1000 ... polaris11.gfx.ring[1778] == 0xffff1000 ... polaris11.gfx.ring[1779] == 0xffff1000 ... polaris11.gfx.ring[1780] == 0xffff1000 ... polaris11.gfx.ring[1781] == 0xffff1000 ... polaris11.gfx.ring[1782] == 0xffff1000 ... polaris11.gfx.ring[1783] == 0xffff1000 ... polaris11.gfx.ring[1784] == 0xffff1000 ... polaris11.gfx.ring[1785] == 0xffff1000 ... polaris11.gfx.ring[1786] == 0xffff1000 ... polaris11.gfx.ring[1787] == 0xffff1000 ... polaris11.gfx.ring[1788] == 0xffff1000 ... polaris11.gfx.ring[1789] == 0xffff1000 ... polaris11.gfx.ring[1790] == 0xffff1000 ... polaris11.gfx.ring[1791] == 0xffff1000 ... polaris11.gfx.ring[1792] == 0xc0032200 rwD
trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
done after crash, flashing NUMLOCK LED. amdgpu_cs:0-799 [001] .... 286.852838: amdgpu_bo_list_set: list=0000000099c16b5c, bo=000000001771c26f, bo_size=131072 amdgpu_cs:0-799 [001] .... 286.852846: amdgpu_bo_list_set: list=0000000099c16b5c, bo=0000000046bfd439, bo_size=131072 ...
But sure, there were no "VM_CONTEXT1_PROTECTION_FAULT_ADDR" error messages this time. Sometimes such are emitted, sometimes not.