https://bugs.freedesktop.org/show_bug.cgi?id=102322
--- Comment #60 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #58)
Here comes another trace log, with your info2.patch applied.
Something must have changed since the last test, as it took pretty long this time to reproduce the crash. Could that have been caused by https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/ nbio_v7_4.c?h=amd-staging-drm- next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the kernel?
Don't think it's related. This code is more related to virtualization.
However, the latest trace you find attached below is not much different to the last one, xzcat /tmp/gpu_debug5.txt.xz | grep '^[' will tell you:
[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=475104, emitted seq=475106 [ 1510.023117] [drm] GPU recovery disabled.
That just means you are again running with GPU VM update mode set to use SDMA. Which is seen in you dmesg (amdgpu.vm_update_mode=0) , so are again experiencing the original issue of SDMA hang. Please use amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
amdgpu_cs:0-806 [012] .... 1787.493126: amdgpu_vm_bo_cs:
soffs=00001001a0, eoffs=00001001b9, flags=70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=0000100200, eoffs=00001021e0, flags=70 amdgpu_cs:0-806 [012] .... 1787.493127: amdgpu_vm_bo_cs: soffs=0000102200, eoffs=00001041e0, flags=70 amdgpu_cs:0-806 [012] .... 1787.493129: amdgpu_vm_bo_cs: soffs=000010c1e0, eoffs=000010c2e1, flags=70 amdgpu_cs:0-806 [012] .... 1787.493131: drm_sched_job: entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job count:8, hw job count:0
And later in the file you can find:
crash detected!
executing umr -O halt_waves -wa No active waves!
executing umr -O verbose -R gfx[.]
polaris11.gfx.rptr == 512 polaris11.gfx.wptr == 512 polaris11.gfx.drv_wptr == 512 polaris11.gfx.ring[ 481] == 0xffff1000 ... polaris11.gfx.ring[ 482] == 0xffff1000 ... polaris11.gfx.ring[ 483] == 0xffff1000 ... polaris11.gfx.ring[ 484] == 0xffff1000 ... polaris11.gfx.ring[ 485] == 0xffff1000 ... polaris11.gfx.ring[ 486] == 0xffff1000 ... polaris11.gfx.ring[ 487] == 0xffff1000 ... polaris11.gfx.ring[ 488] == 0xffff1000 ... polaris11.gfx.ring[ 489] == 0xffff1000 ... polaris11.gfx.ring[ 490] == 0xffff1000 ... polaris11.gfx.ring[ 491] == 0xffff1000 ... polaris11.gfx.ring[ 492] == 0xffff1000 ... polaris11.gfx.ring[ 493] == 0xffff1000 ... polaris11.gfx.ring[ 494] == 0xffff1000 ... polaris11.gfx.ring[ 495] == 0xffff1000 ... polaris11.gfx.ring[ 496] == 0xffff1000 ... polaris11.gfx.ring[ 497] == 0xffff1000 ... polaris11.gfx.ring[ 498] == 0xffff1000 ... polaris11.gfx.ring[ 499] == 0xffff1000 ... polaris11.gfx.ring[ 500] == 0xffff1000 ... polaris11.gfx.ring[ 501] == 0xffff1000 ... polaris11.gfx.ring[ 502] == 0xffff1000 ... polaris11.gfx.ring[ 503] == 0xffff1000 ... polaris11.gfx.ring[ 504] == 0xffff1000 ... polaris11.gfx.ring[ 505] == 0xffff1000 ... polaris11.gfx.ring[ 506] == 0xffff1000 ... polaris11.gfx.ring[ 507] == 0xffff1000 ... polaris11.gfx.ring[ 508] == 0xffff1000 ... polaris11.gfx.ring[ 509] == 0xffff1000 ... polaris11.gfx.ring[ 510] == 0xffff1000 ... polaris11.gfx.ring[ 511] == 0xffff1000 ... polaris11.gfx.ring[ 512] == 0xc0032200 rwD
trying to get ADR from dmesg output for 'umr -O verbose -vm ...' trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
done after crash.
So even without GPU reset, still no "waves". And the error message also does not state any VM fault address.