https://bugs.freedesktop.org/show_bug.cgi?id=102322
--- Comment #19 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Can you use addr2line or gdb with 'list' command to give the line number matching (In reply to dwagner from comment #18)
The good news: So far no crashes during normal uptime with amdgpu.vm_update_mode=3
The bad news: System crashes immediately upon S3 resume (with messages quite different from the ones I saw with earlier S3-resume crashes) - I filed bug report https://bugs.freedesktop.org/show_bug.cgi?id=107065 on this.
(In reply to Andrey Grodzovsky from comment #17)
dwagner, this is obviously just a work around and not a fix. It points to some problem with SDMA packets, if you want to continue exploring we can try to dump some fence traces and SDMA HW ring content to examine the latest packets before the hang happened.
If you can include some debug output into "amd-staging-drm-next" that helps finding the root cause, I might be able to provide some output - if the kernel survives long enough after the crash to write the system journal - this has not always been the case.
No need to recompile, just need to see what is the content of SDMA ring buffer when the hang occurs.
Clone and build our register analyzer from here - https://cgit.freedesktop.org/amd/umr/ and once the hang happens just run
sudo umr -lb sudo umr -R gfx[.] sudo umr -R sdma0[.] sudo umr -R sdma1[.]
I will probably need more info later but let's try this first.