[Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung!

22 Aug 2018


      https://bugs.freedesktop.org/show_bug.cgi?id=102322
--- Comment #60 from Andrey Grodzovsky andrey.grodzovsky@amd.com ---
(In reply to dwagner from comment #58)
...
Here comes another trace log, with your info2.patch applied.
Something must have changed since the last test, as it took pretty long this
time to reproduce the crash. Could that have been caused by
https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/amdgpu/
nbio_v7_4.c?h=amd-staging-drm-
next&id=b385925f3922faca7435e50e31380bb2602fd6b8 now being part of the
kernel?
Don't think it's related. This code is more related to virtualization.
...
However, the latest trace you find attached below is not much different to
the last one, xzcat /tmp/gpu_debug5.txt.xz  | grep '^[' will tell you:
[ 1510.023112] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=475104, emitted seq=475106
[ 1510.023117] [drm] GPU recovery disabled.
That just means you are again running with GPU VM update mode set to use SDMA.
Which is seen in you dmesg (amdgpu.vm_update_mode=0) , so are again
experiencing the original issue of SDMA hang. Please use
amdgpu.vm_update_mode=3 to get back to VM_FAULTs issue.
...
 amdgpu_cs:0-806   [012] ....  1787.493126: amdgpu_vm_bo_cs:

soffs=00001001a0, eoffs=00001001b9, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
soffs=0000100200, eoffs=00001021e0, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493127: amdgpu_vm_bo_cs:
soffs=0000102200, eoffs=00001041e0, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493129: amdgpu_vm_bo_cs:
soffs=000010c1e0, eoffs=000010c2e1, flags=70
     amdgpu_cs:0-806   [012] ....  1787.493131: drm_sched_job:
entity=00000000406345a7, id=10239, fence=000000007a120377, ring=gfx, job
count:8, hw job count:0
And later in the file you can find:
crash detected!
executing umr -O halt_waves -wa
No active waves!
executing umr -O verbose -R gfx[.]
polaris11.gfx.rptr == 512
polaris11.gfx.wptr == 512
polaris11.gfx.drv_wptr == 512
polaris11.gfx.ring[ 481] == 0xffff1000    ... 
polaris11.gfx.ring[ 482] == 0xffff1000    ... 
polaris11.gfx.ring[ 483] == 0xffff1000    ... 
polaris11.gfx.ring[ 484] == 0xffff1000    ... 
polaris11.gfx.ring[ 485] == 0xffff1000    ... 
polaris11.gfx.ring[ 486] == 0xffff1000    ... 
polaris11.gfx.ring[ 487] == 0xffff1000    ... 
polaris11.gfx.ring[ 488] == 0xffff1000    ... 
polaris11.gfx.ring[ 489] == 0xffff1000    ... 
polaris11.gfx.ring[ 490] == 0xffff1000    ... 
polaris11.gfx.ring[ 491] == 0xffff1000    ... 
polaris11.gfx.ring[ 492] == 0xffff1000    ... 
polaris11.gfx.ring[ 493] == 0xffff1000    ... 
polaris11.gfx.ring[ 494] == 0xffff1000    ... 
polaris11.gfx.ring[ 495] == 0xffff1000    ... 
polaris11.gfx.ring[ 496] == 0xffff1000    ... 
polaris11.gfx.ring[ 497] == 0xffff1000    ... 
polaris11.gfx.ring[ 498] == 0xffff1000    ... 
polaris11.gfx.ring[ 499] == 0xffff1000    ... 
polaris11.gfx.ring[ 500] == 0xffff1000    ... 
polaris11.gfx.ring[ 501] == 0xffff1000    ... 
polaris11.gfx.ring[ 502] == 0xffff1000    ... 
polaris11.gfx.ring[ 503] == 0xffff1000    ... 
polaris11.gfx.ring[ 504] == 0xffff1000    ... 
polaris11.gfx.ring[ 505] == 0xffff1000    ... 
polaris11.gfx.ring[ 506] == 0xffff1000    ... 
polaris11.gfx.ring[ 507] == 0xffff1000    ... 
polaris11.gfx.ring[ 508] == 0xffff1000    ... 
polaris11.gfx.ring[ 509] == 0xffff1000    ... 
polaris11.gfx.ring[ 510] == 0xffff1000    ... 
polaris11.gfx.ring[ 511] == 0xffff1000    ... 
polaris11.gfx.ring[ 512] == 0xc0032200    rwD
trying to get ADR from dmesg output for 'umr -O verbose -vm ...'
trying to get VMID from dmesg output for 'umr -O verbose -vm ...'
done after crash.
So even without GPU reset, still no "waves". And the error message also does
not state any VM fault address.
-- 
You are receiving this mail because:
You are the assignee for the bug.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[Bug 102322] System crashes after "[drm] IP block:gmc_v8_0 is hung!" / [drm] IP block:sdma_v3_0 is hung!

And later in the file you can find:

done after crash.