https://bugs.freedesktop.org/show_bug.cgi?id=111551
Bug ID: 111551 Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout Product: DRI Version: XOrg git Hardware: ARM OS: Linux (All) Status: NEW Severity: major Priority: not set Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: 78666679@qq.com
The amdgpu(pollaries10, wx5100) drm drivers sometimes report:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
and many threads run into disk sleeping state
kernel version: 4.19.36
mesa: 18.3.6
https://bugs.freedesktop.org/show_bug.cgi?id=111551
yanhua 78666679@qq.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |78666679@qq.com
--- Comment #1 from yanhua 78666679@qq.com --- Created attachment 145253 --> https://bugs.freedesktop.org/attachment.cgi?id=145253&action=edit dmesg output
grep drm dmesg.txt. there are sdma1 ring timout
https://bugs.freedesktop.org/show_bug.cgi?id=111551
--- Comment #2 from yanhua 78666679@qq.com --- Created attachment 145260 --> https://bugs.freedesktop.org/attachment.cgi?id=145260&action=edit The previous dmesg.txt has messages been overwriten. from the dmesg-full.txt can see more information
https://bugs.freedesktop.org/show_bug.cgi?id=111551
--- Comment #3 from Christian König christian.koenig@amd.com --- As far as I can see this is a really large box with multiple GPUs installed.
The SDMA rarely locks up, especially not while executing page table updates. So there is most likely something wrong with the hardware here.
Are you sure that the power supply is large enough for that system?
What system/platform is that? Could this be a coherency problem?
https://bugs.freedesktop.org/show_bug.cgi?id=111551
--- Comment #4 from yanhua 78666679@qq.com --- I have asked hardware team, they have tested, and can be sure there are no power supply problem.
The system is arm64 with 64 cores. and there are three amdgpu card in the board.
there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring timeout occur, we can use amd supplied tools umr to read chip registers. can we know the real cause from the register value?
with the coherency problem you said, I think if that was true. the problem should occur more frequently. I'm not sure.
https://bugs.freedesktop.org/show_bug.cgi?id=111551
Christian König christian.koenig@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |INVALID Status|NEW |RESOLVED
--- Comment #5 from Christian König christian.koenig@amd.com --- amdgpu is known to not work on arm64 until very recently.
So it is not a supprise that this isn't working. Please switch to a newer kernel and re-test.
Apart from that there isn't much we can do about it.
https://bugs.freedesktop.org/show_bug.cgi?id=111551
--- Comment #6 from yanhua 78666679@qq.com --- As far as I know, arm64 does not support wc memory. and We have already turn the wc flag as newer kernel version does.
dri-devel@lists.freedesktop.org