https://bugs.freedesktop.org/show_bug.cgi?id=111808
Bug ID: 111808 Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout cause process into Disk sleep state Product: DRI Version: DRI git Hardware: ARM OS: Linux (All) Status: NEW Severity: major Priority: not set Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: liansz@fzcyjh.com
Created attachment 145507 --> https://bugs.freedesktop.org/attachment.cgi?id=145507&action=edit timeoutlog
We ran into some gfx timeout problems. Currently, we use the kernel of 4.19.36. We merged some patches regarding GPU from the community. There are multiple GPUs on each server, and each GPU is running some rendering programs. Now, there are 2 different cases of failures. The first one is that one graphics card of a server fails, rendering program does not have a D state, and it shows error code 110 tested by /sys/kernel/debug/dri/1/amdgpu_test_ib, then shows pass after a second test. See tmp-618-2.zip for details. The second one is that one graphics card of a server fails, the whole rendering program running on the server fails and has D state. It fails at drm_release. See tmp-619.zip for details. Could you please help us out?