https://bugs.freedesktop.org/show_bug.cgi?id=109649
Bug ID: 109649 Summary: [raven] gfx ring timeout when running clover apps Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: jv356@scarletmail.rutgers.edu
This is a regression in 4.20.x, the same userspace works ok on 4.19. I could bisect, but it's my main machine so I can't quite dedicate the time, any hint would be appreciated. The kernel is booted using iommu=soft. full iommu hangs on boot, and noimmu disables the wi-fi.
Dmesg:
[ 702.207054] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1340, emitted seq=1342 [ 702.207061] [drm] GPU recovery disabled.
lspci -nn: 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:15dd] (rev c4)
It's a thinkpad e485 laptop with: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx (family: 0x17, model: 0x11, stepping: 0x0)
https://bugs.freedesktop.org/show_bug.cgi?id=109649
Jan Vesely jv356@scarletmail.rutgers.edu changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|[raven] gfx ring timeout |[bisected][raven] gfx ring |when running clover apps |timeout when running clover | |apps CC| |christian.koenig@amd.com
--- Comment #1 from Jan Vesely jv356@scarletmail.rutgers.edu --- Bisection shows that the first bad commit is: commit 09b6f25b55d9c66af7302e1f09ad90aa5b1dfbcb (HEAD, refs/bisect/bad) Author: Christian König christian.koenig@amd.com Date: Wed Aug 15 14:04:47 2018 +0200
drm/amdgpu: fix VM size reporting on Raven
Raven doesn't have an VCE block and so also no buggy VCE firmware.
Signed-off-by: Christian König christian.koenig@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Reviewed-by: Huang Rui ray.huang@amd.com Acked-by: Chunming Zhou david1.zhou@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
I guess there is other buggy firmware/limitation?
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 40, firmware version: 0x00000099 PFP feature version: 40, firmware version: 0x000000ae CE feature version: 40, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x0000d237 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 40, firmware version: 0x0000018b MEC2 feature version: 40, firmware version: 0x0000018b SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x0017ba78 SMC feature version: 0, firmware version: 0x00001e49 SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x01004912 VBIOS version: 113-RAVEN-106
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #2 from Jan Vesely jv356@scarletmail.rutgers.edu --- I've confirmed that reverting the change on top of 4.20.13 fixes the issue.
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #3 from Jan Vesely jv356@scarletmail.rutgers.edu --- The bug is still present in 5.0.0-rc8.
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #4 from Jan Vesely jv356@scarletmail.rutgers.edu --- The issue appears fixed with new firmware, but now the laptop won't suspend.
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 40, firmware version: 0x00000099 PFP feature version: 40, firmware version: 0x000000ae CE feature version: 40, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x0000d237 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 40, firmware version: 0x0000018b MEC2 feature version: 40, firmware version: 0x0000018b SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x0017ba78 SMC feature version: 0, firmware version: 0x00001e49 SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x01004912 DMCU feature version: 0, firmware version: 0x00000001 VBIOS version: 113-RAVEN-106
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #5 from Jan Vesely jv356@scarletmail.rutgers.edu --- since the sysfs does not show fw difference, here's the change in files: $ diff old_fw new_fw 8,9c8 - e2ddb912bf242e3b1b4219b36a19bff7 /lib/firmware/amdgpu/raven2_rlc.bin - 27168d5b60ef396926a2aa0e2da00a97 /lib/firmware/amdgpu/raven2_sdma1.bin --- + 4ac07f88b9c4aa4fe026be87cb16ceda /lib/firmware/amdgpu/raven2_rlc.bin
(In reply to Jan Vesely from comment #4)
The issue appears fixed with new firmware, but now the laptop won't suspend.
The same workaround as before fixes the suspend/resume issue.
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:709 + vm_size = min(vm_size, 1ULL << 40);
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #6 from Jan Vesely jv356@scarletmail.rutgers.edu --- I managed to get IOMMU working by passing "amd_iommu=pt ivrs_ioapic[32]=00:14.0" on the kernel commandline. Now it's back to square one. all clover kernels hang the GPU unless I limit VM size to 'vm_size = min(vm_size, 1ULL << 40);' otherwise the machine works (including 3d graphics and suspend/resume).
https://bugs.freedesktop.org/show_bug.cgi?id=109649
--- Comment #7 from Jan Vesely jv356@scarletmail.rutgers.edu --- The workaround is still necessary in kernel 5.1.0. The failure mode is a bit different, it hangs just the application, not entire machine.
https://bugs.freedesktop.org/show_bug.cgi?id=109649
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #8 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/698.
dri-devel@lists.freedesktop.org