https://bugs.freedesktop.org/show_bug.cgi?id=107898
Bug ID: 107898 Summary: "kfd: Failed to resume IOMMU for device 1002:15dd" on Raven Ridge Product: DRI Version: DRI git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: major Priority: medium Component: DRM/amdkfd Assignee: dri-devel@lists.freedesktop.org Reporter: marvin.damschen@gullz.de
Created attachment 141520 --> https://bugs.freedesktop.org/attachment.cgi?id=141520&action=edit dmesg 4.19-rc3
Hey,
I wanted to try the newly-added support for Raven Ridge in amdkfd, but initialization fails at: "kfd: Failed to resume IOMMU for device 1002:15dd" on AMD Ryzen 5 2500U (Lenovo E485) with 4.19-rc3. IOMMU itself seems to initialize fine (As I understand, I can ignore the "AMD-Vi: Unable to write to IOMMU perf counter." msg). Full log is attached.
Best regards Marvin
https://bugs.freedesktop.org/show_bug.cgi?id=107898
Marvin Damschen marvin.damschen@gullz.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #141520|text/x-log |text/plain mime type| |
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #1 from Oded Gabbay oded.gabbay@gmail.com --- Added Felix to CC
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #2 from Felix Kuehling felix.kuehling@amd.com --- The AMD-Vi messages in the log look OK. I'm seeing the same on my Raven system (Ryzen 5 2400G desktop).
I'm currently running a 4.19-rc1+ kernel from Alex Deucher's drm-next-4.20-wip branch. I haven't tried rc3 from the master branch yet. I'll try it tonight and see if I can reproduce the issue.
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #3 from Felix Kühling fxkuehl@gmx.de --- I'm not seeing this problem on my Raven system with 4.19-rc3+ ($ git describe v4.19-rc3-21-g5e335542de83).
The most likely explanation is that on your system IOMMUv2 is not enabled. That may be a BIOS setting. If your system BIOS setup doesn't allow you to enable the IOMMUv2, then you may be out of luck. I'll attach a patch that adds some extra error messages that should confirm that or point to a different source of the problem.
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #4 from Felix Kühling fxkuehl@gmx.de --- Created attachment 141532 --> https://bugs.freedesktop.org/attachment.cgi?id=141532&action=edit Add iommu init instrumentation
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #5 from Marvin Damschen marvin.damschen@gullz.de --- Output with patch applied:
Sep 12 12:08:20 zen kernel: kfd kfd: Allocated 3969056 bytes on gart Sep 12 12:08:20 zen kernel: Topology: Add APU node [0x15dd:0x1002] Sep 12 12:08:20 zen kernel: Failed to attache to group Sep 12 12:08:20 zen kernel: amd_iommu_init_device failed: -22 Sep 12 12:08:20 zen kernel: kfd kfd: Failed to resume IOMMU for device 1002:15dd Sep 12 12:08:20 zen kernel: Creating topology SYSFS entries Sep 12 12:08:20 zen kernel: kfd kfd: device 1002:15dd NOT added due to errors
Full log attached.
Thank you Marvin
https://bugs.freedesktop.org/show_bug.cgi?id=107898
Marvin Damschen marvin.damschen@gullz.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #141520|0 |1 is obsolete| |
--- Comment #6 from Marvin Damschen marvin.damschen@gullz.de --- Created attachment 141533 --> https://bugs.freedesktop.org/attachment.cgi?id=141533&action=edit dmesg 4.19-rc3 with iommu init instrumentation
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #7 from Felix Kuehling felix.kuehling@amd.com --- Good timing. We were just given a laptop that has similar problems and found a partial workaround: Try adding "iommu=pt" to your kernel command line. This may at least get you through the KFD initialization, but there are likely more problems down the line.
The problems are due to BIOS bugs. We're looking into more workarounds to ignore or patch incorrect information in the CRAT ACPI table that describes the compute devices for KFD.
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #8 from Marvin Damschen marvin.damschen@gullz.de --- KFD initializes without errors using "iommu=pt". I will see whether I can get ROCm running on top of that.
Unfortunately, the BIOS has been terrible so far on the raven-based Lenovo laptops. I am happy to try any patches or workarounds you have, just let me know.
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #9 from Marvin Damschen marvin.damschen@gullz.de --- ROCm 1.9 runs OpenCL on GPU on top of mainline kfd and seems stable. However: - CPU is not detected as a compute device (rocminfo attached) - Performance, at least in darktable, is quite low (the "bench.SRW" benchmark in OpenCL on GPU takes more than 3 times longer than on CPU without OpenCL). The problem could be that memory buffers are too small, clinfo reports: "Max memory allocation 268435456 (256MiB)" which seems quite small to me (?).
Are these problems a result of incorrect information in CRAT?
Best regards Marvin
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #10 from Marvin Damschen marvin.damschen@gullz.de --- Created attachment 141657 --> https://bugs.freedesktop.org/attachment.cgi?id=141657&action=edit ROCm 1.9 info on 4.19-rc4
https://bugs.freedesktop.org/show_bug.cgi?id=107898
--- Comment #11 from Felix Kuehling felix.kuehling@amd.com --- rocminfo reports both the CPU and the GPU.
If OpenCL can't use the CPU as a compute device, that's probably a limitation of the OpenCL implementation.
The max memory allocation size is strange. rocminfo reports a single 16GB memory pool attached to the CPU. That's system memory from the CRAT table and looks reasonable. It should be possible to use at least 3/8 of that with the upstream KFD. If CLinfo is reporting something different I'm wondering if it's an OpenCL limitation rather than a ROCm limitation.
If you're interested in the raw information reported by KFD to user mode, checkout /sys/class/kfd/kfd/topology/nodes. On an APU there should be only one node (0). Underneath that you'll find node properties as well as memory properties that may be interesting.
https://bugs.freedesktop.org/show_bug.cgi?id=107898
Marvin Damschen marvin.damschen@gullz.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |NOTOURBUG
--- Comment #12 from Marvin Damschen marvin.damschen@gullz.de --- Thanks a lot for the info. /sys/class/kfd/kfd/topology/nodes/0/mem_banks/0/properties correctly reports 16GB of RAM. As the issues seem to come from BIOS/OpenCL (not from kfd) and kfd successfully initializes with "iommu=pt", I will close this bug report as resolved.
Best regards Marvin
https://bugs.freedesktop.org/show_bug.cgi?id=107898
Chí-Thanh Christopher Nguyễn chithanh@gentoo.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|NOTOURBUG |---
--- Comment #13 from Chí-Thanh Christopher Nguyễn chithanh@gentoo.org --- I have the same issue on Dell Latitude 5495 with Linux kernel 4.19.1 and iommu=pt is a workaround here too.
But as AMD is working around other BIOS bugs[1] (rather than getting them fixed quickly with their business partners), I think this bug report should be left open for now.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
https://bugs.freedesktop.org/show_bug.cgi?id=107898
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|REOPENED |RESOLVED
--- Comment #14 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/4.
dri-devel@lists.freedesktop.org