https://bugzilla.kernel.org/show_bug.cgi?id=214859
Bug ID: 214859 Summary: drm-amdgpu-init-iommu~fd-device-init.patch introduce bug Product: Drivers Version: 2.5 Kernel Version: 5.14.15 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: towo@siduction.org Regression: No
After commit d60096b3b2c2..cd8cc7d31b49 100644 drm-amdgpu-init-iommu~fd-device-init.patch
Kernel 5.14.15 on most Ryzen Notebooks X cant't start really. There is a long time, before x is starting, dmesg is spammed with failure messages like
Okt 28 10:28:08 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 Okt 28 10:28:21 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Okt 28 10:28:34 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 Okt 28 10:28:47 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Okt 28 10:29:01 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 Okt 28 10:29:14 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Okt 28 10:29:27 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
and/or
Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:128 vmid:0 pasid:0, for process pid 0 thread pid 0) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: in page starting at address 0x0000000000872000 from IH client 0x1b (UTCL2) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00040D00 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: MORE_FAULTS: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: WALKER_ERROR: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: MAPPING_ERROR: 0x1 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: RW: 0x1 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:128 vmid:0 pasid:0, for process pid 0 thread pid 0) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: in page starting at address 0x0000000000872000 from IH client 0x1b (UTCL2) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00040D00 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6) Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: MORE_FAULTS: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: WALKER_ERROR: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: PERMISSION_FAULTS: 0x0 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: MAPPING_ERROR: 0x1 Okt 28 10:29:40 kernel: ^[[0;1;39mamdgpu 0000:04:00.0: amdgpu: RW: 0x1
Reverting that commit and the kernel is back working normal. Here the related reports from our users (ignore the nvidia posts). https://forum.siduction.org/index.php?topic=8439.0
https://bugzilla.kernel.org/show_bug.cgi?id=214859
--- Comment #1 from Sebastian Dalfuß (sd@sedf.de) --- I can confirm this for a "04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c2)".
https://bugzilla.kernel.org/show_bug.cgi?id=214859
--- Comment #2 from towo@siduction.org --- The relevant commit is 714d9e4574d54596973ee3b0624ee4a16264d700
https://bugzilla.kernel.org/show_bug.cgi?id=214859
towo@siduction.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.14.15 |5.14.15, 5.15.0 Regression|No |Yes
https://bugzilla.kernel.org/show_bug.cgi?id=214859
--- Comment #3 from towo@siduction.org --- Additional info, after installing the kernel from a working system, 1st boot with that kernel is working flawless. Rebooting with that kernel and the boot is hanging a long time, then the desktop starts but the system is not really usuable. All the problems do not happen after reverting 714d9e4574d54596973ee3b0624ee4a16264d700.
https://bugzilla.kernel.org/show_bug.cgi?id=214859
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #4 from Alex Deucher (alexdeucher@gmail.com) --- I think this patch set should address the issue: https://patchwork.freedesktop.org/series/96508/
https://bugzilla.kernel.org/show_bug.cgi?id=214859
James Zhu (jamesz@amd.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jamesz@amd.com
--- Comment #5 from James Zhu (jamesz@amd.com) --- Created attachment 299413 --> https://bugzilla.kernel.org/attachment.cgi?id=299413&action=edit patch to fix
Suggest to upgrade to 5.15rc7 and apply this patch, then make a test.
https://bugzilla.kernel.org/show_bug.cgi?id=214859
--- Comment #6 from James Zhu (jamesz@amd.com) --- Created attachment 299437 --> https://bugzilla.kernel.org/attachment.cgi?id=299437&action=edit analysis for this issue
Linux 5.14.15 + afd1818 can fix the issue.
Linux 5.15rc7 re-apply "init iommu after amdkfd device init" and "move iommu_resume before ip init/resume" which overwrote afd1818 caused the issue again.
714d9e4 drm/amdgpu: init iommu after amdkfd device init
f02abeb drm/amdgpu: move iommu_resume before ip init/resume
afd1818 drm/amdkfd: fix boot failure when iommu is disabled in Picasso.
286826d drm/amdgpu: init iommu after amdkfd device init
9cec53c drm/amdgpu: move iommu_resume before ip init/resume
https://bugzilla.kernel.org/show_bug.cgi?id=214859
--- Comment #7 from towo@siduction.org --- With linux 5.14.17-rc1 and 5.15.1-rc1 the problem is gone. So i think, that bug is resolved.
https://bugzilla.kernel.org/show_bug.cgi?id=214859
spasswolf@web.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |spasswolf@web.de
--- Comment #8 from spasswolf@web.de --- *** Bug 214901 has been marked as a duplicate of this bug. ***
dri-devel@lists.freedesktop.org