https://bugzilla.kernel.org/show_bug.cgi?id=213391
Bug ID: 213391 Summary: AMDGPU retries page fault with some specific processes amdgpu: [gfxhub0] retry page fault until *ERROR* ring gfx timeout, but soft recovered Product: Drivers Version: 2.5 Kernel Version: Linux 5.12.9-arch-1-1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: low Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: samy@lahfa.xyz Regression: No
Hi,
I just updated recently from mainstream Kernel 5.11.16 to 5.12.9 and I've ran into this issue, I've also updated the Mesa driver from mesa-git (21.1.0_devel.137307.f8e5f945b8f-1) to mesa-git (21.2.0_devel.140633.c04f20e7e01-1).
Current kernel parameters : /vmlinuz-linux zfs=zroot/ROOT/default rw loglevel=3 quiet radeon.si_support=0 amdgpu.si_support=1 radeon.cik_support=0 amdgpu.cik_support=1
My computer is a Thinkpad T495 laptop (AMD Ryzen 7 3700 Pro with an iGPU RX VEGA 10, 16GB DDR4 3200Mhz) the very important bit of information is that the BIOS reserves up to 2GB of DDR4 RAM for the iGPU VRAM, I currently have setup 1GB (1024MB) of RAM in my BIOS for the iGPU, I'm thinking the page fault retries could be linked to this in someways.
I think this has a higher chance of happening when my RAM memory is under heavy load and the system is swapping quite a lot too. (I have 12.3GB of Swap on a NVMe PCIe 3.0)
At present, I cannot reproduce this issue consistently yet, however it has been happening with web browsers Qutebrowser (more with Qutebrowser) and also happened only once with Chromium (made the X11 server crash and the computer completely froze, kernel was still responsive to SysReq keys hence I could get out of that tricky situation safely).
I'll be uploading both logs of the crashes I have encountered along with an lspci and other logs files that could be useful.
Kind regards,
Lahfa Samy
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #1 from Lahfa Samy (samy@lahfa.xyz) --- Created attachment 297287 --> https://bugzilla.kernel.org/attachment.cgi?id=297287&action=edit dmesg-chromium-amdgpu-retry-page-fault
In the dmesg, there is the end of an entry to a sleep state and then out of the sleep state (a USB-C dock was connected to the laptop, and it has screens however errors happened with it plugged and when it was unplugged).
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Lahfa Samy (samy@lahfa.xyz) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |samy@lahfa.xyz
--- Comment #2 from Lahfa Samy (samy@lahfa.xyz) --- Created attachment 297291 --> https://bugzilla.kernel.org/attachment.cgi?id=297291&action=edit journalctl-amdgpu-qutebrowser-page-retry
This time there was no gfx timeout and thus the X11 server did not freeze, and I didn't notice the retry page faults until I ran dmesg.
There is a call trace at the beginning (irq 7: nobody cared (try booting with the "irqpoll" option) and then a call trace, this is a known and reported bug that doesn't affect my computer functionality in any way since I acquired it.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Lahfa Samy (samy@lahfa.xyz) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|AMDGPU retries page fault |AMDGPU retries page fault |with some specific |with some specific |processes amdgpu: [gfxhub0] |processes amdgpu and |retry page fault until |sometimes [gfxhub0] retry |*ERROR* ring gfx timeout, |page fault until *ERROR* |but soft recovered |ring gfx timeout, but soft | |recovered
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Lahfa Samy (samy@lahfa.xyz) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|AMDGPU retries page fault |AMDGPU retries page fault |with some specific |with some specific |processes amdgpu and |processes amdgpu and |sometimes [gfxhub0] retry |sometimes followed |page fault until *ERROR* |[gfxhub0] retry page fault |ring gfx timeout, but soft |until *ERROR* ring gfx |recovered |timeout, but soft recovered
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Nirmoy (nirmoy.aiemd@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |nirmoy.aiemd@gmail.com
--- Comment #3 from Nirmoy (nirmoy.aiemd@gmail.com) --- How much VRAM do you have, I can't seem to find that from dmesg? We recently fixed a similar issue using https://patchwork.freedesktop.org/patch/437369/. I wonder if you can try this patch out.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #4 from Lahfa Samy (samy@lahfa.xyz) --- I have about 1GB of VRAM currently set according to glxinfo:
Extended renderer info (GLX_MESA_query_renderer): Vendor: AMD (0x1002) Device: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0) (0x15d8) Version: 21.2.0 Accelerated: yes Video memory: 1024MB Unified memory: no Memory info (GL_ATI_meminfo): VBO free memory - total: 42 MB, largest block: 42 MB VBO free aux. memory - total: 2442 MB, largest block: 2442 MB Texture free memory - total: 42 MB, largest block: 42 MB Texture free aux. memory - total: 2442 MB, largest block: 2442 MB Renderbuffer free memory - total: 42 MB, largest block: 42 MB Renderbuffer free aux. memory - total: 2442 MB, largest block: 2442 MB Memory info (GL_NVX_gpu_memory_info): Dedicated video memory: 1024 MB Total available memory: 4096 MB Currently available dedicated video memory: 42 MB OpenGL vendor string: AMD OpenGL renderer string: AMD Radeon(TM) Vega 10 Graphics (RAVEN, DRM 3.40.0, 5.12.9-arch1-1, LLVM 12.0.0)
How would I go about testing a patch ? (I probably need to rebuild the Linux kernel with the patch, right and boot with it), I found this link, but it says that the information in there is probably deprecated : https://www.kernel.org/doc/html/v5.12/process/applying-patches.html
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #5 from Nirmoy (nirmoy.aiemd@gmail.com) --- Please let me know what distro are you using then I can prepare a complete guide.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #6 from Lahfa Samy (samy@lahfa.xyz) --- I'm under ArchLinux running with the ZFS module (I can't boot and mount the root/home "partition" without it), thanks for the time you'll be taking to make this guide, I'll be trying my best to test the patch in any ways I can.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #7 from Nirmoy (nirmoy.aiemd@gmail.com) --- Actually, I am wrong, I checked out v5.12.9-arch1 from Arch and realized the fix I mentioned before isn't valid.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #8 from Lahfa Samy (samy@lahfa.xyz) --- In the meantime, I'll be trying to find a way to reproduce this issue reliably, if you have any plans on writing a patch for this issue, I would be glad to help in any testing in order to help squash this bug.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #9 from Michel Dänzer (michel@daenzer.net) --- If you can, reverting to an older version of the files under /lib/firmware/amdgpu/ may avoid the hangs.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #10 from dimitris@gmail.com --- Seeing the same thing on a T495 running Fedora 33 and Wayland, typically involving Firefox: https://bugzilla.redhat.com/show_bug.cgi?id=1966384
Would it be possible for me to try that patch?
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #11 from Lahfa Samy (samy@lahfa.xyz) --- Hi Dimitris, what is your current kernel version under Fedora, or the output of this command "uname --kernel-release" in a terminal, I cannot try the patch given however I haven't run into the issue again, I haven't had the time to put my RAM under heavy load.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #12 from dimitris@gmail.com --- Hi, I've seen this under 5.12.10-200.fc33.x86_64, two incidents hours apart. Earlier had a number of incidents under 5.12.9.
In all of my cases I was using Firefox "heavily". Creating tabs and using graphics-heavy pages.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #13 from Nirmoy (nirmoy.aiemd@gmail.com) --- Hi Dimitris and Lahfa, please try Michel's suggestion.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #14 from Dominic Letz (dominic.letz@berlin.de) --- Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade the /lib/firmware/amdgpu any hint to which git tag you would consider safe?
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #15 from Michel Dänzer (michel@daenzer.net) --- (In reply to Dominic Letz from comment #14)
Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade the /lib/firmware/amdgpu any hint to which git tag you would consider safe?
20210315 seems to work fine here (on an E595).
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #16 from Dominic Letz (dominic.letz@berlin.de) --- (In reply to Michel Dänzer from comment #15)
(In reply to Dominic Letz from comment #14)
Having the same issue on an E495 with Kernel 5.12.9. Will try to downgrade the /lib/firmware/amdgpu any hint to which git tag you would consider safe?
20210315 seems to work fine here (on an E595).
+1 trying that
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Leandro Jacques (lsrzj@yahoo.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |lsrzj@yahoo.com
--- Comment #17 from Leandro Jacques (lsrzj@yahoo.com) --- Created attachment 297413 --> https://bugzilla.kernel.org/attachment.cgi?id=297413&action=edit Crash log for kernel 5.12.10
I'm having issues with amdgpu since kernel 5.10. I had to downgrade to 5.4 LTS to get rid of any kind of issue.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #18 from Leandro Jacques (lsrzj@yahoo.com) --- Created attachment 297467 --> https://bugzilla.kernel.org/attachment.cgi?id=297467&action=edit amdgpu crash log for kernel 5.4.126
Before 5.4.126 I had no issues at all, downgrading to 5.4.123 to check if the problem will be gone.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #19 from dimitris@gmail.com --- I've also just replaced /lib/firmware/amdgpu with the `20210315` version, I'll see how this goes. Currently running Fedora kernel 5.12.11-200.fc33.x86_64 on a T495.
Question, don't I also need to update the initrd? `lsinitrd` shows that all the amdgpu modules are included in the initrd image. Or is the firmware reloaded once root is mounted?
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #20 from Michel Dänzer (michel@daenzer.net) --- (In reply to dimitris from comment #19)
Question, don't I also need to update the initrd?
Yes you do, if it didn't happen automatically.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #21 from Dominic Letz (dominic.letz@berlin.de) --- So I'm running since 16th on 20210315 and it has been stable so far vs. multiple freezes a day before.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #22 from dimitris@gmail.com --- Updated initrd also to 20210315, ran under 5.12.11-200.fc33 for a day or so without issues, now under 5.12.12-200.fc33, we'll see how it goes.
For reference what's the best way to check the active/loaded firmware? I don't see anything obvious on dmesg or lspci -vv.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #23 from Michel Dänzer (michel@daenzer.net) --- /sys/kernel/debug/dri/0/amdgpu_firmware_info has all the info.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #24 from Leandro Jacques (lsrzj@yahoo.com) --- Created attachment 297557 --> https://bugzilla.kernel.org/attachment.cgi?id=297557&action=edit Firmware info
The downgrade to kernel 5.4.123 doesn't had any effect, I had the same bug. Now I'm passing my firmware versions information.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #25 from Leandro Jacques (lsrzj@yahoo.com) --- (In reply to Dominic Letz from comment #21)
Trying the same version linux firmware 20210315. Let's check how it goes
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #26 from Lahfa Samy (samy@lahfa.xyz) --- Created attachment 297669 --> https://bugzilla.kernel.org/attachment.cgi?id=297669&action=edit amdgpu-xorg-page-faults-screen-blackout-when-memory-heavily-used
Here are other logs. I have seen that when triggering the bug yet again on the 5.12.10-arch1-1 linux kernel running on ArchLinux, the computer didn't freeze this time like before, it just stopped displaying anything (Xorg was affected so I guess that's why). I'm using this version of the linux-firmware package under Arch : linux-firmware-20210511.7685cf4-1
I have not yet downgraded to test with a downgraded linux-firmware package, may try this soon, if I get affected by the issue too frequently.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #27 from Lahfa Samy (samy@lahfa.xyz) --- Created attachment 297671 --> https://bugzilla.kernel.org/attachment.cgi?id=297671&action=edit Firmware information for a T495 with an AMD Vega RX 10
Here is again my Linux firmware package version (given by pacman coming from ArchLinux core repositories) : 20210511.7685cf4-1
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #28 from Leandro Jacques (lsrzj@yahoo.com) --- (In reply to Leandro Jacques from comment #25)
Until now, no problems. So the problem is with newer firmware versions, working without any issues since 2021-06-21 19:26:28 UTC with version 20210315
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #29 from Leandro Jacques (lsrzj@yahoo.com) --- How to file a bug to the linux-firmware project for the amdgpu driver? After the downgrade I haven't experienced any issues anymore.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #30 from Leandro Jacques (lsrzj@yahoo.com) --- (In reply to Dominic Letz from comment #21) I made what you suggested, no issues anymore. It was a linux-firmware package problem, not a kernel driver problem.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #31 from Lahfa Samy (samy@lahfa.xyz) --- I just have hit the same error even after downgrading, here is the current version of the package linux-firmware 20210315.3568f96-3.
I have hit the error again, the computer froze for a few seconds, looking at the logs shows many retry page faults for the amdgpu driver.
Furthermore, I'm on ArchLinux and I will attach the output of `modinfo amdgpu`, I'm thinking that downgrading linux-firmware on my distro wasn't enough it seems to downgrade the AMDGPU driver.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #32 from Lahfa Samy (samy@lahfa.xyz) --- Created attachment 297781 --> https://bugzilla.kernel.org/attachment.cgi?id=297781&action=edit Archlinux-part-of-modinfo-amdgpu
I think that my kernel is using the latest amdgpu driver that is coming with 5.12.13-arch1-2 and not the version coming with the linux-firmware pkg, if anyone can enlighten me or explain to me if I'm mistaken.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #33 from Leandro Jacques (lsrzj@yahoo.com) --- Created attachment 297881 --> https://bugzilla.kernel.org/attachment.cgi?id=297881&action=edit Kernel crash log for linux firmware version 20210511.7685cf4
amdgpu kernel crash log when the problem ocurred, with the exact same message telling about page fault.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #34 from Leandro Jacques (lsrzj@yahoo.com) --- Created attachment 297883 --> https://bugzilla.kernel.org/attachment.cgi?id=297883&action=edit Linux Firmware version info 20210511.7685cf4
Firmware info as of the moment when the system crashed
https://bugzilla.kernel.org/show_bug.cgi?id=213391
mcmarius@gmx.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mcmarius@gmx.net
--- Comment #35 from mcmarius@gmx.net --- i have a Lenovo L340 and the same problem
here is the complete dmesg log
https://gist.github.com/McMarius11/36c8d21a2dcaf5c2289c91a74af4f7fb
Operating System: Manjaro Linux KDE Plasma Version: 5.22.4 KDE Frameworks Version: 5.84.0 Qt Version: 5.15.2 Kernel Version: 5.11.22-2-MANJARO (64-bit) Graphics Platform: X11 Processors: 8 × AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx Memory: 5,6 GiB of RAM Graphics Processor: AMD Radeon™ Vega 10 Graphics
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #36 from Lahfa Samy (samy@lahfa.xyz) --- Did anyone test whether this has been fixed in newer firmware updates, or should we still stay on version 20210315.3568f96-3 ?
https://bugzilla.kernel.org/show_bug.cgi?id=213391
--- Comment #37 from Michel Dänzer (michel@daenzer.net) --- (In reply to Lahfa Samy from comment #36)
Did anyone test whether this has been fixed in newer firmware updates, or should we still stay on version 20210315.3568f96-3 ?
It's fixed in upstream linux-firmware 20210818.
https://bugzilla.kernel.org/show_bug.cgi?id=213391
Lahfa Samy (samy@lahfa.xyz) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |UNREPRODUCIBLE
dri-devel@lists.freedesktop.org