https://bugzilla.kernel.org/show_bug.cgi?id=213561
Bug ID: 213561 Summary: [bisected] AMD GPU can no longer idle state after commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3 Product: Drivers Version: 2.5 Kernel Version: 5.13rc7 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: untaintableangel@hotmail.co.uk Regression: No
Nature of the problem: RX 5700 is unable to enter low power state at idle (see below for usual behaviour)
Sensors at idle prior to the commit: amdgpu-pci-0f00 Adapter: PCI adapter vddgfx: 775.00 mV fan1: 0 RPM (min = 0 RPM, max = 3200 RPM) edge: +48.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) junction: +48.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +52.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) power1: 8.00 W (cap = 165.00 W)
After the commit, the lowest is: amdgpu-pci-0f00 Adapter: PCI adapter vddgfx: 1.03 V fan1: 0 RPM (min = 0 RPM, max = 3200 RPM) edge: +54.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) junction: +56.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +52.0°C (crit = +105.0°C, hyst = -273.1°C) (emerg = +110.0°C) power1: 31.00 W (cap = 165.00 W)
This problem wasn't present in rc6 but is present in 5.13rc7 and bisects to:
1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3 is the first bad commit commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3 Author: Yifan Zhang yifan1.zhang@amd.com Date: Thu Jun 10 10:10:07 2021 +0800
drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell.
If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC. Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue.
Signed-off-by: Yifan Zhang yifan1.zhang@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Cc: stable@vger.kernel.org
The device is a Sapphire Pulse RX5700 and this problem is seen even with one monitor set at 60Hz. GPU: 0f:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c4)
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Regression|No |Yes
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|[bisected] AMD GPU can no |[bisected] AMD GPU can no |longer idle state after |longer enter idle state |commit |after commit |1c0b0efd148d5b24c4932ddb3fa |1c0b0efd148d5b24c4932ddb3fa |03c8edd6097b3 |03c8edd6097b3
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|[bisected] AMD GPU can no |[bisected][regression] AMD |longer enter idle state |GPU can no longer enter |after commit |idle state after commit |1c0b0efd148d5b24c4932ddb3fa | |03c8edd6097b3 |
https://bugzilla.kernel.org/show_bug.cgi?id=213561
matoro (matoro@airmail.cc) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |matoro@airmail.cc
--- Comment #1 from matoro (matoro@airmail.cc) --- Hi, I am confirming this issue in 5.12.13 on my Colorful Red Devil RX 5700XT. Because of the OC profile it was consuming almost 100W continuously and heated up to nearly 90°C before I realized what was happening and reverted to 5.12.12.
My card: 0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #2 from tgn-ff (tgnff242@gmail.com) --- It's not just locked at the highest clock states, the GPU seems to be under "load": radeontop shows the GPU utilisation to be at nearly 100%. Consequently, performance is terrible.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
hagar-dunor@wanadoo.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |hagar-dunor@wanadoo.fr
--- Comment #3 from hagar-dunor@wanadoo.fr --- Same on vega 64
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #4 from Linux_Chemist (untaintableangel@hotmail.co.uk) --- Yes, it seems this commit was also pushed into 5.12.13 so users with similar hardware (gfx10) may also be experiencing this.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #5 from hagar-dunor@wanadoo.fr --- Sorry should have provided more info, I have this on 5.10.46.
cat /sys/kernel/debug/dri/0/amdgpu_pm_info reports 100% GPU usage and ~60W "idle" on 5.10.46 where I get 0% GPU usage and ~7W on 5.10.45.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.13rc7 |5.13rc7, 5.12.13
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.13rc7, 5.12.13 |5.13rc7, 5.12.13, 5.10.46, | |5.4.128
--- Comment #6 from Linux_Chemist (untaintableangel@hotmail.co.uk) --- Commit is also present in kernels 5.10.46 and 5.4.128 (stables) dated 23rd June 2021, so updating info for bug report. Should be a few people hitting this if they update to the latest kernels this week.
It is also conceivable that a similar bug may happen with gfx9 devices with a similar commit (I don't know and can't test though) (but that is commit 4cbbe34807938e6e494e535a68d5ff64edac3f20 upstream) that is also in all of these kernels. If you don't have a gfx10 device, you should file a separate bug report if that commit IS causing you an issue (it may not, I'm just guessing based on reports) - build a linux kernel and do a git bisect to check.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|[bisected][regression] AMD |[bisected][regression] |GPU can no longer enter |GFX10 AMDGPUs can no longer |idle state after commit |enter idle state after | |commit. Commit has been | |pushed to stable branches | |too.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #7 from Linux_Chemist (untaintableangel@hotmail.co.uk) --- (In reply to hagar-dunor from comment #3)
Same on vega 64
I think I'm right in saying Vega is gfx9 rather than gfx10 (navi etc), so you may be affected by a similar commit (4cbbe34807938e6e494e535a68d5ff64edac3f20 upstream) and not the specific one I'm filing for.
Are you able to build a linux kernel and check if you are being affected by this particular commit instead? At any rate, could you file a similar new bug report ("for gfx9 devices") and link it to this one since the specific commit I've confirmed that causes the problem is not applicable to your particular device.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #8 from hagar-dunor@wanadoo.fr --- Thanks for pointing to a different commit. I don't really have the time currently to revert a specific commit to try it out, pointing out the problem happening between two consecutive kernel versions should be enough TBH for the author to know what this is about.
I don't mind filling another bug if you insist, it would be nice to have the dev show up here and state if that's necessary; the problem might not affect the same hwid, but it's basically identical, I wouldn't be surprised if I open a bug the dev decides it's a duplicate.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Alan Swanson (reiver@improbability.net) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |reiver@improbability.net
--- Comment #9 from Alan Swanson (reiver@improbability.net) --- These patches have just been reverted for 5.13-rc8 and should hopefully be backported to be stable.
https://lists.freedesktop.org/archives/amd-gfx/2021-June/065575.html https://lists.freedesktop.org/archives/dri-devel/2021-June/312755.html
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Linux_Chemist (untaintableangel@hotmail.co.uk) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #10 from Linux_Chemist (untaintableangel@hotmail.co.uk) --- Thank you :) I'll mark this as resolved since the problem is known and code has been reverted ready for the next kernels.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
Marco Scardovi (marco@scardovi.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |marco@scardovi.com
--- Comment #11 from Marco Scardovi (marco@scardovi.com) --- Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD 3200U in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #12 from Linux_Chemist (untaintableangel@hotmail.co.uk) --- (In reply to Marco Scardovi from comment #11)
Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD 3200U in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?
Hi Marco, it should do if it's the same issue. Your choice of actions are to either:
1) Downgrade to or use kernel 5.12.12 (I don't know which distro you're using, but it should be available somewhere). 2) Build your own kernel from mainline (currently latest version is 5.13 final) 3) Wait until kernel 5.12.14 or later is available for you (at this time, I don't think it's been released yet). 4) Download and run a kernel from a 3rd party source that doesn't contain these commits.
As you're on a laptop (and thus probably on battery power), I would just pick an earlier kernel for now (option 1). If you've got grub for your bootloader (for example), just install an earlier kernel (or use another one if there's one installed), by choosing it at grub's menu when you boot up, then once you're logged in and confirm you're not on 5.12.13 (confirm with the command 'uname -a'), remove/uninstall 5.12.13 and then return things to how you like it.
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #13 from Marco Scardovi (marco@scardovi.com) --- (In reply to Linux_Chemist from comment #12)
(In reply to Marco Scardovi from comment #11)
Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD
3200U
in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?
Hi Marco, it should do if it's the same issue. Your choice of actions are to either:
- Downgrade to or use kernel 5.12.12 (I don't know which distro you're
using, but it should be available somewhere). 2) Build your own kernel from mainline (currently latest version is 5.13 final) 3) Wait until kernel 5.12.14 or later is available for you (at this time, I don't think it's been released yet). 4) Download and run a kernel from a 3rd party source that doesn't contain these commits.
As you're on a laptop (and thus probably on battery power), I would just pick an earlier kernel for now (option 1). If you've got grub for your bootloader (for example), just install an earlier kernel (or use another one if there's one installed), by choosing it at grub's menu when you boot up, then once you're logged in and confirm you're not on 5.12.13 (confirm with the command 'uname -a'), remove/uninstall 5.12.13 and then return things to how you like it.
Hi and thank for the answer. I'm using Gentoo and waiting for 5.13 release (it has been released today upstream). I hope this will help as my laptop is running at 73°C on idle
https://bugzilla.kernel.org/show_bug.cgi?id=213561
--- Comment #14 from Marco Scardovi (marco@scardovi.com) --- Can confirm on kernel 5.13-final is fixed. 44°C instead of 73°C on idle
https://bugzilla.kernel.org/show_bug.cgi?id=213561
soshial@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |soshial@gmail.com
--- Comment #15 from soshial@gmail.com --- I have exactly the same problem on my Dell XPS 9575 laptop.
GPU: AMD Polaris 22 XL [Radeon RX Vega M GL]. Kernel: 5.15.12
Several months ago there was no such problem. The amdgpu is always in D0 state and fans are spinning all the time. How may I help to fix the problem?
dri-devel@lists.freedesktop.org