https://bugzilla.kernel.org/show_bug.cgi?id=29842
Summary: Radeon runs very hot Product: Drivers Version: 2.5 Kernel Version: 2.6.38-rc6 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) AssignedTo: drivers_video-dri@kernel-bugs.osdl.org ReportedBy: psusi@cfl.rr.com Regression: No
The newer kernels read the temperature of my Radeon card, and report that it is running at around 80 C on an idle desktop. For comparison, my CPU is only 34 C.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #1 from Phillip Susi psusi@cfl.rr.com 2011-02-25 14:54:00 --- To make sure something in the desktop wasn't causing it, I booted the kernel with init=/bin/bash and the temperature still rose to 83 C. My guess is this is a defect in the firmware, or the driver interface to it, and it is running in an infinite loop instead of going idle.
So I have been looking over the source code in drivers/gpu/drm/radeon. I see various functions to start/stop/resume/initialize "mc" and "cp". I assume those stand for microcode and control program? What exactly is the difference?
It seems like the GPU is executing a few different microcode kernels that process commands placed into ring buffers. When the ring buffers are empty and the gui is idle, it seems like the GPU is still busily executing an infinite loop checking for work in the ring buffers. Shouldn't the driver detect the idle condition and issue an r600_cp_stop() to halt execution and stop wasting power?
On Fri, Feb 25, 2011 at 5:11 PM, Phillip Susi psusi@cfl.rr.com wrote:
So I have been looking over the source code in drivers/gpu/drm/radeon. I see various functions to start/stop/resume/initialize "mc" and "cp". I assume those stand for microcode and control program? What exactly is the difference?
Memory Controller and Command Processor, I believe.
Matt
On Fre, 2011-02-25 at 12:11 -0500, Phillip Susi wrote:
It seems like the GPU is executing a few different microcode kernels that process commands placed into ring buffers. When the ring buffers are empty and the gui is idle, it seems like the GPU is still busily executing an infinite loop checking for work in the ring buffers.
As has been pointed out by Alex, that's not true to the best of our knowledge.
Shouldn't the driver detect the idle condition and issue an r600_cp_stop() to halt execution and stop wasting power?
Feel free to try it, but I wouldn't expect it to make much if any difference.
Did you check that your card runs significantly cooler in the other OS before starting all this ruckus? (Though even if it does, the lack of clock gating might explain the difference)
On 02/26/2011 03:09 AM, Michel Dänzer wrote:
On Fre, 2011-02-25 at 12:11 -0500, Phillip Susi wrote:
It seems like the GPU is executing a few different microcode kernels that process commands placed into ring buffers. When the ring buffers are empty and the gui is idle, it seems like the GPU is still busily executing an infinite loop checking for work in the ring buffers.
As has been pointed out by Alex, that's not true to the best of our knowledge.
I wonder how that is though. I see nothing in the R600 microcode documentation about a way to halt execution, and it explicitly says it does not support interrupts, so I don't see any way for the CP to avoid busy waiting other than to be explicitly stopped by the driver.
Shouldn't the driver detect the idle condition and issue an r600_cp_stop() to halt execution and stop wasting power?
Feel free to try it, but I wouldn't expect it to make much if any difference.
I tried adding a debugfs file to call it and it didn't seem to make any difference.
Did you check that your card runs significantly cooler in the other OS before starting all this ruckus? (Though even if it does, the lack of clock gating might explain the difference)
I don't even have a working copy of the other OS any more. It is on my old first gen WD raptor fakeraid 0 that the new system's bios and Windows driver won't recognize. Now that you mention it though, I do think it always tended to run hot there and I usually underclocked it a bit to try and help. Maybe I just have a poorly designed card with insufficient heatsink+fan?
It seems like clock gating, while helpful to maximize power savings, should not be needed to stay below critical temperatures when idle.
Strange. This morning it seems to be running at "only" 66 C instead of 80+.
2011/2/26 Phillip Susi psusi@cfl.rr.com:
On 02/26/2011 03:09 AM, Michel Dänzer wrote:
On Fre, 2011-02-25 at 12:11 -0500, Phillip Susi wrote:
It seems like the GPU is executing a few different microcode kernels that process commands placed into ring buffers. When the ring buffers are empty and the gui is idle, it seems like the GPU is still busily executing an infinite loop checking for work in the ring buffers.
As has been pointed out by Alex, that's not true to the best of our knowledge.
I wonder how that is though. I see nothing in the R600 microcode documentation about a way to halt execution, and it explicitly says it does not support interrupts, so I don't see any way for the CP to avoid busy waiting other than to be explicitly stopped by the driver.
Changing the wptr is a trigger to start the cp. When the rptr and wptr are equal the cp is idle. See chapter 5 of the r5xx acceleration guide as I mentioned earlier.
Shouldn't the driver detect the idle condition and issue an r600_cp_stop() to halt execution and stop wasting power?
Feel free to try it, but I wouldn't expect it to make much if any difference.
I tried adding a debugfs file to call it and it didn't seem to make any difference.
Did you check that your card runs significantly cooler in the other OS before starting all this ruckus? (Though even if it does, the lack of clock gating might explain the difference)
I don't even have a working copy of the other OS any more. It is on my old first gen WD raptor fakeraid 0 that the new system's bios and Windows driver won't recognize. Now that you mention it though, I do think it always tended to run hot there and I usually underclocked it a bit to try and help. Maybe I just have a poorly designed card with insufficient heatsink+fan?
It's possible. The default clocks are designed to be safe however even if the temperature seems a bit high.
It seems like clock gating, while helpful to maximize power savings, should not be needed to stay below critical temperatures when idle.
Strange. This morning it seems to be running at "only" 66 C instead of 80+.
Make sure you clean out the fan and heatsink if a lot of dust has built up in there.
Alex
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Andrew Morton akpm@linux-foundation.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |akpm@linux-foundation.org Regression|No |Yes
--- Comment #2 from Andrew Morton akpm@linux-foundation.org 2011-03-01 00:40:31 --- What kernel versions were OK? 2.6.37?
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Phillip Susi psusi@cfl.rr.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Regression|Yes |No
--- Comment #3 from Phillip Susi psusi@cfl.rr.com 2011-03-01 03:19:07 --- It isn't a regression. Older kernels did not report the temperature at all. I noticed an odd pattern today as well. After a cold boot, the temperature runs up to 80+, but after suspending and resuming, it remains around 66.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Andrew Morton akpm@linux-foundation.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #4 from Andrew Morton akpm@linux-foundation.org 2011-03-01 03:43:47 --- Is this the same as bug #29572?
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #5 from Phillip Susi psusi@cfl.rr.com 2011-03-01 03:49:19 --- No.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Rafael J. Wysocki rjw@sisk.pl changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |florian@mickler.org, | |maciej.rutecki@gmail.com, | |rjw@sisk.pl Blocks| |27352 Regression|No |Yes
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #6 from Alex Deucher alexdeucher@gmail.com 2011-03-01 21:06:17 --- Rafael, why is this marked as a regression? The reporter explicitly stated it was not.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Rafael J. Wysocki rjw@sisk.pl changed:
What |Removed |Added ---------------------------------------------------------------------------- CC|florian@mickler.org, | |maciej.rutecki@gmail.com | Blocks|27352 | Regression|Yes |No
--- Comment #7 from Rafael J. Wysocki rjw@sisk.pl 2011-03-01 21:09:18 --- Presumably by mistake. Sorry.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Mike Meehan mjmeehan@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mjmeehan@gmail.com
--- Comment #8 from Mike Meehan mjmeehan@gmail.com 2011-04-12 01:22:03 --- My system is also impacted, noticed since upgrading to Ubuntu Natty. Kernel version 2.6.38-8.41-generic. Under no load my video card is reporting 82 degrees Celsius. The graphics card is a ATI Technologies Inc Barts PRO [ATI Radeon HD 6800 Series]. I'm using the radeon kernel module with the radeondrmfb frame buffer device.
I think it's related to putting the console in framebuffer mode, the card is quiet in text mode.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #9 from Mike Meehan mjmeehan@gmail.com 2011-04-12 02:26:49 --- # echo low > /sys/class/drm/card0/device/power_profile "resolves" the issue. Default power management settings for KMS put the card in high performance mode on AC power.
# echo dynpm > /sys/class/drm/card0/device/power_method Dynamic frequency scaling may work for you, though the screen flashes when power levels change. Still seems to run too hot. I'm sticking to low for most purposes.
This page was very helpful: https://wiki.archlinux.org/index.php?title=ATI&oldid=135045
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Igor Rudchenko igor@starrain.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |igor@starrain.org
--- Comment #10 from Igor Rudchenko igor@starrain.org 2011-04-12 16:35:47 --- This seems to be regression in my case.
Mobile FireGL V5250, temperature reading from thinkpad-acpi:
2.6.37.2, KMS, profile=default/high - temperature=67 2.6.37.2, KMS, profile=mid - temperature=64
2.6.38.2, KMS, profile=default/high - temperature=71 2.6.38.2, KMS, profile=mid - temperature=64
Some older kernels and windows with default clocks for GPU - temperature=67
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #11 from Igor Rudchenko igor@starrain.org 2011-06-27 12:44:42 --- High temperature of mobile radeon is back to normal with pcie_aspm=force.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Rafael J. Wysocki rjw@sisk.pl changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO
--- Comment #12 from Rafael J. Wysocki rjw@sisk.pl 2011-06-27 19:12:07 --- Can you attach dmesg output from your system with the 3.0-rc4 kernel, please?
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #13 from Igor Rudchenko igor@starrain.org 2011-07-02 11:57:02 --- Created an attachment (id=64452) --> (https://bugzilla.kernel.org/attachment.cgi?id=64452) dmesg output for 3.0-rc5 kernel
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #14 from Igor Rudchenko igor@starrain.org 2012-02-22 16:56:23 --- Commit "PCI: Rework ASPM disable code" added in 3.0.20 and 3.2.25 has worsened the situation. I can't enable ASPM on ThinkPad T60 now even with "pcie_aspm=force" kernel parameter. So radeon is always hot now.
3.2.4 kernel:
# dmesg | grep ASPM [ 0.000000] PCIe ASPM is forcibly enabled [ 0.161612] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it [ 3.612673] e1000e 0000:02:00.0: Disabling ASPM L0s L1
# lspci -vv -s 01:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
# cat /proc/acpi/ibm/thermal temperatures: 49 41 37 68 36 -128 33 -128 42 54 55 -128 -128 -128 -128 -128
3.2.5 kernel:
# dmesg | grep ASPM [ 0.000000] PCIe ASPM is forcibly enabled [ 0.161614] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it [ 3.523647] e1000e 0000:02:00.0: Disabling ASPM L0s L1
# lspci -vv -s 01:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
# cat /proc/acpi/ibm/thermal temperatures: 51 41 37 72 36 -128 33 -128 43 55 59 -128 -128 -128 -128 -128
Already tested kernels 3.2.7 and 3.3-rc4 - same problem.
On Wed, Feb 22, 2012 at 04:56:26PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #14 from Igor Rudchenko igor@starrain.org 2012-02-22 16:56:23 --- Commit "PCI: Rework ASPM disable code" added in 3.0.20 and 3.2.25 has worsened the situation. I can't enable ASPM on ThinkPad T60 now even with "pcie_aspm=force" kernel parameter. So radeon is always hot now.
Are you sure it's about ASPM ? My radeon (On a HP laptop) is very hot because of the unimplemented (and/or buggy) power management in the radeon driver!
I can make the temperature lower if I switch to power_profile "low" instead of the "default".
-- Pasi
3.2.4 kernel:
# dmesg | grep ASPM [ 0.000000] PCIe ASPM is forcibly enabled [ 0.161612] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it [ 3.612673] e1000e 0000:02:00.0: Disabling ASPM L0s L1
# lspci -vv -s 01:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
# cat /proc/acpi/ibm/thermal temperatures: 49 41 37 68 36 -128 33 -128 42 54 55 -128 -128 -128 -128 -128
3.2.5 kernel:
# dmesg | grep ASPM [ 0.000000] PCIe ASPM is forcibly enabled [ 0.161614] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it [ 3.523647] e1000e 0000:02:00.0: Disabling ASPM L0s L1
# lspci -vv -s 01:00.0 | grep ASPM LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
# cat /proc/acpi/ibm/thermal temperatures: 51 41 37 72 36 -128 33 -128 43 55 59 -128 -128 -128 -128 -128
Already tested kernels 3.2.7 and 3.3-rc4 - same problem.
-- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #15 from Igor Rudchenko igor@starrain.org 2012-03-19 10:50:43 --- Tested 3.3.0 kernel today and nothing changes. So I look deeper into ASPM registers:
----
Windows XP and Linux kernel prior 2.6.38:
root complex 00:01.0 0xB0 == 0x03 (L1 and L0s)
video card 01:00.0 0x68 == 0x43 (L1 and L0s)
----
Linux 3.2.4:
00:01.0 0xB0 == 00 (L0 only) 01:00.0 0x68 == 40 (L0 only)
with pcie_aspm=force:
00:01.0 0xB0.b=43 (L1 and L0s) 01:00.0 0x68.b=43 (L1 and L0s)
----
Linux 3.2.5 and 3.3.0:
00:01.0 0xB0.b=40 (L0 only) 01:00.0 0x68.b=40 (L0 only)
with pcie_aspm=force:
00:01.0 0xB0.b=40 (L0 only) 01:00.0 0x68.b=40 (L0 only)
----
Also I have working ASPM for network devices (ethernet and wireless) with Windows XP and kernels prior 2.6.38. But after 2.6.38 ASPM doesn't turn on even with force key for network devices. And after another rework of ASPM code in 3.0.20, 3.2.5 and 3.3 kernels ASPM doesn't turn on for video card despite force key.
I can enable ASPM on my devices with setpci:
setpci -s 00:01.0 0xB0.b=0x3:3 setpci -s 01:00.0 0x68.b=0x3:3
It works without problems, like it works prior 2.6.38 kernel. But, in my opinion, ASPM handling code in Linux definately needs another rework.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Matthew Garrett mjg59-kernel@srcf.ucam.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mjg59-kernel@srcf.ucam.org
--- Comment #16 from Matthew Garrett mjg59-kernel@srcf.ucam.org 2012-03-19 14:01:30 --- Your network driver is explicitly turning off ASPM, so that's completely unrelated to the core ASPM handling code. pcie_aspm=force will only enable ASPM handling, it won't change the policy. If your BIOS didn't enable L1 and you want L1 enabled, you have to set the policy to powersave.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
--- Comment #17 from Igor Rudchenko igor@starrain.org 2012-03-19 16:35:11 --- To Matthew Garrett:
I agree about network cards, but current situation with video card worries me much. Prior 3.0.20, 3.2.5 and 3.3 kernels, users of ThinkPads T60 can simply add key "pcie_aspm=force" to kernel and get ASPM working for their radeon card. But after your last patch to ASPM code we can't get ASPM working simply by keys or sysfs. "pcie_aspm=force" does nothing, but the ability to change policy. But changing policy also does nothing! I tried to change to powersave and then watch into registers - I got the same 0x40 values. So now direct change registers with setpci is our only choice to get ASPM working. And so it should not be.
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Alan alan@lxorguk.ukuu.org.uk changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alan@lxorguk.ukuu.org.uk Kernel Version|2.6.38-rc6 |3.3
https://bugzilla.kernel.org/show_bug.cgi?id=29842
Alan alan@lxorguk.ukuu.org.uk changed:
What |Removed |Added ---------------------------------------------------------------------------- Regression|No |Yes
dri-devel@lists.freedesktop.org