https://bugzilla.kernel.org/show_bug.cgi?id=207673
Bug ID: 207673 Summary: radeon: crash due to over temperature Product: Drivers Version: 2.5 Kernel Version: 5.6.x and previous Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: phil@jpmr.org Regression: No
Created attachment 289045 --> https://bugzilla.kernel.org/attachment.cgi?id=289045&action=edit kernel log + lspci + glxinfo + patch
The radeon driver crashes because of an over temperature of my AMD Cape Verde Pro graphic card.
On my system, there's no overclocking, and the power management mode is the default one, with power_method dpm, power_dpm_state balanced and power_dpm_force_performance_level auto. This GPU is used for display and opencl computing.
The default over temperature value in r600_dpm is 120C, which seems to be too high for this chip/card. I patched my system to have a 100C limit, and I've no crash anymore. (I tried 110C, and it's still too high).
Attached are the full kernel log of the crash event, the lspci and glxinfo for the graphic card, and the proposed patch.
https://bugzilla.kernel.org/show_bug.cgi?id=207673
--- Comment #1 from phileimer (phil@jpmr.org) --- Created attachment 289047 --> https://bugzilla.kernel.org/attachment.cgi?id=289047&action=edit radeon: lower the high temperature limit
Limit the chip temperature to 100C, instead of 120C.
https://bugzilla.kernel.org/show_bug.cgi?id=207673
--- Comment #2 from phileimer (phil@jpmr.org) --- I can give more information about the over temperature problem :
* if I keep the 120C limit, the card runs at power level 3 until the driver crashes
* limiting at 100C allows the driver to decrease power level to 2 after a small overshoot, i.e. the temperature reaches 103/104C
* once at power level 2, the temperature stabilizes around 96C
* to test further, I decreased the case fan speed, and then, even with the 100C limit, the card continues to run at power level 2 until the driver crashes around 112C
So, there seems to be 2 problems : * the default 120C is clearly too high, at least for this board/chip * the temperature limit is used to go from PWL 3 to PWL 2, but there's no decrease to a lower PWL (1 or 0), as a safe measure
https://bugzilla.kernel.org/show_bug.cgi?id=207673
phileimer (phil@jpmr.org) changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|radeon: crash due to over |amdgpu/radeon: crash due to |temperature |over temperature
https://bugzilla.kernel.org/show_bug.cgi?id=207673
phileimer (phil@jpmr.org) changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.6.x and previous |5.6.x, 5.7.x
https://bugzilla.kernel.org/show_bug.cgi?id=207673
--- Comment #3 from phileimer (phil@jpmr.org) --- Created attachment 289807 --> https://bugzilla.kernel.org/attachment.cgi?id=289807&action=edit amdgpu: lower the temperature limit to avoid kernel crash
https://bugzilla.kernel.org/show_bug.cgi?id=207673
--- Comment #4 from phileimer (phil@jpmr.org) --- I modified my kernel configuration to use the new amdgpu driver for this SI chip, instead of the legacy radeon. The same problem occurs: to avoid frequent kernel crashes, I must apply a patch to lower the maximum temperature allowed.
https://bugzilla.kernel.org/show_bug.cgi?id=207673
--- Comment #5 from phileimer (phil@jpmr.org) --- Created attachment 289897 --> https://bugzilla.kernel.org/attachment.cgi?id=289897&action=edit amdgpu: kernel log when over temperature crash
dri-devel@lists.freedesktop.org