https://bugs.freedesktop.org/show_bug.cgi?id=107518
Bug ID: 107518 Summary: polaris powerplay init fails: There must be 1 or more PCIE levels defined in PPTable Product: DRI Version: unspecified Hardware: PowerPC OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: shawnanastasio@yahoo.com
Created attachment 141002 --> https://bugs.freedesktop.org/attachment.cgi?id=141002&action=edit dmesg for 4.17.11-200
When booting kernel 4.17 or 4.18-rc8+ (git) on a POWER9 system with an ASUS Rx 580 GPU, the following messages are printed to the kernel log:
[ 10.398837] amdgpu: [powerplay] There must be 1 or more PCIE levels defined in PPTable. [ 10.398839] amdgpu: [powerplay] Failed to populate SCLK during PopulateNewDPMClocksStates Function! [ 10.398840] amdgpu: [powerplay] Failed to populate and upload SCLK MCLK DPM levels!
Note that the system is booted with the kernel argument `amdgpu.dc=0` to work around this issue: https://bugs.freedesktop.org/show_bug.cgi?id=107049
GPU performance seems to be significantly hindered as a result of these errors.
Booting with `amdgpu.dpm=0` silences the errors but does not improve performance.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #1 from Shawn Anastasio shawnanastasio@yahoo.com --- Created attachment 141003 --> https://bugs.freedesktop.org/attachment.cgi?id=141003&action=edit dmesg for 4.18.0-rc8+
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #2 from Shawn Anastasio shawnanastasio@yahoo.com --- Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #3 from Alex Deucher alexdeucher@gmail.com --- (In reply to Shawn Anastasio from comment #2)
Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.
The hw requires a special reset before it can be initialized again. This is handled in driver for things like hibernate (S4) support.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #4 from Shawn Anastasio shawnanastasio@yahoo.com --- (In reply to Alex Deucher from comment #3)
(In reply to Shawn Anastasio from comment #2)
Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.
The hw requires a special reset before it can be initialized again. This is handled in driver for things like hibernate (S4) support.
Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #5 from Alex Deucher alexdeucher@gmail.com --- (In reply to Shawn Anastasio from comment #4)
Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.
Probably not. I'm not that familiar with kexec unfortunately.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #6 from Shawn Anastasio shawnanastasio@yahoo.com --- Could you point me towards the applicable routines that perform the reset on hibernate? They may provide some more insight into the situation.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #7 from Alex Deucher alexdeucher@gmail.com --- amdgpu_pmops_freeze() calls amdgpu_device_suspend() which calls amdgpu_asic_reset() at the end. amdgpu_asic_reset() is a macro which calls an asic specific callback to reset the GPU. vi_asic_reset() in vi.c is the callback for polaris and other VI family parts.
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #8 from Timothy Pearson tpearson@raptorengineering.com --- Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #9 from Timothy Pearson tpearson@raptorengineering.com --- (In reply to Timothy Pearson from comment #8)
Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?
This didn't fix the problem, but I did note that rmmod / modprobing the amdgpu module from the host is a valid workaround. Something must happen on rmmod-based teardown aside from amdgpu_asic_reset().
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #10 from Alex Deucher alexdeucher@gmail.com --- Does this patch help? https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&...
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #11 from Luigi Laurini moonlght@tiscali.it --- Same bug present in amd64 architecture under kvm guest with RX 480 passed through. The first time i boot the guest the performance are ok, but if i reboot the guest without rebooting the host, the messages appears.
Tried the patch
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&...
The message are gone but the performance problem is still present.
the problem affects the memory/gpu clocking:
First boot of the guest:
cat /sys/kernel/debug/dri/0/amdgpu_pm_info Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off
GFX Clocks and Power: 1750 MHz (MCLK) 330 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 1000 mV (VDDGFX) 19.127 W (average GPU)
GPU Temperature: 56 C GPU Load: 0 %
UVD: Disabled
VCE: Disabled
if i reboot the guest:
cat /sys/kernel/debug/dri/0/amdgpu_pm_info Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off
GFX Clocks and Power: 300 MHz (MCLK) 300 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 800 mV (VDDGFX) 7.162 W (average GPU)
GPU Temperature: 42 C GPU Load: 0 %
UVD: Disabled
VCE: Disabled
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #12 from Luigi Laurini moonlght@tiscali.it --- I've to rectify my last affirmation about the patch.
I've patched the 4.18.5 kernel, compiled and rebooted the guest without rebooting the host. The guest vm was already in a bad state (300 MHz (MCLK)). Using a pathed kernel in a guest vm in such state did not solve the situation.
After this, i tried resetting the host. The first start of the host has led to normal behavior. Resetting the guest without resetting the host seems to maintain the correct behavior (1750 MHz (MCLK)).
The patch seems to work only if the card is in a "good" state. if, for some reasons, the card tunrns in a bad state, the patch cannot solve the problem.
I also notice that the message "powerplay ini fails" is gone, but now i get:
[ 36.352542] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353460] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353468] amdgpu: [powerplay] failed to send message 260 ret is 255 [ 36.353471] amdgpu: [powerplay] failed to send message 145 ret is 255 [ 36.353475] amdgpu: [powerplay] last message was failed ret is 255
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #13 from Timothy Pearson tpearson@raptorengineering.com --- The patch originally at https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&... is no longer available:
Bad commit reference: 8242308cc3c4419832126ab78ca409ce7110ab33
Is an equivalent now in mainline? I'd like to try it out on one of our POWER9 boxes.
Thanks!
https://bugs.freedesktop.org/show_bug.cgi?id=107518
--- Comment #14 from Alex Deucher alexdeucher@gmail.com --- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
https://bugs.freedesktop.org/show_bug.cgi?id=107518
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #15 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/474.
dri-devel@lists.freedesktop.org