[Bug 107518] polaris powerplay init fails: There must be 1 or more PCIE levels defined in PPTable

List overview All Threads
Download

newer

older

[Bug 107536] gfx_v8_0_priv_reg_irq...

[Bug 107498] Southern Islands...

bugzilla-daemon＠freedesktop.org

7 Aug 2018 7 Aug '18

7:15 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

Bug ID: 107518 Summary: polaris powerplay init fails: There must be 1 or more PCIE levels defined in PPTable Product: DRI Version: unspecified Hardware: PowerPC OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: shawnanastasio@yahoo.com

Created attachment 141002 --> https://bugs.freedesktop.org/attachment.cgi?id=141002&action=edit dmesg for 4.17.11-200

When booting kernel 4.17 or 4.18-rc8+ (git) on a POWER9 system with an ASUS Rx 580 GPU, the following messages are printed to the kernel log:

[ 10.398837] amdgpu: [powerplay] There must be 1 or more PCIE levels defined in PPTable. [ 10.398839] amdgpu: [powerplay] Failed to populate SCLK during PopulateNewDPMClocksStates Function! [ 10.398840] amdgpu: [powerplay] Failed to populate and upload SCLK MCLK DPM levels!

Note that the system is booted with the kernel argument `amdgpu.dc=0` to work around this issue: https://bugs.freedesktop.org/show_bug.cgi?id=107049

GPU performance seems to be significantly hindered as a result of these errors.

Booting with `amdgpu.dpm=0` silences the errors but does not improve performance.

-- You are receiving this mail because: You are the assignee for the bug.

Attachments:

attachment.html (text/html — 2.9 KB)

Show replies by date

bugzilla-daemon＠freedesktop.org

7 Aug 7 Aug

7:55 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #1 from Shawn Anastasio shawnanastasio@yahoo.com --- Created attachment 141003 --> https://bugs.freedesktop.org/attachment.cgi?id=141003&action=edit dmesg for 4.18.0-rc8+

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

8:34 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #2 from Shawn Anastasio shawnanastasio@yahoo.com --- Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

9:24 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #3 from Alex Deucher alexdeucher@gmail.com --- (In reply to Shawn Anastasio from comment #2)

...

Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.

The hw requires a special reset before it can be initialized again. This is handled in driver for things like hibernate (S4) support.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

8 Aug 8 Aug

3:47 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #4 from Shawn Anastasio shawnanastasio@yahoo.com --- (In reply to Alex Deucher from comment #3)

...

(In reply to Shawn Anastasio from comment #2)

...
Upon further testing, the issue seems to go away when the firmware is removed from petitboot, preventing it from initializing the card before the host OS. This indicates that it may have something to do with the GPU being initialized twice.

The hw requires a special reset before it can be initialized again. This is handled in driver for things like hibernate (S4) support.

Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

3:04 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #5 from Alex Deucher alexdeucher@gmail.com --- (In reply to Shawn Anastasio from comment #4)

...

Does the driver do the reset on a kexec reboot? If so, it seems insufficient to mitigate this issue.

Probably not. I'm not that familiar with kexec unfortunately.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

10 Aug 10 Aug

6:06 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #6 from Shawn Anastasio shawnanastasio@yahoo.com --- Could you point me towards the applicable routines that perform the reset on hibernate? They may provide some more insight into the situation.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

2:46 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #7 from Alex Deucher alexdeucher@gmail.com --- amdgpu_pmops_freeze() calls amdgpu_device_suspend() which calls amdgpu_asic_reset() at the end. amdgpu_asic_reset() is a macro which calls an asic specific callback to reset the GPU. vi_asic_reset() in vi.c is the callback for polaris and other VI family parts.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

25 Aug 25 Aug

7:19 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #8 from Timothy Pearson tpearson@raptorengineering.com --- Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

26 Aug 26 Aug

12:08 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #9 from Timothy Pearson tpearson@raptorengineering.com --- (In reply to Timothy Pearson from comment #8)

...

Would it make sense to call amdgpu_asic_reset() as part of module load to ensure that the GPU is in a known good state?

This didn't fix the problem, but I did note that rmmod / modprobing the amdgpu module from the host is a valid workaround. Something must happen on rmmod-based teardown aside from amdgpu_asic_reset().

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

27 Aug 27 Aug

1:56 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #10 from Alex Deucher alexdeucher@gmail.com --- Does this patch help? https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&...

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

29 Aug 29 Aug

9:09 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #11 from Luigi Laurini moonlght@tiscali.it --- Same bug present in amd64 architecture under kvm guest with RX 480 passed through. The first time i boot the guest the performance are ok, but if i reboot the guest without rebooting the host, the messages appears.

Tried the patch

https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&...

The message are gone but the performance problem is still present.

the problem affects the memory/gpu clocking:

First boot of the guest:

cat /sys/kernel/debug/dri/0/amdgpu_pm_info Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power: 1750 MHz (MCLK) 330 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 1000 mV (VDDGFX) 19.127 W (average GPU)

GPU Temperature: 56 C GPU Load: 0 %

UVD: Disabled

VCE: Disabled

if i reboot the guest:

GFX Clocks and Power: 300 MHz (MCLK) 300 MHz (SCLK) 300 MHz (PSTATE_SCLK) 300 MHz (PSTATE_MCLK) 800 mV (VDDGFX) 7.162 W (average GPU)

GPU Temperature: 42 C GPU Load: 0 %

UVD: Disabled

VCE: Disabled

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

1:33 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #12 from Luigi Laurini moonlght@tiscali.it --- I've to rectify my last affirmation about the patch.

I've patched the 4.18.5 kernel, compiled and rebooted the guest without rebooting the host. The guest vm was already in a bad state (300 MHz (MCLK)). Using a pathed kernel in a guest vm in such state did not solve the situation.

After this, i tried resetting the host. The first start of the host has led to normal behavior. Resetting the guest without resetting the host seems to maintain the correct behavior (1750 MHz (MCLK)).

The patch seems to work only if the card is in a "good" state. if, for some reasons, the card tunrns in a bad state, the patch cannot solve the problem.

I also notice that the message "powerplay ini fails" is gone, but now i get:

[ 36.352542] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353460] amdgpu: [powerplay] last message was failed ret is 0 [ 36.353468] amdgpu: [powerplay] failed to send message 260 ret is 255 [ 36.353471] amdgpu: [powerplay] failed to send message 145 ret is 255 [ 36.353475] amdgpu: [powerplay] last message was failed ret is 255

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

27 Oct 27 Oct

5:29 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #13 from Timothy Pearson tpearson@raptorengineering.com --- The patch originally at https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&... is no longer available:

...

Bad commit reference: 8242308cc3c4419832126ab78ca409ce7110ab33

Is an equivalent now in mainline? I'd like to try it out on one of our POWER9 boxes.

Thanks!

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

29 Oct 29 Oct

7:04 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

--- Comment #14 from Alex Deucher alexdeucher@gmail.com --- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

19 Nov 19 Nov

8:46 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=107518

Martin Peres martin.peres@free.fr changed:

--- Comment #15 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/474.

-- You are receiving this mail because: You are the assignee for the bug.

1980

Age (days ago)

2449

Last active (days ago)

dri-devel@lists.freedesktop.org

15 comments

1 participants

tags (0)

participants (1)

bugzilla-daemon＠freedesktop.org