https://bugzilla.kernel.org/show_bug.cgi?id=199959
Bug ID: 199959 Summary: amdgpu, regression?: system freezes after resume Product: Drivers Version: 2.5 Kernel Version: 4.17 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: mezin.alexander@gmail.com Regression: No
Created attachment 276359 --> https://bugzilla.kernel.org/attachment.cgi?id=276359&action=edit failed resume - journalctl output
After suspend and resume, I see the lock screen, but mouse cursor doesn't move, pressing keys doesn't seem to change anything (can't perform VT switch too).
Sapphire Radeon RX 580 Pulse 8 Gb Two displays connected through DisplayPort: Dell P2415Q and LG 27UD69P Cinnamon desktop (Xorg) Arch Linux
Happens on kernels 4.16.13 and 4.17 (even with amdgpu.dc=0) Doesn happen with kernel 4.14.48 (and earlier 4.14.*)
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #1 from Alexander Mezin (mezin.alexander@gmail.com) --- Just tested 4.15.15 and 4.16 On 4.15.15 suspend and resume works fine On 4.16 the system freezes even with amdgpu.dc=0
Note that Arch has
CONFIG_DRM_AMD_DC=y CONFIG_DRM_AMD_DC_PRE_VEGA=y # CONFIG_DRM_AMD_DC_FBC is not set CONFIG_DRM_AMD_DC_DCN1_0=y
in kernel config for both 4.15 and 4.16
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #2 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276363 --> https://bugzilla.kernel.org/attachment.cgi?id=276363&action=edit Failed resume - 4.16, amdgpu.dc=0
https://bugzilla.kernel.org/show_bug.cgi?id=199959
Alexander Mezin (mezin.alexander@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Regression|No |Yes
https://bugzilla.kernel.org/show_bug.cgi?id=199959
Alexander Mezin (mezin.alexander@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|4.17 |4.16
https://bugzilla.kernel.org/show_bug.cgi?id=199959
Christian König (christian.koenig@amd.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |christian.koenig@amd.com
--- Comment #3 from Christian König (christian.koenig@amd.com) --- Standard question: Can you bisect?
The logs don't show anything suspicious, so without a bisect it is probably really hard to guess what this could be.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #4 from Alexander Mezin (mezin.alexander@gmail.com) --- Commit d6895ad39f3b396be199f5b6fdfb8cde4be7bbf7 seems to be the cause. Resume works on 4.16 if I revert that single commit (tested on 4.16.0, 4.16.13, with both amdgpu.dc=0 and amdgpu.dc=1).
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #5 from Christian König (christian.koenig@amd.com) --- Ok, well that is interesting.
Please provide the output of "sudo cat /proc/iomem" and "lspci -t -v -nn".
In the meantime I will try to reproduce the issue here.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #6 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276391 --> https://bugzilla.kernel.org/attachment.cgi?id=276391&action=edit lspci -t -v -nn
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #7 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276393 --> https://bugzilla.kernel.org/attachment.cgi?id=276393&action=edit /proc/iomem
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #8 from Christian König (christian.koenig@amd.com) --- Mhm, I've tried the same ASIC (Polaris 10 8gb) in an AMD Threadripper and here it is working quite fine with suspend/resume.
So the only explanation I have is that this is some strange issue with PCI BAR resizing and Intel hardware.
Is the system completely unresponsive after resume, or can you at least ping it over the network?
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #9 from Alexander Mezin (mezin.alexander@gmail.com) --- It seems that only GPU is hung, I can even SSH to the machine. But things like restarting gdm/Xorg/unplugging the monitor didn't "fix" it. "shutdown -h now" didn't work.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #10 from Alexander Mezin (mezin.alexander@gmail.com) --- Actually, sometimes mouse pointer moves, and only freezes after I press a few keys/click a few times. Also, sometimes it's just colored pattern instead of the lock screen on the background. With Gnome on Wayland it takes a bit more time to break: after resume I see the desktop, but after a few clicks/key presses I see artifacts and then eventually everything freezes.
And just in case: - The problem also occurs with only one monitor connected. - On Windows on the same machine suspend and resume works without any problems.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #11 from Alexander Mezin (mezin.alexander@gmail.com) --- I literally have no idea what I'm doing, but adding 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()' (because I don't know which version is used for my card) "fixed" it somehow. Resume works, but there are some artifacts on screen during resume (they flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar' was introduced, there were no artifacts at all.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #12 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276415 --> https://bugzilla.kernel.org/attachment.cgi?id=276415&action=edit dmesg: resume with device_resize_fb_bar() in gmc_v?_?_resume()
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #13 from Christian König (christian.koenig@amd.com) --- (In reply to Alexander Mezin from comment #11)
I literally have no idea what I'm doing, but adding 'amdgpu_device_resize_fb_bar(adev);' line to all 'gmc_v?_?_resume()' (because I don't know which version is used for my card) "fixed" it somehow. Resume works, but there are some artifacts on screen during resume (they flash only once and then disappear). Before 'amdgpu_device_resize_fb_bar' was introduced, there were no artifacts at all.
Hehe, yeah that was a really nice test and confirms my suspicion on what's going wrong here.
Because you tried to resize the BAR once more after resume the resources in the address space are freed up and allocated again: [ 212.484672] amdgpu 0000:65:00.0: BAR 2: releasing [mem 0xe200000000-0xe2001fffff 64bit pref] [ 212.484673] amdgpu 0000:65:00.0: BAR 0: releasing [mem 0xe000000000-0xe1ffffffff 64bit pref] [ 212.484683] pcieport 0000:64:00.0: BAR 15: releasing [mem 0xe000000000-0xe2ffffffff 64bit pref]
[ 212.484691] pcieport 0000:64:00.0: BAR 15: assigned [mem 0xe000000000-0xe2ffffffff 64bit pref] [ 212.484692] amdgpu 0000:65:00.0: BAR 0: assigned [mem 0xe000000000-0xe1ffffffff 64bit pref] [ 212.484697] amdgpu 0000:65:00.0: BAR 2: assigned [mem 0xe200000000-0xe2001fffff 64bit pref]
Since it allocates the exact same address we freed up before the real issue is not the address itself, but that fact that the hardware config isn't saved during suspend/resume.
That strongly looks like a bug in the BIOS and/or the Linux PCI subsystem driver for Intel hardware to me.
I will try to narrow this down with a few patches on Monday, but don't expect any quick fix.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #14 from Christian König (christian.koenig@amd.com) --- Created attachment 276471 --> https://bugzilla.kernel.org/attachment.cgi?id=276471&action=edit Testing patch
Please test if this patch helps as well.
It limits the work done during resume to reprogramming BAR 0 & 2 and not the bridge.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #15 from Alexander Mezin (mezin.alexander@gmail.com) --- No, it doesn't change anything, system freezes on resume.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #16 from Christian König (christian.koenig@amd.com) --- So the problem seems to be the bridge then.
Please provide me with the output of the following commands, once before you suspended, once after you resumed without any change and once after you resumed with your hack to resize the BAR once more:
sudo setpci -s 64:00.0 COMMAND PREF_MEMORY_BASE PREF_MEMORY_LIMIT PREF_BASE_UPPER32 PREF_LIMIT_UPPER32 sudo lspci -s 64:00.0 -vvvv
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #17 from Alexander Mezin (mezin.alexander@gmail.com) --- setpci - exactly the same output in all 3 cases (verified with 'diff' to be sure): 0407 0001 fff1 000000e0 000000e2
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #18 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276517 --> https://bugzilla.kernel.org/attachment.cgi?id=276517&action=edit lspci before suspend
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #19 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276519 --> https://bugzilla.kernel.org/attachment.cgi?id=276519&action=edit lspci after resume, no hack
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #20 from Alexander Mezin (mezin.alexander@gmail.com) --- Created attachment 276521 --> https://bugzilla.kernel.org/attachment.cgi?id=276521&action=edit lspci after resume with hack
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #21 from Alexander Mezin (mezin.alexander@gmail.com) --- Not sure if it'll help, but I've added more logging here:
--- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -436,6 +436,8 @@ int pci_resize_resource(struct pci_dev *dev, int resno, int size) if (ret) return ret;
+ pci_info(dev, "BAR %d: resized from %d to %d", resno, old, size); + res->end = res->start + pci_rebar_size_to_bytes(size) - 1;
/* Check if the new config works by trying to assign everything. */
And suspend-resume with "re-resize" hack shows this:
amdgpu 0000:65:00.0: BAR 0: resized from 8 to 13
(this message appears in dmesg two times, first one on boot, second one during resume, exactly the same message in both cases)
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #22 from Christian König (christian.koenig@amd.com) --- Your debugging efforts are better than mine.
Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l ECAP15+8.l" once before suspend and once after suspend without any changes (e.g. when the problem happens).
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #23 from Alexander Mezin (mezin.alexander@gmail.com) --- (In reply to Christian König from comment #22)
Your debugging efforts are better than mine.
Please provide the output of "sudo setpci -s 65:00.0 ECAP15.l ECAP15+4.l ECAP15+8.l" once before suspend and once after suspend without any changes (e.g. when the problem happens).
before suspend: 27010015 0003f000 00000d20
after resume: 27010015 0003f000 00000820
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #24 from Christian König (christian.koenig@amd.com) --- Created attachment 276547 --> https://bugzilla.kernel.org/attachment.cgi?id=276547&action=edit Possible fix
In this case please try the attached patch and see if it helps.
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #25 from Alexander Mezin (mezin.alexander@gmail.com) --- Yes, it works
dmesg: [ 34.330683] amdgpu 0000:65:00.0: Test 0 from 8 to 13
https://bugzilla.kernel.org/show_bug.cgi?id=199959
Joern Hoffmann (j.hoffmann@quapona.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |j.hoffmann@quapona.com
--- Comment #26 from Joern Hoffmann (j.hoffmann@quapona.com) --- For me, it works to.
dmesg | grep amdgpu:
[ 3.437098] [drm] amdgpu kernel modesetting enabled. [ 3.442103] fb: switching to amdgpudrmfb from EFI VGA [ 3.442234] amdgpu 0000:01:00.0: enabling device (0006 -> 0007) [ 3.443795] amdgpu 0000:01:00.0: BAR 2: releasing [mem 0xd0000000-0xd01fffff 64bit pref] [ 3.443797] amdgpu 0000:01:00.0: BAR 0: releasing [mem 0xc0000000-0xcfffffff 64bit pref] [ 3.443822] amdgpu 0000:01:00.0: BAR 0: assigned [mem 0x2200000000-0x23ffffffff 64bit pref] [ 3.443827] amdgpu 0000:01:00.0: BAR 2: assigned [mem 0x2100000000-0x21001fffff 64bit pref] [ 3.443849] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used) [ 3.443850] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF [ 3.443917] [drm] amdgpu: 8192M of VRAM memory ready [ 3.443918] [drm] amdgpu: 8192M of GTT memory ready. [ 4.239650] fbcon: amdgpudrmfb (fb0) is primary device [ 4.323338] amdgpu 0000:01:00.0: fb0: amdgpudrmfb frame buffer device [ 4.340440] [drm] Initialized amdgpu 3.25.0 20150101 for 0000:01:00.0 on minor 0 [ 10.704309] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary [ 10.704310] amdgpu 0000:01:00.0: 00000000a78be373 unpin not necessary [ 10.704310] amdgpu 0000:01:00.0: 000000006047af5e unpin not necessary [ 10.704311] amdgpu 0000:01:00.0: 000000002d9a27ec unpin not necessary [ 11.443673] amdgpu 0000:01:00.0: Test 0 from 8 to 13
https://bugzilla.kernel.org/show_bug.cgi?id=199959
--- Comment #27 from Alexander Mezin (mezin.alexander@gmail.com) --- So the patch will only land in 4.19. Are you going to fix the regression (in amdgpu) for 4.15-4.18 somehow?
https://bugzilla.kernel.org/show_bug.cgi?id=199959
Aleksandr Mezin (mezin.alexander@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #28 from Aleksandr Mezin (mezin.alexander@gmail.com) --- Seems to be fixed in 4.18.5 by backport
dri-devel@lists.freedesktop.org