https://bugzilla.kernel.org/show_bug.cgi?id=204227
Bug ID: 204227 Summary: Visual artefacts and crash from suspend on amdgpu Product: Drivers Version: 2.5 Kernel Version: 5.2.1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: dolohow@outlook.com Regression: No
Created attachment 283823 --> https://bugzilla.kernel.org/attachment.cgi?id=283823&action=edit dmesg
After upgrading kernel from 5.1.14 to 5.2.1 I encountered many artifacts during desktop session. Also when going from suspend state, external monitor is green and kernel crashes. See dmesg
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #1 from dolohow (dolohow@outlook.com) --- Created attachment 283825 --> https://bugzilla.kernel.org/attachment.cgi?id=283825&action=edit lspci
https://bugzilla.kernel.org/show_bug.cgi?id=204227
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #2 from Alex Deucher (alexdeucher@gmail.com) --- Can you bisect?
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #3 from dolohow (dolohow@outlook.com) --- Well, that took me some time...
Looks like this is the cause...
005440066f929ba0dca8f4e0aebfbf8daac592cc is the first bad commit commit 005440066f929ba0dca8f4e0aebfbf8daac592cc Author: Huang Rui ray.huang@amd.com Date: Wed Mar 13 20:21:00 2019 +0800
drm/amdgpu: enable gfxoff again on raven series (v2)
This patch enables gfxoff and stutter mode again, since we take more testing on raven series. For raven2 and picasso, we can enable it directly. And for raven, we need check the RLC/SMC ucode version cannot be less than #531/0x1e45.
v2: add smc version checking for raven.
Signed-off-by: Huang Rui ray.huang@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com (v1) Tested-by: Likun Gao Likun.Gao@amd.com (v2) Signed-off-by: Alex Deucher alexander.deucher@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 21 +++++++++++++++++++++ drivers/gpu/drm/amd/powerplay/hwmgr/smu10_hwmgr.c | 13 ++++--------- drivers/gpu/drm/amd/powerplay/smumgr/smu10_smumgr.c | 4 ++++ 5 files changed, 33 insertions(+), 11 deletions(-)
https://bugzilla.kernel.org/show_bug.cgi?id=204227
tones111@hotmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |tones111@hotmail.com
--- Comment #4 from tones111@hotmail.com --- I'm seeing the same problems when running 5.2.x that were not present in 5.1. The commit above is the source of the visual artifacts, but I believe the lockup issue was introduced later. Is there any help I can provide in testing a fix?
It looks like there might have been some previous effort here: https://www.spinics.net/lists/amd-gfx/msg32192.html
I created https://bugzilla.kernel.org/show_bug.cgi?id=204611 that can be used to track the lockup issue.
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #5 from Alex Deucher (alexdeucher@gmail.com) --- This issue should be fixed with this patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #6 from Łukasz Żarnowiecki (lukasz@zarnowiecki.pl) --- (In reply to Alex Deucher from comment #5)
This issue should be fixed with this patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ ?id=98f58ada2d37e68125c056f1fc005748251879c2
Is this patch going to 5.2?
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #7 from Alex Deucher (alexdeucher@gmail.com) --- yes.
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #8 from tones111@hotmail.com --- I applied this to 5.2.10 and I'm still seeing artifacts.
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #9 from tones111@hotmail.com --- (In reply to tones111 from comment #8)
I applied this to 5.2.10 and I'm still seeing artifacts.
Sorry, I realized that statement doesn't give much context to work with. My system has an R5 2500U. lspci shows the following:
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller]) Subsystem: Lenovo Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] Flags: bus master, fast devsel, latency 0, IRQ 51 Memory at b0000000 (64-bit, prefetchable) [size=256M] Memory at c0000000 (64-bit, prefetchable) [size=2M] I/O ports at 1000 [size=256] Memory at c0600000 (32-bit, non-prefetchable) [size=512K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [64] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/4 Maskable- 64bit+ Capabilities: [c0] MSI-X: Enable- Count=3 Masked- Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [200] Resizable BAR <?> Capabilities: [270] Secondary PCI Express <?> Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Capabilities: [320] Latency Tolerance Reporting Kernel driver in use: amdgpu Kernel modules: amdgpu
https://bugzilla.kernel.org/show_bug.cgi?id=204227
Mirek Kratochvil (exa.exa@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |exa.exa@gmail.com
--- Comment #10 from Mirek Kratochvil (exa.exa@gmail.com) --- Hello everyone,
would the artefacts look like on this picture, or am I having a different issue? http://e-x-a.org/stuff/amdgpu-artefacts.jpg (Taken with a phone, as the artefacts are not screenshottable.)
The squares appear around small stuff that changes (esp. terminal text) and disappear in around half a second. Notably, they are only seen in xfce (suspect compositor is needed); not in LightDM (which does not do composition) nor around any frequently refreshed/accelerated surface (glxgears and animations in forefox are clean.)
Mine is:
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c3) (prog-if 00 [VGA controller]) Subsystem: Lenovo Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] Flags: bus master, fast devsel, latency 0, IRQ 58 Memory at b0000000 (64-bit, prefetchable) [size=256M] Memory at c0000000 (64-bit, prefetchable) [size=2M] I/O ports at 1000 [size=256] Memory at c0800000 (32-bit, non-prefetchable) [size=512K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [64] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/4 Maskable- 64bit+ Capabilities: [c0] MSI-X: Enable- Count=3 Masked- Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [200] Resizable BAR <?> Capabilities: [270] Secondary PCI Express <?> Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Capabilities: [320] Latency Tolerance Reporting Kernel driver in use: amdgpu Kernel modules: amdgpu
The problem happens on all 5.2 kernels I tried (from debian). "Debian stable" 4.19 and one 5.1 I tried are OK.
If this is a different kind of artifacts, please let me know (I'd open a different kind of bug.)
Thanks in advance! -mk
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #11 from tones111@hotmail.com --- Some good news. After a bios update to Lenovo's E485/E585 1.54 I no longer need to provide additional boot arguments in order for the machine to come up and the visual artifacts have gone away.
I would see issues with some fonts in Firefox that looked similar to your screenshot. The easiest way for me to reproduce the problem was to resize my terminal (Alacritty) or scroll around in gitk or gvim.
After a few days running on the new bios I haven't seen the artifacts, so this bug looks to be resolved for me since kernel 5.2.11.
Thanks!
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #12 from Mirek Kratochvil (exa.exa@gmail.com) --- That sounds great, thank you very much for the information and confirmation. I will try to update the BIOS and confirm ASAP.
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #13 from Mirek Kratochvil (exa.exa@gmail.com) --- After the BIOS upgrade the kernel parameters can be removed, but the kernel (5.2.16) now locks up when entering XFCE (it survives lightdm though). The error is almost same as as in the posted dmesg; I'll attach mine with backtraces in a few seconds.
Highlights:
This gets printed out before each warning: [ 66.159175] [drm] pstate TEST_DEBUG_DATA: 0x36F60000
R08 gets increased by some value between 49 and 56 after each next warning (the value is sometimes in R10)
Userspace seems working otherwise (the logs are from syslog), just the display won't show anything.
I will try a few other kernels available for debian and eventually bisect.
https://bugzilla.kernel.org/show_bug.cgi?id=204227
--- Comment #14 from Mirek Kratochvil (exa.exa@gmail.com) --- Created attachment 285069 --> https://bugzilla.kernel.org/attachment.cgi?id=285069&action=edit syslog from 5.2.16 with warnings
https://bugzilla.kernel.org/show_bug.cgi?id=204227
Łukasz Żarnowiecki (lukasz@zarnowiecki.pl) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #15 from Łukasz Żarnowiecki (lukasz@zarnowiecki.pl) --- I updated kernel to 5.3 and problem disappeared. I did not update bios or anything like that. Perhaps the problem you guys are facing is different than originally reported.
dri-devel@lists.freedesktop.org