https://bugzilla.kernel.org/show_bug.cgi?id=199357
Bug ID: 199357 Summary: amdgpu: hang a few seconds after logging in, most likely due to regression Product: Drivers Version: 2.5 Kernel Version: v4.16 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: master.homer@gmail.com Regression: No
Created attachment 275291 --> https://bugzilla.kernel.org/attachment.cgi?id=275291&action=edit Kernel log of the hang/crash
I've been testing kernel v4.16 on my computer, but it's basically unusable - because after a few seconds or so after logging in it will do a soft lockup, and I can't even switch to VT. I was, however, able to ssh in to it, which is how I was able to get the kernel log. Right as the hang happened, I can see this in the log: Apr 11 14:04:13 homer-desktop kernel: [ 45.532038] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:45:crtc-1] flip_done timed out Apr 11 14:04:23 homer-desktop kernel: [ 55.772028] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:37:plane-1] flip_done timed out Apr 11 14:04:33 homer-desktop kernel: [ 66.012282] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:44:plane-7] flip_done timed out
and after that is the regular kernel crash.
I have tried this on both v4.16 and v4.16.1 with the same results. However, it doesn't happen on v4.15 (which is what I'm running now). So there must be some kind of regression between those releases.
I am running stable KDE neon (which is based on Ubuntu LTS) with precompiled kernels from the ubuntu mainline ppa.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #1 from Mathias Tillman (master.homer@gmail.com) --- Created attachment 275293 --> https://bugzilla.kernel.org/attachment.cgi?id=275293&action=edit Hardware info
https://bugzilla.kernel.org/show_bug.cgi?id=199357
Christian König (christian.koenig@amd.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |harry.wentland@amd.com
--- Comment #2 from Christian König (christian.koenig@amd.com) --- Looks like an issue with DC to me.
Can you bisect?
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #3 from Mathias Tillman (master.homer@gmail.com) --- I've just finished running a bisect now, and I have concluded that commit 36cc549d59864b7161f0e23d710c1c4d1b9cf022 (drm/amd/display: disable CRTCs with NULL FB on their primary plane (V2)) causes the lock-up. Let me know if you need anything else.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #4 from Christian König (christian.koenig@amd.com) --- Thanks, yeah that is clearly DC (display core).
Harry can you take a look?
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #5 from Harry Wentland (harry.wentland@amd.com) --- I've no idea why this causes "flip_done timed out" and locks the system right now, but we're currently also dealing with some more fallout from that change, in particular blinking/flickering display if redshift/nightlight is on. I'm reluctant to just revert the offending commit as it's not incorrect but seems to expose some other flaws in our atomic check/commit implementation.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #6 from Michel Dänzer (michel@daenzer.net) --- (In reply to Harry Wentland from comment #5)
I'm reluctant to just revert the offending commit as it's not incorrect but seems to expose some other flaws in our atomic check/commit implementation.
Unless a fix is at least on the horizon, since this commit introduced multiple issues, it would be nice to our users to revert it for the time being, then re-apply it when it's safe.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #7 from Mathias Tillman (master.homer@gmail.com) --- Wanted to add some more info. The soft lock up will release after approximately 30 seconds, but after a few seconds it will lock up again and repeat. Looking at the kernel log, it seems that when the lock up happens, it takes an abnormally long time to reach the dm_pflip_high_irq function which is supposed to trigger the flip_done message. I've attached a new log with my added logging in case that helps.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
Mathias Tillman (master.homer@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #275291|0 |1 is obsolete| |
--- Comment #8 from Mathias Tillman (master.homer@gmail.com) --- Created attachment 275337 --> https://bugzilla.kernel.org/attachment.cgi?id=275337&action=edit Kernel log with added logging
https://bugzilla.kernel.org/show_bug.cgi?id=199357
Mathias Tillman (master.homer@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #9 from Mathias Tillman (master.homer@gmail.com) --- Just saw that this has been reverted on git, so I will mark this as resolved.
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #10 from Mathias Tillman (master.homer@gmail.com) --- Since that commit was pushed to v4.16, shouldn't it also be reverted on linux-stable to make it to a future 4.16.y release?
https://bugzilla.kernel.org/show_bug.cgi?id=199357
--- Comment #11 from Alex Deucher (alexdeucher@gmail.com) --- Yes, the revert cc'ed stable so it will show up in 4.16 as well.
dri-devel@lists.freedesktop.org