https://bugs.freedesktop.org/show_bug.cgi?id=108464
Bug ID: 108464 Summary: System fails to reboot after Ctrl-Alt-Del Product: DRI Version: DRI git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: duncan_roe@optusnet.com.au
Created attachment 142062 --> https://bugs.freedesktop.org/attachment.cgi?id=142062&action=edit kernel config
System shuts down after reboot command or Ctrl-Alt_Del but fails to actually reboot. The last line output to the Virtual Console is "Rebooting". The screen then turns off as normal, but fails to turn on again with the boot menu. Bisection finds this problem is introduced by commit 0a1d56599b9bb58464a8bf1243191eb32b36b694 which patches drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c Hardware: as documented in Bug 108139
https://bugs.freedesktop.org/show_bug.cgi?id=108464
Alex Deucher alexdeucher@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |harry.wentland@amd.com, | |sunpeng.li@amd.com
--- Comment #1 from Alex Deucher alexdeucher@gmail.com --- Are you sure the bisect is correct? This commit just changes debugfs which shouldn't be triggered unless you actually write to the file in question. Please attach your xorg log (is using X) and dmesg output.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #2 from Duncan Roe duncan_roe@optusnet.com.au --- Yes I thought that was weird (debug fs). But adjacent commit 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf does not show the problem
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #3 from Duncan Roe duncan_roe@optusnet.com.au --- halt is fine btw, it's only reboot that breaks. Do you want extra debug turned on for dmesg?
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #4 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 142063 --> https://bugs.freedesktop.org/attachment.cgi?id=142063&action=edit dmesg o/p as requested
No Xorg involvement - boot up to command line only
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #5 from Duncan Roe duncan_roe@optusnet.com.au --- Still present at 4.19.0-rc8. Is there any other info I can provide?
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #6 from Duncan Roe duncan_roe@optusnet.com.au --- (In reply to Duncan Roe from comment #3)
halt is fine btw, it's only reboot that breaks. Do you want extra debug turned on for dmesg?
At Linux 19.0-rc8, power button / halt command also fails. The backlight goes off but the power stays on.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #7 from Alex Deucher alexdeucher@gmail.com --- (In reply to Duncan Roe from comment #6)
(In reply to Duncan Roe from comment #3)
halt is fine btw, it's only reboot that breaks. Do you want extra debug turned on for dmesg?
At Linux 19.0-rc8, power button / halt command also fails. The backlight goes off but the power stays on.
Can you bisect that? Is it the same commit?
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #8 from Duncan Roe duncan_roe@optusnet.com.au --- (In reply to Alex Deucher from comment #1)
Are you sure the bisect is correct? This commit just changes debugfs which shouldn't be triggered unless you actually write to the file in question. Please attach your xorg log (is using X) and dmesg output.
No longer sure about that. I had been bisecting in anongit.freedesktop.org/drm/drm. When I switched to git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, both 0a1d565 and 30cdbfa show the problem. So I can now try to bisect further.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #9 from Duncan Roe duncan_roe@optusnet.com.au --- (In reply to Alex Deucher from comment #7)
(In reply to Duncan Roe from comment #6)
(In reply to Duncan Roe from comment #3)
halt is fine btw, it's only reboot that breaks. Do you want extra debug turned on for dmesg?
At Linux 19.0-rc8, power button / halt command also fails. The backlight goes off but the power stays on.
Can you bisect that? Is it the same commit?
It is the same commit i.e. 0a1d565 built from the stable tree consistently fails to power off. 30cdbfa *sometimes* fails to power off - I think I have seen 2 fails in6 reboots (my spreadsheet isn't set up to count results (yet)).
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #10 from Duncan Roe duncan_roe@optusnet.com.au --- (In reply to Alex Deucher from comment #1)
Are you sure the bisect is correct? This commit just changes debugfs which shouldn't be triggered unless you actually write to the file in question. Please attach your xorg log (is using X) and dmesg output.
A fresh bisect in the stable tree has given a new pair of commits. e1cb3e4801e6896ba93d63222b1052199d2a8c9b has the problem and 899e2aaddbfa0ff96fbaf31f0d9e91427e87dd88 does not have it. (In the stable tree, 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf also has the problem, unlike in the drm tree that I bisected previously). A diff of the new commit pair shows 14 patched files. These are TODO, Makefile and a mixture of .c and .h sources. I am unsure how to proceed with bisecting these diffs: advice welcome.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #11 from Duncan Roe duncan_roe@optusnet.com.au --- There is a kernel Oops associated with this problem. I only just discovered it started on the same commit as reboot failure did. You can see the BUG line in attachment 142063 at time 5.075194
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #12 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 142354 --> https://bugs.freedesktop.org/attachment.cgi?id=142354&action=edit Patch for Linux-19.0 to revert e1cb3e4
Revert commit e1cb3e4801e6896ba93d63222b1052199d2a8c9b (drm/amd/display: Convert remaining loggers off dc_logger). Reboot works again and the BUG / Oops is gone.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #13 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 143007 --> https://bugs.freedesktop.org/attachment.cgi?id=143007&action=edit Diagnostic patches to determine which pointer is null
These patches are against Linux 4.19.12, commit 2a7cb228d29c3882c1414c10a44c5f3f59bfa44d in git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #14 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 143008 --> https://bugs.freedesktop.org/attachment.cgi?id=143008&action=edit dmesg o/p with attachment 143007
The exception occurs in dc_link_aux_transfer, which is called by dm_dp_aux_transfer which is the top displayed function on the stack after the BUG line in attachment 142063. There is no BUG entry with the patch, instead there is a line Cowardly refusing to call through null pointer after which the patch makes dc_link_aux_transfer return -1. Code somewhere up the stack attempts 2 retries.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #15 from Duncan Roe duncan_roe@optusnet.com.au --- Further to attachment 143008: there are lots of calls to dm_dp_aux_transfer with aux=00000000f0bfdb41, but the first call with aux=0000000074cc4227 fails (because the aux_engine pointer is NULL). Then a few more calls with 00000000f0bfdb41, 2 more with 0000000074cc4227 and lastly a few with 00000000f0bfdb41 again. Does that pattern jog anyone's memory? Is anyone else reproducing this bug? https://bugs.freedesktop.org/show_bug.cgi?id=108139#c5 mentions the name "Stoney" (chipset(?)) in case that is any help. If no-one else is reproducing this, what would be the most helpful thing I could try next? I don't see this behviour in a VM, so can't gdb it.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #16 from Harry Wentland harry.wentland@amd.com --- Created attachment 143011 --> https://bugs.freedesktop.org/attachment.cgi?id=143011&action=edit [PATCH] drm/amd/display: Limit number of links to num_ddc
Can you see if this patch helps you?
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #17 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 143055 --> https://bugs.freedesktop.org/attachment.cgi?id=143055&action=edit dmesg o/p with diags after applying attachment 143011
The line before the BUG line shows a null pointer
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #18 from Duncan Roe duncan_roe@optusnet.com.au --- Mixed results on applying this patch. IN BRIEF: If you could eliminate this second Oops then we can see what works and what doesn't. In the meantime with the patch applied to v4.20 in the stable repository: reboot *sometimes* works. Ctl-Alt-Del w/out logging in seems not to. Log in as root and issue reboot cmd: no. Well it did work for me a couple of times but I can't seem to be able to do it again. Another thing: I boot to command level. If I let the VC time out (backlight goes off) then I can never wake it again no matter what keys I press. Caps Lock light goes on and off, so keyboard is still active. Hopefully this all gets better once there is no Oops. Attachment 143055 pinpoints the immediate NULL pointer. Again this is a new aux_engine.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #19 from Duncan Roe duncan_roe@optusnet.com.au --- Comment on attachment 143011 --> https://bugs.freedesktop.org/attachment.cgi?id=143011 [PATCH] drm/amd/display: Limit number of links to num_ddc
Review of attachment 143011: -----------------------------------------------------------------
Diagnostics show that this patch has no effect because the compared quantities are always equal
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #20 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 143344 --> https://bugs.freedesktop.org/attachment.cgi?id=143344&action=edit Display connectors_num & res_cap->num_ddc before compare
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #21 from Duncan Roe duncan_roe@optusnet.com.au --- Restarting investigations at Linux 5.0.0-rc5. Modified attachment 143055 to check whether the patch would trigger. It never would. New patch is attachment 143344.
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #22 from Duncan Roe duncan_roe@optusnet.com.au --- Created attachment 143346 --> https://bugs.freedesktop.org/attachment.cgi?id=143346&action=edit dmesg o/p showing output from Attachment 143344
This is a typical BUG occurrence. Stack trace looks similar to attachment 143055. Would it help to add diagnostics as in 143055? Which variables would you want to see?
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #23 from Duncan Roe duncan_roe@optusnet.com.au --- Here are some notes for anyone trying to reproduce this problem. By "this problem" I mean failure to reboot after "reboot" command issued.
1. On my system (Slackware, no systemd) I am triggering a reboot by Ctl-Alt-Del and this line in /etc/inittab: ca::ctrlaltdel:/sbin/shutdown -t5 -r now
2. I am only booting up to the command line (no X).
3. The occurrence of BUG is intermittent. I am seeing it on about 2 reboots in 3.
4. If there is no BUG, the next reboot will be OK.
5. If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe 90% of the time.
6. With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of the time (measured over 25 reboots). (I put a dmesg command in rc.local to do this test).
With e1cb3e4801 reverted, reboot always works and BUG never shows. Were it not for that, I would be suspecting a local hardware problem by now
https://bugs.freedesktop.org/show_bug.cgi?id=108464
--- Comment #24 from Duncan Roe duncan_roe@optusnet.com.au --- (In reply to Duncan Roe from comment #23)
Here are some notes for anyone trying to reproduce this problem. By "this problem" I mean failure to reboot after "reboot" command issued.
- On my system (Slackware, no systemd) I am triggering a reboot by
Ctl-Alt-Del and this line in /etc/inittab: ca::ctrlaltdel:/sbin/shutdown -t5 -r now
I am only booting up to the command line (no X).
The occurrence of BUG is intermittent. I am seeing it on about 2 reboots
in 3.
If there is no BUG, the next reboot will be OK.
If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe
90% of the time.
- With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of
the time (measured over 25 reboots). (I put a dmesg command in rc.local to do this test).
With e1cb3e4 reverted, reboot always works and BUG never shows. Were it not for that, I would be suspecting a local hardware problem by now
https://bugs.freedesktop.org/show_bug.cgi?id=108464
Duncan Roe duncan_roe@optusnet.com.au changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
--- Comment #25 from Duncan Roe duncan_roe@optusnet.com.au --- Since Linux 5.1, I do not see this bug any more. So I guess it is "fixed".
dri-devel@lists.freedesktop.org