https://bugzilla.kernel.org/show_bug.cgi?id=211277
Bug ID: 211277 Summary: sometimes crash at s2ram-wake (Ryzen 3500U): amdgpu, drm, commit_tail, amdgpu_dm_atomic_commit_tail Product: Drivers Version: 2.5 Kernel Version: 5.10.4 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: kolAflash@kolahilft.de Regression: No
I'm currently on Debian-11-Testing (Bullseye). And since a few weeks the system sometimes (not always) doesn't wake up from suspend. Most of the time suspend works. But about 1 in 10 times it crashes.
I attached /var/log/kern.log which holds plenty of information about the crash. Looks like the crash happened in amdgpu_dm.c:7273 (amdgpu_dm_atomic_commit_tail, Linux-5.10.4).
I'm pretty sure this behavior didn't appeared a few month before. So I guess a recent change is causing it. This may either be:
1. an updated package by Debian-Testing
Indeed I'm pretty sure the problem didn't appeared before Linux-5.9. So maybe this is being caused by a change between Linux-5.8 and Linux-5.9. I'll try to test going back to Linux-5.8 in the next days.
2. a BIOS update In November 2020 I installed the BIOS update sp110770.exe. Before I was using sp107599.exe. You can find the BIOS history attached. I'll also see if I can test a BIOS downgrade in the next days.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #1 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 294747 --> https://bugzilla.kernel.org/attachment.cgi?id=294747&action=edit kern.log
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #2 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 294749 --> https://bugzilla.kernel.org/attachment.cgi?id=294749&action=edit BIOS update history (just in case someone has a clue if something looks suspicios and this might not be a Linux problem)
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #3 from kolAflash (kolAflash@kolahilft.de) --- I searched through my journalctl log.
I set up the whole system in May 2020 with Linux-5.6.7. (journalctl has everything back to that date)
The bug appeared as following since October and Linux-5.8. So Linux-5.8 was also affected (contradicting my original post).
I used the system nearly every day and always use s2ram (never shutting down, only rebooting when needed for updates). So this can be seen statistically.
- 2020-10-21 with Linux-5.8.14 (Debian 5.8.0-3, installed after 2020-09-26) - 2020-12-11 with Linux-5.9.11 (Debian 5.9.0-4, installed 2020-12-04) - 2020-12-25 with Linux-5.9.11 - 2021-01-13 with Linux-5.10.4 (Debian 5.10.0-1, installed 2021-01-10) - 2021-01-16 with Linux-5.10.4 - 2021-01-19 with Linux-5.10.4
So the bug didn't appear with Linux <= 5.7. And the bugs frequency increased with Linux-5.10.
In parallel I'm still trying to rule out other factors. (BIOS updates, other software changes, ...) Something significant might be, that Debian used GCC-9 for Linux-5.7. And starting with Linux-5.8 GCC-10 was used.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
Jerome C (me@jeromec.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |me@jeromec.com
--- Comment #4 from Jerome C (me@jeromec.com) --- I too have a Ryzen 5 3500U and random resumes where the screen updates are very slow ( 1 frame change every 1-2 minutes ) which looks like it's crashed and in the kernel logs I see a bunch of "flip_done timed out" and "amdgpu_dm_atomic_commit_tail" errors
This never happened for me between 5.4.6 - 5.9.14. I noticed this since 5.10.4 and did never suspended on 5.10.0 - 5.10.3, so my guess it's an issue sometime in 5.10.0 - 5.10.3
Do you have kernel parameter set "init_on_free=1" or in your kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", if so try changing/setting the kernel parameter "init_on_free=0", so far ( for me and still testing ) it's resumed every time
I think it's an issue with amdgpu and kernel paramater "init_on_free=1" or kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y" which zero's memory on free/deallocation.
kernel paramter "init_on_alloc=1" or kernel config "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" works fine for me
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #5 from Jerome C (me@jeromec.com) --- Created attachment 294879 --> https://bugzilla.kernel.org/attachment.cgi?id=294879&action=edit Kernel log
Unfortunately it crashed again although I've noticed it's been crashing a lot less (4-5 days) since I set kernel parameter "init_on_free=0".
I've attached a kernel log for 5.10.10
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #6 from kolAflash (kolAflash@kolahilft.de) --- (In reply to Jerome C from comment #4)
[...] Do you have kernel parameter set "init_on_free=1" or in your kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", [...]
I'm using the Debian-11 (Testing / Bullseye) standard kernel.
$ grep -i init_on_free /boot/config-5.10.0-2-amd64 # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #7 from Jerome C (me@jeromec.com) --- ok, you have it turned off already
Weird thing happened this morning... I woke my laptop up and it was slow screen updates... I just closed my laptop lid, frustrated... I noticed it suspended again... I open my laptop again and it resumed
I looked in my kernel logs and saw the error messages from the first resume
NOTE: only copied the error messages
[drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:62:crtc-0] flip_done timed out [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CONNECTOR:73:eDP-1] flip_done timed out [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:52:plane-3] flip_done timed out
but on the second resume... no warnings or errors
I think it's a bug somewhere between suspension and resuming
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #8 from Jerome C (me@jeromec.com) --- I've tried kernel 5.11-rc5 and same issue occurs there.
For now I've downgraded kernel to 5.9.14 ( will update it to 5.9.16 ) until this issue is fixed
What I've mentioned in comment 4 isn't really helping I think
Sometimes the issue happens frequently in a day but then other times it could be a few days before it happens again
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #9 from kolAflash (kolAflash@kolahilft.de) --- I'm on Linux-5.7 now since 2021-01-26. And I woke up the notebook at least once a day since then. So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and probably between 5.7 and 5.8.
And it's definitely not a BIOS issue, because I changed anything about the BIOS since the problem appeared last time with Kernel-5.10.
Regards, kolAflash
https://bugzilla.kernel.org/show_bug.cgi?id=211277
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #10 from Alex Deucher (alexdeucher@gmail.com) --- Can you bisect? https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #11 from kolAflash (kolAflash@kolahilft.de) --- (In reply to Alex Deucher from comment #10)
Can you bisect? https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
I will try to.
But it will definitely need some time and may not be possible at all. Because the bug cannot be reproduced completely deterministically.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #12 from kolAflash (kolAflash@kolahilft.de) --- I've tried doing a bisect using this script. Unfortunately I couldn't reproduce the bug this way. So I bisecting will take a lot longer.
for i in {0..19}; do echo -e "\n${i}" /usr/sbin/rtcwake --seconds 15 --mode no systemctl start suspend.target sleep 15 done
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #13 from Jerome C (me@jeromec.com) --- (In reply to kolAflash from comment #12)
I've tried doing a bisect using this script. Unfortunately I couldn't reproduce the bug this way. So I bisecting will take a lot longer.
for i in {0..19}; do echo -e "\n${i}" /usr/sbin/rtcwake --seconds 15 --mode no systemctl start suspend.target sleep 15 done
Hiya
I did some testing myself recently and unfortunately doing 20 tests was not enough for me. I found that it could be 50 - 100 resumes before it would fail so I capped mine at 150 resumes, there were too many times where things looked fine for me with less than 50. After I tested kernels between 5.10.4 to 5.11-rc5 ( I didn't use 5.10.0 to 5.10.3 ) and found that this commit
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
was causing the issue for me
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #14 from kolAflash (kolAflash@kolahilft.de) --- (In reply to Jerome C from comment #13)
I don't get how you got to your results. There's no straight path from 5.10.4 to 5.11-rc5, as they are on different branches (5.10.y and master).
Nevertheless, your result may be reasonable from the point of the git history. I'm not sure about the commit ID a10aad137, but it has an completly identical twin commit c6d2b0fbb (also removing AMD_PG_SUPPORT_VCN_DPG from that expression). https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... And c6d2b0fbb has been applied between v5.10-rc2 and v5.10-rc3 (a10aad137 is only in master).
So if c6d2b0fbb (a.k.a a10aad137) is responsible, this explains why I started recognizing the problem when Debian-Testing went from Linux-5.9 to Linux-5.10.
I'm now running a 5.10.21 kernel where I reverted c6d2b0fbb. And I'll try using this kernel for at least one week and also run some iterative tests with it.
Regarding reproduction in general:
I really wonder what triggers this bug. I didn't went so far to test with more than 50 tests (sleep-wake iterations). Especially I didn't tried more than 50 because the bug definitely appeared more often if it happened under "natural" (non-testing) circumstances.
Some test series I did which are hard to make sense of statistically: I tried 20 tests and nothing happened. A few minutes later I decided to try 50 more tests and it directly failed on the first one. So I had to reboot, tried again 50 tests and nothing happened. Afterwards I put my notebook into s2ram and when I woke it the next day it immediately crashed.
By the way the two times it crashed recently (see above) happened with a kernel I compiled from clean kernel.org sources. Also I never experienced the bug with a clean 5.8.18 compiled from kernel.org running with the same system for about a week. So I'm quite convinced it's nothing Debian specific.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #15 from kolAflash (kolAflash@kolahilft.de) --- (In reply to Alex Deucher from comment #10)
Can you bisect? https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
I've done several s2ram-wakeup cycles (100 automatic and about three manual wakeups/day) with the kernel I compiled on 2021-03-07.
It's based on 5.10.21 with c6d2b0fbb reverted. (as suggested by Jerome) Result: No crashes. This looks very prosiming!
@Alex Can I help with anything else to solve this?
I also compiled 5.10.21 without reverting c6d2b0fbb, tested it for a few hours and got three wakeup-crashes.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #16 from kolAflash (kolAflash@kolahilft.de) --- @Alex Any progress on this?
If there's no perfect way to fix this, what about an option to turn on/off this behaviour? A module option that can be changed at runtime would be ideal. So it can be set right before suspending. But a kernel boot parameter would be fine too.
P.S. Would someone be so kind and set this bug to "confirmed"?
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #17 from Alex Deucher (alexdeucher@gmail.com) --- I don't think we've been able to reproduce it. That said, we did double check the programmign sequences and I believe it may be fixed with these patches: https://gitlab.freedesktop.org/agd5f/linux/-/commit/71efc8701a47aa9e3de74bab... https://gitlab.freedesktop.org/agd5f/linux/-/commit/a8f768874aaf751738a2e035...
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #18 from Jerome C (me@jeromec.com) --- (In reply to Alex Deucher from comment #17)
I don't think we've been able to reproduce it. That said, we did double check the programmign sequences and I believe it may be fixed with these patches: https://gitlab.freedesktop.org/agd5f/linux/-/commit/ 71efc8701a47aa9e3de74bab06020da81757893f https://gitlab.freedesktop.org/agd5f/linux/-/commit/ a8f768874aaf751738a2e0350bf2e70085f93ace
I've tried these two commits and the issue still there unfortunately
https://bugzilla.kernel.org/show_bug.cgi?id=211277
jamesz@amd.com (jamesz@amd.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jamesz@amd.com
--- Comment #19 from jamesz@amd.com (jamesz@amd.com) --- Created attachment 296841 --> https://bugzilla.kernel.org/attachment.cgi?id=296841&action=edit to fix suspend/resume hung issue
Hi @kolAflash and @jeromec, Can you help check if this patch can fix the issue? Since we can't reproduce at our side. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #20 from Jerome C (me@jeromec.com) --- (In reply to James Zhu from comment #19)
Created attachment 296841 [details] to fix suspend/resume hung issue
Hi @kolAflash and @jeromec, Can you help check if this patch can fix the issue? Since we can't reproduce at our side. Thanks! James
no, this doesn't work for me.
I'm curious to how your exactly to reproducing this
I start Xorg using the command "startx"
Xorg is running with LXQT
I start "Konsole" a gui terminal and execute the following
"for i in $(seq 1 150); do echo $i; sudo rtcwake -s 7 -m mem; done"
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #21 from James Zhu (jamesz@amd.com) --- Hi Jeromec, to isolate the cause, can you help run two experiments separately? 1. To run suspend/resume without launching Xorg, just on text mode. 2. To disable video acceleration (VCN IP). I need you share me the whole dmesg log after loading amdgpu driver. I think basically running modprobe with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words in dmesg to tell you if vcn ip is disabled or not).
Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #22 from kolAflash (kolAflash@kolahilft.de) --- @James What do you mean by video acceleration? Is this about 3D / DRI acceleration like in video games? Or do you mean just "video" playback (movie, mp4, webm, h264, vp8, ...) acceleration?
And I don't completely understand what ip_block_mask=0x0ff is supposed to do. I just rebootet with that kernel parameter added and 3D acceleration (DRI) is still working.
----
I'm planing to run these kernels in the next days:
1. Current Debian testing Linux-5.10.0-6 with ip_block_mask=0x0ff, Xorg and 3D acceleration in daily use.
2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg and with 3D acceleration in daily use.
3. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg, but without 3D acceleration** in daily use.
4. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff and without Xorg, doing some standby cycles for testing.
If I encounter any crash I'll post the whole dmesg starting with the boot output.
----
* amd-drm-next-5.14-2021-05-12 https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-... ae30d41eb
** Is there something special I should do to turn off acceleration? Or should I just don't start any application doing 3D / DRI acceleration? (the latter one might be difficult - I got to keep an eye on every application like Firefox, Atom, VLC, KWin/KDE window manager, ... not to use DRI)
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #23 from James Zhu (jamesz@amd.com) --- Hi kolAflash, VCN IP is for video acceleration(for video playback), if vcn ip didn't handle suspend/resume process properly, we do observe other IP blocks be affected. For your case it is display IP(dm) related. ip_block_mask=0xff (in grub should be amdgpu.ip_block_mask=0x0ff) can disable VCN IP during amdgpu driver loading. so this experiment can tell if this dm error is caused by VCN IP or not. sometimes /sys/kernel/debug/dri/0/amdgpu_fence_info can provide some useful information if it has chance to be dumped. these experiments can help identified which IP cause the issue. So we can find expert in that area to continue to triage. Your current report is case 2, so it can be replaced with 2. amd-drm-next-5.14-2021-05-12* with ip_block_mask=0x0ff, with Xorg and without 3D acceleration in daily use. I suggest you to execute your test plan in order 4->3->2->1. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #24 from Jerome C (me@jeromec.com) --- (In reply to James Zhu from comment #21)
Hi Jeromec, to isolate the cause, can you help run two experiments separately?
- To run suspend/resume without launching Xorg, just on text mode.
- To disable video acceleration (VCN IP). I need you share me the whole
dmesg log after loading amdgpu driver. I think basically running modprobe with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words in dmesg to tell you if vcn ip is disabled or not).
Thanks! James
1) In text mode, VCN enabled, suspensions issues are still there 2) I see the message confirming that VCN is disabled, In text mode, VCN disabled, suspensions issues are gone, After starting Xorg, VCN disabled, suspensions issues are gone
I'll gather the logs those soon ( tomorrow sometime )
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #25 from Jerome C (me@jeromec.com) --- I forgot to mention... I'm on kernel 5.13.4
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #26 from Jerome C (me@jeromec.com) --- (In reply to Jerome C from comment #25)
I forgot to mention... I'm on kernel 5.13.4
5.12.4 I mean
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #27 from James Zhu (jamesz@amd.com) --- Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff modprobe? I need log: case 1 dmesg and /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #28 from Jerome C (me@jeromec.com) --- Created attachment 296877 --> https://bugzilla.kernel.org/attachment.cgi?id=296877&action=edit AMDGPU fence info
(In reply to James Zhu from comment #27)
Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff modprobe? I need log: case 1 dmesg and /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James.
I've tested text mode and gui/drm mode with "drm.debug=0x1ff" set and found no crashes... when "drm.debug=0x1ff" is unset... the crashes/timeouts are back... I think this is why your unable to reproduce the problem...
I've never known debug option(s) to remove issue(s)... oh well
I've added the contents of the file "/sys/kernel/debug/dri/0/amdgpu_fence_info".
The file contains 4 different boot states ( vcn on/off, drm debug on/off ) clearly marked/seperated in the attached file
I'm using 5.12.5 now but I also tried this on 5.12.4. Usually the crashes happen within 50 suspensions/resumes but today I left it to do over 2000 suspensions/resumes just to make sure...
I know you asked for a log but I spent so much time on this ( other things too ), it wasn't on my mind so I'll get that by Friday, if you still need it ofcourse
thanks
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #29 from James Zhu (jamesz@amd.com) --- Hi Jeromec,I think debug turn-on changes a little bit timing. log without debug info can't give me any help. The amdgpu_fence_info looks good for all cases. this issue is possible device specified.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #30 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 296891 --> https://bugzilla.kernel.org/attachment.cgi?id=296891&action=edit all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)
Also crashes with ip_block_mask=0x0ff Tested with the current Debian Testing kernel 5.10.0-6.
I attached all kernel messages from /var/log/messages from boot to crash. I think that should be the dmesg output.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #31 from Jerome C (me@jeromec.com) --- (In reply to kolAflash from comment #30)
Created attachment 296891 [details] all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)
Also crashes with ip_block_mask=0x0ff Tested with the current Debian Testing kernel 5.10.0-6.
I attached all kernel messages from /var/log/messages from boot to crash. I think that should be the dmesg output.
hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not "ip_block_mask=0x0ff"
"ip_block_mask=0x0ff" will only apply to linux
"amdgpu.ip_block_mask=0x0ff" will only apply to amdgpu module
I can see in your kernel logs that VCN is still enabled
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #32 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 296901 --> https://bugzilla.kernel.org/attachment.cgi?id=296901&action=edit dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without ip_block_mask=0x0ff and with Xorg
(In reply to Jerome C from comment #31)
[...] hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not "ip_block_mask=0x0ff" [...] I can see in your kernel logs that VCN is still enabled
Ooops you're right. I know someone wrote that before. But it seems I somehow missed it while editing my Grub parameters.
I'll give it another try!
----
In the meanwhile I performed test number 2.
- amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg [...]
This time the crash was very different!
After some minutes (about 3) the graphical screen actually turned back on. I'm pretty sure that didn't happen with the other kernels I tested. (never tested amd-drm-next-5.14-2021-05-12 before)
Nevertheless everything graphical is lagging extremely. If I move the mouse or do anything else it takes more than 10 seconds until something happens on the screen.
On the other hand SSH access is smoothly possible. And I was able to save the dmesg output. (see attachment) Unlocking the screen via SSH (loginctl) or starting graphical programs (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #33 from Jerome C (me@jeromec.com) --- (In reply to kolAflash from comment #32)
In the meanwhile I performed test number 2.
- amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg
[...]
This time the crash was very different!
After some minutes (about 3) the graphical screen actually turned back on. I'm pretty sure that didn't happen with the other kernels I tested. (never tested amd-drm-next-5.14-2021-05-12 before)
Nevertheless everything graphical is lagging extremely. If I move the mouse or do anything else it takes more than 10 seconds until something happens on the screen.
On the other hand SSH access is smoothly possible. And I was able to save the dmesg output. (see attachment) Unlocking the screen via SSH (loginctl) or starting graphical programs (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)
I experienced this laggy too although I didn't try the SSH thing ( I don't have it setup )
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #34 from Jerome C (me@jeromec.com) --- Using 5.13.0 now and the issue is still here
(In reply to kolAflash from comment #32)
Created attachment 296901 [details] dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without ip_block_mask=0x0ff and with Xorg
(In reply to Jerome C from comment #31)
[...] hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not "ip_block_mask=0x0ff" [...] I can see in your kernel logs that VCN is still enabled
Ooops you're right. I know someone wrote that before. But it seems I somehow missed it while editing my Grub parameters.
I'll give it another try!
In the meanwhile I performed test number 2.
- amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg
[...]
This time the crash was very different!
After some minutes (about 3) the graphical screen actually turned back on. I'm pretty sure that didn't happen with the other kernels I tested. (never tested amd-drm-next-5.14-2021-05-12 before)
Nevertheless everything graphical is lagging extremely. If I move the mouse or do anything else it takes more than 10 seconds until something happens on the screen.
On the other hand SSH access is smoothly possible. And I was able to save the dmesg output. (see attachment) Unlocking the screen via SSH (loginctl) or starting graphical programs (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)
You have any updates since you corrected the kernel parameter?
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #35 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 298193 --> https://bugzilla.kernel.org/attachment.cgi?id=298193&action=edit /var/log/kern.log running amd-drm-next-5.14-2021-05-12 (ae30d41eb) with Xorg
Sorry for the long delay. I've tested:
1. Current Debian-11 testing Linux-5.10.0-8 with amdgpu.ip_block_mask=0x0ff while running Xorg. Result: everything ok
2. amd-drm-next-5.14-2021-05-12* (ae30d41eb) without any special kernel options while running Xorg. Result: - crashes - also the screen starts flickering about every 10 seconds after second resume - flickering also happens with using a8f768874^ (before the first fix-commit by Alex D.) - log attached: 5.12.0-rc7-original-ae30d41eb_crash.txt
3. Upstream Linux-5.14.0-rc4. Result: Still broken.
----
* amd-drm-next-5.14-2021-05-12 https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-... ae30d41eb
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #36 from Jerome C (me@jeromec.com) --- I've been watching linux-next and noticed that this commit
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/d...
was posted on linux-next back between 5.10-5.11, I don't remember but it keeps getting pushed back and not mainlined...
I think this is why the issues are still here and none of AMD are responding to this now since comment 29
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #37 from James Zhu (jamesz@amd.com) --- HiJerome and kolAflash, would you mind base on your original test configuration,and add pci=noats in boot parameter? for example: linux /boot/vmlinuz-5.4.0-54-generic root=UUID=803844cc-7291-4056-bd04-f1b43b54ed97 ro pci=noats see if this helps. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #38 from Jerome C (me@jeromec.com) --- Hi James,
With "pci=noats" set the suspension and resume works fine
I did see some errors ( something about device not added ) in the kernel log from "kfd" but I guess that's related to PCIe ATS being disabled with the kernel parameter set
Thanks
Jerome
On 21/02/2021 00:17, bugzilla-daemon@bugzilla.kernel.org wrote:
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #9 from kolAflash (kolAflash@kolahilft.de) --- I'm on Linux-5.7 now since 2021-01-26. And I woke up the notebook at least once a day since then. So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and probably between 5.7 and 5.8.
And it's definitely not a BIOS issue, because I changed anything about the BIOS since the problem appeared last time with Kernel-5.10.
Regards, kolAflash
-- You may reply to this email to add a comment.
You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #40 from James Zhu (jamesz@amd.com) --- Hi Jerome, Yes, you are right.Turning off ats will affect iommu. KFD needs iommu enable. KFD supports computing engine. It won't affect 3D and video acceleration. After I confirm if ats/iommu causes the issue, I will find right person to fix it. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #41 from kolAflash (kolAflash@kolahilft.de) --- I can confirm Jeromes result.
Bug is gone with pci=noats. (Debian-11 kernel 5.10.0-8-amd64)
I ran 50 suspend/standby rounds. Also I used the notebook for 2 days and suspended it multiple times without issues.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #42 from James Zhu (jamesz@amd.com) --- Hi Jerome and kolAflash,
Thanks for confirmation. I have a workaround for this issue. But I wish I can find the root cause or better workaround.
James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #43 from kolAflash (kolAflash@kolahilft.de) --- (In reply to James Zhu from comment #42)
Hi Jerome and kolAflash,
Thanks for confirmation. I have a workaround for this issue. But I wish I can find the root cause or better workaround.
Thanks too for your help James!
For me personally the situation is quite fine with pci=noats. I'm sometimes using Qemu/KVM and VirtualBox. But no need for absolute bleeding edge VM performance. So I'll probably be fine with pci=noats.
However, I'd love to contribute to a fix for all users without kernel parameter stuff. (including a fix in longterm Linux-5.10 for Debian) So just tell me if I can help by doing more tests, sending logs, ... :-)
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #44 from James Zhu (jamesz@amd.com) --- Created attachment 298651 --> https://bugzilla.kernel.org/attachment.cgi?id=298651&action=edit A workaround for suspend/resume hung issue
The VCN block passed all ring tests, usually the vcn will get into idle within 1 sec. Somehow it affected later amd iommu device resume which is controlled by kfd resume. This workaround is to gate vcn block immediately when ring test passed. It can fix the suspend/resume hung issue.
Hi kolAflash, Please help check the WA in your setup. I will continue working on root cause. thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #45 from Jerome C (me@jeromec.com) --- Unfortunately this failed after 138 susp/resu
Thanks
Jerome
On 02/09/2021 22:24, bugzilla-daemon@bugzilla.kernel.org wrote:
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #44 from James Zhu (jamesz@amd.com) --- Created attachment 298651 --> https://bugzilla.kernel.org/attachment.cgi?id=298651&action=edit A workaround for suspend/resume hung issue
The VCN block passed all ring tests, usually the vcn will get into idle within 1 sec. Somehow it affected later amd iommu device resume which is controlled by kfd resume. This workaround is to gate vcn block immediately when ring test passed. It can fix the suspend/resume hung issue.
Hi kolAflash, Please help check the WA in your setup. I will continue working on root cause. thanks! James
-- You may reply to this email to add a comment.
You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #47 from James Zhu (jamesz@amd.com) --- Hi Jerome, Thanks! I knew this issue is not easy to judge if it is fixed. Since it occurred quite randomly. On my setup, this WA passed 5 times up to 300 suspend/resume cycles, 1 time up to 3800 suspend/resume cycle. But I doubt that it is root cause, so I took it as WA. But it seems it is not WA for all system. James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
Anthony Rabbito (ted437@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |ted437@gmail.com
--- Comment #48 from Anthony Rabbito (ted437@gmail.com) --- I'm also facing consistent wake up from screen saver crashes on a Radeon VII. This became more appearant 5.14.0-rc7 and has made it's way to 5.14.0. After the screens blank waking up from sleep typically leaves artifacts on one screen, another screen will be forozen, and a third screen allows to unlock out of SDDM. I will attach kernel logs of a trace while this happens. Please let me know if I can assist in anyway.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #49 from Anthony Rabbito (ted437@gmail.com) --- Created attachment 298661 --> https://bugzilla.kernel.org/attachment.cgi?id=298661&action=edit journalctl of amdgpu trace
(In reply to Anthony Rabbito from comment #48)
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #50 from James Zhu (jamesz@amd.com) --- Hi Anthony, Can you try if Comment #37? see if it helps. But from the log that you attached, it is a different issue that GFX hw has lots of ECC error, which cause gfx ring time out. after that the gpu recover is triggered, unfortunately, screen blank came up. I think you need create another ticket for your case. Best Regards! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
Arham Jain (arhamjain@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |arhamjain@gmail.com
--- Comment #51 from Arham Jain (arhamjain@gmail.com) --- I can confirm that the issue I was having after trying to wake after suspend (Ryzen 3500u, Linux 5.14 RC7) has vanished after adding pci=noats to my boot parameters a few days ago. I've had this issue on every kernel since 5.10 (5.4 and 5.9 were fine for me for several months each, not sure what I used in between). Thank you so much James for posting this (and trying to fix it)!
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #52 from James Zhu (jamesz@amd.com) --- Created attachment 298691 --> https://bugzilla.kernel.org/attachment.cgi?id=298691&action=edit Fix for S3 hung issue
Hi Jerome and kolAflash,
I think iommu device init is put at wrong place during the resume. I attache a patch. Please confirm if it works. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #53 from Anthony Rabbito (ted437@gmail.com) --- Thanks for chiming in James! Few things I've observed since adding 'pci=noats' the graphic artifacts seem to happen way less. I did observe one lockup which required me to hard shut down the computer. This was a wake from suspend scenario.
I used to deal with somwhat similar issues here -- https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of any use. Let me know if a fresh bug is warranted.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #54 from Jerome C (me@jeromec.com) --- Hi James,
After 900 ( 600 on LLVM, 300 on GCC ) susp/resu using kernel 5.14.1 compiled by LLVM 12.0.1 ( LLVM_IAS is unset during compiling ) and again by GCC 11.1.0, there no crash on resume, awesome. It usually fails between 1-150 susp/resu
BRING ON THE RYZEN 6000 SERIES APU
Thanks
Jerome
-------- Original Message -------- On 7 Sep 2021, 03:00, < bugzilla-daemon@bugzilla.kernel.org> wrote:
[https://bugzilla.kernel.org/show%5C_bug.cgi?id=211277%5D%5Bhttps_bugzilla.ke...]
--- Comment #52 from James Zhu (jamesz@amd.com) --- Created attachment 298691 --> https://bugzilla.kernel.org/attachment.cgi?id=298691&action=edit Fix for S3 hung issue
Hi Jerome and kolAflash,
I think iommu device init is put at wrong place during the resume. I attache a patch. Please confirm if it works. Thanks! James
-- You may reply to this email to add a comment.
You are receiving this mail because: You are on the CC list for the bug.
[https_bugzilla.kernel.org_show_bug.cgi_id_211277]: https://bugzilla.kernel.org/show_bug.cgi?id=211277
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #55 from Jerome C (me@jeromec.com) --- Created attachment 298695 --> https://bugzilla.kernel.org/attachment.cgi?id=298695&action=edit signature.asc
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #56 from Jerome C (me@jeromec.com) --- damn, sorry for the ugly message layout replies
I didn't realize my e-mail provider was doing that
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #57 from James Zhu (jamesz@amd.com) --- (In reply to Anthony Rabbito from comment #53)
Thanks for chiming in James! Few things I've observed since adding 'pci=noats' the graphic artifacts seem to happen way less. I did observe one lockup which required me to hard shut down the computer. This was a wake from suspend scenario.
I used to deal with somwhat similar issues here -- https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of any use. Let me know if a fresh bug is warranted.
Hi Anthony,
The s3 hung issue here always with error: AMD-Vi: Event logged [IO_PAGE_FAULT...] Bug:110674 don't have gfx ECC error. You case do have lots of them. Can you share the whole dmesg after you added pci=noats? Regards! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
youling257@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |youling257@gmail.com
--- Comment #58 from youling257@gmail.com --- drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #59 from James Zhu (jamesz@amd.com) --- (In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #60 from youling257@gmail.com --- Created attachment 298889 --> https://bugzilla.kernel.org/attachment.cgi?id=298889&action=edit dmesg5.15.txt
(In reply to James Zhu from comment #59)
(In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #61 from James Zhu (jamesz@amd.com) --- (In reply to youling257 from comment #60)
Created attachment 298889 [details] dmesg5.15.txt
(In reply to James Zhu from comment #59)
(In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg.
after reboot, you can find under /var/log/kern.log and /var/log/syslog based on timestamp. you can just attach kern.log
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #62 from youling257@gmail.com --- (In reply to James Zhu from comment #61)
(In reply to youling257 from comment #60)
Created attachment 298889 [details] dmesg5.15.txt
(In reply to James Zhu from comment #59)
(In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to
disk
resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg.
after reboot, you can find under /var/log/kern.log and /var/log/syslog based on timestamp. you can just attach kern.log
my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on amdgpu, no /var/log. git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move iommu_resume before ip init/resume.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #63 from James Zhu (jamesz@amd.com) --- (In reply to youling257 from comment #62)
(In reply to James Zhu from comment #61)
(In reply to youling257 from comment #60)
Created attachment 298889 [details] dmesg5.15.txt
(In reply to James Zhu from comment #59)
(In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to
disk
resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg.
after reboot, you can find under /var/log/kern.log and /var/log/syslog
based
on timestamp. you can just attach kern.log
my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on amdgpu, no /var/log. git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move iommu_resume before ip init/resume.
Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the below link help you dump the error message during resume. https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs-a...
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #64 from youling257@gmail.com --- Created attachment 298899 --> https://bugzilla.kernel.org/attachment.cgi?id=298899&action=edit config-5.15.0-rc2-android-x86_64+
CONFIG_HSA_AMD=y
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #65 from youling257@gmail.com --- (In reply to James Zhu from comment #63)
(In reply to youling257 from comment #62)
(In reply to James Zhu from comment #61)
(In reply to youling257 from comment #60)
Created attachment 298889 [details] dmesg5.15.txt
(In reply to James Zhu from comment #59)
(In reply to youling257 from comment #58)
drm/amdgpu: move iommu_resume before ip init/resume cause suspend
to
disk
resume failed on my amdgpu 3400g.
Can you share whole demsg log? Regards! James
when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg.
after reboot, you can find under /var/log/kern.log and /var/log/syslog
based
on timestamp. you can just attach kern.log
my userspace is androidx86, running androidx86 with linux 5.15 and mesa21
on
amdgpu, no /var/log. git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move iommu_resume before ip init/resume.
Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the below link help you dump the error message during resume. https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs- after-kernel-panic
do you see my dmesg kernel command line "memmap=1M!5M ramoops.mem_size=1048576 ramoops.ecc=1 ramoops.mem_address=0x00500000 ramoops.console_size=16384 ramoops.ftrace_size=16384 ramoops.pmsg_size=16384 ramoops.record_size=32768".
if kernel panic reboot, can get /sys/fs/pstore/console-ramoops-0 and /sys/fs/pstore/pmsg-ramoops-0. but when resume failed, have to press power button force shutdown, no anything.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #66 from youling257@gmail.com --- resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-shir0pqX?usp...
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #67 from James Zhu (jamesz@amd.com) --- (In reply to youling257 from comment #66)
resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- shir0pqX?usp=sharing
Can you try apply this patch: https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #68 from youling257@gmail.com --- (In reply to James Zhu from comment #67)
(In reply to youling257 from comment #66)
resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- shir0pqX?usp=sharing
Can you try apply this patch: https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?
linux kernel 5.15rc1 is good, suspend to disk resume success. linux kernel 5.15rc2 is bad, suspend to disk failed. revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success.
linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #69 from youling257@gmail.com --- (In reply to James Zhu from comment #67)
(In reply to youling257 from comment #66)
resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- shir0pqX?usp=sharing
Can you try apply this patch: https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?
linux kernel 5.15rc1 is good, suspend to disk resume success. linux kernel 5.15rc2 is bad, suspend to disk failed. revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success.
linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #70 from James Zhu (jamesz@amd.com) --- My mistaake. Can you try add pci=noats in boot parameters?
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #71 from youling257@gmail.com --- (In reply to James Zhu from comment #70)
My mistaake. Can you try add pci=noats in boot parameters?
no help, still resume failed.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #72 from Jerome C (me@jeromec.com) --- Hi James,
I noticed the patch that you asked us to try from comment 52 were also submitted to kernel 5.14.7
tested it, all is good for now
Thanks
Jerome
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #73 from kolAflash (kolAflash@kolahilft.de) --- (In reply to Jerome C from comment #72)
Hi James,
I noticed the patch that you asked us to try from comment 52 were also submitted to kernel 5.14.7
tested it, all is good for now
Pleased to hear that :-) I'm just compiling 5.15.2 to run a test myself.
@James Will those patches be backported to the Linux-5.10 LTS kernel?
master and Linux-5.15 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
Linux-5.14.7 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=... https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #74 from kolAflash (kolAflash@kolahilft.de) --- @James Zhu
Tested 5.15.2 for over a week and more than 50 standby-wakeups. No problems! Thanks :-)
I would be happy about a patch for the 5.10 longterm kernel. The bug became a problem with v5.10-rc3 (see comment 14), just before Debian made 5.10-longterm the Debian-11 kernel. So it would be great if I and probably other Debian-11 users could finally use that AMD GPU without workarounds.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #75 from James Zhu (jamesz@amd.com) --- (In reply to kolAflash from comment #74)
@James Zhu
Tested 5.15.2 for over a week and more than 50 standby-wakeups. No problems! Thanks :-)
I would be happy about a patch for the 5.10 longterm kernel. The bug became a problem with v5.10-rc3 (see comment 14), just before Debian made 5.10-longterm the Debian-11 kernel. So it would be great if I and probably other Debian-11 users could finally use that AMD GPU without workarounds.
Hi @Alex Deucher, Can you help on this request? thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #76 from Alex Deucher (alexdeucher@gmail.com) --- (In reply to James Zhu from comment #75)
(In reply to kolAflash from comment #74)
@James Zhu
Tested 5.15.2 for over a week and more than 50 standby-wakeups. No problems! Thanks :-)
I would be happy about a patch for the 5.10 longterm kernel. The bug became a problem with v5.10-rc3 (see comment 14), just before
Debian
made 5.10-longterm the Debian-11 kernel. So it would be great if I and probably other Debian-11 users could finally use that AMD GPU without workarounds.
Hi @Alex Deucher, Can you help on this request? thanks! James
I cc'ed stable with the patches so they should show up in 5.10 assuming they apply cleanly. If not, can you look at what it would take to backport them?
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #77 from James Zhu (jamesz@amd.com) --- Created attachment 299697 --> https://bugzilla.kernel.org/attachment.cgi?id=299697&action=edit backport patch for 5.10 stable.
Hi @kolAflash, before I send out them to public for review,. could you help take a test? Thanks so much! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #78 from kolAflash (kolAflash@kolahilft.de) --- (In reply to James Zhu from comment #77)
Created attachment 299697 [details] backport patch for 5.10 stable.
Hi @kolAflash, before I send out them to public for review,. could you help take a test? Thanks so much! James
Thanks for the patch! :-)
make is currently running and I'll conduct some tests in the next days.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #79 from kolAflash (kolAflash@kolahilft.de) --- @James
Got this when compiling with Linux-5.10.81:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c: In function ‘kgd2kfd_device_init’: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:754:6: error: implicit declaration of function ‘kgd2kfd_resume_iommu’; did you mean ‘kgd2kfd_resume_mm’? [-Werror=implicit-function-declaration] 754 | if (kgd2kfd_resume_iommu(kfd)) | ^~~~~~~~~~~~~~~~~~~~ | kgd2kfd_resume_mm
Patching 5.10.81 was without problems:
$ patch -p1 -i ../../backport_patch/0001-drm-amdkfd-separate-kfd_iommu_resume-from-kfd_resume.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c
$ patch -p1 -i ../../backport_patch/0002-drm-amdgpu-add-amdgpu_amdkfd_resume_iommu.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
$ patch -p1 -i ../../backport_patch/0003-drm-amdgpu-move-iommu_resume-before-ip-init-resume.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
$ patch -p1 -i ../../backport_patch/0004-drm-amdgpu-init-iommu-after-amdkfd-device-init.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
$ patch -p1 -i ../../backport_patch/0005-drm-amdkfd-fix-boot-failure-when-iommu-is-disabled-i.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #80 from James Zhu (jamesz@amd.com) --- Hi @kolAflash, I applied those patches on (https://github.com/gregkh/linux.git linux-5.10.y f884bb85b8d877d4e0c670403754813a7901705b) (https://github.com/gregkh/linux.git linux-5.12.y 0e6f651912bdd027a6d730b68d6d1c3f4427c0ae). I didn't see compiling issue.
Can you share me .config?
James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #81 from kolAflash (kolAflash@kolahilft.de) --- Created attachment 299721 --> https://bugzilla.kernel.org/attachment.cgi?id=299721&action=edit Linux kernel make .config
@James
Compiling v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) with the provided patch results in the same error.
I attached my Linux kernel make .config.
Compilation platform is Debian-11.1.0.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #82 from James Zhu (jamesz@amd.com) --- Hi @kolAflash,
I don't have issue with your .config. on ubuntu 20.04
From source code, it should be fine.
$ grep -rn "kgd2kfd_resume_iommu" drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 309:int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
$ grep -rn "amdgpu_amdkfd.h|kgd2kfd_resume_iommu" drivers/gpu/drm/amd/amdkfd/kfd_device.c 31:#include "amdgpu_amdkfd.h" 604: kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);
792: if (kgd2kfd_resume_iommu(kfd))
940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
Looks we are using different 5.10, should we use 5.10 stable for adding this backport patches?.
754 | if (kgd2kfd_resume_iommu(kfd))
| ^~~~~~~~~~~~~~~~~~~~ | kgd2kfd_resume_mm Best Regards! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #83 from kolAflash (kolAflash@kolahilft.de) --- Hi James,
(In reply to James Zhu from comment #82)
[...] $ grep -rn "amdgpu_amdkfd.h|kgd2kfd_resume_iommu" drivers/gpu/drm/amd/amdkfd/kfd_device.c 31:#include "amdgpu_amdkfd.h" 604: kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);
792: if (kgd2kfd_resume_iommu(kfd))
940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd)
the line numbers you're quoting are for Linux v5.12.19 (0e6f651912bdd027a6d730b68d6d1c3f4427c0ae) + the attachment-299697 patch.
Looks we are using different 5.10, should we use 5.10 stable for adding this backport patches?.
754 | if (kgd2kfd_resume_iommu(kfd))
| ^~~~~~~~~~~~~~~~~~~~ | kgd2kfd_resume_mm
I'm testing with Linux v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) + the attachment-299697 patch. And there it's line number 754.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #84 from kolAflash (kolAflash@kolahilft.de) --- @James
I was able to compile!
Looks like this was some fault of mine. (I'm usually building out of source directory and did something wrong...)
Now I'm testing the current v5.10.82 with the provided attachment 299697 patches.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #85 from kolAflash (kolAflash@kolahilft.de) --- (In reply to James Zhu from comment #77)
Created attachment 299697 [details] backport patch for 5.10 stable.
Hi @kolAflash, before I send out them to public for review,. could you help take a test? Thanks so much! James
Works excellent!
Tested with Linux-5.10.82 on Debian-11.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #86 from James Zhu (jamesz@amd.com) --- Hi @kolAflash, thanks so much for your effort on this verification! Would you mind help apply those patches on 5.12 stable to check also? it should be automatically merged. Thanks! James
https://bugzilla.kernel.org/show_bug.cgi?id=211277
--- Comment #87 from kolAflash (kolAflash@kolahilft.de) --- (In reply to James Zhu from comment #86)
Hi @kolAflash, thanks so much for your effort on this verification! Would you mind help apply those patches on 5.12 stable to check also? it should be automatically merged. Thanks! James
I'm testing Linux-5.12.19 with the patch from attachment 299697 since 2021-12-02. Until now everything works fine.
https://bugzilla.kernel.org/show_bug.cgi?id=211277
kolAflash (kolAflash@kolahilft.de) changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |CODE_FIX
--- Comment #88 from kolAflash (kolAflash@kolahilft.de) --- Debian-11 just got a kernel security update, giving me Linux-5.10.92.
https://snapshot.debian.org/package/linux-signed-amd64/5.10.92%2B1/#linux-im...
Since rebooting into that kernel I got no more crashes after waking from s2ram. (not using pci=noats or any other workarounds)
Conclusion: Everything fixed! Thanks a lot to everyone involved :-)
dri-devel@lists.freedesktop.org