Hi Alex, hi all
In Debian we got a regression report from Dominique Dumont, CC'ed in https://bugs.debian.org/1005005 that afer an update to 5.15.15 based kernel, his machine noe longer suspends correctly, after screen going black as usual it comes back. The Debian bug above contians a trace.
Dominique confirmed that this issue persisted after updating to 5.16.7 furthermore he bisected the issue and found
3c196f05666610912645c7c5d9107706003f67c3 is the first bad commit commit 3c196f05666610912645c7c5d9107706003f67c3 Author: Alex Deucher alexander.deucher@amd.com Date: Fri Nov 12 11:25:30 2021 -0500
drm/amdgpu: always reset the asic in suspend (v2)
[ Upstream commit daf8de0874ab5b74b38a38726fdd3d07ef98a7ee ]
If the platform suspend happens to fail and the power rail is not turned off, the GPU will be in an unknown state on resume, so reset the asic so that it will be in a known good state on resume even if the platform suspend failed.
v2: handle s0ix
Acked-by: Luben Tuikov luben.tuikov@amd.com Acked-by: Evan Quan evan.quan@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Sasha Levin sashal@kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
to be the first bad commit, see https://bugs.debian.org/1005005#34 .
Does this ring any bell? Any idea on the problem?
Regards, Salvatore
[TLDR: I'm adding the regression report below to regzbot, the Linux kernel regression tracking bot; all text you find below is compiled from a few templates paragraphs you might have encountered already already from similar mails.]
Hi, this is your Linux kernel regression tracker speaking.
CCing the regression mailing list, as it should be in the loop for all regressions, as explained here: https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
To be sure this issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression tracking bot:
#regzbot ^introduced 3c196f056666 #regzbot title amdgfx: suspend stopped working #regzbot ignore-activity #regzbot link: https://bugs.debian.org/1005005
Reminder for developers: when fixing the issue, please add a 'Link:' tags pointing to the report (the mail quoted above) using lore.kernel.org/r/, as explained in 'Documentation/process/submitting-patches.rst' and 'Documentation/process/5.Posting.rst'. This allows the bot to connect the report with any patches posted or committed to fix the issue; this again allows the bot to show the current status of regressions and automatically resolve the issue when the fix hits the right tree.
I'm sending this to everyone that got the initial report, to make them aware of the tracking. I also hope that messages like this motivate people to directly get at least the regression mailing list and ideally even regzbot involved when dealing with regressions, as messages like this wouldn't be needed then.
Don't worry, I'll send further messages wrt to this regression just to the lists (with a tag in the subject so people can filter them away), if they are relevant just for regzbot. With a bit of luck no such messages will be needed anyway.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them and lack knowledge about most of the areas they concern. I thus unfortunately will sometimes get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.
On 12.02.22 19:23, Salvatore Bonaccorso wrote:
On Sat, Feb 12, 2022 at 1:23 PM Salvatore Bonaccorso carnil@debian.org wrote:
Does the system actually suspend? Putting the GPU into reset on suspend shouldn't cause any problems since the power rail will presumably be cut by the platform. Is this system S0i3 or regular S3? Does this patch help by any chance? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
Alex
Regards, Salvatore
On Monday, 14 February 2022 22:52:27 CET Alex Deucher wrote:
Does the system actually suspend?
Not really. The screens looks like it's going to suspend, but it does come back after 10s or so. The light mounted in the middle of the power button does not switch off.
Is this system S0i3 or regular S3?
I'm not sure how to check that. After a bit of reading on the Internet [1], I hope that the following information answers that question. Please get back to me if that's not the case.
Looks like my system supports both Soi3 and S3
$ cat /sys/power/state freeze mem disk
I get the same result running these 2 commands as root: # echo freeze > /sys/power/state # echo mem > /sys/power/state
yes, with this patch: - the suspend issue is solved - kernel logs no longer show messages like "failed to send message" or "*ERROR* suspend of IP block <powerplay> failed" while suspending
All the best
[1] https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/ hibernate-issues
On 20/02/2022 16:48, Dominique Dumont wrote:
As I have a very similar problem and also commented on the original debian bug report (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1005005), I will add some information here on another amd only laptop (renoir AMD Ryzen 7 4800H with Radeon Graphics + Radeon RX 5500/5500M / Pro 5500M).
For me the suspend works once, but after the first resume (I do know know if it is in the suspend path or the resume path I see a RIP in the dmesg (see aditional info in debian bug)) and later suspend do not work: It only go to the kde login screen.
I was unable due to network connectivity to do a full bisect but tested with the patch I had on my laptop:
5.10.101 works, 5.10 from debian works 5.11 works 5.12 works 5.13 suspend works but when resuming the PC is dead I have to reboot 5.14 seems to work but looking at dmesg it is full of RIP messages at various places. 5.15.24 is a described 5.15 from debian is behaving identically 5.16 from debian is behaving identically.
Is this system S0i3 or regular S3?
For me it is real S3.
The proposed patch is intended for INTEl + intel gpu + amdgpu but I have dual amd GPU.
--eric
On Mon, Feb 21, 2022 at 3:29 AM Eric Valette eric.valette@free.fr wrote:
It doesn't really matter what the platform is, it could still potentially help on your system, it depends on the bios implementation for your platform and how it handles suspend. You can try the patch, but I don't think you are hitting the same issue. I bisect would be helpful in your case.
Alex
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone.
Dominique/Salvatore/Eric, what's the status of this regression? According to the debian bug tracker the problem is solved with 5.16 and 5.17, but was 5.15 ever fixed?
Ciao, Thorsten
On 21.02.22 15:16, Alex Deucher wrote:
My problem has never been fixed. The proposed patch has been applied to 5.15. I do not remerber which version 28 maybe.
I still have à RIP in pm_suspend. Did not test the Last two 15 versions.
I can leave with 5.10 est using own compiled kernels.
Thanks for asking.
21 mars 2022 09:58:01 Thorsten Leemhuis regressions@leemhuis.info:
On 21.03.22 13:07, Éric Valette wrote:
This thread/the debian bug report (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1005005 ) is getting long which makes things hard to grasp. But to me it looks a lot like the problem you are facing is different from the problem that others ran into and bisected -- but I might be totally wrong there. Have you ever tried reverting 3c196f056666 to seem if it helps (sorry if that's mentioned in the bug report somewhere, as I said, it became long)? I guess a bisection from your side really would help a lot; but before you go down that route you might want to give 5.17 and the latest 5.15.y kernel a try.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them and lack knowledge about most of the areas they concern. I thus unfortunately will sometimes get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.
Hi
On Monday, 21 March 2022 09:57:59 CET Thorsten Leemhuis wrote:
I don't think so.
On kernel side, the commit fixing this issue is e55a3aea418269266d84f426b3bd70794d3389c8 .
According to the logs of [1] , this commit landed in v5.17-rc3
HTH
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
On 21.03.22 19:49, Dominique Dumont wrote:
And from there it among others got backported to 5.15.22:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=l...
https://lwn.net/Articles/884107/
Another indicator that Eric's problem is something else.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them and lack knowledge about most of the areas they concern. I thus unfortunately will sometimes get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.
On maandag 21 maart 2022 19:49:56 CET Dominique Dumont wrote:
It was included in 5.15.22, but the newest 5.15 kernel uploaded to Debian was 5.15.15, so their is no fixed 5.15 in Debian. It was also included in 5.16.8 and the earlier version in Debian which had that commit was 5.16.10 (uploaded 2022-02-18 to Unstable). Current version in Unstable is 5.16.14. Testing/Bookworm now had 5.16.12. In Experimental, on 2022-02-12, 5.17-rc3 was uploaded.
HTH, Diederik
[AMD Official Use Only]
I checked the back trace posted there(below). It seems the error occurred during amdgpu_device_suspend(). That means Alex's patch should not be related(as it affected only those logic after amdgpu_device_suspend()). So we might got a wrong regression point here. [ 257.842851] ? vi_common_set_clockgating_state+0x229/0x2f0 [amdgpu] [ 257.843356] amdgpu_device_ip_suspend_phase1+0x5e/0xc0 [amdgpu] [ 257.843771] amdgpu_device_suspend+0x62/0xc0 [amdgpu] [ 257.844184] amdgpu_pmops_suspend+0x36/0x70 [amdgpu] [ 257.844631] pci_pm_suspend+0x71/0x160 [ 257.844643] ? pci_pm_freeze+0xb0/0xb0
BR Evan
dri-devel@lists.freedesktop.org