[AMD Official Use Only]
-----Original Message----- From: Salvatore Bonaccorso salvatore.bonaccorso@gmail.com On Behalf Of Salvatore Bonaccorso Sent: Sunday, February 13, 2022 2:24 AM To: Deucher, Alexander Alexander.Deucher@amd.com Cc: Dominique Dumont dod@debian.org; 1005005@bugs.debian.org; Tuikov, Luben Luben.Tuikov@amd.com; Quan, Evan Evan.Quan@amd.com; Sasha Levin sashal@kernel.org; Koenig, Christian Christian.Koenig@amd.com; Pan, Xinhui Xinhui.Pan@amd.com; David Airlie airlied@linux.ie; Daniel Vetter daniel@ffwll.ch; amd- gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux- kernel@vger.kernel.org Subject: Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?
Hi Alex, hi all
In Debian we got a regression report from Dominique Dumont, CC'ed in https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs .debian.org%2F1005005&data=04%7C01%7Cevan.quan%40amd.com%7 C735917b6e3f44fc8fda808d9ee54cbc0%7C3dd8961fe4884e608e11a82d994e1 83d%7C0%7C0%7C637802870862664095%7CUnknown%7CTWFpbGZsb3d8eyJ WIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% 7C3000&sdata=6xECB3MmvNYuOn41ZOEDPyWUjklY%2Bfxumz7lf8fijwA %3D&reserved=0 that afer an update to 5.15.15 based kernel, his machine noe longer suspends correctly, after screen going black as usual it comes back. The Debian bug above contians a trace.
Dominique confirmed that this issue persisted after updating to 5.16.7 furthermore he bisected the issue and found
3c196f05666610912645c7c5d9107706003f67c3 is the first bad commit commit 3c196f05666610912645c7c5d9107706003f67c3 Author: Alex Deucher alexander.deucher@amd.com Date: Fri Nov 12 11:25:30 2021 -0500
drm/amdgpu: always reset the asic in suspend (v2) [ Upstream commit daf8de0874ab5b74b38a38726fdd3d07ef98a7ee ] If the platform suspend happens to fail and the power rail is not turned off, the GPU will be in an unknown state on resume, so reset the asic so that it will be in a known good state on resume even if the platform suspend failed. v2: handle s0ix Acked-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
to be the first bad commit, see https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs .debian.org%2F1005005%2334&data=04%7C01%7Cevan.quan%40amd.c om%7C735917b6e3f44fc8fda808d9ee54cbc0%7C3dd8961fe4884e608e11a82d 994e183d%7C0%7C0%7C637802870862664095%7CUnknown%7CTWFpbGZsb3 d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0 %3D%7C3000&sdata=CV%2FKmpYT8WOVJnrTiU91godaFDJMpjih%2FAV NAcw5qaI%3D&reserved=0 .
I checked the back trace posted there(below). It seems the error occurred during amdgpu_device_suspend(). That means Alex's patch should not be related(as it affected only those logic after amdgpu_device_suspend()). So we might got a wrong regression point here. [ 257.842851] ? vi_common_set_clockgating_state+0x229/0x2f0 [amdgpu] [ 257.843356] amdgpu_device_ip_suspend_phase1+0x5e/0xc0 [amdgpu] [ 257.843771] amdgpu_device_suspend+0x62/0xc0 [amdgpu] [ 257.844184] amdgpu_pmops_suspend+0x36/0x70 [amdgpu] [ 257.844631] pci_pm_suspend+0x71/0x160 [ 257.844643] ? pci_pm_freeze+0xb0/0xb0
BR Evan
Does this ring any bell? Any idea on the problem?
Regards, Salvatore