https://bugs.freedesktop.org/show_bug.cgi?id=110509
Bug ID: 110509 Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout Product: Mesa Version: git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: James.Dutton@gmail.com QA Contact: dri-devel@lists.freedesktop.org
AMD Vega 56 fails to reset: [ 188.771043] Evicting PASID 32782 queues [ 188.782094] Restoring PASID 32782 queues [ 214.563362] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=19285, emitted seq=19287 [ 214.563432] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACOdyssey.exe pid 3761 thread ACOdyssey.exe pid 3761 [ 214.563439] amdgpu 0000:43:00.0: GPU reset begin! [ 214.563445] Evicting PASID 32782 queues [ 224.793032] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out
How do I go about diagnosing this problem?
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #1 from James.Dutton@gmail.com --- Created attachment 144084 --> https://bugs.freedesktop.org/attachment.cgi?id=144084&action=edit ./umr -O bits -r *.*.mmGRBM_STATUS
Output while GPU failed to reset.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #2 from James.Dutton@gmail.com --- Created attachment 144085 --> https://bugs.freedesktop.org/attachment.cgi?id=144085&action=edit /usr/src/umr/build/src/app/umr -wa
Output of the wave.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #3 from James.Dutton@gmail.com --- Created attachment 144086 --> https://bugs.freedesktop.org/attachment.cgi?id=144086&action=edit dmesg
dmesg during reset.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
James.Dutton@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #144086|0 |1 is obsolete| |
--- Comment #4 from James.Dutton@gmail.com --- Created attachment 144087 --> https://bugs.freedesktop.org/attachment.cgi?id=144087&action=edit dmesg
dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #5 from James.Dutton@gmail.com --- This is a result of trying to play games in wine and dxvk. It used to work, but the latest mesa git fails. Games that fails are: Assassin's creed odyssey Devil May Cry 5
Both these games get through the title sequences, but fail when you reach the actual game play. The GPU hangs and tries to reset, but fails to reset.
So, there are two problems: 1) Why does it hang in the first place 2) Why does it fail to recover and reset itself.
I can ssh into the PC. poweroff <- Attempts to power off but never actually reaches off state. echo b > /proc/sysrq-trigger <- reboots the box, and everything is then ok again, so long as one does not try to play a game.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #6 from James.Dutton@gmail.com --- I think I have found the problem. [ 657.526313] amdgpu 0000:43:00.0: GPU reset begin! [ 657.526318] Evicting PASID 32782 queues [ 667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out
The intention is to do a GPU reset, but the implementation in the code is just to try and do a suspend. Part of the suspend does this:
Apr 29 14:29:19 thread kernel: [ 363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds. Apr 29 14:29:19 thread kernel: [ 363.445612] Not tainted 5.0.10-dirty #26 Apr 29 14:29:19 thread kernel: [ 363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 14:29:19 thread kernel: [ 363.445615] kworker/u258:0 D 0 55 2 0x80000000 Apr 29 14:29:19 thread kernel: [ 363.445628] Workqueue: events_unbound commit_work [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445629] Call Trace: Apr 29 14:29:19 thread kernel: [ 363.445635] __schedule+0x2c0/0x880 Apr 29 14:29:19 thread kernel: [ 363.445637] schedule+0x2c/0x70 Apr 29 14:29:19 thread kernel: [ 363.445639] schedule_timeout+0x1db/0x360 Apr 29 14:29:19 thread kernel: [ 363.445641] ? update_load_avg+0x8b/0x590 Apr 29 14:29:19 thread kernel: [ 363.445645] dma_fence_default_wait+0x1eb/0x270 Apr 29 14:29:19 thread kernel: [ 363.445647] ? dma_fence_release+0xa0/0xa0 Apr 29 14:29:19 thread kernel: [ 363.445649] dma_fence_wait_timeout+0xfd/0x110 Apr 29 14:29:19 thread kernel: [ 363.445651] reservation_object_wait_timeout_rcu+0x17d/0x370 Apr 29 14:29:19 thread kernel: [ 363.445710] amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445767] amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445820] ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445828] commit_tail+0x42/0x70 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445835] commit_work+0x12/0x20 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445838] process_one_work+0x1fd/0x400 Apr 29 14:29:19 thread kernel: [ 363.445840] worker_thread+0x34/0x410 Apr 29 14:29:19 thread kernel: [ 363.445841] kthread+0x121/0x140 Apr 29 14:29:19 thread kernel: [ 363.445843] ? process_one_work+0x400/0x400 Apr 29 14:29:19 thread kernel: [ 363.445844] ? kthread_park+0x90/0x90 Apr 29 14:29:19 thread kernel: [ 363.445847] ret_from_fork+0x22/0x40
So, amggpu_dm_do_flip() is the bit that hangs. If the GPU needs to be reset because some of it has hung, trying a "flip" is unlikely to work. It is failing/hanging when doing "suspend of IP block <dm>" in amdgpu_device_ip_suspend_phase1().
I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #7 from Alex Deucher alexdeucher@gmail.com --- (In reply to James.Dutton from comment #6)
I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.
That is part of the GPU reset sequence. We need to attempt to stop the engines before resetting the GPU. That is what the suspend code does. Not all of the engines are necessarily hung so you need to stop and drain them properly.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #8 from James.Dutton@gmail.com --- Thank you for the feedback. Is there a data sheet somewhere that might help me work out a fix for this. What I would like is: 1) A way to scan all the engines and detect which ones have hung. 2) A way to intentionally halt an engine and tidy up. So that the modprobe, rmmod, modprobe scenario works. 3) data sheet details regarding how to un-hang each engine. Specifically, in this case the IP block <dm>.
Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and then do a full reset of the entire card, then resume.
I think a different suspend process would be better. We have a for_each within the suspend code. The output of that code should not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could compare the current state and intended state, and attempt to reach the intended state, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended transition. i.e. code for running -> suspended, can be different than code for hung -> suspended. The code already needs to know which engines are enabled/disabled (Vega 56 vs Vega 64)
I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #9 from Alex Deucher alexdeucher@gmail.com --- (In reply to James.Dutton from comment #8)
If the gpu scheduler for a queue on a particular engine times out, you can be pretty sure the engine has hung. At that point you can check the current busy status for the block (IP is_idle() callback).
hw_fini() IP callback.
Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock.
All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback.
We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence.
I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.
There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #10 from James.Dutton@gmail.com --- Created attachment 144118 --> https://bugs.freedesktop.org/attachment.cgi?id=144118&action=edit dmesg with drm-next-5.2-wip
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #11 from James.Dutton@gmail.com --- I tried with drm-next-5.2-wip.
It does not hang any more, but I have a new error now.
It is better, in the sense that I can now reboot the system normally, and not resort to echo b >/proc/sysrq-trigger
[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
After the GPU reset, the screen is corrupted. I can do, via ssh, service gdm stop. service gdm start and I then get a working login screen. (Mouse moves, I can type in password) I cannot actually login because X fails. The desktop fails to appear and it returns to the login greeter screen.
I will try to get more details when I have time later.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #12 from James.Dutton@gmail.com ---
The error is from this bit of code in: amdgpu_cs.c: Line about 232 In function: amdgpu_cs_parser_init: if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) { ret = -ECANCELED; goto free_all_kdata; }
So, I guess, somewhere is the gpu reset, those values need to be fixed up.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #13 from Michel Dänzer michel@daenzer.net --- (In reply to James.Dutton from comment #12)
It means the VRAM contents were lost during the GPU reset, so any existing userspace contexts are invalid and need to be re-created (which at this point boils down to restarting any processes using the GPU for rendering).
https://bugs.freedesktop.org/show_bug.cgi?id=110509
--- Comment #14 from James.Dutton@gmail.com --- I stop gdm and kill any remaining X processes. When I start gdm and login, it works, and displays the desktop.
Previously, I was leaving on of the X processes running.
So, I think this (drm-next-5.2-wip) has fixed this bug.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
Alessandro lifeisfoo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |lifeisfoo@gmail.com
--- Comment #15 from Alessandro lifeisfoo@gmail.com --- Created attachment 145050 --> https://bugs.freedesktop.org/attachment.cgi?id=145050&action=edit dmsg drm amdgpu
I'm facing the same issue with 5.2.x and 5.3-rc4 kernel and a Radeon RX 580.
https://bugs.freedesktop.org/show_bug.cgi?id=110509
Alessandro lifeisfoo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #145050|dmsg drm amdgpu |dmsg drm amdgpu linux description| |5.3-rc4 from ubuntu ppa
https://bugs.freedesktop.org/show_bug.cgi?id=110509
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |MOVED
--- Comment #16 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1389.
dri-devel@lists.freedesktop.org