[Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout - dri-devel - freedesktop.org experimental mailing list

List overview All Threads
Download

newer

[Bug 110509] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx timeout

older

[Bug 110575] [R9 380X] Artifacts...

[Bug 110472] Graphical Fault...

bugzilla-daemon＠freedesktop.org

24 Apr 2019 24 Apr '19

5:26 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

Bug ID: 110509 Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout Product: Mesa Version: git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: James.Dutton@gmail.com QA Contact: dri-devel@lists.freedesktop.org

AMD Vega 56 fails to reset: [ 188.771043] Evicting PASID 32782 queues [ 188.782094] Restoring PASID 32782 queues [ 214.563362] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=19285, emitted seq=19287 [ 214.563432] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ACOdyssey.exe pid 3761 thread ACOdyssey.exe pid 3761 [ 214.563439] amdgpu 0000:43:00.0: GPU reset begin! [ 214.563445] Evicting PASID 32782 queues [ 224.793032] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out

How do I go about diagnosing this problem?

-- You are receiving this mail because: You are the assignee for the bug.

Attachments:

attachment.html (text/html — 2.5 KB)

Show replies by date

bugzilla-daemon＠freedesktop.org

24 Apr 24 Apr

5:31 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #1 from James.Dutton@gmail.com --- Created attachment 144084 --> https://bugs.freedesktop.org/attachment.cgi?id=144084&action=edit ./umr -O bits -r *.*.mmGRBM_STATUS

Output while GPU failed to reset.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

5:32 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #2 from James.Dutton@gmail.com --- Created attachment 144085 --> https://bugs.freedesktop.org/attachment.cgi?id=144085&action=edit /usr/src/umr/build/src/app/umr -wa

Output of the wave.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

5:33 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #3 from James.Dutton@gmail.com --- Created attachment 144086 --> https://bugs.freedesktop.org/attachment.cgi?id=144086&action=edit dmesg

dmesg during reset.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

5:35 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

James.Dutton@gmail.com changed:

--- Comment #4 from James.Dutton@gmail.com --- Created attachment 144087 --> https://bugs.freedesktop.org/attachment.cgi?id=144087&action=edit dmesg

dmesg

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

28 Apr 28 Apr

3:42 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #5 from James.Dutton@gmail.com --- This is a result of trying to play games in wine and dxvk. It used to work, but the latest mesa git fails. Games that fails are: Assassin's creed odyssey Devil May Cry 5

Both these games get through the title sequences, but fail when you reach the actual game play. The GPU hangs and tries to reset, but fails to reset.

So, there are two problems: 1) Why does it hang in the first place 2) Why does it fail to recover and reset itself.

I can ssh into the PC. poweroff <- Attempts to power off but never actually reaches off state. echo b > /proc/sysrq-trigger <- reboots the box, and everything is then ok again, so long as one does not try to play a game.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

29 Apr 29 Apr

1:41 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #6 from James.Dutton@gmail.com --- I think I have found the problem. [ 657.526313] amdgpu 0000:43:00.0: GPU reset begin! [ 657.526318] Evicting PASID 32782 queues [ 667.756000] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:49:crtc-0] hw_done or flip_done timed out

The intention is to do a GPU reset, but the implementation in the code is just to try and do a suspend. Part of the suspend does this:

Apr 29 14:29:19 thread kernel: [ 363.445607] INFO: task kworker/u258:0:55 blocked for more than 120 seconds. Apr 29 14:29:19 thread kernel: [ 363.445612] Not tainted 5.0.10-dirty #26 Apr 29 14:29:19 thread kernel: [ 363.445613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 14:29:19 thread kernel: [ 363.445615] kworker/u258:0 D 0 55 2 0x80000000 Apr 29 14:29:19 thread kernel: [ 363.445628] Workqueue: events_unbound commit_work [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445629] Call Trace: Apr 29 14:29:19 thread kernel: [ 363.445635] __schedule+0x2c0/0x880 Apr 29 14:29:19 thread kernel: [ 363.445637] schedule+0x2c/0x70 Apr 29 14:29:19 thread kernel: [ 363.445639] schedule_timeout+0x1db/0x360 Apr 29 14:29:19 thread kernel: [ 363.445641] ? update_load_avg+0x8b/0x590 Apr 29 14:29:19 thread kernel: [ 363.445645] dma_fence_default_wait+0x1eb/0x270 Apr 29 14:29:19 thread kernel: [ 363.445647] ? dma_fence_release+0xa0/0xa0 Apr 29 14:29:19 thread kernel: [ 363.445649] dma_fence_wait_timeout+0xfd/0x110 Apr 29 14:29:19 thread kernel: [ 363.445651] reservation_object_wait_timeout_rcu+0x17d/0x370 Apr 29 14:29:19 thread kernel: [ 363.445710] amdgpu_dm_do_flip+0x14a/0x4a0 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445767] amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445820] ? amdgpu_dm_atomic_commit_tail+0x7b7/0xc10 [amdgpu] Apr 29 14:29:19 thread kernel: [ 363.445828] commit_tail+0x42/0x70 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445835] commit_work+0x12/0x20 [drm_kms_helper] Apr 29 14:29:19 thread kernel: [ 363.445838] process_one_work+0x1fd/0x400 Apr 29 14:29:19 thread kernel: [ 363.445840] worker_thread+0x34/0x410 Apr 29 14:29:19 thread kernel: [ 363.445841] kthread+0x121/0x140 Apr 29 14:29:19 thread kernel: [ 363.445843] ? process_one_work+0x400/0x400 Apr 29 14:29:19 thread kernel: [ 363.445844] ? kthread_park+0x90/0x90 Apr 29 14:29:19 thread kernel: [ 363.445847] ret_from_fork+0x22/0x40

So, amggpu_dm_do_flip() is the bit that hangs. If the GPU needs to be reset because some of it has hung, trying a "flip" is unlikely to work. It is failing/hanging when doing "suspend of IP block <dm>" in amdgpu_device_ip_suspend_phase1().

I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

6:30 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #7 from Alex Deucher alexdeucher@gmail.com --- (In reply to James.Dutton from comment #6)

...

I would suggest creating code that actually tries to reset the GPU, instead of trying to suspend it while GPU is hung.

That is part of the GPU reset sequence. We need to attempt to stop the engines before resetting the GPU. That is what the suspend code does. Not all of the engines are necessarily hung so you need to stop and drain them properly.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

10:41 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #8 from James.Dutton@gmail.com --- Thank you for the feedback. Is there a data sheet somewhere that might help me work out a fix for this. What I would like is: 1) A way to scan all the engines and detect which ones have hung. 2) A way to intentionally halt an engine and tidy up. So that the modprobe, rmmod, modprobe scenario works. 3) data sheet details regarding how to un-hang each engine. Specifically, in this case the IP block <dm>.

Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and then do a full reset of the entire card, then resume.

I think a different suspend process would be better. We have a for_each within the suspend code. The output of that code should not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could compare the current state and intended state, and attempt to reach the intended state, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended transition. i.e. code for running -> suspended, can be different than code for hung -> suspended. The code already needs to know which engines are enabled/disabled (Vega 56 vs Vega 64)

I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

30 Apr 30 Apr

1:26 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #9 from Alex Deucher alexdeucher@gmail.com --- (In reply to James.Dutton from comment #8)

...

Thank you for the feedback. Is there a data sheet somewhere that might help me work out a fix for this. What I would like is:

A way to scan all the engines and detect which ones have hung.

If the gpu scheduler for a queue on a particular engine times out, you can be pretty sure the engine has hung. At that point you can check the current busy status for the block (IP is_idle() callback).

...

A way to intentionally halt an engine and tidy up. So that the modprobe,

rmmod, modprobe scenario works.

hw_fini() IP callback.

...

data sheet details regarding how to un-hang each engine.

Specifically, in this case the IP block <dm>.

Each IP has a soft reset (implemented via the IP soft_reset() callback), but depending on the hang, in some cases, you may have to do a full GPU reset to recover. This is not a hw hang, it's a sw deadlock.

...

Maybe that is not possible, and (I think you are hinting at it), one cannot reset an individual IP block. So the approach is to suspend the card, and then do a full reset of the entire card, then resume.

All asics support full GPU reset which is implemented via the SOC level amdgpu_asic_funcs reset() callback.

...

I think a different suspend process would be better. We have a for_each within the suspend code. The output of that code should not be a single error code, but instead an array indicating the current state of each engine (running/hung), the intended state and status of whether the intention worked or failed. If the loop through the for_each, it could compare the current state and intended state, and attempt to reach the intended state, and report an error code for each engine. Then the code to achieve the transition can been different depending on the current -> intended transition. i.e. code for running -> suspended, can be different than code for hung -> suspended. The code already needs to know which engines are enabled/disabled (Vega 56 vs Vega 64)

We don't really care of the suspend fails or not. See amdgpu_device_gpu_recover() for the full sequence.

...

I can hang this IP block <dm> at will. I have 2 games that hang it within seconds of starting.

There was a deadlock in the dm code which has been fixed. Please try a new code base. e.g., https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-5.2-wip

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

10:40 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #10 from James.Dutton@gmail.com --- Created attachment 144118 --> https://bugs.freedesktop.org/attachment.cgi?id=144118&action=edit dmesg with drm-next-5.2-wip

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

10:44 a.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #11 from James.Dutton@gmail.com --- I tried with drm-next-5.2-wip.

It does not hang any more, but I have a new error now.

It is better, in the sense that I can now reboot the system normally, and not resort to echo b >/proc/sysrq-trigger

[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

After the GPU reset, the screen is corrupted. I can do, via ssh, service gdm stop. service gdm start and I then get a working login screen. (Mouse moves, I can type in password) I cannot actually login because X fails. The desktop fails to appear and it returns to the login greeter screen.

I will try to get more details when I have time later.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

2:22 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #12 from James.Dutton@gmail.com ---

The error is from this bit of code in: amdgpu_cs.c: Line about 232 In function: amdgpu_cs_parser_init: if (p->ctx->vram_lost_counter != p->job->vram_lost_counter) { ret = -ECANCELED; goto free_all_kdata; }

So, I guess, somewhere is the gpu reset, those values need to be fixed up.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

2:26 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #13 from Michel Dänzer michel@daenzer.net --- (In reply to James.Dutton from comment #12)

...

It means the VRAM contents were lost during the GPU reset, so any existing userspace contexts are invalid and need to be re-created (which at this point boils down to restarting any processes using the GPU for rendering).

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

2:43 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

--- Comment #14 from James.Dutton@gmail.com --- I stop gdm and kill any remaining X processes. When I start gdm and login, it works, and displays the desktop.

Previously, I was leaving on of the X processes running.

So, I think this (drm-next-5.2-wip) has fixed this bug.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

13 Aug 13 Aug

8:56 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

Alessandro lifeisfoo@gmail.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |lifeisfoo@gmail.com

--- Comment #15 from Alessandro lifeisfoo@gmail.com --- Created attachment 145050 --> https://bugs.freedesktop.org/attachment.cgi?id=145050&action=edit dmsg drm amdgpu

I'm facing the same issue with 5.2.x and 5.3-rc4 kernel and a Radeon RX 580.

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

9:20 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

Alessandro lifeisfoo@gmail.com changed:

-- You are receiving this mail because: You are the assignee for the bug.

bugzilla-daemon＠freedesktop.org

25 Sep 25 Sep

6:49 p.m.

https://bugs.freedesktop.org/show_bug.cgi?id=110509

GitLab Migration User gitlab-migration@fdo.invalid changed:

--- Comment #16 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1389.

-- You are receiving this mail because: You are the assignee for the bug.

2068

Age (days ago)

2222

Last active (days ago)

dri-devel@lists.freedesktop.org

17 comments

1 participants

tags (0)

participants (1)

bugzilla-daemon＠freedesktop.org