https://bugs.freedesktop.org/show_bug.cgi?id=111591
Bug ID: 111591 Summary: [radeonsi/Navi] The Bard's Tale IV causes a GPU hang Product: Mesa Version: git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: not set Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: shtetldik@gmail.com QA Contact: dri-devel@lists.freedesktop.org
When running the Bard's Tale IV, in the beginning of the game, if I turn around, it consistently is causing a GPU hang. And I see this in dmesg:
[ 4246.501534] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! [ 4251.365674] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=178390, emitted seq=178392 [ 4251.365740] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 7251 thread BardsTale4:cs0 pid 7292 [ 4251.365742] [drm] GPU recovery disabled.
GPU: Sapphire Pulse RX 5700 XT Kernel: 5.3.0-rc8+ OpenGL renderer string: AMD NAVI10 (DRM 3.33.0, 5.3.0-rc8+, LLVM 10.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.3.0-devel (git-87fa8d9ebc) Game version: GOG, release 1.0.0 (version 4.20.1 / 32050).
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #1 from Timothy Arceri t_arceri@yahoo.com.au --- An apitrace of the problem would be helpful if you can get it.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #2 from Shmerl shtetldik@gmail.com --- I'll try to make a trace. The error message looks like this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/driv...
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #3 from Shmerl shtetldik@gmail.com --- Is there any way to postpone tracing kick in, to avoid massive size of the file?
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #4 from Shmerl shtetldik@gmail.com --- Uploaded the trace here (should be valid for 30 days): https://ufile.io/kvf9t1eu
Sorry for huge size, there is an unskippable cutscene in the beginning.
Compressed with pixz, so should be decompressible using all CPU cores (compatible with regular single threaded decompressing xz as well).
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #5 from Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- (In reply to Shmerl from comment #4)
Uploaded the trace here (should be valid for 30 days): https://ufile.io/kvf9t1eu
Sorry for huge size, there is an unskippable cutscene in the beginning.
Compressed with pixz, so should be decompressible using all CPU cores (compatible with regular single threaded decompressing xz as well).
The attached trace doesn't cause a GPU hang here.
Does it hang on your machine?
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #6 from Shmerl shtetldik@gmail.com --- (In reply to Pierre-Eric Pelloux-Prayer from comment #5)
The attached trace doesn't cause a GPU hang here.
Does it hang on your machine?
I recorded it until the freeze happened, and then had to do Alt+SysRq+REISUB to reboot. So that's the resulting file. I'll try replaying the trace to see what happens.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #7 from Shmerl shtetldik@gmail.com --- Just replayed the trace - it ended before the buggy part. Something must have interrupted it, or may be it has a size cap? I'll try making it again.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #8 from Shmerl shtetldik@gmail.com --- Here is a new trace: https://uploadfiles.io/9uykx7nh
Now it's catching the hang moment. Replaying it doesn't hang the GPU though, just produces some errors in the trace output.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #9 from Timothy Arceri t_arceri@yahoo.com.au --- Thanks! I can reproduce the problem using the new trace.
It's strange the problem is caused by some shaders failing to link but the error message doesn't match what the shaders actually do. Also dumping out the shaders and compiling them with our shader-db tool also results in them compiling correctly. There is clearly a bug in here somewhere but will take some more digging to find it.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #10 from Timothy Arceri t_arceri@yahoo.com.au --- Ok. apitrace was pointing me to the incorrect shaders I managed to find the correct ones and can confirm this is a bug in the game itself. I have reported the problem to the developers, lets see if they reply.
For completeness here is the body of the bug report:
"The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed.
There is at least one attempt in the game (maybe more?) to link a vertex shader output that sets the noperspective qualifier on an output to a fragment shader input where no interpolation qualifier is set. This results in hangs and stuttering in the game when it attempts to use the program that failed to link.
I've attached the problem shaders in a text file."
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #11 from Shmerl shtetldik@gmail.com --- Since the game is using Unreal Engine, I wonder if developers control shaders directly, or it's something produced by UE toolchain that transpiles them from something else. I mean it could be upstream UE bug.
Just for the record, game shows it's using Unreal Engine 4.20.1-150741.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #12 from Timothy Arceri t_arceri@yahoo.com.au --- For now you could try using the environment variable:
allow_glsl_cross_stage_interpolation_mismatch=true
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #13 from Shmerl shtetldik@gmail.com --- (In reply to Timothy Arceri from comment #12)
For now you could try using the environment variable:
allow_glsl_cross_stage_interpolation_mismatch=true
Thanks! I tried setting it, and it shows the message that it's overridden, but the game still hangs.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #14 from Timothy Arceri t_arceri@yahoo.com.au --- (In reply to Shmerl from comment #13)
(In reply to Timothy Arceri from comment #12)
For now you could try using the environment variable:
allow_glsl_cross_stage_interpolation_mismatch=true
Thanks! I tried setting it, and it shows the message that it's overridden, but the game still hangs.
Are you sure it is hanging? There is a huge amount of stuttering due to the game compiling shaders in-game. Its really bad the first time I run the apitrace but much better the second time.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #15 from Shmerl shtetldik@gmail.com --- (In reply to Timothy Arceri from comment #14)
Are you sure it is hanging? There is a huge amount of stuttering due to the game compiling shaders in-game. Its really bad the first time I run the apitrace but much better the second time.
I couldn't even switch to tty using Ctrl+Alt+F1, so I didn't check dmesg and just SysRq rebooted. Next time if this happens with override, may be I can try accessing it over ssh remotely to check if it's different from before.
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #16 from vggl vgglvyww36@khasekhemwy.net ---
"The games shaders use GLSL 4.30 which mean interpolation qualifiers must match across shader interfaces otherwise it is a link-time error. In GLSL 4.40 this restriction was relaxed."
I believe that relaxation came in version 4.30, not 4.40.
The 4.30 spec here: https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.30.pdf
From the "4.3.4 Input Variables" section:
"The fragment shader inputs form an interface with the last active shader in the vertex processing pipeline. For this interface, the last active shader stage output variables and fragment shader input variables of the same name must match in type and qualification, with a few exceptions: The storage qualifiers must, of course, differ (one is in and one is out). Also, interpolation qualification (e.g., flat) and auxiliary qualification (e.g. centroid) may differ. These mismatches are allowed between any pair of stages. When interpolation or auxiliary qualifiers do not match, those provided in the fragment shader supersede those provided in previous stages. If any such qualifiers are completely missing in the fragment shaders, then the default is used, rather than any qualifiers that may have been declared in previous stages. That is, what matters is what is declared in the fragment shaders, not what is declared in shaders in previous stages."
That language is identical between 4.30 and 4.40. It sounds like it explicitly allows interpolation qualifiers to differ. However the 4.20 spec language in that section was quite different and did require an interpolation qualifier match.
Also, from https://www.khronos.org/opengl/wiki/Shader_Compilation#Interface_matching:
"If GLSL 4.30 or later is available, then the interpolation qualifiers (including centroid and sample) do not need to match."
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #17 from Shmerl shtetldik@gmail.com --- (In reply to Timothy Arceri from comment #14)
Are you sure it is hanging? There is a huge amount of stuttering due to the game compiling shaders in-game. Its really bad the first time I run the apitrace but much better the second time.
It is a hang. Even with allow_glsl_cross_stage_interpolation_mismatch=true it gets stuck permanently. I was able to log into the system over ssh when that happened, and this was shown in dmesg:
[ 149.642857] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! [ 154.762918] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=20378, emitted seq=20380 [ 154.762984] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process BardsTale4-Linu pid 2563 thread BardsTale4:cs0 pid 2597 [ 154.762986] [drm] GPU recovery disabled. [ 363.660017] INFO: task BardsTale4-Linu:2563 blocked for more than 120 seconds. [ 363.660021] Tainted: G E 5.3.0-rc8+ #14 [ 363.660022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 363.660023] BardsTale4-Linu D 0 2563 2556 0x80004002 [ 363.660026] Call Trace: [ 363.660033] ? __schedule+0x2b9/0x6c0 [ 363.660035] schedule+0x39/0xa0 [ 363.660037] schedule_timeout+0x20f/0x300 [ 363.660040] dma_fence_default_wait+0x1c2/0x2a0 [ 363.660042] ? dma_fence_free+0x20/0x20 [ 363.660044] dma_fence_wait_timeout+0xdd/0xf0 [ 363.660106] gmc_v10_0_flush_gpu_tlb+0x159/0x1a0 [amdgpu] [ 363.660157] amdgpu_gart_unbind+0x89/0xb0 [amdgpu] [ 363.660206] amdgpu_ttm_backend_unbind+0x3c/0xe0 [amdgpu] [ 363.660211] ttm_tt_unbind+0x1d/0x30 [ttm] [ 363.660215] ttm_tt_destroy.part.0+0xe/0x50 [ttm] [ 363.660219] ttm_bo_cleanup_memtype_use+0x2e/0x70 [ttm] [ 363.660222] ttm_bo_put+0x24e/0x2a0 [ttm] [ 363.660269] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ 363.660317] amdgpu_gem_object_free+0x2e/0x50 [amdgpu] [ 363.660328] drm_gem_object_release_handle+0x5a/0xc0 [drm] [ 363.660339] ? drm_gem_object_handle_put_unlocked+0x90/0x90 [drm] [ 363.660341] idr_for_each+0x5e/0xd0 [ 363.660344] ? __inode_wait_for_writeback+0x7e/0xf0 [ 363.660354] drm_gem_release+0x1c/0x30 [drm] [ 363.660363] drm_file_free.part.0+0x2ab/0x300 [drm] [ 363.660373] drm_release+0x4b/0x80 [drm] [ 363.660375] __fput+0xb9/0x250 [ 363.660378] task_work_run+0x8a/0xb0 [ 363.660381] do_exit+0x2f5/0xb60 [ 363.660383] do_group_exit+0x3a/0xa0 [ 363.660385] get_signal+0x15b/0x890 [ 363.660387] do_signal+0x30/0x690 [ 363.660390] ? _copy_from_user+0x37/0x60 [ 363.660393] exit_to_usermode_loop+0x91/0xf0 [ 363.660394] do_syscall_64+0x100/0x110 [ 363.660396] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 363.660398] RIP: 0033:0x4540f22 [ 363.660403] Code: Bad RIP value. [ 363.660404] RSP: 002b:00007fff54bf6c30 EFLAGS: 00210202 [ 363.660406] RAX: 00007fff54bf6c30 RBX: 0000000000000001 RCX: 00000000939f4000 [ 363.660406] RDX: 00007fff54bf6c88 RSI: 00007fff54bf6c98 RDI: 00007fff54bf6c80 [ 363.660407] RBP: 00007fa81869c430 R08: 000000000000021f R09: 000000000936d890 [ 363.660408] R10: 0000000000000001 R11: 0000000000200206 R12: 00007fff54bf6d90 [ 363.660408] R13: 0000000000000008 R14: 000000000768bdd8 R15: 00007fff54bf6ce0
May be trace alone isn't enough to reproduce it? Did you try the actual game?
https://bugs.freedesktop.org/show_bug.cgi?id=111591
--- Comment #18 from Shmerl shtetldik@gmail.com --- Just for the reference, I'm using firmware from here: https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/
https://bugs.freedesktop.org/show_bug.cgi?id=111591
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #19 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1427.
dri-devel@lists.freedesktop.org