https://bugs.freedesktop.org/show_bug.cgi?id=111231
Bug ID: 111231 Summary: VM_L2_PROTECTION_FAULT Product: DRI Version: XOrg git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: major Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: ds2.bugs.freedesktop@gmail.com
When playing minetest on an AMD ryzen 2200G with vega integrated graphics, occasionally the system will appear to suffer a graphics lock-up during game load when the loading bar appears. When this occours, dmesg spits out a VM_L2_PROTECTION_FAULT and then repeated errors about fence timeouts:
[ 5699.136659] amdgpu 0000:0b:00.0: [gfxhub] no-retry page fault (src_id:0 ring:155 vmid:5 pasid:32770, for process minetest pid 7127 thread minetest:cs0 pid 7133) [ 5699.136662] amdgpu 0000:0b:00.0: in page starting at address 0x000080014034d000 from 27 [ 5699.136664] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00501136 [ 5704.343299] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5709.259775] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5709.259860] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5709.259862] [drm] GPU recovery disabled. [ 5709.463238] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out. [ 5719.286451] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5719.286537] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5719.286539] [drm] GPU recovery disabled. [ 5729.312836] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5729.312921] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5729.312923] [drm] GPU recovery disabled. [ 5739.339485] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5739.339570] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5739.339572] [drm] GPU recovery disabled. [ 5749.366552] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=443165, emitted seq=443167 [ 5749.366637] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process minetest pid 7127 thread minetest:cs0 pid 7133 [ 5749.366640] [drm] GPU recovery disabled.
Notably, when playing minetest normally, this doesn't always happen, but when it does the screen gets a light covering of graphical corruption "confetti" (photos to follow - had to be taken on a phone, sorry). Currently running a mesa debug build compiled from git at commit b0626c1f306 after seeing if https://bugs.freedesktop.org/show_bug.cgi?id=105251 had anything to do with it - I think this is related but not entirely a duplicate, as a fix mentioned there did stop the test program there from having an effect but did not stop this problem.
In the course of trying to reproduce this problem in a more repeatable manner, I decided to take an apitrace (will attach in following messages). Interestingly, the brief trace I took did not crash my system during recording of it, but now replaying it will fairly regularly cause the same kind of lockup, more frequently than the game itself will. I ran apitrace replay in verbose mode to see whereabouts it stopped to see if this gave an approximate indications of where things starting going pear shaped. The point at which output ends is well short of the entire apitrace dump, as expected from what I saw - and additionally the stderr appears to contain an exception of some kind. See the apitrace.out.txt and apitrace.err.txt attachments (to follow separately).
I haven't yet got a dmesg output during minetest running itself, but I have got some runs (spanning from boot to either hard or soft reboot - sometimes xorg was killable, othertimes not) from replaying the offending api trace. These will also be attached in follow-up messages. These appear to have a lot more GPU faults before the messages about timeouts appear.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
deltasquared ds2.bugs.freedesktop@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|VM_L2_PROTECTION_FAULT |VM_L2_PROTECTION_FAULT when | |loading minetest on AMD | |ryzen 2200G integrated | |graphics
https://bugs.freedesktop.org/show_bug.cgi?id=111231
deltasquared ds2.bugs.freedesktop@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|VM_L2_PROTECTION_FAULT when |random |loading minetest on AMD |VM_L2_PROTECTION_FAULTs |ryzen 2200G integrated |when loading a world in |graphics |minetest on AMD ryzen 2200G | |integrated graphics
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #1 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144882 --> https://bugs.freedesktop.org/attachment.cgi?id=144882&action=edit API trace that can reliably cause GPU protection faults on a ryzen 2200G
The adformentioned "dodgy" apitrace trace file.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #2 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144883 --> https://bugs.freedesktop.org/attachment.cgi?id=144883&action=edit apitrace replay --verbose --debug: stdout
NB: stderr attached separately. Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty.
I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not. I lack the knowledge to spot which particular call is the bad one though.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #3 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144884 --> https://bugs.freedesktop.org/attachment.cgi?id=144884&action=edit apitrace replay --verbose --debug: stderr
stderr of the same as above. I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #4 from deltasquared ds2.bugs.freedesktop@gmail.com --- Oh dear, it seems I'm getting in a bit of a muddle with the attachments, please bear with.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
deltasquared ds2.bugs.freedesktop@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #144883|0 |1 is obsolete| |
--- Comment #5 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144885 --> https://bugs.freedesktop.org/attachment.cgi?id=144885&action=edit apitrace replay --verbose --debug: stdout
NB: stderr attached separately. Note that it stops after a certain swap buffers call, so I can only guess something occurred leading up to that which would cause difficulty.
I note there are some attrib pointer calls in-between that and the previous swap, which from my understanding of bug 105251 was one thing that could cause crashes - however while that test program was fixed in the git build, this issue was not. I lack the knowledge to spot which particular call is the bad one though.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
deltasquared ds2.bugs.freedesktop@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #144884|0 |1 is obsolete| |
--- Comment #6 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144886 --> https://bugs.freedesktop.org/attachment.cgi?id=144886&action=edit apitrace replay --verbose --debug: stderr
stderr of the same as above. I made them separate as it helped me to have a look through them - though notably the stack trace I see at the end of this stderr output can't be placed in relation to stdout now, so if need be I can re-run the offending replay file with both redirected to the same file.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #7 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144887 --> https://bugs.freedesktop.org/attachment.cgi?id=144887&action=edit dmesg log from boot to running apitrace replay on the above apitrace trace file
Notably there are a lot more "VM_L2_PROTECTION_FAULT_STATUS: ..." messages when replaying (this file) vs the original dmesg output (when I was able to hit the bug playing the game itself) in the main bug description.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #8 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144888 --> https://bugs.freedesktop.org/attachment.cgi?id=144888&action=edit dmesg output from boot to stopping dmesg when killing xorg was possible
In this case I was able to kill xorg and return to the linux console. When this happens the protection faults continue in dmesg but the pid and thread id values go to zero, not sure if this is significant. This particular dmesg output accompanies the attached apitrace stdout/stderr files from that replay run.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
deltasquared ds2.bugs.freedesktop@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|XOrg git |unspecified
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #9 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144889 --> https://bugs.freedesktop.org/attachment.cgi?id=144889&action=edit Observed graphical corruption - left hand side of monitor
Taken in two photos as my screen's a rather large one. This graphical corruption appears along the edges of some objects, which can *sometimes* occur either when running minetest directly and loading a world or replaying the above api trace. However, I notice that sometimes no graphical corruption occors whatsoever but everything still freezes. That's on top of the fact that the freeze itself doesn't happen all the time... suggests something highly indetermistic at play? Sometimes it flickers to a similarly corrupted version of the minetest logo before the freeze, haven't caught that on camera yet.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #10 from deltasquared ds2.bugs.freedesktop@gmail.com --- Created attachment 144890 --> https://bugs.freedesktop.org/attachment.cgi?id=144890&action=edit Observed graphical corruption - right hand side of monitor
Other side of above. The sky colour is otherwise undisturbed to the top edge of the monitor, hence why the top edge was not in shot. I doubt this camera would have picked up enough detail otherwise - again it's a fairly large monitor.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #11 from deltasquared ds2.bugs.freedesktop@gmail.com --- Some additional information I had neglected to mention in the initial description in the "excitement" of filing my first bug here...
Relevant hardware is as stated a ryzen 2200G running solely on integrated vega graphics - I haven't mentioned any other specs of the system as this bug has persisted across replacements of all components, even the motherboard - only things that have not changed are the PSU, nvme storage and the ryzen chip itself.
Distro is arch linux with all packages up to date at the time of writing. Kernel version 5.2.1-arch1-1-ARCH. Mesa built-from-git version mentonied in bug description. LLVM version 8.0.1.
Any other information is available on request.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
Bas Nieuwenhuizen bas@basnieuwenhuizen.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- QA Contact| |dri-devel@lists.freedesktop | |.org Component|DRM/AMDgpu |Drivers/Gallium/radeonsi Version|unspecified |git Product|DRI |Mesa
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #12 from Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- Thanks for the bug report.
I could reproduce the bug using the provided apitrace, both on a Ryzen platform and on a Vega Mobile laptop (can't reproduce on Navi).
Using MESA_DEBUG=flush or AMD_DEBUG=check_vm seem to make the problem go away so my guess would be a synchronization / cache issue but I didn't find the root issue yet.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #13 from Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- Using AMD_DEBUG=nodpbb "fixes" the problem.
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #14 from deltasquared ds2.bugs.freedesktop@gmail.com --- The apitrace no longer causes issues on my system either if I use AMD_DEBUG=nodpbb . I also decided to try this on minetest and *so far* (bearing in mind the issue was indetermistic in the first place, so a decisive ruling is near impossible) I have not re-incurred a crash.
Interestingly, what I have noticed is that sometimes when minetest did not lock up my system before, the loading bar would suffer mild graphical corruption (bits of the black border go white) - quite difficult to capture on camera due to being so fleeting. So far with nodpbb I have yet to observe these artefacts again.
I did try launching a minetest world with AMD_DEBUG=check_vm instead, however I somehow still managed to get a lock-up that way with similar graphical corruption as the bug description. Alas it seems my btrfs root decided to eat my dmesg log file when I had to force power off, so unable to see if it was the dreaded VM_L2_PROTECTION_FAULT again >:(
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #15 from Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- Could you test the branch from MR https://gitlab.freedesktop.org/mesa/mesa/merge_requests/1554 and let me know if it fixes the issue for you?
https://bugs.freedesktop.org/show_bug.cgi?id=111231
Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
--- Comment #16 from Pierre-Eric Pelloux-Prayer pierre-eric.pelloux-prayer@amd.com --- The MR has been merged.
Thanks for your help!
https://bugs.freedesktop.org/show_bug.cgi?id=111231
--- Comment #17 from deltasquared ds2.bugs.freedesktop@gmail.com --- Apologies for being late to reply. Having run mesa built from the MR branch, I have since been unable to get the same crash when running minetest. Certainly the apitrace capture can no longer bring my system down, however the actual program running was always less determistic than that, so it was hard to prove the absence of - that said, I have been playing the game again for a few days now and have not experienced the crash, so I feel reasonably comfortable it has gone.
dri-devel@lists.freedesktop.org