https://bugs.freedesktop.org/show_bug.cgi?id=91278
Bug ID: 91278 Summary: Tonga GPU lock/reset fail with Unigine Valley Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: adf.lists@gmail.com
R9 285 kernel agd5f amdgpu with or without patches from
https://bugs.freedesktop.org/show_bug.cgi?id=91141
mesa is agd5f with a few patches from mainline to build with current llvm.
ddx is git against older xorg.
Simpler games like openarena don't lock.
Valley settings ultra, 8xAA, fullscreen 1920x1080.
Doesn't show in this log but I've also seen some
[drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-35)
around the lock/reset on previous tests.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #1 from Andy Furniss adf.lists@gmail.com --- Created attachment 117012 --> https://bugs.freedesktop.org/attachment.cgi?id=117012&action=edit dmesg with lock
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #2 from Andy Furniss adf.lists@gmail.com --- Created attachment 117013 --> https://bugs.freedesktop.org/attachment.cgi?id=117013&action=edit xorg-log for ref (not from lock)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #3 from Andy Furniss adf.lists@gmail.com --- I've of course tried various things sine reporting - Valley doesn't always instantly lock. Unreal 4.5 Elemental got half way before locking.
Perhaps more interesting I managed to reset/fail resume just browsing - of course I've done a lot of browsing without issue so far. The difference this time was I had a huge ffmpeg/x265 encode going - it was using all my memory (8 Gig and swap had been used a bit), so it's possible memory pressure plays a role - or maybe just a red herring :-)
I haven't managed to get a reset running timedemos on openarena or xonotic so far - will try with memory pressure as time allows.
The reset when browsing -
-rw-rw-r-- 1 andy andy 153K Jun 13 00:04 hacky-fix.jpeg [ 8052.101670] amdgpu 0000:01:00.0: GPU lockup (waiting for 0x000000000000f019 last fence id 0x000000000000f018 on ring 9) [ 8052.101672] amdgpu 0000:01:00.0: failed to sync rings (-35) [ 8052.108912] amdgpu 0000:01:00.0: Saved 9216 dwords of commands on ring 9. [ 8052.108929] amdgpu 0000:01:00.0: GPU softreset: 0x00000100 [ 8052.108930] amdgpu 0000:01:00.0: GRBM_STATUS=0x00003028 [ 8052.108932] amdgpu 0000:01:00.0: GRBM_STATUS2=0x00000008 [ 8052.108934] amdgpu 0000:01:00.0: GRBM_STATUS_SE0=0x00000006 [ 8052.108935] amdgpu 0000:01:00.0: GRBM_STATUS_SE1=0x00000006 [ 8052.108937] amdgpu 0000:01:00.0: GRBM_STATUS_SE2=0x00000006 [ 8052.108938] amdgpu 0000:01:00.0: GRBM_STATUS_SE3=0x00000006 [ 8052.108940] amdgpu 0000:01:00.0: SRBM_STATUS=0x20020240 [ 8052.108941] amdgpu 0000:01:00.0: SRBM_STATUS2=0x00000080 [ 8052.108943] amdgpu 0000:01:00.0: SDMA0_STATUS_REG = 0x76DEED57 [ 8052.108945] amdgpu 0000:01:00.0: SDMA1_STATUS_REG = 0x46DEED57 [ 8052.108946] amdgpu 0000:01:00.0: CP_STAT = 0x00000000 [ 8052.108948] amdgpu 0000:01:00.0: CP_STALLED_STAT1 = 0x00000c00 [ 8052.108949] amdgpu 0000:01:00.0: CP_STALLED_STAT2 = 0x00000000 [ 8052.108951] amdgpu 0000:01:00.0: CP_STALLED_STAT3 = 0x00000000 [ 8052.108953] amdgpu 0000:01:00.0: CP_CPF_BUSY_STAT = 0x00000000 [ 8052.108954] amdgpu 0000:01:00.0: CP_CPF_STALLED_STAT1 = 0x00000000 [ 8052.108956] amdgpu 0000:01:00.0: CP_CPF_STATUS = 0x00000000 [ 8052.108957] amdgpu 0000:01:00.0: CP_CPC_BUSY_STAT = 0x00000000 [ 8052.108959] amdgpu 0000:01:00.0: CP_CPC_STALLED_STAT1 = 0x00000000 [ 8052.108961] amdgpu 0000:01:00.0: CP_CPC_STATUS = 0x00000000 [ 8052.108962] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 8052.108964] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 8052.109078] amdgpu 0000:01:00.0: SRBM_SOFT_RESET=0x00000400 [ 8052.110233] amdgpu 0000:01:00.0: GRBM_STATUS=0x00003028 [ 8052.110235] amdgpu 0000:01:00.0: GRBM_STATUS2=0x00000008 [ 8052.110236] amdgpu 0000:01:00.0: GRBM_STATUS_SE0=0x00000006 [ 8052.110238] amdgpu 0000:01:00.0: GRBM_STATUS_SE1=0x00000006 [ 8052.110239] amdgpu 0000:01:00.0: GRBM_STATUS_SE2=0x00000006 [ 8052.110241] amdgpu 0000:01:00.0: GRBM_STATUS_SE3=0x00000006 [ 8052.110242] amdgpu 0000:01:00.0: SRBM_STATUS=0x20020040 [ 8052.110244] amdgpu 0000:01:00.0: SRBM_STATUS2=0x00000080 [ 8052.110245] amdgpu 0000:01:00.0: SDMA0_STATUS_REG = 0x76DEED57 [ 8052.110247] amdgpu 0000:01:00.0: SDMA1_STATUS_REG = 0x46DEED57 [ 8052.110248] amdgpu 0000:01:00.0: CP_STAT = 0x00000000 [ 8052.110250] amdgpu 0000:01:00.0: CP_STALLED_STAT1 = 0x00000c00 [ 8052.110252] amdgpu 0000:01:00.0: CP_STALLED_STAT2 = 0x00000000 [ 8052.110253] amdgpu 0000:01:00.0: CP_STALLED_STAT3 = 0x00000000 [ 8052.110255] amdgpu 0000:01:00.0: CP_CPF_BUSY_STAT = 0x00000000 [ 8052.110256] amdgpu 0000:01:00.0: CP_CPF_STALLED_STAT1 = 0x00000000 [ 8052.110258] amdgpu 0000:01:00.0: CP_CPF_STATUS = 0x00000000 [ 8052.110259] amdgpu 0000:01:00.0: CP_CPC_BUSY_STAT = 0x00000000 [ 8052.110261] amdgpu 0000:01:00.0: CP_CPC_STALLED_STAT1 = 0x00000000 [ 8052.110262] amdgpu 0000:01:00.0: CP_CPC_STATUS = 0x00000000 [ 8052.110282] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume [ 8052.110289] [drm] probing gen 2 caps for device 1002:5a16 = 31cd02/0 [ 8052.111446] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000). [ 8052.113940] [drm] ring test on 0 succeeded in 10 usecs [ 8053.856277] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.049187] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 2 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.242101] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 3 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.435020] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 4 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.627925] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 5 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8054.820839] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 6 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.013737] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 7 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.206669] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 8 test failed (scratch(0xC040)=0xCAFEDEAD) [ 8055.313826] [drm:sdma_v3_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD) [ 8055.319862] amdgpu 0000:01:00.0: GPU reset failed [ 8055.320787] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320806] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 8055.320831] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320841] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22) [ 8055.320854] amdgpu 0000:01:00.0: couldn't schedule ib [ 8055.320863] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-22)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #4 from Andy Furniss adf.lists@gmail.com --- I tested Xonotic with load over the weekend and it did lock after about 10 minutes - but I then tested without any memory pressure and still managed to lock, it did take longer.
I got another lock ehile browsing today - there was nothing else happening at the time, but I had a few minutes earlier compiled llvm and mesa.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
Mathias Tillman master.homer@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |master.homer@gmail.com
--- Comment #5 from Mathias Tillman master.homer@gmail.com --- Created attachment 117739 --> https://bugs.freedesktop.org/attachment.cgi?id=117739&action=edit Kernel log of hang
Have been getting the same hangs, though I get it while just using the computer normally, or even while it was idle.
Using Ubuntu vivid with kernel 4.2-rc7 from Ubuntu mainline with the oibaf ppa and a self-compiled xf86-video-amdgpu module.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #6 from Michel Dänzer michel@daenzer.net --- (In reply to Mathias Tillman from comment #5)
Have been getting the same hangs, though I get it while just using the computer normally, or even while it was idle.
The symptoms may be similar, but since the circumstances differ, please file your own report.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #7 from Andy Furniss adf.lists@gmail.com --- Created attachment 117964 --> https://bugs.freedesktop.org/attachment.cgi?id=117964&action=edit hung tast with current agd5f drm-next-4.3
Valley does sometimes get further with newer gits - I have recently got all the way through the scenes. It does still lock though.
Attached is a hung task trace with current agd5f drm-next-4.3, libdrm, mesa and a recentish llvm.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
Alex Deucher alexdeucher@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |edward.ocallaghan@koparo.co | |m
--- Comment #8 from Alex Deucher alexdeucher@gmail.com --- *** Bug 92087 has been marked as a duplicate of this bug. ***
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #9 from Mathias Tillman master.homer@gmail.com --- Created attachment 118448 --> https://bugs.freedesktop.org/attachment.cgi?id=118448&action=edit apitrace of hang
Not exactly sure how useful it is, but I have attached an excerpt of an apitrace of the unigine valley demo. I had to cut out most of it, due to the size of it - I've run it several times, and the size has always been >500MB. The last thing in the trace was always glXSwapBuffers, so the excerpt consists of the contents between the last and the next to last glXSwapBuffers lines.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #10 from Michel Dänzer michel@daenzer.net --- Mathias, so you can reliably reproduce the hang by replaying that apitrace?
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #11 from Mathias Tillman master.homer@gmail.com --- (In reply to Michel Dänzer from comment #10)
Mathias, so you can reliably reproduce the hang by replaying that apitrace?
Unfortunately I haven't been able to replay the apitrace properly - I just get a black screen with a bunch of errors in the output about not supporting GLSL 1.50, program not supported etc. I will keep trying though.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #12 from Mathias Tillman master.homer@gmail.com --- (In reply to Michel Dänzer from comment #10)
Mathias, so you can reliably reproduce the hang by replaying that apitrace?
Okay, I have been able to replay the trace by compiling apitrace from source, and renaming glretrace to valley_x64. I can reproduce the hang by replaying, but not in a very useful way as the hang happens on different frames on each replay, so I wouldn't really call it reproducible at this point. I will see what different options glretrace gives me, to see if I can find some kind of common denominator between the hangs.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #13 from Mathias Tillman master.homer@gmail.com --- Sorry for triple posting, but I have some more info that may or may not be useful. When enabling verbose output from glretrace, I can see that the next to last operation before a hang is always glBindBuffer (I've run it 10 times) even though the rest of the output is very different. This, however, only happens when double buffer visuals is enabled, if I disable it using --sb the output is the same as above, ending with glXSwapBuffers. Not sure if this is a coincidence or not, but it seemed interesting to me.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #14 from Andy Furniss adf.lists@gmail.com --- FWIW I don't think the Valley code/shaders/whatever its self triggers this.
I can run valley for > an hour depending on luck/state of my box.
One way to get a long run for me is to go into mem sleep, come out and while nothing else is running run valley. Based on only a few runs like this, I haven't locked it yet. Randomly starting valley after my PC has been in use all day may lock withing 10 seconds.
I am not saying mem sleep cures all locks, just seems to make it hard. After running valley for an hour one time, I mover onto Unreal elemental, several runs, no lock. I then tried Unreal Atlantis and eventually got it to hang with that, though it ran through the scripted bit and I had to start flying around interactively before the hang.
Unreal Atlantis is a bit different/annoying. Different in that it requests more vram than I have and annoying as it makes a 1.8 gig cache file under $HOME/.config.
This mem sleep observation may be luck. I also haven't tested some variants yet like with vblank_mode=0 or cpufreq_ondemand settings. I run valley all maxed fullscreen - so yet more variables verses what others may be running it with.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #15 from Alex Deucher alexdeucher@gmail.com --- These patches may help: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150928/302380.htm... http://lists.freedesktop.org/archives/mesa-dev/2015-September/095718.html
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #16 from Mathias Tillman master.homer@gmail.com --- (In reply to Alex Deucher from comment #15)
These patches may help: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150928/302380.htm... http://lists.freedesktop.org/archives/mesa-dev/2015-September/095718.html
Afraid not, applied both of them and it still hangs. One interesting thing is that I was using mesa from the oibaf ppa which compiles against llvm 3.6. While using that I haven't been able to replay my apitrace of valley once - it always hangs before it finishes. However, I compiled mesa against llvm 3.7 (one compiled from source with the patch, and one from llvm's apt repository) and 3.8, and it gets much further now - I've been able to replay the trace three times without a hang, though it does ultimately hang unfortunately.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #17 from Andy Furniss adf.lists@gmail.com --- Haven't had time yet to hang with the patches.
Yesterday without I them I hung, rebooted, did the memsleep, then tested the rest of the day trying to lock valley and unreal but couldn't. For the whole day, the only logging I got was a few hundred -
Sep 29 18:10:47 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72) Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0be84804 Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0017557D Sep 29 18:10:49 ph4 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048004 Sep 29 18:10:49 ph4 kernel: VM fault (0x04, vmid 4) at page 1529213, read from 'TC6' (0x54433600) (72)
Last thing I applied the patches to couple of days old llvm and mesa gits.
This morning ran valley from power off boot after a bit of browsing/mail (yesterday this hung).
Only a quick test which I stopped, looked OK but in dmesg I have >10k of -
[ 1792.292640] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404 [ 1792.292643] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.292644] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x060C4004 [ 1792.292646] VM fault (0x04, vmid 3) at page 1273635, read from 'TC4' (0x54433400) (196) [ 1792.292650] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.292651] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292652] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292654] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.292658] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404 [ 1792.292659] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292660] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292661] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.292666] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404 [ 1792.292667] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.292668] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.292669] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375515] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09188404 [ 1792.375518] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.375519] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06084004 [ 1792.375521] VM fault (0x04, vmid 3) at page 1273635, read from 'TC10' (0x54433130) (132) [ 1792.375526] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.375527] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375528] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375530] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375534] amdgpu 0000:01:00.0: GPU fault detected: 146 0x0918c404 [ 1792.375535] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375536] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375538] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.375542] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09180404 [ 1792.375543] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1792.375544] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1792.375546] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1792.432272] amdgpu 0000:01:00.0: GPU fault detected: 146 0x09184404 [ 1792.432276] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00136F23 [ 1792.432277] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06044004 [ 1792.432280] VM fault (0x04, vmid 3) at page 1273635, read from 'TC7' (0x54433700) (68)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #18 from Andy Furniss adf.lists@gmail.com --- So after last update I ran valley again briefly and saw a few vmfaults, then a longer run and got thousands again.
without touching anything else did echo mem >/sys/power/state and then woke up.
10 minute run of valley has produced zero faults.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #19 from Andy Furniss adf.lists@gmail.com --- (In reply to Andy Furniss from comment #18)
So after last update I ran valley again briefly and saw a few vmfaults, then a longer run and got thousands again.
without touching anything else did echo mem >/sys/power/state and then woke up.
10 minute run of valley has produced zero faults.
Further test from power off, nothing else running apart from X/fluxox short run of valley no faults. Reran valley for a bit longer and got thousands. Did memsleep ran valley no faults but after about 10 minutes it hung.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #20 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #19)
(In reply to Andy Furniss from comment #18)
So after last update I ran valley again briefly and saw a few vmfaults, then a longer run and got thousands again.
without touching anything else did echo mem >/sys/power/state and then woke up.
10 minute run of valley has produced zero faults.
Further test from power off, nothing else running apart from X/fluxox short run of valley no faults. Reran valley for a bit longer and got thousands. Did memsleep ran valley no faults but after about 10 minutes it hung.
Do you get those GPU faults in the log even when there's no hang? I haven't checked dmesg while running valley myself, but I do know they always appear when a hang has happened (I'm using ssh to grab dmesg while it's hung).
Dmesg is sometimes completely filled with GPU faults, other times it's just a few. I ran it a few minutes ago and only got this:
[ 1737.984328] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804804 [ 1737.984338] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00100110 [ 1737.984343] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A048004 [ 1737.984348] VM fault (0x04, vmid 5) at page 1048848, read from 'TC6' (0x54433600) (72) [ 1737.984355] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08804004 [ 1737.984359] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1737.984363] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1737.984366] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0) [ 1737.984374] amdgpu 0000:01:00.0: GPU fault detected: 146 0x08800804 [ 1737.984378] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 1737.984381] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 [ 1737.984384] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #21 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #20)
Do you get those GPU faults in the log even when there's no hang?
Yes and I can also hang without getting any.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #22 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #21)
(In reply to Mathias Tillman from comment #20)
Do you get those GPU faults in the log even when there's no hang?
Yes and I can also hang without getting any.
Actually, I think I've seen hangs without the GPU faults too now that I think about it. Makes you wonder if the GPU faults are related at all, or if this is something else entirely.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #23 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #22)
(In reply to Andy Furniss from comment #21)
(In reply to Mathias Tillman from comment #20)
Do you get those GPU faults in the log even when there's no hang?
Yes and I can also hang without getting any.
Actually, I think I've seen hangs without the GPU faults too now that I think about it. Makes you wonder if the GPU faults are related at all, or if this is something else entirely.
Yea, could be unrelated.
I haven't seen thousands until Today so that aspect could be to do with the patches.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #24 from Andy Furniss adf.lists@gmail.com --- (In reply to Andy Furniss from comment #23)
I haven't seen thousands until Today so that aspect could be to do with the patches.
Ignore that, more grepping of kernel log does show I have got thousands before today.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #25 from Andy Furniss adf.lists@gmail.com --- (In reply to Andy Furniss from comment #24)
(In reply to Andy Furniss from comment #23)
I haven't seen thousands until Today so that aspect could be to do with the patches.
Ignore that, more grepping of kernel log does show I have got thousands before today.
Haven't had time to test thoroughly, but I see the latest updated agd5f fixes has a commit to reduce vm faults and I haven't got thousands since changing to that.
I also see a new R600_DEBUG=check_vm in mesa. I don't know what, if any, extra info is expected from that, but testing valley with it caused it to quit when it hit a fault after a about minute of running -
Detected a VM fault, exiting...
in dmesg -
[ 261.017278] amdgpu 0000:01:00.0: GPU fault detected: 146 0x01e84804 [ 261.017290] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0015703D [ 261.017296] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048004 [ 261.017302] VM fault (0x04, vmid 3) at page 1404989, read from 'TC6' (0x54433600) (72)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #26 from Michel Dänzer michel@daenzer.net --- (In reply to Andy Furniss from comment #25)
I also see a new R600_DEBUG=check_vm in mesa. I don't know what, if any, extra info is expected from that, but testing valley with it caused it to quit when it hit a fault after a about minute of running -
It should generate a file in ~/ddebug_dumps/ with more information about the VM fault.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #27 from Andy Furniss adf.lists@gmail.com --- Created attachment 118665 --> https://bugs.freedesktop.org/attachment.cgi?id=118665&action=edit valley vm fault dump
Aha, here it is.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #28 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #27)
Created attachment 118665 [details] valley vm fault dump
Aha, here it is.
I am getting similar results using check_vm. However, since the hang usually doesn't happen after the first vm fail (it did actually for me a few times, but most of the time it didn't) I modified the mesa code to disable closing the program once a vm fault happened. Unfortunately this doesn't really provide me with any interesting information - they have all complained about the paging fault happening in a VERTEX_BUFFER with an identical, or close to identical address.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #29 from Michel Dänzer michel@daenzer.net --- That is interesting, though; the radeonsi driver seems to think there should be something mapped at the faulting address. This indicates that either the kernel driver fails to handle the mapping properly, or maybe there's a problem with communicating the buffer mapping information from userspace to the kernel driver.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #30 from Mathias Tillman master.homer@gmail.com --- Just updated to latest drm-next-4.4-wip and mesa and reran the valley test. Been running for two hours now without a hang (although there have been vm faults). Will test more over the weekend, but it's looking good so far :)
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #31 from Mathias Tillman master.homer@gmail.com --- Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3) without a hang on the latest drm-next-4.4-wip. Tried the latest drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think there's something in 4.4 that fixes it. I will do a bisect to see if I can figure out what, more exactly, is the fix.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #32 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #31)
Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3) without a hang on the latest drm-next-4.4-wip. Tried the latest drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think there's something in 4.4 that fixes it. I will do a bisect to see if I can figure out what, more exactly, is the fix.
I'll try over the weekend - no locks so far.
Bisecting would be a pain for me as I managed to very long lucky runs previously, so wouldn't be able to easily call good.
I notice that enable_scheduler is now on by default, maybe flipping that would be quicker than bisect. Long ago I did try some older kernel with that enabled and though it didn't fix IIRC it took longer to lock than was normal at that time.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #33 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #32)
(In reply to Mathias Tillman from comment #31)
Small update: I have been able to run it for a total of 7 hours (2 + 2 + 3) without a hang on the latest drm-next-4.4-wip. Tried the latest drm-fixes-4.3 and it hung after about 30 minutes, so I definitely think there's something in 4.4 that fixes it. I will do a bisect to see if I can figure out what, more exactly, is the fix.
I'll try over the weekend - no locks so far.
Bisecting would be a pain for me as I managed to very long lucky runs previously, so wouldn't be able to easily call good.
I notice that enable_scheduler is now on by default, maybe flipping that would be quicker than bisect. Long ago I did try some older kernel with that enabled and though it didn't fix IIRC it took longer to lock than was normal at that time.
Just ran it again with the scheduler disabled (from code, not through the module parameter) on 4.4-wip, and sure enough, after about 30 minutes it hung. So it looks like there's something in the scheduler that either makes it not happen as often, or not at all (will need to confirm this).
https://bugs.freedesktop.org/show_bug.cgi?id=91278
Grazvydas Ignotas notasas@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |notasas@gmail.com
--- Comment #34 from Grazvydas Ignotas notasas@gmail.com --- Created attachment 118824 --> https://bugs.freedesktop.org/attachment.cgi?id=118824&action=edit test kernel patch
(In reply to Michel Dänzer from comment #29)
That is interesting, though; the radeonsi driver seems to think there should be something mapped at the faulting address. This indicates that either the kernel driver fails to handle the mapping properly, or maybe there's a problem with communicating the buffer mapping information from userspace to the kernel driver.
Judging by the symptoms it feels like some caching/buffering problem somewhere.
If I understand the code right, most of things are mapped write-combine, which means the CPU is allowed to write data it any order it likes. Looking at amdgpu/radeon code, there is surprising lack of barriers, basically it's just amdgpu_ring_commit()/radeon_ring_commit() and that's it. But mb() doesn't guarantee that the writes will arrive in program order, it just ensures that all the writes are finished after that mb() statement.
So the question is, is it ok for the hardware if in something like amdgpu_ib_schedule() the writes to the ring arrive before the writes to IB? I do admit I don't understand how the hardware works, like what triggers the hardware to start processing the ring contents, perhaps the write to the last word in the ring? If so you clearly need a wmb() before the write which triggers the hardware so that everything is ready before the GPU kicks in.
Attached is a debug kernel patch to test if my guess is correct. It's way overkill and will trash performance, but it should show if this is a problem related to CPU caching/buffering. I don't have the hardware to test this myself.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #35 from Christian König deathsimple@vodafone.de --- The ring and IB are just normal system memory made accessible to the GPU. So it's perfectly fine that it's mapped WC by the CPU.
The processing of the commands written into the ring and IB are triggered by writing the wptr register.
See radeon_ring_set_wptr().
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #36 from Mathias Tillman master.homer@gmail.com --- I have run the valley test the entire day (10 hours or so) without a single VM fault or hang with the GPU scheduler enabled on 4.4.
Re-ran the test with your (Grazvydas) patch and it's showing the same symptoms as before when not using the schedluer - that is, a bunch of VM faults, which lead to a hang after around 30 minutes or so.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #37 from Grazvydas Ignotas notasas@gmail.com --- ok then my guess is wrong.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #38 from Andy Furniss adf.lists@gmail.com --- (In reply to Andy Furniss from comment #32)
I'll try over the weekend - no locks so far.
Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no hangs and no vm faults so far.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #39 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #38)
(In reply to Andy Furniss from comment #32)
I'll try over the weekend - no locks so far.
Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no hangs and no vm faults so far.
Assuming you tried it with the scheduler enabled? Just tried it with the scheduler enabled and semaphores disabled on drm-fixes-4.3, and it also seems to work.
Is the scheduler going to be left enabled from now on? If so we could probably say that this issue has been solved (though more testing is probably required). However, something is causing all of those vm faults when the scheduler is disabled, so it might still be worth looking into if it's going to be possible to disable it using a module parameter.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #40 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #39)
(In reply to Andy Furniss from comment #38)
(In reply to Andy Furniss from comment #32)
I'll try over the weekend - no locks so far.
Still good, I've also tried to hang with Unreal demos and mplayer+uvd, no hangs and no vm faults so far.
Assuming you tried it with the scheduler enabled?
Yea, I am running with the default for drm-next-4.4-wip = enabled.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #41 from Mathias Tillman master.homer@gmail.com --- Just tried it again with the latest drm-next-4.4 and the kernel parameters amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1. Can't even make it to the beginning of the test (it hangs on the loading screen, if not before that - I have OpenGL acceleration enabled in kwin) when vm_debug=1 before hanging with a bunch of VM faults. If I set vm_debug to 0 it at least starts properly. Not sure if that was the intended behaviour, but that's what happens in any case.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #42 from Alex Deucher alexdeucher@gmail.com --- (In reply to Mathias Tillman from comment #41)
Just tried it again with the latest drm-next-4.4 and the kernel parameters amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1.
I don't think forcing semaphores will work on tonga/fiji since there were disabled in the code a while ago due to hw bugs.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #43 from Christian König deathsimple@vodafone.de --- (In reply to Alex Deucher from comment #42)
(In reply to Mathias Tillman from comment #41)
Just tried it again with the latest drm-next-4.4 and the kernel parameters amdgpu.enable_scheduler=0 amdgpu.vm_debug=1 amdgpu.enable_semaphores=1.
I don't think forcing semaphores will work on tonga/fiji since there were disabled in the code a while ago due to hw bugs.
Yeah, forcefully enabling semaphores on Tonga will crash rather fast.
The purpose of vm_debug != 0 is to stop after the first VM fault.
So if you got a VM fault from time to time which was just ignored than setting vm_debug will certainly crash the system.
On the other hand you shouldn't get VM faults in the first place.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #44 from Mathias Tillman master.homer@gmail.com --- Just done some tests:
enable_scheduler=0, enable_semaphores=1, vm_fault_stop=0, vm_debug=1 Hang after a few minutes, normal desktop use or running the valley test. Many VM faults in dmesg.
enable_scheduler=0, enable_semaphores=0, vm_fault_stop=1, vm_debug=1 Hang after a few minutes, normal desktop use or running the valley test. Only one VM fault visible in dmesg, that would be due to vm_fault_stop=1 I'm guessing?
enable_scheduler=0, enable_semaphores=0, vm_fault_stop=0, vm_debug=1 Same as above, but with several VM faults in dmesg.
enable_scheduler=1, enable_semaphores=0, vm_fault_stop=0, vm_debug=1 All good.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #45 from Daniele Ruffini faffo665@gmail.com --- Created attachment 119679 --> https://bugs.freedesktop.org/attachment.cgi?id=119679&action=edit journalctl | grep amdgpu
Went on for a while, cut short because the error was just repeating
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #46 from Daniele Ruffini faffo665@gmail.com --- Just forgot to mention that the bug for me happens totally randomly. No Unigine Heave. No intensive graphic usage. It seems to happen when something new is displayed tough but i wasn't able to recreate it metodically.
Amd Radeon HD380 4GB.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #47 from Andy Furniss adf.lists@gmail.com --- Seems like you are using an old kernel - though what I call old may be current for what's released.
The hanging for me stopped with 4.4, but 4.3 is release.
If you are using 4.3 maybe booting with the option -
amdgpu.enable_scheduler=1
will help.
If you are used to compiling your own kernels then using one from -
http://cgit.freedesktop.org/~agd5f/linux/
like drm-next-4.4 should be stable.
Or if you want to test/use the new powerplay try amdgpu-powerplay
https://bugs.freedesktop.org/show_bug.cgi?id=91278
EoD EoD@xmw.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |EoD@xmw.de
--- Comment #48 from EoD EoD@xmw.de --- I don't get any lockups with kernel 4.4-rc6 and current mesa git on my R380X.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
Andy Furniss adf.lists@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED
--- Comment #49 from Andy Furniss adf.lists@gmail.com --- I can't speak for everyone on the cc with different h/w, but on Tonga valley has been stable for a long time, so closing.
If it's still an issue for anyone you can reopen.
https://bugs.freedesktop.org/show_bug.cgi?id=91278
--- Comment #50 from Michel Dänzer michel@daenzer.net --- (In reply to Andy Furniss from comment #49)
If it's still an issue for anyone you can reopen.
This is your report, so anyone else please file their own instead.
dri-devel@lists.freedesktop.org