https://bugs.freedesktop.org/show_bug.cgi?id=91790
Bug ID: 91790 Summary: TONGA hang in amdgpu_ring_lock Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: master.homer@gmail.com
Created attachment 117962 --> https://bugs.freedesktop.org/attachment.cgi?id=117962&action=edit dmesg of hang
I've been getting random hangs in amdgpu_ring_lock, this causes X to hang, meaning I can't use the computer at all. I can sometimes switch to a tty, but this doesn't always work either.
I'm running Ubuntu 15.04 with mesa and libdrm from the oibaf ppa, with a self-compiled xf86-video-amdgpu and a self-compiled kernel from agd5f, drm-next-4.3-wip (9066b0c318589f47b754a3def4fe8ec4688dc21a).
I haven't been able to predict when the hang will happen, sometimes I can use it for several hours before it hangs, other times it happens just a few minutes after booting.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #1 from Andy Furniss adf.lists@gmail.com --- Created attachment 117963 --> https://bugs.freedesktop.org/attachment.cgi?id=117963&action=edit mplayer X hung task
I got a similar trace yesterday on current agd5f drm-next-4.3 while trying to kill uvd with mplayer by repeatedly starting.
I am slightly hopeful this is a different issue from uvd as it starts with X and I got way more starts than I recently have - 360 to get this trace after a couple of OK 250 runs.
I haven't locked in normal use, but then my desktop setup is simple = fluxbox.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #2 from Mathias Tillman master.homer@gmail.com --- Created attachment 117967 --> https://bugs.freedesktop.org/attachment.cgi?id=117967&action=edit dmesg with added debug output
I've done some more testing, turns out that it never reaches amdgpu_ring_unlock_commit on certain cases, and that's what causes it to hang, since the mutex never unlocks. I added some debug output to the code, gfx/sdma0 is ring->name, 0/9 is ring->idx and the address is the address of the ring struct. As you can see in the log, it calls amdgpu_ring_lock on ring 9 with name sdma0, and then afterwards it calls it again on ring 0 with name gfx, without calling amdgpu_ring_unlock_commit. I will add some more debug output in hopes of finding why exactly it's never unlocked, and if it is fixable. I should mention that these random lockups do not happen while using the proprietary catalyst driver, so it must be something in the amdgpu driver.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #3 from Christian König deathsimple@vodafone.de --- That could just be a symptom of a hardware hang which isn't detected for some reason.
Please take a look at amdgpu_fence_info as well to see if there are any outstanding submissions.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #4 from Andy Furniss adf.lists@gmail.com --- (In reply to Christian König from comment #3)
That could just be a symptom of a hardware hang which isn't detected for some reason.
There's this - drm/amdgpu: disable GPU reset by default
http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.3&id=a895c...
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #5 from Mathias Tillman master.homer@gmail.com --- (In reply to Christian König from comment #3)
That could just be a symptom of a hardware hang which isn't detected for some reason.
Please take a look at amdgpu_fence_info as well to see if there are any outstanding submissions.
If it's a hardware hang, wouldn't it also happen when using catalyst? It doesn't happen there, so it should at least be possible to work around (if it is a hardware problem). I will continue investigating why this happens, but it does seem to me like this, #91278, and #91676 all are caused by the same thing, but with different log output depending on if you use drm-next-4.3 or drm-next-4.2.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #6 from Christian König deathsimple@vodafone.de --- No, current released catalyst doesn't uses anything from the amdgpu module yet.
It's clearly not a hardware problem, but invalid render commands can cause the hardware to lock up.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #7 from Mathias Tillman master.homer@gmail.com --- Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #8 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #7)
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.
I can imaging that it's far better for desktop locks - I moved onto it when it got updated.
Initially testing with Unigine Valley I thought it was going to be good - I got further than ever before (about 4x through all the scenes having not got through once previously), but it did lock.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #9 from Mathias Tillman master.homer@gmail.com --- (In reply to Andy Furniss from comment #8)
(In reply to Mathias Tillman from comment #7)
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.
I can imaging that it's far better for desktop locks - I moved onto it when it got updated.
Initially testing with Unigine Valley I thought it was going to be good - I got further than ever before (about 4x through all the scenes having not got through once previously), but it did lock.
That's a shame. I'll try and see if I can find out what has caused the lockups to stop for me, maybe that could help in finding out what's still causing them for you.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #10 from Alex Deucher alexdeucher@gmail.com --- Created attachment 118056 --> https://bugs.freedesktop.org/attachment.cgi?id=118056&action=edit possible fix
I think this patch should fix it.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #11 from Mathias Tillman master.homer@gmail.com --- (In reply to Alex Deucher from comment #10)
Created attachment 118056 [details] [review] possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #12 from Christian König deathsimple@vodafone.de --- (In reply to Mathias Tillman from comment #11)
No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
Assuming you can still access the box over the network after the lockup then please provide the output of the following as root:
cat /sys/kernel/debug/dri/0/amdgpu_fence_info hexdump -s 0x14fc -n 4 /sys/kernel/debug/dri/0/amdgpu_regs
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #13 from Andy Furniss adf.lists@gmail.com --- (In reply to Mathias Tillman from comment #11)
(In reply to Alex Deucher from comment #10)
Created attachment 118056 [details] [review] [review] possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
I see drm-next-4.3 is now ahead again, haven't tested that yet.
With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but I've only had time to do a couple of runs (45 min then 90 min) from a clean boot. Maybe later when I've been up a while doing other things I'll try harder.
Patch doesn't apply with git apply - did it by hand.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #14 from Mathias Tillman master.homer@gmail.com --- Created attachment 118060 --> https://bugs.freedesktop.org/attachment.cgi?id=118060&action=edit Output of amdgpu_regs and amdgpu_fence_info
I have attached the output of amdgpu_regs and amdgpu_fence_info. Hang is right after the hang happened, Normal is right after a reboot after the hang (for comparison).
https://bugs.freedesktop.org/show_bug.cgi?id=91790
--- Comment #15 from Andy Furniss adf.lists@gmail.com --- (In reply to Andy Furniss from comment #13)
(In reply to Mathias Tillman from comment #11)
(In reply to Alex Deucher from comment #10)
Created attachment 118056 [details] [review] [review] [review] possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
I see drm-next-4.3 is now ahead again, haven't tested that yet.
With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but I've only had time to do a couple of runs (45 min then 90 min) from a clean boot. Maybe later when I've been up a while doing other things I'll try harder.
Patch doesn't apply with git apply - did it by hand.
I managed to lock it, seems that doing "something" between runs changes things, or first runs are lucky.
FWIW I tried running Unreal 4.5 ElementalDemo after my long runs and I got a signal 7.
After I later locked/hung valley I rebooted and tried again elemental from a clean boot and it ran OK, but after quitting. it now gives signal 7 again if I try to start it.
https://bugs.freedesktop.org/show_bug.cgi?id=91790
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #16 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/57.
dri-devel@lists.freedesktop.org