https://bugs.freedesktop.org/show_bug.cgi?id=58667
Priority: medium Bug ID: 58667 Assignee: dri-devel@lists.freedesktop.org Summary: Random crashes on CAYMAN Severity: normal Classification: Unclassified OS: All Reporter: v10lator@myway.de Hardware: Other Status: NEW Version: DRI CVS Component: DRM/Radeon Product: DRI
This is with newest mesa from git with kernel 3.8-rc1 (+ this patch: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.8&id=668b... )
The screen first freezes (mouse still movable, keyboard not responding, not even to MagSysRQ), then the monitor goes off (standby) and back on with only garbage on the screen.
Not sure if this has anything to do with it (but it should get fixed anyway) but dmesg gets spammed with this: [ 533.928472] radeon 0000:03:00.0: GPU fault detected: 146 0x00335514 [ 533.928477] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 533.928483] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
where the address isn't always the same, example: [ 533.928374] radeon 0000:03:00.0: GPU fault detected: 146 0x0033ed14 [ 533.928379] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 [ 533.928385] radeon 0000:03:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #1 from Thomas Rohloff v10lator@myway.de --- Created attachment 72006 --> https://bugs.freedesktop.org/attachment.cgi?id=72006&action=edit Full dmesg output
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #2 from Alex Deucher agd5f@yahoo.com --- Is this a regression? Does it happen with older versions of mesa or kernel? If it's a regression can you identify which component (mesa or kernel) and bisect?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #3 from Alex Deucher agd5f@yahoo.com --- May also be related to bug 58354.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #4 from Thomas Rohloff v10lator@myway.de --- !Is this a regression? Does it happen with older versions of mesa or kernel?! Not that I know about. "May also be related to bug 58354." Do you have the path noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would loce to try to revert this patch and test it, but I'm unable to google it.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #5 from Thomas Rohloff v10lator@myway.de --- I should really read before I click save, sorry. Here again:
"Is this a regression? Does it happen with older versions of mesa or kernel?" Not that I know about.
"May also be related to bug 58354." Do you have a link to the patch noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would love to try to revert this patch and test it, but I'm unable to google it.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #6 from Thomas Rohloff v10lator@myway.de --- Created attachment 72041 --> https://bugs.freedesktop.org/attachment.cgi?id=72041&action=edit New dmesg output
Never mind, I found the patch here: http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-next&id=33e5467...
I reverted it and no crash so far (but as they are random they might still occur). On the other side the dmesg messages are still there. Uploading the new output just in case it is needed.
While writing this minecraft (which I used to trigger the crashes) crashed (right before and shortly after the crash the mouse wsq in slow-motion and I thought it will crash right away).
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #7 from Thomas Rohloff v10lator@myway.de --- Crashes are still there after reverting "drm/radeon: use DMA engine for VM page table updates on cayman/TN"
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #8 from Thomas Rohloff v10lator@myway.de --- But this crash was different: The image froze but the monitor didn't go into standby nor came it back with garbage.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #9 from Thomas Rohloff v10lator@myway.de --- Bisected mesa.
This is a mesa bug caused by http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076...
Can anybody move this to the right place or do I have to re-create the report (and if so: Where) ?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #10 from Thomas Rohloff v10lator@myway.de --- I was to fast with this. While the error messages in dmesg are gone it still randomly crashes, but this time the computer just froze completely. I think this bug report are in fact at least two bugs.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #11 from Thomas Rohloff v10lator@myway.de --- Also the error messages aren't completely gone. I did go back to mesa commit f5632094ba0c19d570ea47025cf6da75ef8457a (mesa: Allow glReadBuffer(GL_NONE) for winsys framebuffers.) and played Minecraft a bit. Suddenly all slowed down and the screen started to corrupt. I looked into dmesg and the messages where back. I made a video from after I killed Minecraft (when the corruption slowly disappeared) and after all corruptions where gone the message spam stopped again: https://www.dropbox.com/s/su1b6oaeiz028y2/out-86.ogv
I will do more bisecting but as this is really randomly it may take a long time. Also I hope my hardware hasn't been damaged by 6532eb17baff6e61b427f29e076883f8941ae664 (is this possible and if so: Is there any way to get my money back?)
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #12 from Thomas Rohloff v10lator@myway.de --- I did go back till http://cgit.freedesktop.org/mesa/mesa/commit/?id=6c99f2101fbd3edb7d5899c44ca... and the bug is still there (not the crashes directly, at least I couldn't trigger them, but the error messages) so either this bug is really old or it's not a mesa bug (or, but that would be really bad: It damaged the hardware).
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #13 from Thomas Rohloff v10lator@myway.de --- After going back to kernel 3.6 (3.7 not tested) I'm unable to re-trigger this bug even after doing more actions that triggered it than in every test before. So I'm pretty sure this is a kernel bug!
Is anybody able to help be bisecting the kernel? I don't even know which tree (drm-next?)
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #14 from Dmitry Cherkassov dcherkassov@gmail.com --- drm-next should be fine.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #15 from Thomas Rohloff v10lator@myway.de --- Thanks for the fast reply. Just to get sure before I clone a few hours for no reason:
git clone git://people.freedesktop.org/~airlied/linux git checkout drm-next
should be a good start, right?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #16 from Dmitry Cherkassov dcherkassov@gmail.com --- git checkout drm-next-3.8 i guess
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #17 from Dmitry Cherkassov dcherkassov@gmail.com --- or better drm-fixes-3.8 just to be sure. (it has few relevant commits on top)
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #18 from Thomas Rohloff v10lator@myway.de --- There is no drm-next-3.8 nor drm-fixes-3.8 at ~airlied/linux, see: http://cgit.freedesktop.org/~airlied/linux/refs/ - That's why I asked what exactly to clone as this step will take hours.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #19 from Thomas Rohloff v10lator@myway.de --- This was a long night but I finally got it: Bad commit: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54f...
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #20 from Thomas Rohloff v10lator@myway.de --- I'm going crazy. I just removed the bad patch from 3.8-rc1 and updated mesa to newest git version (therefore I had to stay at a9048aa6e6abcbeb498ef286630be30729aebaf3 cause of a patch missing in the bisected tree) and the bug is back again.
I don't know how to find the root of it and I have headache cause of it. There seems to be something really wrong with memory management but it's way over my head.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #21 from Thomas Rohloff v10lator@myway.de --- Here's my final summary:
If http://cgit.freedesktop.org/mesa/mesa/commit/?id=6532eb17baff6e61b427f29e076... and cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb is missing the bug doesn't trigger. If the first one is there the bug is triggered extremely often, spamming dmesg. If only the second is there the bug triggers randomly (good way to trigger: Lot of exploding TNT in Minecraft. Just build TNT pillars and ignite them till you are at bedrock).
My last hope is that some genius hacker which knows the driver has some "ah, I see the problem" moment. :(
https://bugs.freedesktop.org/show_bug.cgi?id=58667
Alexandre Demers alexandre.f.demers@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |58354
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #22 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #5)
I should really read before I click save, sorry. Here again:
"Is this a regression? Does it happen with older versions of mesa or kernel?" Not that I know about.
"May also be related to bug 58354." Do you have a link to the patch noted there ("drm/radeon: use DMA engine for VM page table updates on cayman/TN") ? I would love to try to revert this patch and test it, but I'm unable to google it.
"Is this a regression? Does it happen with older versions of mesa or kernel?" Yes. Previous kernel 3.7 doesn't show this problem.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #23 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #22)
"Is this a regression? Does it happen with older versions of mesa or kernel?" Yes. Previous kernel 3.7 doesn't show this problem.
Can you bisect? Is it the same commit Thomas landed on or another one?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #24 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #23)
(In reply to comment #22)
"Is this a regression? Does it happen with older versions of mesa or kernel?" Yes. Previous kernel 3.7 doesn't show this problem.
Can you bisect? Is it the same commit Thomas landed on or another one?
Pretty sure it is the same problem. With kernel 3.8.0-rcx, just launching Gnome Shell starts flooding my logs of: radeon 0000:0X:00.0: GPU fault detected: 146 0x00xxxxxx radeon 0000:0X:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000 radeon 0000:0X:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing. Having a crash here and there from time to time may be coming from something different, but the incessant flood is a big one. In a single session, I end up with kernel.log and everything.log being over 52GB each. I'm also sure this message have to be triggered is something wrong is going on. I'll let you know when I'm done bisecting to figure out what is triggering this flood.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #25 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #24)
I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.
Maybe you should just compile http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=dd54f... (bad) and http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3.8&id=9d89d... (good). It would be faster than bisecting and if you get another result than me you can still do a full bisect afterwards.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #26 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #25)
(In reply to comment #24)
I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.
Maybe you should just compile http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster than bisecting and if you get another result than me you can still do a full bisect afterwards.
That's what I'll do, it makes sense.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #27 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #26)
(In reply to comment #25)
(In reply to comment #24)
I'll bisect between 3.7 and 3.8-rc1 and see if I end up at the same thing.
Maybe you should just compile http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. 8&id=dd54fee7d440c4a9756cce2c24a50c15e4c17ccb (bad) and http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-3. 8&id=9d89d78e3a20980205966fba6345645547e59ceb (good). It would be faster than bisecting and if you get another result than me you can still do a full bisect afterwards.
That's what I'll do, it makes sense.
It seems both are bad: crashed on logon with 9d89d and both flooded my logs.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #28 from Alexandre Demers alexandre.f.demers@gmail.com --- The flood is caused by: Commit: 4ac0533abaec2b83a7f2c675010eedd55664bc26
Author: Jerome Glisse jglisse@redhat.com 2012-12-13 12:08:11 Committer: Alex Deucher alexander.deucher@amd.com 2012-12-14 10:45:24 Parent: 9af20792124850369e764965690b99b20623dfc4 (drm/radeon: fix fence locking in the pageflip callback) Branch: remotes/origin/master Follows: v3.7-rc7 Precedes: v3.8-rc1
drm/radeon: fix htile buffer size computation for command stream checker
Fix the size computation of the htile buffer.
Signed-off-by: Jerome Glisse jglisse@redhat.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
However, I think this is not related to the lockups/crashes. So, the bug's description points actually to two different bugs: the flood and the crashes. Should I open a different bug for the flood of GPU fault detected?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
Alexandre Demers alexandre.f.demers@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=59089
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #29 from Alexandre Demers alexandre.f.demers@gmail.com --- I just created a new bug (bug 59089) for the GPU fault flood which is not a direct link with the crashes, the first happening without the other.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #30 from Alex Deucher agd5f@yahoo.com --- Should be fixed with this mesa commit: http://cgit.freedesktop.org/mesa/mesa/commit/?id=4332f6fc185f968e7563e748b8c...
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #31 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #30)
Should be fixed with this mesa commit: http://cgit.freedesktop.org/mesa/mesa/commit/ ?id=4332f6fc185f968e7563e748b8c949021937c935
Sadly it isn't.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #32 from Alexandre Demers alexandre.f.demers@gmail.com --- You're using a Cayman card, but which model exactly?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #33 from Alex Deucher agd5f@yahoo.com --- Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a?
I think r600g: rework flusing and synchronization pattern v7 http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656a... may be problematic on cayman.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #34 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #33)
Does a 3.8 kernel it work ok if you revert mesa back to cf5632094ba0c19d570ea47025cf6da75ef8457a?
(In reply to comment #12)
I did go back till http://cgit.freedesktop.org/mesa/mesa/commit/ ?id=6c99f2101fbd3edb7d5899c44ca9d984a3c0f8b6 and the bug is still there
I think r600g: rework flusing and synchronization pattern v7 http://cgit.freedesktop.org/mesa/mesa/commit/ ?id=24b1206ab2dcd506aaac3ef656aebc8bc20cd27a may be problematic on cayman.
I'm actually updating my kernel to 3.8-rc3, then I'll test newest mesa and cf5632094ba0c19d570ea47025cf6da75ef8457a again.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #35 from Thomas Rohloff v10lator@myway.de --- Still there with 3.8-rc3 + mesa cf5632094ba0c19d570ea47025cf6da75ef8457a
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #36 from Jerome Glisse glisse@freedesktop.org --- Did you test with mesa reverted to before following commit : http://cgit.freedesktop.org/mesa/mesa/commit/?id=24b1206ab2dcd506aaac3ef656a...
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #37 from Jerome Glisse glisse@freedesktop.org --- This patch might help:
http://people.freedesktop.org/~glisse/0001-drm-radeon-exclude-system-placeme...
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #38 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #37)
This patch might help:
I applied it to a 3.8-rc3 kernel and while I didn't see the message spam till now the GPU crashes extremely often (so often that this might be the case I'm unable to see the spam). Either the image freezes or the monitor goes into standby. In both cases the keyboard doesn't react anymore (not even SysMagRQ).
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #39 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #38)
(In reply to comment #37)
This patch might help:
I applied it to a 3.8-rc3 kernel and while I didn't see the message spam till now the GPU crashes extremely often (so often that this might be the case I'm unable to see the spam). Either the image freezes or the monitor goes into standby. In both cases the keyboard doesn't react anymore (not even SysMagRQ).
Does it do the same thing without the patch? I applied it yesterday and I haven't seen any difference.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #40 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #39)
Does it do the same thing without the patch?
It has random crashes without, too, yes. But way less frequent. In fact I had to revert that patch to be able to use my desktop for more than 5 minutes again.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #41 from Thomas Rohloff v10lator@myway.de --- I got a crash with a BUG message. I'm sorry for the bad image quality but I had no better camera available (that's why I made that many images)
http://img571.imageshack.us/img571/5517/dsc02036ws.jpg http://img254.imageshack.us/img254/6779/dsc02037i.jpg http://img834.imageshack.us/img834/4889/dsc02038cz.jpg http://img835.imageshack.us/img835/5993/dsc02039s.jpg http://img338.imageshack.us/img338/1946/dsc02040b.jpg http://img5.imageshack.us/img5/5683/dsc02041hc.jpg http://img69.imageshack.us/img69/8716/dsc02042vj.jpg http://img594.imageshack.us/img594/8600/dsc02043cnt.jpg
I also have a video, so if the images aren't enough just ask.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #42 from Thomas Rohloff v10lator@myway.de --- I updated m kernel to 3.8-rc5 and mesa to http://cgit.freedesktop.org/mesa/mesa/commit/?id=952e6e9f3b0eb179f67345f00e5... (can't go higher cause of https://bugs.freedesktop.org/show_bug.cgi?id=60038 ) + disabled huge pages in the kernel and now things are different. First of the message spam seems to be gone completely and second the GPU doesn't crash anymore. At one time the image froze but switching to console and back solved this.
I'll look if it continues like that and later on re-enable huge pages to see what happens then.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #43 from Thomas Rohloff v10lator@myway.de --- And again I was to fast with this. I started another game and the dmesg spam was there again.
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #44 from Thomas Rohloff v10lator@myway.de --- And it crashed again, too. :(
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #45 from Marek Olšák maraeo@gmail.com --- Is this still an issue with the latest kernel and Mesa?
https://bugs.freedesktop.org/show_bug.cgi?id=58667
Marek Olšák maraeo@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|Random crashes on CAYMAN |VM-related crashes on | |CAYMAN
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #46 from Marek Olšák maraeo@gmail.com --- Also, does setting this environment variable help?
R600_DEBUG=nohyperz
https://bugs.freedesktop.org/show_bug.cgi?id=58667
--- Comment #47 from udo udovdh@xs4all.nl --- Over here, with 3.12.6 and these $ cat /etc/environment LIBGL_DRIVERS_PATH=/opt/xorg/lib/dri/ RADEON_VA=0 R600_DEBUG=nodma
all appears stable.
(git llvm, libclc, mesa, etc)
https://bugs.freedesktop.org/show_bug.cgi?id=58667
Thomas Rohloff v10lator@myway.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |WORKSFORME
--- Comment #48 from Thomas Rohloff v10lator@myway.de --- (In reply to comment #45)
Is this still an issue with the latest kernel and Mesa?
Sorry for the delay. It seems to be fixed.
dri-devel@lists.freedesktop.org