https://bugs.freedesktop.org/show_bug.cgi?id=91268
Bug ID: 91268 Summary: R6xx freezes with kernel 3.17 and up Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/Radeon Assignee: dri-devel@lists.freedesktop.org Reporter: kap3tan@gmail.com
Something was introduced in kernel 3.17 which makes my GPU to freeze while playing games. When it happens screen freeze for a few seconds, then it goes blank for a few seconds, then it comes back with strange artifacts on the screen, system is basically unresponsive, the only thing you can do is a hard reset. With kernels below 3.17 this doesn't happen. Mesa version also doesn't matter. Basically same everything, just booting with different kernel makes a difference. 3.16 is good and 3.17 is bad (also every other kernel above 3.17) With kernel 3.16 I can play games for days/weeks and bug will not happen. With 3.17 it can happen anywhere between 15 minutes and few hours. I did a bisect and it produced this :
git bisect start '--' 'drivers/gpu/drm/radeon' # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16 git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6 # bad: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17 git bisect bad bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9 # bad: [03f62abd112d5150b6ce8957fa85d4f6e85e357f] drm/radeon: split PT setup in more functions git bisect bad 03f62abd112d5150b6ce8957fa85d4f6e85e357f # bad: [391bfec33cd4e103274f197924d41ef648b849de] drm/radeon: remove visible vram size limit on bo allocation (v4) git bisect bad 391bfec33cd4e103274f197924d41ef648b849de # good: [da9976206c15178eeae1b4445c9266125bf35b0a] drm/radeon: enable display scaling on all connectors (v2) git bisect good da9976206c15178eeae1b4445c9266125bf35b0a # good: [380670aebfca998bb67b9cf05fc7f28ebeac4b18] drm/radeon: Demote 'BO allocation size too large' message to debug only git bisect good 380670aebfca998bb67b9cf05fc7f28ebeac4b18 # bad: [02376d8282b88f07d0716da6155094c8760b1a13] drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2) git bisect bad 02376d8282b88f07d0716da6155094c8760b1a13 # good: [77497f2735ad6e29c55475e15e9790dbfa2c2ef8] drm/radeon: Pass GART page flags to radeon_gart_set_page() explicitly git bisect good 77497f2735ad6e29c55475e15e9790dbfa2c2ef8 # first bad commit: [02376d8282b88f07d0716da6155094c8760b1a13] drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2)
commit 02376d8282b88f07d0716da6155094c8760b1a13 Author: Michel Dänzer michel.daenzer@amd.com Date: Thu Jul 17 19:01:08 2014 +0900
drm/radeon: Allow write-combined CPU mappings of BOs in GTT (v2)
v2: fix rebase onto drm-fixes
Signed-off-by: Michel Dänzer michel.daenzer@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com
Currently I'm running kernel with commit before the first bad one : $ git reset --hard 77497f2735ad6e29c55475e15e9790dbfa2c2ef8 HEAD is now at 77497f2 drm/radeon: Pass GART page flags to radeon_gart_set_page() explicitly
to test it more thoroughly and see if hang will occur.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #1 from Kajzer kap3tan@gmail.com --- Quote from another thread where this bug initially started :
(In reply to Michel Dänzer from comment #273)
Please run a kernel built from commit 77497f2735ad6e29c55475e15e9790dbfa2c2ef8 (the commit before 02376d8282b88f07d0716da6155094c8760b1a13) for at least a few days to make sure it doesn't happen with that.
After few days I can safely say that this kernel runs great, I had no hangs.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #2 from Kajzer kap3tan@gmail.com --- I made a patch using git show and I've patched last known good kernel 3.16.7 I guess that's one way to find out is this commit the real culprit or not.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #3 from Kajzer kap3tan@gmail.com --- Trouble is that kernel won't compile now.
CC [M] drivers/gpu/drm/radeon/radeon_object.o drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_ttm_placement_from_domain’: drivers/gpu/drm/radeon/radeon_object.c:117:20: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function) if (rbo->flags & RADEON_GEM_GTT_UC) { ^ drivers/gpu/drm/radeon/radeon_object.c:117:20: note: each undeclared identifier is reported only once for each function it appears in drivers/gpu/drm/radeon/radeon_object.c:119:28: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function) } else if ((rbo->flags & RADEON_GEM_GTT_WC) || ^ drivers/gpu/drm/radeon/radeon_object.c: In function ‘radeon_bo_create’: drivers/gpu/drm/radeon/radeon_object.c:198:18: error: ‘RADEON_GEM_GTT_WC’ undeclared (first use in this function) bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC); ^ drivers/gpu/drm/radeon/radeon_object.c:198:38: error: ‘RADEON_GEM_GTT_UC’ undeclared (first use in this function) bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC); ^ make[5]: *** [drivers/gpu/drm/radeon/radeon_object.o] Error 1
I made a patch with git show 02376d8282b88f07d0716da6155094c8760b1a13 > badcommit.patch It patched fine with no errors.
I'm out of moves now, is there any other way to either add this commit to 3.16 or take it out from 3.17 ?
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #4 from Alex Deucher alexdeucher@gmail.com --- Created attachment 117089 --> https://bugs.freedesktop.org/attachment.cgi?id=117089&action=edit disable uc/wc
The attached patch will disable uncached mappings.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #5 from Kajzer kap3tan@gmail.com --- (In reply to Alex Deucher from comment #4)
Created attachment 117089 [details] [review] disable uc/wc
The attached patch will disable uncached mappings.
Thanks Alex ! I've patched kernel 3.18.8 and I'm running it right now. I'll see what happens, hopefully it won't hang ! :)
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #6 from Michel Dänzer michel@daenzer.net --- Please attach the output of dmesg, including all the drm/radeon initialization messages.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #7 from Kajzer kap3tan@gmail.com --- Created attachment 117136 --> https://bugs.freedesktop.org/attachment.cgi?id=117136&action=edit dmesg output
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #8 from Kajzer kap3tan@gmail.com --- (In reply to Michel Dänzer from comment #6)
Please attach the output of dmesg, including all the drm/radeon initialization messages.
I suspect you need one when hang happens, I'm trying really hard to make it hang with the patch from Alex but it seems that patch did the trick, there are no more hangs. But I'll keep trying, just to be sure. Although it should have happened by now.
Anyway, if you need dmesg when bug happens I'll do that one later, for now here's the current one with no hangs : https://bugs.freedesktop.org/attachment.cgi?id=117136
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #9 from Michel Dänzer michel@daenzer.net --- (In reply to Kajzer from comment #8)
I suspect you need one when hang happens,
No, as I said I'm mostly interested in the initialization messages.
I'm trying really hard to make it hang with the patch from Alex but it seems that patch did the trick, there are no more hangs.
That's expected. Alex's patch isn't a fix but just to confirm the problem is really directly related to write-combined CPU mappings.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #10 from Kajzer kap3tan@gmail.com --- (In reply to Michel Dänzer from comment #9)
That's expected. Alex's patch isn't a fix but just to confirm the problem is really directly related to write-combined CPU mappings.
Yeah I know, that's what I really asked for, a way to disable that commit. I can confirm now that indeed there's some bug in that commit (with R6xx chips) I had no hangs with mappings disabled. I'm willing to test potential fixes.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
Christian König deathsimple@vodafone.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #117089|0 |1 is obsolete| | CC| |deathsimple@vodafone.de
--- Comment #11 from Christian König deathsimple@vodafone.de --- Created attachment 117172 --> https://bugs.freedesktop.org/attachment.cgi?id=117172&action=edit Disable uc/wc on anything older than R7xx
Considering how old the hardware is I suggest that we just disable that feature for anything older than R7XX.
A patch doing exactly this is attached.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #12 from Alex Deucher alexdeucher@gmail.com --- Just to be clear, does this bug only happen when you force dpm on or all the time?
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #13 from Kajzer kap3tan@gmail.com --- (In reply to Alex Deucher from comment #12)
Just to be clear, does this bug only happen when you force dpm on or all the time?
If I don't set performance to high then it hangs all the time (not just in gaming) and I can provoke it within minutes, regardless of kernel version. This bug (CPU mappings) happens only while playing games and with kernels above 3.16 So, will this bug happen if I don't force performance to high ? To be honest I don't know, been a while since I was on anything else other than high, because for sure the other bug would happen, and they behave the same when the hang happens. So I guess it would hang if I don't force it. Except maybe if there were some kind of mappings in the kernel before 3.17 and that somehow both bugs are related. That I don't know.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #14 from Alex Deucher alexdeucher@gmail.com --- (In reply to Kajzer from comment #13)
If I don't set performance to high then it hangs all the time (not just in gaming) and I can provoke it within minutes, regardless of kernel version. This bug (CPU mappings) happens only while playing games and with kernels above 3.16 So, will this bug happen if I don't force performance to high ? To be honest I don't know, been a while since I was on anything else other than high, because for sure the other bug would happen, and they behave the same when the hang happens. So I guess it would hang if I don't force it. Except maybe if there were some kind of mappings in the kernel before 3.17 and that somehow both bugs are related. That I don't know.
Do you see this bug if you don't enable dpm at all (which is the default)?
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #15 from Kajzer kap3tan@gmail.com --- (In reply to Alex Deucher from comment #14)
Do you see this bug if you don't enable dpm at all (which is the default)?
Ah I get you now... I don't know, there's no point for me to even be on Linux without dpm, but if you think that testing that would solve some things then I guess I can try that. I'll let you know.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #16 from Alex Deucher alexdeucher@gmail.com --- (In reply to Kajzer from comment #15)
(In reply to Alex Deucher from comment #14)
Do you see this bug if you don't enable dpm at all (which is the default)?
Ah I get you now... I don't know, there's no point for me to even be on Linux without dpm, but if you think that testing that would solve some things then I guess I can try that. I'll let you know.
Yes, please test.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #17 from Kajzer kap3tan@gmail.com --- (In reply to Alex Deucher from comment #16)
(In reply to Kajzer from comment #15)
(In reply to Alex Deucher from comment #14)
Do you see this bug if you don't enable dpm at all (which is the default)?
Ah I get you now... I don't know, there's no point for me to even be on Linux without dpm, but if you think that testing that would solve some things then I guess I can try that. I'll let you know.
Yes, please test.
I just did and it happened fast, 20 mins after game started. So, answer is yes, I see this bug when dpm is disabled.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #18 from Michel Dänzer michel@daenzer.net --- (In reply to Christian König from comment #11)
Considering how old the hardware is I suggest that we just disable that feature for anything older than R7XX.
fglrx was already using write-combined CPU mappings with the very first PCIe GPUs (RV3xx), so I don't think it's that simple.
I was hoping that we'd find something to key off a quirk in the dmesg output, but since we can't seem to get that, maybe this is the best we can do for now. :(
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #19 from Michel Dänzer michel@daenzer.net --- (In reply to Michel Dänzer from comment #18)
I was hoping that we'd find something to key off a quirk in the dmesg output, but since we can't seem to get that, maybe this is the best we can do for now. :(
Oops, sorry, I totally missed that the dmesg output is here already. :) Nothing in particular jumps out at me though.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
Fedja Beader fedja.beader@t-2.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |fedja.beader@t-2.net
--- Comment #20 from Fedja Beader fedja.beader@t-2.net --- This patch seems (for 1h now) to work on 4.0.8 + Gentoo + grsecurity
For me, the screen froze with the graphics still visible. Additionally, the game was still running in the background (heard sounds and spewed errors in console) and I had full ssh access. In another game the screen turned black and white +something that looked like missing textures, but I could still interact with it.
Happened on both 3.18.9 + Gentoo + grsecurity and above mentioned 4.0.8 mesa is at 10.3
lspci: VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470]
[ 3936.443037] radeon 0000:01:00.0: ring 0 stalled for more than 10273msec [ 3936.443046] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050ded last fence id 0x0000000000050df3 on ring 0) [ 3936.450174] radeon 0000:01:00.0: Saved 185 dwords of commands on ring 0. [ 3936.450191] radeon 0000:01:00.0: GPU softreset: 0x00000008 [ 3936.450197] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 [ 3936.450202] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 [ 3936.450207] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200000C0 [ 3936.450212] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 3936.450216] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 3936.450221] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00020186 [ 3936.450226] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80028645 [ 3936.450231] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 3936.501715] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00004001 [ 3936.501773] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100 [ 3936.503883] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 [ 3936.503888] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 [ 3936.503893] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200080C0 [ 3936.503898] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 3936.503903] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 3936.503907] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 3936.503912] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80100000 [ 3936.503917] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 3936.503929] radeon 0000:01:00.0: GPU reset succeeded, trying to resume [ 3936.523106] [drm] PCIE GART of 512M enabled (table at 0x0000000000254000). [ 3936.523152] radeon 0000:01:00.0: WB enabled [ 3936.523160] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00 and cpu addr 0xffff880074d72c00 [ 3936.524373] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0 and cpu addr 0xffffc900045921d0 [ 3936.556287] [drm] ring test on 0 succeeded in 0 usecs [ 3936.732365] [drm] ring test on 5 succeeded in 1 usecs [ 3936.732375] [drm] UVD initialized successfully. [ 3946.943038] radeon 0000:01:00.0: ring 0 stalled for more than 10213msec [ 3946.943047] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000050dee last fence id 0x0000000000050df3 on ring 0) [ 3946.956388] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [ 3946.956396] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #21 from Fedja Beader fedja.beader@t-2.net --- (In reply to Kajzer from comment #13)
If I don't set performance to high then it hangs all the time
It gave me that impression, yes
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #22 from Michel Dänzer michel@daenzer.net --- Seeing as both Kajzer and Fedja Beader are using RV6xx GPUs, maybe we could just disable WC for those for now?
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #23 from Kajzer kap3tan@gmail.com --- Still working fine with disabled WC, not a single crash since. Also, I wasn't able to notice any difference with disabled WC, I mean regarding performance or something. Disabling WC on RV6xx is definitely a good thing.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #24 from Laurento Frittella laurento.frittella@gmail.com --- I'm trying the attached patch to disable WC on my r6xx and it seems to help here as well.
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV620/M82 [Mobility Radeon HD 3450/3470]
Linux mybox 4.2.1-custom #3 SMP PREEMPT Mon Oct 26 22:05:24 CET 2015 x86_64 GNU/Linux
Debian stretch/sid
https://bugs.freedesktop.org/show_bug.cgi?id=91268
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
--- Comment #25 from Michel Dänzer michel@daenzer.net --- Fixed in https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9... , will get backported to stable kernel trees.
https://bugs.freedesktop.org/show_bug.cgi?id=91268
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |dabreese00@gmail.com
--- Comment #26 from Michel Dänzer michel@daenzer.net --- *** Bug 93911 has been marked as a duplicate of this bug. ***
https://bugs.freedesktop.org/show_bug.cgi?id=91268
--- Comment #27 from David Breese dabreese00@gmail.com --- *** Bug 93911 has been marked as a duplicate of this bug. ***
dri-devel@lists.freedesktop.org