On Fri, Dec 7, 2012 at 9:44 PM, Heinz Diehl htd@fritha.org wrote:
On 07.12.2012, Daniel Vetter wrote:
I think I can reliably reproduce the hang on my machine now. I have to try some HD-videos on Youtube while writing a big file with dd. The hang often occurs withing max. 5 min.
That sounds pretty awesome: Just to check, is this already with rc6 disable? Also, which gpu chip?
This is with latest 3.7-git and i915.i915_enable_rc6=0. Attached is a logfile/dmesg after booting with debug options on which hopefully shows you the gpu chip.
Sure, always glad to help excellent bug reporters along. My usual kernel bisect howto is: http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-... It seems to server rather well thus far.
ilk with rc6 disabled, and the two hangs you've attached both die on the MI_FLUSH in between a 3D primitive and a 2D blit, like all the other non-rc6 hangs we've seen thus far that indicate that 3.7 regressed. This A looks _very_ good. I'm adding lists again so that people are updated and can check whether I've analyzed the error_states correctly. For reference I've uploaded your dmesg and error_states at
http://people.freedesktop.org/~danvet/stuff/gpu-hang.tar.bz2
Thanks, I've read it and think that will be pretty easy (technically). Am I right to download Linus' tree first, and so compile an 3.7.0-rc1, and if I can reproduce the bug with it, it should be a little bit of a shorter way to get the offending patch bisected?
Good luck with the exams!
Thanks! :-)
Yeah, something in 3.7 seems to have blown up - we have a few reports all claiming that 3.6 is solid, while 3.7 is not :(
I'll try my very best to detect the offending patch. So stay tuned ;-)
Yeah, this would be very good information to move forward with this bug.
Thanks a lot for your hard work in helping with reproducing this bug.
Yours, Daniel
On Fri, 7 Dec 2012 22:08:13 +0100, Daniel Vetter daniel@ffwll.ch wrote:
ilk with rc6 disabled, and the two hangs you've attached both die on the MI_FLUSH in between a 3D primitive and a 2D blit, like all the other non-rc6 hangs we've seen thus far that indicate that 3.7 regressed. This A looks _very_ good. I'm adding lists again so that people are updated and can check whether I've analyzed the error_states correctly. For reference I've uploaded your dmesg and error_states at
The error states do disappear into a black hole during the execution of a 3DPRIMITIVE. The similarity between the two appear that the WM kernel loaded for the 3DPRIMITIVE both appear to be recently bound, and were the last kernels to be bound in the batch. Coincidence? Maybe, the INSTDONE in both cases is again the same highly unusual condition suggesting that the EU died. However, both error-states also suggest that a fresh surface was uploaded for the same 3DPRIMITIVE - but I'm having to guess since the error-state doesn't include the auxiliary state for me to check. One thing you can try is SNA, which packs its batches differently with the advantage that more auxiliary state is included in the error-state. It also packs all the kernels into a single buffer which will reduce the frequency at which it is paged out/in. So if you can reproduce with SNA (use Option "AccelMethod" "SNA" in a device section of your xorg.conf snippet) I expect the error-state to be quite different and hopefully shed some more light on the issue. -Chris
On 08.12.2012, Chris Wilson wrote:
One thing you can try is SNA, which packs its batches differently with the advantage that more auxiliary state is included in the error-state. It also packs all the kernels into a single buffer which will reduce the frequency at which it is paged out/in. So if you can reproduce with SNA (use Option "AccelMethod" "SNA" in a device section of your xorg.conf snippet) I expect the error-state to be quite different and hopefully shed some more light on the issue.
I tried this with latest 3.7-rc8 git, but no matter how hard I try, I can't get the gpu to hang (with i915.915_enable_rc6=0). Will use this as my default kernel the next few days and see if the hang occurs by chance.
Heinz
On Sat, 8 Dec 2012 15:30:53 +0100, Heinz Diehl htd@fritha.org wrote:
On 08.12.2012, Chris Wilson wrote:
One thing you can try is SNA, which packs its batches differently with the advantage that more auxiliary state is included in the error-state. It also packs all the kernels into a single buffer which will reduce the frequency at which it is paged out/in. So if you can reproduce with SNA (use Option "AccelMethod" "SNA" in a device section of your xorg.conf snippet) I expect the error-state to be quite different and hopefully shed some more light on the issue.
I tried this with latest 3.7-rc8 git, but no matter how hard I try, I can't get the gpu to hang (with i915.915_enable_rc6=0). Will use this as my default kernel the next few days and see if the hang occurs by chance.
Can you confirm one thing: are you able to reproduce the hangs at all on 3.7-rc8, using your original setup? -Chris
On 11.12.2012, Chris Wilson wrote:
Can you confirm one thing: are you able to reproduce the hangs at all on 3.7-rc8, using your original setup?
I can reproduce the hang with both 3.7-rc8 and 3.7 final inkl. latest Linus-git. All with i915.i915_enable_rc6=0.
Heinz
On 07.12.2012, Daniel Vetter wrote:
[....]
I did a "git bisect" betweeb 3.6 and 3.7-rc8 and ended up with this. Unfortunately, git can't revert this patch on top of master, sp I have not been able to test if a revert will cure the problem.
After reading on the net that Peter (Lekensteyn) already ended up with bisecting the same patch and it didn't work for him reverting it on top of 3-7-rc4, I'm somewhat clueless..
What else can I do to help finding the cause?
Heinz
[root@wildsau linux-git]# git bisect good 6c085a728cf000ac1865d66f8c9b52935558b328 is the first bad commit commit 6c085a728cf000ac1865d66f8c9b52935558b328 Author: Chris Wilson chris@chris-wilson.co.uk Date: Mon Aug 20 11:40:46 2012 +0200
drm/i915: Track unbound pages
When dealing with a working set larger than the GATT, or even the mappable aperture when touching through the GTT, we end up with evicting objects only to rebind them at a new offset again later. Moving an object into and out of the GTT requires clflushing the pages, thus causing a double-clflush penalty for rebinding.
To avoid having to clflush on rebinding, we can track the pages as they are evicted from the GTT and only relinquish those pages on memory pressure.
As usual, if it were not for the handling of out-of-memory condition and having to manually shrink our own bo caches, it would be a net reduction of code. Alas.
Note: The patch also contains a few changes to the last-hope evict_everything logic in i916_gem_execbuffer.c - we no longer try to only evict the purgeable stuff in a first try (since that's superflous and only helps in OOM corner-cases, not fragmented-gtt trashing situations).
Also, the extraction of the get_pages retry loop from bind_to_gtt (and other callsites) to get_pages should imo have been a separate patch.
v2: Ditch the newly added put_pages (for unbound objects only) in i915_gem_reset. A quick irc discussion hasn't revealed any important reason for this, so if we need this, I'd like to have a git blame'able explanation for it.
v3: Undo the s/drm_malloc_ab/kmalloc/ in get_pages that Chris noticed.
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk [danvet: Split out code movements and rant a bit in the commit message with a few Notes. Done v2] Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch
:040000 040000 c4f02e0d05a570d0baf9d2f19a6c276c06a55142 df93a56308637e3840353c3c9425ec96c3422dcc M drivers [root@wildsau linux-git]#
On Sat, Dec 08, 2012 at 02:06:48PM +0100, Heinz Diehl wrote:
On 07.12.2012, Daniel Vetter wrote:
[....]
I did a "git bisect" betweeb 3.6 and 3.7-rc8 and ended up with this. Unfortunately, git can't revert this patch on top of master, sp I have not been able to test if a revert will cure the problem.
After reading on the net that Peter (Lekensteyn) already ended up with bisecting the same patch and it didn't work for him reverting it on top of 3-7-rc4, I'm somewhat clueless..
What else can I do to help finding the cause?
Can you please test the patch at
https://bugs.freedesktop.org/attachment.cgi?id=70111
That one should disable all effects of the unbound tracking, since a revert of the below commit conflicts.
Thanks, Daniel
Heinz
[root@wildsau linux-git]# git bisect good 6c085a728cf000ac1865d66f8c9b52935558b328 is the first bad commit commit 6c085a728cf000ac1865d66f8c9b52935558b328 Author: Chris Wilson chris@chris-wilson.co.uk Date: Mon Aug 20 11:40:46 2012 +0200
drm/i915: Track unbound pages When dealing with a working set larger than the GATT, or even the mappable aperture when touching through the GTT, we end up with evicting objects only to rebind them at a new offset again later. Moving an object into and out of the GTT requires clflushing the pages, thus causing a double-clflush penalty for rebinding. To avoid having to clflush on rebinding, we can track the pages as they are evicted from the GTT and only relinquish those pages on memory pressure. As usual, if it were not for the handling of out-of-memory condition and having to manually shrink our own bo caches, it would be a net reduction of code. Alas. Note: The patch also contains a few changes to the last-hope evict_everything logic in i916_gem_execbuffer.c - we no longer try to only evict the purgeable stuff in a first try (since that's superflous and only helps in OOM corner-cases, not fragmented-gtt trashing situations). Also, the extraction of the get_pages retry loop from bind_to_gtt (and other callsites) to get_pages should imo have been a separate patch. v2: Ditch the newly added put_pages (for unbound objects only) in i915_gem_reset. A quick irc discussion hasn't revealed any important reason for this, so if we need this, I'd like to have a git blame'able explanation for it. v3: Undo the s/drm_malloc_ab/kmalloc/ in get_pages that Chris noticed. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> [danvet: Split out code movements and rant a bit in the commit message with a few Notes. Done v2] Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
:040000 040000 c4f02e0d05a570d0baf9d2f19a6c276c06a55142 df93a56308637e3840353c3c9425ec96c3422dcc M drivers [root@wildsau linux-git]#
On 11.12.2012, Daniel Vetter wrote:
Can you please test the patch at
https://bugs.freedesktop.org/attachment.cgi?id=70111
That one should disable all effects of the unbound tracking, since a revert of the below commit conflicts.
I applied this patch to Linus' git from today. "Boom" after about 1 min.
The errorstate file is here:
http://www.fritha.org/i915/errorstate3.tar.bz2
Heinz
dri-devel@lists.freedesktop.org