Add more CC's
On Mon, Dec 3, 2012 at 12:39 PM, Heinz Diehl htd@fancy-poultry.org wrote:
Hi,
with latest linus-3.7 git from today, after some time, my machine gets more and more unresponsible, fanspeed increases, and that's what I see in the logs:
Dec 3 18:08:10 wildsau kernel: [35092.535757] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Dec 3 18:08:10 wildsau kernel: [35092.535768] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Dec 3 18:08:12 wildsau kernel: [35094.050918] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Dec 3 18:08:12 wildsau kernel: [35094.051081] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! Dec 3 18:08:12 wildsau kernel: [35094.051086] [drm:i915_reset] *ERROR* Failed to reset chip.
I have never seen that before, up to 3.6.7.
Don't know what information would be important for you, but will provide anything you'll ask me to.
Thanks, Heinz. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Dec 4, 2012 at 6:36 PM, Heinz Diehl htd@fancy-poultry.org wrote:
On 03.12.2012, devendra.aaru wrote:
Add more CC's
Thanks!
This is a real showstopper for me, it occurs in every session now. Booting with "i915.i915_enable_rc6=0" doesn't help
https://bugs.freedesktop.org/show_bug.cgi?id=55984
intel guys are as lost as anyone.
Dave.
On Tue, Dec 4, 2012 at 10:27 AM, Dave Airlie airlied@gmail.com wrote:
On Tue, Dec 4, 2012 at 6:36 PM, Heinz Diehl htd@fancy-poultry.org wrote:
On 03.12.2012, devendra.aaru wrote:
Add more CC's
Thanks!
This is a real showstopper for me, it occurs in every session now. Booting with "i915.i915_enable_rc6=0" doesn't help
https://bugs.freedesktop.org/show_bug.cgi?id=55984
intel guys are as lost as anyone.
Yeah, if anyone can somewhat reliably reproduce this (you need to disable rc6 on ilk to not hit another issue which seems much easier to hit) and bisect it, this would be _very_ much appreciated - we've pretty much tested all possible "disable stuff" and "revert random patch" we could thing of, and we can't reproduce these hangs no matter how hard we bang our heads against this. Atm we're trying to come up with ways to dump more debug information, but with no clue whatsoever what's going on that's slow-going. -Daniel
On 04.12.2012, Daniel Vetter wrote:
Yeah, if anyone can somewhat reliably reproduce this
Ok, I see. So the beginning would be to reliably reproduce the the hang. I have encountered it in any possbile situasjon, both when watching videos on Youtube and right after booting the machine and doing absolutely nothing.
I'll try around a little bit and see if I can find something that triggers this hang.
Btw: which kernel is known to be the "last good one"?
(you need to disable rc6 on ilk to not hit another issue which seems much easier to hit)
Ilk? If this stands for "Ironlake": I'm on Sandybridge.
and bisect it, this would be _very_ much appreciated - we've pretty much tested all possible "disable stuff" and "revert random patch" we could thing of, and we can't reproduce these hangs no matter how hard we bang our heads against this.
Bisecting will be a pain without being able to reproduce the hang reliably.
Atm we're trying to come up with ways to dump more debug information, >but with no clue whatsoever what's going on that's slow-going.
Is there anything at the moment I can do to help you to get a grip on this problem? My machine is a Core i5-420M laptop with 4GB RAM (Asus U45-JC).
Heinz
On Tue, Dec 04, 2012 at 01:35:22PM +0100, Heinz Diehl wrote:
On 04.12.2012, Daniel Vetter wrote:
Yeah, if anyone can somewhat reliably reproduce this
Ok, I see. So the beginning would be to reliably reproduce the the hang. I have encountered it in any possbile situasjon, both when watching videos on Youtube and right after booting the machine and doing absolutely nothing.
I'll try around a little bit and see if I can find something that triggers this hang.
Btw: which kernel is known to be the "last good one"?
If it's the ilk one we only know that 3.6.x series seems to be solid, and something in 3.7-rc (probably before -rc1) broke stuff. So not too useful.
(you need to disable rc6 on ilk to not hit another issue which seems much easier to hit)
Ilk? If this stands for "Ironlake": I'm on Sandybridge.
Hm, then it could very well be something different, so I think we need to track this one as a separate bug. Can you please file a new one on bugs.freedesktop.org against DRI -> DRM (Intel) and attach dmesg when booting with drm.debug=0xe (just so we know what's in your box) plus the i915_error_state from debugfs once the gpu is hung (if you can get at that file, reboot kills it).
Thanks, Daniel
On 04.12.2012, Daniel Vetter wrote:
Yeah, if anyone can somewhat reliably reproduce this
While writing a big file with dd and watching high resolution videos on youtube, I've managed to reproduce the hang. Unfortunately, it doesn't occur within seconds. Some playing around is neccessary, and it takes between 30 sec. and 20 min.
Btw: which kernel is known to be the "last good one"?
If it's the ilk one we only know that 3.6.x series seems to be solid, and something in 3.7-rc (probably before -rc1) broke stuff. So not too useful.
I tried 3.6.9 several times over a few hours and could not trigger the hang, which clearly adds evidence to this statement. I don't want to scream out too loud, but 3.6.9 seems not to be affected. Will try some more hours to get a 3.6.9 box to hang, though.. Just in case..
Heinz
On Tuesday 04 December 2012 13:35:22 Heinz Diehl wrote:
Btw: which kernel is known to be the "last good one"?
As mentioned in the linked bug [1], I bisected it to:
commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 Author: Chris Wilson chris@chris-wilson.co.uk Date: Thu Aug 23 13:12:52 2012 +0100
drm/i915: Use cpu relocations if the object is in the GTT but not mappable
(you need to disable rc6 on ilk to not hit another issue which seems much easier to hit)
Ilk? If this stands for "Ironlake": I'm on Sandybridge.
...
Bisecting will be a pain without being able to reproduce the hang reliably.
Atm we're trying to come up with ways to dump more debug information, >but with no clue whatsoever what's going on that's slow-going.
Is there anything at the moment I can do to help you to get a grip on this problem? My machine is a Core i5-420M laptop with 4GB RAM (Asus U45-JC).
i5-420M is not SB, but ILK. i5-2xxx is SB. I have a i5-460M myself. i915.i915_enable_rc6=0 worked for me, if it does not work for you, then you probably hit another bug.
Peter
On Tue, Dec 4, 2012 at 9:41 PM, Lekensteyn lekensteyn@gmail.com wrote:
On Tuesday 04 December 2012 13:35:22 Heinz Diehl wrote:
Btw: which kernel is known to be the "last good one"?
As mentioned in the linked bug [1], I bisected it to:
commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 Author: Chris Wilson chris@chris-wilson.co.uk Date: Thu Aug 23 13:12:52 2012 +0100
drm/i915: Use cpu relocations if the object is in the GTT but not mappable
Iirc your issue goes away with rc6=0, the residual bugs we still have all still happen with rc6 disable, so probably something else. Hence also why we're asking everyone who can still reproduce to try a bisect, since with rc6 disabled we've can't reproduce the hang any more (beforehand we could reproduce it on 3 different ilk machines). The important part is to not enable rc6 (on ironlake at least) when bisecting. -Daniel
On 04.12.2012, Daniel Vetter wrote:
The important part is to not enable rc6 (on ironlake at least) when bisecting.
A shot in the dark: could it be that all the machines wich encounter this hang have nvidia's optimus? Mine has. Could that somehow be related? (I'm by no means a programmer or a kernel hacker..).
On 04.12.2012, Lekensteyn wrote:
As mentioned in the linked bug [1], I bisected it to:
commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 Author: Chris Wilson chris@chris-wilson.co.uk Date: Thu Aug 23 13:12:52 2012 +0100
drm/i915: Use cpu relocations if the object is in the GTT but not mappable
Ok, but in comment 11 in the same thread you mention that reverting this patch didn't fix the issue for you:
"Reverting that commit on top of 3.7-rc4 did not fix the hang issue."
i5-420M is not SB, but ILK. i5-2xxx is SB. I have a i5-460M myself.
Yes, you're right, my bad! Don't know what I was thinking as I wrote that. I don't have any i5-420M either, but an i5-450M. It was clearly not my day..
[htd@wildsau ~]$ cat /proc/cpuinfo | grep model model : 37 model name : Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz [....]
i915.i915_enable_rc6=0 worked for me, if it does not work for you, then you probably hit another bug.
I have now i915.i915_enable_rc6=0 in grub.cfg and disabled the XFCE compositor. Now I'm trying to hit the bug again...
Heinz
On Tuesday 04 December 2012 22:08:45 Heinz Diehl wrote:
Ok, but in comment 11 in the same thread you mention that reverting this patch didn't fix the issue for you:
"Reverting that commit on top of 3.7-rc4 did not fix the hang issue."
The bisected commit was from between rc2 and rc3: $ git describe 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 v3.6-rc2-88-g504c726 The fact that reverting that commit does not help implies that some commits thereafter also expose the bug.
i915.i915_enable_rc6=0 worked for me, if it does not work for you, then you probably hit another bug.
I have now i915.i915_enable_rc6=0 in grub.cfg and disabled the XFCE compositor. Now I'm trying to hit the bug again...
Do you have a reliable reproduce method? As you can see in the linked bug it was caused by relatively low memory pressure combined with high I/O (caching? delays? Who knows).
A shot in the dark: could it be that all the machines wich encounter this hang have nvidia's optimus? Mine has. Could that somehow be related? (I'm by no means a programmer or a kernel hacker..).
It is unlikely that Optimus has anything to do with this.
Peter
On Tue, Dec 4, 2012 at 11:09 PM, Lekensteyn lekensteyn@gmail.com wrote:
On Tuesday 04 December 2012 22:08:45 Heinz Diehl wrote:
Ok, but in comment 11 in the same thread you mention that reverting this patch didn't fix the issue for you:
"Reverting that commit on top of 3.7-rc4 did not fix the hang issue."
The bisected commit was from between rc2 and rc3: $ git describe 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 v3.6-rc2-88-g504c726
This just means that after -rc2 there are 88 patches until 504c72. This doesn't mean at all that this patch is included in -rc3 - git history is non-linear! In fact this commit is only part of the 3.7-rc1 release, so if you just update Linus' tree it will have shown up somewhere between the 3.6 and 3.7-rc1 tag being pushed out.
The fact that reverting that commit does not help implies that some commits thereafter also expose the bug.
Well, enabling rc6 was merge before the offending commit but still works around at least a class of bugs. So it's very likely that we're just hunting down different strawmens ...
i915.i915_enable_rc6=0 worked for me, if it does not work for you, then you probably hit another bug.
I have now i915.i915_enable_rc6=0 in grub.cfg and disabled the XFCE compositor. Now I'm trying to hit the bug again...
Do you have a reliable reproduce method? As you can see in the linked bug it was caused by relatively low memory pressure combined with high I/O (caching? delays? Who knows).
Nope, we could only reproduce quickly with rc6 enabled :( -Daniel
On 05.12.2012, Daniel Vetter wrote:
Nope, we could only reproduce quickly with rc6 enabled :(
Could reproduce it today this way:
dd if=/dev/zero of=deleteme bs=1M count=50000
while watching several HD videos on Youtube. Just tried once, so I'm not shure if this will work all the way. Will try again now.
My "i915_error_state" is here:
http://www.fritha.org/i915/error-01.tar.bz2
Heinz
On 06.12.2012, Heinz Diehl wrote:
[....]
Here are some more error-logs, inkl. dmesg after booting with drm debug options turned on:
On 06.12.2012, Heinz Diehl wrote:
[....]
Ok, the last one for today. After extensive testing with heavy load and I/O while watching HD videos, I can almost safely conclude with the following:
1.) The hang does *never* occur with 3.6.9 vanilla
2.) The hang does *always* occur with 3.7-rc8+ / latest git
3.) The hang doesn't occur with 3.7/latest git when
Driver "Intel" Options "NoAccel" "True"
in Xorg.xonf is set (with all the drawbacks this introduces). Maybe this rings a bell for someone..
In all cases, the machine is booted with "i915.i915_enable_rc6=0".
Please contact me if you think I can help to debug this further.
Thanks, Heinz.
On 05.12.2012, Lekensteyn wrote:
I have now i915.i915_enable_rc6=0 in grub.cfg and disabled the XFCE compositor. Now I'm trying to hit the bug again...
Do you have a reliable reproduce method? As you can see in the linked bug it was caused by relatively low memory pressure combined with high I/O (caching? delays? Who knows).
No, unfortunately not. I will do my very best to find out how to trigger it. For now, I'm trying with a script which produces max. I/O. Will also try by replaying a lot of high resolution videos and similar.
It is unlikely that Optimus has anything to do with this.
Ok.
Heinz
dri-devel@lists.freedesktop.org