i915: GPU hung (F14, Intel Core i5-670)

List overview All Threads
Download

newer

older

[Bug 13170] Macbook 5,2 only boots...

[Bug 13132] vbetool stops X from...

Linus Torvalds

28 May 2012 28 May '12

1:26 a.m.

More i915 error reports.. Is the freedesktop bugzilla the right place for these? I'm an email kind of person, and the bugs.freedesktop.org bugzilla setup for xorg seems kind of broken.

I did fill in this, though:

https://bugs.freedesktop.org/show_bug.cgi?id=50405

but here is the info by email too... Up-to-date Fedora-14, current kernel, Google chrome with WebGL forced on and wasting time with bejeweled resulted in:

[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [drm:intel_pipe_set_base] *ERROR* pin & fence failed

i915_error_state contents attached. I haven't done the bejeweled thing in a while, but it used to work fine, so I'm pretty sure this is new. There seems to have been a fair amount of ringbuffer and fencing work by Chris since 3.4, so..

Any other information I can give you guys?

Linus

Attachments:

i915_error_state.bin (application/octet-stream — 1.4 MB)

Show replies by date

Chris Wilson

28 May 28 May

7:06 a.m.

On Sun, 27 May 2012 18:26:05 -0700, Linus Torvalds torvalds@linux-foundation.org wrote:

...

i915_error_state contents attached. I haven't done the bejeweled thing in a while, but it used to work fine, so I'm pretty sure this is new. There seems to have been a fair amount of ringbuffer and fencing work by Chris since 3.4, so..

Any other information I can give you guys?

No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901 -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Adam Jackson

29 May 29 May

2:45 p.m.

On Mon, 2012-05-28 at 08:06 +0100, Chris Wilson wrote:

...

No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901

Which you'd need to build yourself, since I couldn't build F14 updates even if I wanted to. F15 has something much newer in updates though.

- ajax

Linus Torvalds

30 May 30 May

2:41 a.m.

On Mon, May 28, 2012 at 12:06 AM, Chris Wilson chris@chris-wilson.co.uk wrote:

...

No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901

I really don't think that's the case.

I have run the F14 X server for a *long* time without these issues on this machine, and today I now got a second GPU hang with the current git tree. I was in the middle of just writing an email in chrome, nothing fancy going on at all.

So please please *please* take a second look. Because I think it's triggered by the i915 changes, or you undid a workaround that used to work fine.

Linus

Chris Wilson

7:11 a.m.

On Tue, 29 May 2012 19:41:37 -0700, Linus Torvalds torvalds@linux-foundation.org wrote:

...

On Mon, May 28, 2012 at 12:06 AM, Chris Wilson chris@chris-wilson.co.uk wrote:

...
No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901

I really don't think that's the case.

I have run the F14 X server for a *long* time without these issues on this machine, and today I now got a second GPU hang with the current git tree. I was in the middle of just writing an email in chrome, nothing fancy going on at all.

So please please *please* take a second look. Because I think it's triggered by the i915 changes, or you undid a workaround that used to work fine.

...

From the error-state:

0x0314e050: 0x61010006: STATE_BASE_ADDRESS 0x0314e054: 0x00000001: general state base address 0x00000000 0x0314e058: 0x00000001: surface state base address 0x00000000 0x0314e05c: 0x00000001: indirect state base address 0x00000000 0x0314e060: 0x00000001: instruction state base address 0x00000000 0x0314e064: 0x10000001: general state upper bound 0x10000000 0x0314e068: 0x10000001: indirect state upper bound 0x10000000 0x0314e06c: 0x10000001: instruction state upper bound 0x10000000

And if we look at some of the other auxiliary instructions buffers sent along with the batch:

0314e000 16384 0048 0000 000ab700 dirty purgeable render uncached ... 11e30000 4096 0011 0000 000ab700 purgeable render uncached 11e2b000 4096 0011 0000 000ab700 purgeable render uncached 10e43000 4096 0011 0000 000ab700 render uncached 10e44000 4096 0011 0000 000ab700 purgeable render uncached

0x10 being the instruction domain for a total of about 20 instruction buffers referenced from that batch above the upper bound (and in particular appears to have been the first batch to use addresses above 256M).

This batch fits the modus operandi of the bug that was fixed in 2.14.901, it would seem sensible to address the known issue first. -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Chris Wilson

7:25 a.m.

On Tue, 29 May 2012 19:41:37 -0700, Linus Torvalds torvalds@linux-foundation.org wrote:

...

On Mon, May 28, 2012 at 12:06 AM, Chris Wilson chris@chris-wilson.co.uk wrote:

...
No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901

I really don't think that's the case.

I have run the F14 X server for a *long* time without these issues on this machine, and today I now got a second GPU hang with the current git tree. I was in the middle of just writing an email in chrome, nothing fancy going on at all.

You've reported this bug in the past, though maybe on a different machine: alpine.LFD.2.02.1111221437120.9111@i5.linux-foundation.org -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Linus Torvalds

4:22 p.m.

On Wed, May 30, 2012 at 12:25 AM, Chris Wilson chris@chris-wilson.co.uk wrote:

...

You've reported this bug in the past, though maybe on a different machine:

It's quite likely the same machine - but in the past it may have happened once per six months or something. Now it happened twice in two days.

Linus

Daniel Vetter

7:42 a.m.

On Tue, May 29, 2012 at 07:41:37PM -0700, Linus Torvalds wrote:

...

On Mon, May 28, 2012 at 12:06 AM, Chris Wilson chris@chris-wilson.co.uk wrote:

...
No, the i915_error_state had everything I needed to see. It is the old ddx bug that was hardcoding a maximum relocation address that never corresponded with an actual hw limit. As soon we try to use memory above that value, the GPU decides not to listen to us any more.

Fixed in xf86-video-intel 2.14.901

I really don't think that's the case.

I have run the F14 X server for a *long* time without these issues on this machine, and today I now got a second GPU hang with the current git tree. I was in the middle of just writing an email in chrome, nothing fancy going on at all.

Well, we've quite massively tuned our gpu address space handling in 3.5, so it's a bit more likely to hit this problem. Relevant commits are 3ae5378330f5814 ffc62976d2158 dabdfe021ab1e985e

...

So please please *please* take a second look. Because I think it's triggered by the i915 changes, or you undid a workaround that used to work fine.

Nope, and you've reported this exact problem previously already, e.g.

http://comments.gmane.org/gmane.comp.video.dri.devel/63082

Really, please upgrade your userspace - this is by far not the only bug fixed since then that can result in a gpu hang.

Yours, Daniel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Linus Torvalds

4:21 p.m.

On Wed, May 30, 2012 at 12:42 AM, Daniel Vetter daniel@ffwll.ch wrote:

...

Really, please upgrade your userspace - this is by far not the only bug fixed since then that can result in a gpu hang.

I *can't* upgrade my userpsace.

F14 is the last one that has a sane window manager. After that, the gnome3 shit happens, and my productivity goes to hell.

Gnome 3.4 in F17 is getting closer to usable, so I may be forced to some day, but seriously - I want to have a modern kernel with usable user space, and right now that means F14.

Linus

4720

Age (days ago)

4722

Last active (days ago)

dri-devel@lists.freedesktop.org

8 comments

4 participants

tags (0)

participants (4)

Adam Jackson
Chris Wilson
Daniel Vetter
Linus Torvalds