i915_driver_irq_handler: irq 42: nobody cared

List overview All Threads
Download

newer

older

[PATCH] drm/radeon: improve sa...

Include request for reset-rework...

Jiri Slaby

27 Mar 2012 27 Mar '12

8:40 a.m.

Hi,

I'm getting spurious interrupts leading to disabling the interrupt: 42: 1916853 2471662 PCI-MSI-edge i915@pci:0000:00:02.0

The message: irq 42: nobody cared (try booting with the "irqpoll" option) Pid: 20716, comm: virtuoso-t Not tainted 3.3.0-next-20120326_64+ #1673

It is not new, but now I can reproduce it more-or-less reliably after an hour or so. It usually happens when playing a game using wine.

Do you want me to dump some registers when IRQ_NONE is returned from the ISR? As this is MSI, nobody else can sit there.

thanks, -- js suse labs

Show replies by date

Jiri Slaby

27 Mar 27 Mar

8:42 a.m.

On 03/27/2012 10:40 AM, Jiri Slaby wrote:

...

Hi,

I'm getting spurious interrupts leading to disabling the interrupt: 42: 1916853 2471662 PCI-MSI-edge i915@pci:0000:00:02.0

The message: irq 42: nobody cared (try booting with the "irqpoll" option) Pid: 20716, comm: virtuoso-t Not tainted 3.3.0-next-20120326_64+ #1673

It is not new, but now I can reproduce it more-or-less reliably after an hour or so. It usually happens when playing a game using wine.

Do you want me to dump some registers when IRQ_NONE is returned from the ISR? As this is MSI, nobody else can sit there.

Also lspci: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] (rev 02) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 42 Region 0: Memory at feb80000 (32-bit, non-prefetchable) [size=512K] Region 1: I/O ports at ec00 [size=8] Region 2: Memory at d0000000 (32-bit, prefetchable) [size=256M] Region 3: Memory at fea00000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0300c Data: 4179 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: i915 00: 86 80 c2 29 07 04 90 00 02 00 00 03 00 00 00 00 10: 00 00 b8 fe 01 ec 00 00 08 00 00 d0 00 00 a0 fe 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 c2 29 30: 00 00 00 00 90 00 00 00 00 00 00 00 05 01 00 00 40: 09 00 0b 01 00 00 00 00 01 00 00 00 00 00 00 00 50: 00 00 30 02 c9 03 00 00 00 00 00 00 00 00 80 af 60: 00 00 02 02 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 05 d0 01 00 0c 30 e0 fe 79 41 00 00 00 00 00 00 a0: 11 11 00 00 00 00 06 03 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 01 02 00 e0: 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00 f0: 10 00 00 00 00 00 00 00 90 0f 03 00 e4 e0 5b af

...

thanks,

-- js suse labs

Jiri Slaby

30 Mar 30 Mar

9:59 a.m.

On 03/27/2012 10:42 AM, Jiri Slaby wrote:

...

On 03/27/2012 10:40 AM, Jiri Slaby wrote:

...
Hi,

I'm getting spurious interrupts leading to disabling the interrupt: 42: 1916853 2471662 PCI-MSI-edge i915@pci:0000:00:02.0

The message: irq 42: nobody cared (try booting with the "irqpoll" option) Pid: 20716, comm: virtuoso-t Not tainted 3.3.0-next-20120326_64+ #1673

It is not new, but now I can reproduce it more-or-less reliably after an hour or so. It usually happens when playing a game using wine.

Do you want me to dump some registers when IRQ_NONE is returned from the ISR? As this is MSI, nobody else can sit there.

The handler *constantly* returns IRQ_NONE.

With this patch: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -28,6 +28,7 @@

#include <linux/sysrq.h> #include <linux/slab.h> +#include <linux/ratelimit.h> #include "drmP.h" #include "drm.h" #include "i915_drm.h" @@ -1416,6 +1417,14 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }

+ if (ret == IRQ_NONE && printk_ratelimit()) { + printk(KERN_DEBUG "%s:", __func__); + for_each_pipe(pipe) { + printk(KERN_CONT " %d=%.8x", pipe, + pipe_stats[pipe]); + } + } + return ret; }

And I get: [ 3572.968581] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3572.977472] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.224839] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.243558] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.384912] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.403462] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.464100] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.477383] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.829016] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3577.830093] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3578.013015] i915_driver_irq_handler: 12 callbacks suppressed

I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

...

Also lspci: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] (rev 02) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 42 Region 0: Memory at feb80000 (32-bit, non-prefetchable) [size=512K] Region 1: I/O ports at ec00 [size=8] Region 2: Memory at d0000000 (32-bit, prefetchable) [size=256M] Region 3: Memory at fea00000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0300c Data: 4179 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: i915 00: 86 80 c2 29 07 04 90 00 02 00 00 03 00 00 00 00 10: 00 00 b8 fe 01 ec 00 00 08 00 00 d0 00 00 a0 fe 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 c2 29 30: 00 00 00 00 90 00 00 00 00 00 00 00 05 01 00 00 40: 09 00 0b 01 00 00 00 00 01 00 00 00 00 00 00 00 50: 00 00 30 02 c9 03 00 00 00 00 00 00 00 00 80 af 60: 00 00 02 02 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 05 d0 01 00 0c 30 e0 fe 79 41 00 00 00 00 00 00 a0: 11 11 00 00 00 00 06 03 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 01 02 00 e0: 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00 f0: 10 00 00 00 00 00 00 00 90 0f 03 00 e4 e0 5b af

...
thanks,

-- js suse labs

Chris Wilson

10:45 a.m.

On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...

I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU. -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Jiri Slaby

12:11 p.m.

On 03/30/2012 12:45 PM, Chris Wilson wrote:

...

On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

And this is MSI, so there can be no other source of the interrupt. (Except broken IRQ routing.) I may try to boot with MSIs off if you think it's important.

thanks, -- js suse labs

Chris Wilson

12:24 p.m.

On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jslaby@suse.cz wrote:

...

On 03/30/2012 12:45 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI). -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Jiri Slaby

6 Apr 6 Apr

9:31 p.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

On 03/30/2012 02:24 PM, Chris Wilson wrote:

...

On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
On 03/30/2012 12:45 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about this interrupt a week. This never used to happen. And something weird emerges in /proc/interrupts when this happens: 42: 1003292 1212890 PCI-MSI-edge �s��:0000:00:02.0 instead of 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0

It very looks like the generic IRQ handling code is broken. Like it frees/corrupts irq_desc and then as well calls random handlers.

Suspend/resume cycle helps in this case and "i915@pci:0000:00:02.0" is back in /proc/interrupts as can be seen above.

Running 3.3.0-next-20120326_64+ now.

thanks,

-- js suse labs

Thomas Gleixner

10:40 p.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

On Fri, 6 Apr 2012, Jiri Slaby wrote:

...

On 03/30/2012 02:24 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
On 03/30/2012 12:45 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about this interrupt a week. This never used to happen. And something weird emerges in /proc/interrupts when this happens: 42: 1003292 1212890 PCI-MSI-edge ???s????????????:0000:00:02.0 instead of 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0

It very looks like the generic IRQ handling code is broken. Like it frees/corrupts irq_desc and ...

OMG, your problem analyzing skills are amazing.

If irq_desc would have been freed, then it wouldn't print the numbers and the irq type. And irq_desc is not corrupted either, otherwise the whole thing would explode in your face.

The printout of the name is done via action->name. The irq action merily holds a pointer to the device name string, which is handed over with request_irq. So you are saying that the core code corrupts the memory which was handed in via a pointer by the driver?

So now that's really an amazing core feature:

It corrupts the memory with weird characters and still maintains the PCI bus number correct. So it not only corrupts memory it also moves the PCI part of the string a few characters to the end.

If the pointer in the irq action would have been corrupted, then you would see a few weird characters and then the full string, not a random thing which is half correct and shifted by a few bytes.

The pointer which is handed in is dev->devname, which gets allocated and filled in drm_pci_set_busid().

...

... then as well calls random handlers.

Which random handlers would be called? The core code only calls handlers which are associated to an particular interrupt. And only when that particular interrupt is raised and not because the CPU pulls interrupt events out of thin air.

And it calls the stupid i915 handler and not something else, otherwise you would not observe the IIR=0 printk or whatever you put there for debugging.

...

Suspend/resume cycle helps in this case and "i915@pci:0000:00:02.0" is back in /proc/interrupts as can be seen above.

That's proving what? That the irq core code magically restores the correct string, right? And probably it stops calling random handlers as well. Brilliant deduction.

You know what? suspend calls free_irq() via i915_drm_freeze() -> drm_irq_uninstall() and the resume code calls request_irq() again. free_irq() removes the action and request_irq installs it fresh.

So now the interesting part is that free_irq() checks the dev_id cookie for a match, which is also stored in the irq action. So we are dealing with a magic corrupt only action->name and action->handler problem. Pretty realistic.

What the heck makes you assume that the irq core code is broken? Core code, which works on a gazillion of machines and different device drivers and does not corrupt anything except that i915 thingy?

Come on, you need to provide better evidence than weird ass guessing.

If you're still convinced that the irq core is messing with your device string, then simply hand in a NULL pointer when requesting the interrupt. That will make the core code explode nicely when it tries to modify that memory.

Thanks,

tglx

Jesse Barnes

9 Apr 9 Apr

5:12 p.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

On Sat, 7 Apr 2012 00:40:28 +0200 (CEST) Thomas Gleixner tglx@linutronix.de wrote:

...

You know what? suspend calls free_irq() via i915_drm_freeze() -> drm_irq_uninstall() and the resume code calls request_irq() again. free_irq() removes the action and request_irq installs it fresh.

Yeah this is a known issue with the DRM code, I thought Dave had a fix queued a long time ago though... Dave?

-- Jesse Barnes, Intel Open Source Technology Center

Dave Airlie

5:52 p.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

...

...
You know what? suspend calls free_irq() via i915_drm_freeze() -> drm_irq_uninstall() and the resume code calls request_irq() again. free_irq() removes the action and request_irq installs it fresh.

Yeah this is a known issue with the DRM code, I thought Dave had a fix queued a long time ago though... Dave?

/me doesn't remember seeing one but maybe this one?

http://lists.freedesktop.org/archives/dri-devel/2011-August/013407.html

probably fell down a hole.

Dave.

Jiri Slaby

10 Apr 10 Apr

8:44 a.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

On 04/07/2012 12:40 AM, Thomas Gleixner wrote:

...

On Fri, 6 Apr 2012, Jiri Slaby wrote:

...
It very looks like the generic IRQ handling code is broken. Like it frees/corrupts irq_desc and ...

OMG, your problem analyzing skills are amazing.

Hehe, no I did *no* analysis. I stand here as a bug reporter.

...

What the heck makes you assume that the irq core code is broken? Core code, which works on a gazillion of machines and different device drivers and does not corrupt anything except that i915 thingy?

Note that this is a -next regression. And i915 graphics used. This definitely doesn't run on a gazillion of machines.

...

If you're still convinced that the irq core is messing with your device string,

Nope, thanks for the input.

-- js suse labs

Daniel Vetter

8:50 a.m.

New subject: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?]

On Fri, Apr 6, 2012 at 23:31, Jiri Slaby jslaby@suse.cz wrote:

...

...
That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about this interrupt a week. This never used to happen. And something weird emerges in /proc/interrupts when this happens: 42: 1003292 1212890 PCI-MSI-edge �s��:0000:00:02.0 instead of 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0

This looks ugly. Can you try to reproduce on 3.4-rc2? That should contain everything that -next currently contains drm/i915-wise. If it still happens there, please bisect it.

Also please check whether any of the subordinate interrupt regs (pipestat) is stuck and might cause these interrupts as Jesse suggested.

Thanks, Daniel

-- Daniel Vetter daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch

Jiri Slaby

8:52 a.m.

On 04/06/2012 11:31 PM, Jiri Slaby wrote:

...

On 03/30/2012 02:24 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
On 03/30/2012 12:45 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about this interrupt a week. This never used to happen. And something weird emerges in /proc/interrupts when this happens: 42: 1003292 1212890 PCI-MSI-edge �s��:0000:00:02.0 instead of 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0

See the difference of drm_device->devname:

Before: 20 34 32 3a 20 20 20 20 31 34 30 35 34 36 32 20 | 42: 1405462 | 20 20 20 31 37 32 38 33 30 32 20 20 20 50 43 49 | 1728302 PCI| 2d 4d 53 49 2d 65 64 67 65 20 20 20 20 20 20 69 |-MSI-edge i| 39 31 35 40 70 63 69 3a 30 30 30 30 3a 30 30 3a |915@pci:0000:00:| 30 32 2e 30 0a |02.0.|

After: 20 34 32 3a 20 20 20 20 31 30 30 33 32 39 32 20 | 42: 1003292 | 20 20 20 31 32 31 32 38 39 30 20 20 20 50 43 49 | 1212890 PCI| 2d 4d 53 49 2d 65 64 67 65 20 20 20 20 20 20 ef |-MSI-edge .| bf bd 73 ef bf bd ef bf bd ef bf bd ef bf bd 3a |..s............:| 30 30 30 30 3a 30 30 3a 30 32 2e 30 0a |0000:00:02.0.|

Any idea what "ef bf bd" pattern could be? And who *shifts* the "0000:00:02.0" string?

thanks,

-- js suse labs

Marcin Slusarz

4:50 p.m.

On Tue, Apr 10, 2012 at 10:52:06AM +0200, Jiri Slaby wrote:

...

On 04/06/2012 11:31 PM, Jiri Slaby wrote:

...
On 03/30/2012 02:24 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
On 03/30/2012 12:45 PM, Chris Wilson wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

This does not make sense, the handler does something different. Even if IIR is 0, it still takes a look at pipe stats.

That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close a race where the pipestat triggered an interrupt after we processed the secondary registers and before reseting the primary.

But the basic premise that we should only enter the interrupt handler with IIR!=0 holds (presuming non-shared interrupt lines such as MSI).

Ok, this behavior is definitely new. I get several "nobody cared" about this interrupt a week. This never used to happen. And something weird emerges in /proc/interrupts when this happens: 42: 1003292 1212890 PCI-MSI-edge �s��:0000:00:02.0 instead of 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0

See the difference of drm_device->devname:

Before: 20 34 32 3a 20 20 20 20 31 34 30 35 34 36 32 20 | 42: 1405462 | 20 20 20 31 37 32 38 33 30 32 20 20 20 50 43 49 | 1728302 PCI| 2d 4d 53 49 2d 65 64 67 65 20 20 20 20 20 20 69 |-MSI-edge i| 39 31 35 40 70 63 69 3a 30 30 30 30 3a 30 30 3a |915@pci:0000:00:| 30 32 2e 30 0a |02.0.|

After: 20 34 32 3a 20 20 20 20 31 30 30 33 32 39 32 20 | 42: 1003292 | 20 20 20 31 32 31 32 38 39 30 20 20 20 50 43 49 | 1212890 PCI| 2d 4d 53 49 2d 65 64 67 65 20 20 20 20 20 20 ef |-MSI-edge .| bf bd 73 ef bf bd ef bf bd ef bf bd ef bf bd 3a |..s............:| 30 30 30 30 3a 30 30 3a 30 32 2e 30 0a |0000:00:02.0.|

Any idea what "ef bf bd" pattern could be? And who *shifts* the "0000:00:02.0" string?

Maybe this patch will help catch it:

--- diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c index cf85155..2f9717c 100644 --- a/drivers/gpu/drm/drm_ioctl.c +++ b/drivers/gpu/drm/drm_ioctl.c @@ -69,7 +69,7 @@ static void drm_unset_busid(struct drm_device *dev, struct drm_master *master) { - kfree(dev->devname); + free_pages((unsigned long)dev->devname, 0); dev->devname = NULL;

kfree(master->unique); diff --git a/drivers/gpu/drm/drm_pci.c b/drivers/gpu/drm/drm_pci.c index 13f3d93..d788b78 100644 --- a/drivers/gpu/drm/drm_pci.c +++ b/drivers/gpu/drm/drm_pci.c @@ -177,9 +177,7 @@ int drm_pci_set_busid(struct drm_device *dev, struct drm_master *master) } else master->unique_len = len;

- dev->devname = - kmalloc(strlen(pdriver->name) + - master->unique_len + 2, GFP_KERNEL); + dev->devname = (void *)__get_free_pages(GFP_KERNEL, 0);

if (dev->devname == NULL) { ret = -ENOMEM; @@ -188,6 +186,7 @@ int drm_pci_set_busid(struct drm_device *dev, struct drm_master *master)

sprintf(dev->devname, "%s@%s", pdriver->name, master->unique); + set_memory_ro((unsigned long)dev->devname, 1);

return 0; err: @@ -217,8 +216,7 @@ int drm_pci_set_unique(struct drm_device *dev, master->unique[master->unique_len] = '\0';

bus_name = dev->driver->bus->get_name(dev); - dev->devname = kmalloc(strlen(bus_name) + - strlen(master->unique) + 2, GFP_KERNEL); + dev->devname = (void *)__get_free_pages(GFP_KERNEL, 0); if (!dev->devname) { ret = -ENOMEM; goto err; @@ -226,6 +224,7 @@ int drm_pci_set_unique(struct drm_device *dev,

sprintf(dev->devname, "%s@%s", bus_name, master->unique); + set_memory_ro((unsigned long)dev->devname, 1);

/* Return error if the busid submitted doesn't match the device's actual * busid. diff --git a/drivers/gpu/drm/drm_platform.c b/drivers/gpu/drm/drm_platform.c index 82431dc..aa0acec 100644 --- a/drivers/gpu/drm/drm_platform.c +++ b/drivers/gpu/drm/drm_platform.c @@ -148,9 +148,7 @@ static int drm_platform_set_busid(struct drm_device *dev, struct drm_master *mas goto err; }

- dev->devname = - kmalloc(strlen(dev->platformdev->name) + - master->unique_len + 2, GFP_KERNEL); + dev->devname = (void *)__get_free_pages(GFP_KERNEL, 0);

if (dev->devname == NULL) { ret = -ENOMEM; @@ -159,6 +157,8 @@ static int drm_platform_set_busid(struct drm_device *dev, struct drm_master *mas

sprintf(dev->devname, "%s@%s", dev->platformdev->name, master->unique); + set_memory_ro((unsigned long)dev->devname, 1); + return 0; err: return ret; diff --git a/drivers/gpu/drm/drm_stub.c b/drivers/gpu/drm/drm_stub.c index aa454f8..4f53c0f 100644 --- a/drivers/gpu/drm/drm_stub.c +++ b/drivers/gpu/drm/drm_stub.c @@ -187,7 +187,7 @@ static void drm_master_destroy(struct kref *kref) master->unique_len = 0; }

- kfree(dev->devname); + free_pages((unsigned long)dev->devname, 0); dev->devname = NULL;

list_for_each_entry_safe(pt, next, &master->magicfree, head) { @@ -494,7 +494,7 @@ void drm_put_dev(struct drm_device *dev)

list_del(&dev->driver_item); if (dev->devname) { - kfree(dev->devname); + free_pages((unsigned long)dev->devname, 0); dev->devname = NULL; } kfree(dev);

Jesse Barnes

9 Apr 9 Apr

5:11 p.m.

On Fri, 30 Mar 2012 11:45:43 +0100 Chris Wilson chris@chris-wilson.co.uk wrote:

...

On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

I've actually seen cases where one of the PIPE*STAT regs is stuck, and even if IIR is 0 we still get interrupts... Jiri can you verify the PIPE*STAT regs have bits set, maybe one or more we don't check for?

-- Jesse Barnes, Intel Open Source Technology Center

Jiri Slaby

10 Apr 10 Apr

8:47 a.m.

On 04/09/2012 07:11 PM, Jesse Barnes wrote:

...

On Fri, 30 Mar 2012 11:45:43 +0100 Chris Wilson chris@chris-wilson.co.uk wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

I've actually seen cases where one of the PIPE*STAT regs is stuck, and even if IIR is 0 we still get interrupts... Jiri can you verify the PIPE*STAT regs have bits set, maybe one or more we don't check for?

Note that I already attached their contents... This is what is in them (pipes 0 and 1): [ 3572.968581] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3572.977472] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.224839] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.243558] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.384912] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.403462] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.464100] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.477383] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.829016] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3577.830093] i915_driver_irq_handler: 0=00020000 1=00000000

I.e. the handler is called when IIR=0 and both pipe stats are 0.

The stats are dumped this way: @@ -1416,6 +1417,14 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }

+ if (ret == IRQ_NONE && printk_ratelimit()) { + printk(KERN_DEBUG "%s:", __func__); + for_each_pipe(pipe) { + printk(KERN_CONT " %d=%.8x", pipe, + pipe_stats[pipe]); + } + } + return ret; }

thanks,

-- js suse labs

Daniel Vetter

8:58 a.m.

On Tue, Apr 10, 2012 at 10:47:49AM +0200, Jiri Slaby wrote:

...

On 04/09/2012 07:11 PM, Jesse Barnes wrote:

...
On Fri, 30 Mar 2012 11:45:43 +0100 Chris Wilson chris@chris-wilson.co.uk wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

I've actually seen cases where one of the PIPE*STAT regs is stuck, and even if IIR is 0 we still get interrupts... Jiri can you verify the PIPE*STAT regs have bits set, maybe one or more we don't check for?

Note that I already attached their contents... This is what is in them (pipes 0 and 1): [ 3572.968581] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3572.977472] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.224839] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.243558] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.384912] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.403462] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.464100] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.477383] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.829016] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3577.830093] i915_driver_irq_handler: 0=00020000 1=00000000

I.e. the handler is called when IIR=0 and both pipe stats are 0.

Hm, can you also dump the PORT_HOTPLUG_STAT register? That's the only other subordinate interrupt source left. -Daniel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Jiri Slaby

9:48 a.m.

On 04/10/2012 10:58 AM, Daniel Vetter wrote:

...

On Tue, Apr 10, 2012 at 10:47:49AM +0200, Jiri Slaby wrote:

...
On 04/09/2012 07:11 PM, Jesse Barnes wrote:

...
On Fri, 30 Mar 2012 11:45:43 +0100 Chris Wilson chris@chris-wilson.co.uk wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

I've actually seen cases where one of the PIPE*STAT regs is stuck, and even if IIR is 0 we still get interrupts... Jiri can you verify the PIPE*STAT regs have bits set, maybe one or more we don't check for?

Note that I already attached their contents... This is what is in them (pipes 0 and 1): [ 3572.968581] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3572.977472] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.224839] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.243558] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.384912] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.403462] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.464100] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.477383] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.829016] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3577.830093] i915_driver_irq_handler: 0=00020000 1=00000000

I.e. the handler is called when IIR=0 and both pipe stats are 0.

Hm, can you also dump the PORT_HOTPLUG_STAT register? That's the only other subordinate interrupt source left.

It's always 0x300: i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000 i915_driver_irq_handler: HP=00000300 0=00000000 1=00000000

thanks,

-- js suse labs

Jesse Barnes

4:26 p.m.

On Tue, 10 Apr 2012 10:47:49 +0200 Jiri Slaby jslaby@suse.cz wrote:

...

On 04/09/2012 07:11 PM, Jesse Barnes wrote:

...
On Fri, 30 Mar 2012 11:45:43 +0100 Chris Wilson chris@chris-wilson.co.uk wrote:

...
On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby jslaby@suse.cz wrote:

...
I don't know what to dump more, because iir is obviously zero too. What other sources of interrupts are on the (G33) chip?

IIR is the master interrupt, with chained secondary interrupt statuses. If IIR is 0, the interrupt wasn't raised by the GPU.

I've actually seen cases where one of the PIPE*STAT regs is stuck, and even if IIR is 0 we still get interrupts... Jiri can you verify the PIPE*STAT regs have bits set, maybe one or more we don't check for?

Note that I already attached their contents... This is what is in them (pipes 0 and 1): [ 3572.968581] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3572.977472] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.224839] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.243558] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.384912] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3576.403462] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.464100] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.477383] i915_driver_irq_handler: 0=00000000 1=00000000 [ 3577.829016] i915_driver_irq_handler: 0=00020000 1=00000000 [ 3577.830093] i915_driver_irq_handler: 0=00020000 1=00000000

I.e. the handler is called when IIR=0 and both pipe stats are 0.

Oh sorry missed the PIPE*STAT, I thought it was IMR or something, I should have read more closely.

So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

-- Jesse Barnes, Intel Open Source Technology Center

Jiri Slaby

6:11 p.m.

On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...

So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }

+ if (ret == IRQ_NONE) { + u32 hp = I915_READ(PORT_HOTPLUG_STAT); + if (hp) { + I915_WRITE(PORT_HOTPLUG_STAT, hp); + I915_READ(PORT_HOTPLUG_STAT); + } + + if (printk_ratelimit()) + printk(KERN_DEBUG "%s: %.8x\n", __func__, hp); + + }

return ret; }

thanks,

-- js suse labs

Jesse Barnes

6:34 p.m.

On Tue, 10 Apr 2012 20:11:29 +0200 Jiri Slaby jslaby@suse.cz wrote:

...

On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...
So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }
  if (ret == IRQ_NONE) {
          u32 hp = I915_READ(PORT_HOTPLUG_STAT);
          if (hp) {
                  I915_WRITE(PORT_HOTPLUG_STAT, hp);
                  I915_READ(PORT_HOTPLUG_STAT);
          }
          if (printk_ratelimit())
                  printk(KERN_DEBUG "%s: %.8x\n", __func__, hp);
  }

  return ret;
}

Yeah that looks right, you still get 0x300?

You could try masking hotplug interrupts altogether.

Also, just to sanity check things, can you look at the output of "lspci -s 02.0 -vvv -xxx" and see if the "INTx" field is + or -? If it's +, then the interrupt is definitely coming from an un-acked IRQ source on the gfx device. If it's INTx-, it means something in one of the upper MSI layers isn't getting handled right.

-- Jesse Barnes, Intel Open Source Technology Center

Jiri Slaby

7:52 p.m.

On 04/10/2012 08:34 PM, Jesse Barnes wrote:

...

On Tue, 10 Apr 2012 20:11:29 +0200 Jiri Slaby jslaby@suse.cz wrote:

...
On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...
So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }
  if (ret == IRQ_NONE) { +               u32 hp =
I915_READ(PORT_HOTPLUG_STAT); + if (hp) { + I915_WRITE(PORT_HOTPLUG_STAT, hp); + I915_READ(PORT_HOTPLUG_STAT); + } + + if (printk_ratelimit()) + printk(KERN_DEBUG "%s: %.8x\n", __func__, hp); + + }

return ret; }
Yeah that looks right, you still get 0x300?

Yes.

...

You could try masking hotplug interrupts altogether.

This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2049,7 +2051,7 @@ static int i915_driver_irq_postinstall(struct drm_device *dev) I915_WRITE(IER, enable_mask); POSTING_READ(IER);

- if (I915_HAS_HOTPLUG(dev)) { + if (0 && I915_HAS_HOTPLUG(dev)) { u32 hotplug_en = I915_READ(PORT_HOTPLUG_EN);

/* Note HDMI and DP share bits */

...

Also, just to sanity check things, can you look at the output of "lspci -s 02.0 -vvv -xxx" and see if the "INTx" field is + or -? If it's +, then the interrupt is definitely coming from an un-acked IRQ source on the gfx device. If it's INTx-, it means something in one of the upper MSI layers isn't getting handled right.

Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

thanks,

-- js suse labs

Daniel Vetter

8:32 p.m.

On Tue, Apr 10, 2012 at 09:52:40PM +0200, Jiri Slaby wrote:

...

On 04/10/2012 08:34 PM, Jesse Barnes wrote:

...
On Tue, 10 Apr 2012 20:11:29 +0200 Jiri Slaby jslaby@suse.cz wrote:

...
On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...
So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }
  if (ret == IRQ_NONE) { +               u32 hp =
I915_READ(PORT_HOTPLUG_STAT); + if (hp) { + I915_WRITE(PORT_HOTPLUG_STAT, hp); + I915_READ(PORT_HOTPLUG_STAT); + } + + if (printk_ratelimit()) + printk(KERN_DEBUG "%s: %.8x\n", __func__, hp); + + }

return ret; }
Yeah that looks right, you still get 0x300?
Yes.

...
You could try masking hotplug interrupts altogether.

This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2049,7 +2051,7 @@ static int i915_driver_irq_postinstall(struct drm_device *dev) I915_WRITE(IER, enable_mask); POSTING_READ(IER);
  if (I915_HAS_HOTPLUG(dev)) {
  if (0 && I915_HAS_HOTPLUG(dev)) {
          u32 hotplug_en = I915_READ(PORT_HOTPLUG_EN);

          /* Note HDMI and DP share bits */
...
Also, just to sanity check things, can you look at the output of "lspci -s 02.0 -vvv -xxx" and see if the "INTx" field is + or -? If it's +, then the interrupt is definitely coming from an un-acked IRQ source on the gfx device. If it's INTx-, it means something in one of the upper MSI layers isn't getting handled right.

Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

Pipelined fencing is pretty much just broken and we'll completely rip it out in 3.5. Does this also happen with 3.4-rc2? -Daniel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Jesse Barnes

8:34 p.m.

On Tue, 10 Apr 2012 22:32:12 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...

On Tue, Apr 10, 2012 at 09:52:40PM +0200, Jiri Slaby wrote:

...
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

Pipelined fencing is pretty much just broken and we'll completely rip it out in 3.5. Does this also happen with 3.4-rc2?

Does INTx- stay that way? Or does it frequently read INTx+ if you sample it a lot? If it stays as INTx-, then something other than the GPU is getting stuck (though it's possible this could be related to pipelined fencing, if the fences are programmed to point at some funky memory space).

-- Jesse Barnes, Intel Open Source Technology Center

Daniel Vetter

11 Apr 11 Apr

10:40 a.m.

On Tue, Apr 10, 2012 at 01:34:11PM -0700, Jesse Barnes wrote:

...

On Tue, 10 Apr 2012 22:32:12 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...
On Tue, Apr 10, 2012 at 09:52:40PM +0200, Jiri Slaby wrote:

...
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

Pipelined fencing is pretty much just broken and we'll completely rip it out in 3.5. Does this also happen with 3.4-rc2?

Does INTx- stay that way? Or does it frequently read INTx+ if you sample it a lot? If it stays as INTx-, then something other than the GPU is getting stuck (though it's possible this could be related to pipelined fencing, if the fences are programmed to point at some funky memory space).

Shot in the dark, let's disable msi a bit. Can you try the below patch?

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index 785f67f..249d5fe 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -2071,6 +2071,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags) else if (IS_GEN5(dev)) i915_ironlake_get_mem_freq(dev);

+#if 0 /* On the 945G/GM, the chipset reports the MSI capability on the * integrated graphics even though the support isn't actually there * according to the published specs. It doesn't appear to function @@ -2084,6 +2085,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags) */ if (!IS_I945G(dev) && !IS_I945GM(dev)) pci_enable_msi(dev->pdev); +#endif

spin_lock_init(&dev_priv->gt_lock); spin_lock_init(&dev_priv->irq_lock);

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Jiri Slaby

3 May 3 May

7:56 p.m.

On 04/11/2012 12:40 PM, Daniel Vetter wrote:

...

On Tue, Apr 10, 2012 at 01:34:11PM -0700, Jesse Barnes wrote:

...
On Tue, 10 Apr 2012 22:32:12 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...
On Tue, Apr 10, 2012 at 09:52:40PM +0200, Jiri Slaby wrote:

...
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

Pipelined fencing is pretty much just broken and we'll completely rip it out in 3.5. Does this also happen with 3.4-rc2?

Does INTx- stay that way? Or does it frequently read INTx+ if you sample it a lot? If it stays as INTx-, then something other than the GPU is getting stuck (though it's possible this could be related to pipelined fencing, if the fences are programmed to point at some funky memory space).

Hi and sorry for the delay. It stays INTx-. And I tested that with patch removing fencing.

...

Shot in the dark, let's disable msi a bit. Can you try the below patch?

Yeah, no IRQ_NONE at the end of i915_driver_irq_handler now. So MSI is busted, either in the card, the chipset or the kernel. Any idea how to find out?

thanks,

-- js suse labs

Daniel Vetter

9:15 p.m.

On Thu, May 03, 2012 at 09:56:08PM +0200, Jiri Slaby wrote:

...

On 04/11/2012 12:40 PM, Daniel Vetter wrote:

...
On Tue, Apr 10, 2012 at 01:34:11PM -0700, Jesse Barnes wrote:

...
On Tue, 10 Apr 2012 22:32:12 +0200 Daniel Vetter daniel@ffwll.ch wrote:

...
On Tue, Apr 10, 2012 at 09:52:40PM +0200, Jiri Slaby wrote:

...
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

I tried 3.2 and 3.3. Although the spurious interrupts were always there, they occurred with frequency lower by a magnitude (15 vs. 300 after X starts). So I bisected that and it lead to a commit which fixes bad tiling for me: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710...

Pipelined fencing is pretty much just broken and we'll completely rip it out in 3.5. Does this also happen with 3.4-rc2?

Does INTx- stay that way? Or does it frequently read INTx+ if you sample it a lot? If it stays as INTx-, then something other than the GPU is getting stuck (though it's possible this could be related to pipelined fencing, if the fences are programmed to point at some funky memory space).

Hi and sorry for the delay. It stays INTx-. And I tested that with patch removing fencing.

...
Shot in the dark, let's disable msi a bit. Can you try the below patch?

Yeah, no IRQ_NONE at the end of i915_driver_irq_handler now. So MSI is busted, either in the card, the chipset or the kernel. Any idea how to find out?

Ok, so MSI is busted. Can you please paste lspci -nn for you intel gpu? -Daniel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Jiri Slaby

9:16 p.m.

On 05/03/2012 11:15 PM, Daniel Vetter wrote:

...

...
...
Shot in the dark, let's disable msi a bit. Can you try the below patch?

Yeah, no IRQ_NONE at the end of i915_driver_irq_handler now. So MSI is busted, either in the card, the chipset or the kernel. Any idea how to find out?

Ok, so MSI is busted. Can you please paste lspci -nn for you intel gpu?

Sure: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] (rev 02)

thanks,

-- js suse labs

Jesse Barnes

9:54 p.m.

On Thu, 03 May 2012 23:16:02 +0200 Jiri Slaby jslaby@suse.cz wrote:

...

On 05/03/2012 11:15 PM, Daniel Vetter wrote:

...
...
...
Shot in the dark, let's disable msi a bit. Can you try the below patch?

Yeah, no IRQ_NONE at the end of i915_driver_irq_handler now. So MSI is busted, either in the card, the chipset or the kernel. Any idea how to find out?

Ok, so MSI is busted. Can you please paste lspci -nn for you intel gpu?

Sure: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] (rev 02)

Ok nevermind about the INTx-; now I'm not sure if it means anything or not in an MSI context (the spec doesn't require it, but I thought our devices would toggle it if they were sending interrupts).

But since line level works for you I guess it's ok to blacklist your chipset until we poke some hw folks internally about this.

Thanks,

-- Jesse Barnes, Intel Open Source Technology Center

Ben Widawsky

11:15 p.m.

On Thu, 3 May 2012 14:54:22 -0700 Jesse Barnes jbarnes@virtuousgeek.org wrote:

...

On Thu, 03 May 2012 23:16:02 +0200 Jiri Slaby jslaby@suse.cz wrote:

...
On 05/03/2012 11:15 PM, Daniel Vetter wrote:

...
...
...
Shot in the dark, let's disable msi a bit. Can you try the below patch?

Yeah, no IRQ_NONE at the end of i915_driver_irq_handler now. So MSI is busted, either in the card, the chipset or the kernel. Any idea how to find out?

Ok, so MSI is busted. Can you please paste lspci -nn for you intel gpu?

Sure: 00:02.0 VGA compatible controller [0300]: Intel Corporation 82G33/G31 Express Integrated Graphics Controller [8086:29c2] (rev 02)

Ok nevermind about the INTx-; now I'm not sure if it means anything or not in an MSI context (the spec doesn't require it, but I thought our devices would toggle it if they were sending interrupts).

But since line level works for you I guess it's ok to blacklist your chipset until we poke some hw folks internally about this.

Thanks,

I occassionally see missed IRQ on 16 (which is my USB) but it has only started showing up in fairly recent dinq (haven't tried Linus') kernels (I'd been using this laptop for over a year).

-- Ben Widawsky, Intel Open Source Technology Center

Michel Dänzer

11 Apr 11 Apr

6:29 a.m.

On Die, 2012-04-10 at 11:34 -0700, Jesse Barnes wrote:

...

On Tue, 10 Apr 2012 20:11:29 +0200 Jiri Slaby jslaby@suse.cz wrote:

...
On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...
So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }
  if (ret == IRQ_NONE) {
          u32 hp = I915_READ(PORT_HOTPLUG_STAT);
          if (hp) {
                  I915_WRITE(PORT_HOTPLUG_STAT, hp);
                  I915_READ(PORT_HOTPLUG_STAT);
          }
          if (printk_ratelimit())
                  printk(KERN_DEBUG "%s: %.8x\n", __func__, hp);
  }

  return ret;
}
Yeah that looks right, you still get 0x300?

You said 'If you write 0x3 back' above, but this code writes 0x300. Which is right?

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer

Jesse Barnes

4:03 p.m.

On Wed, 11 Apr 2012 08:29:22 +0200 Michel Dänzer michel@daenzer.net wrote:

...

On Die, 2012-04-10 at 11:34 -0700, Jesse Barnes wrote:

...
On Tue, 10 Apr 2012 20:11:29 +0200 Jiri Slaby jslaby@suse.cz wrote:

...
On 04/10/2012 06:26 PM, Jesse Barnes wrote:

...
So port hotplug is always reporting that port C has a hotplug interrupt though... If you write 0x3 back to it does the interrupt stop?

I'm not sure I got it right. This doesn't help: --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1416,6 +1416,17 @@ static irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) iir = new_iir; }
  if (ret == IRQ_NONE) {
          u32 hp = I915_READ(PORT_HOTPLUG_STAT);
          if (hp) {
                  I915_WRITE(PORT_HOTPLUG_STAT, hp);
                  I915_READ(PORT_HOTPLUG_STAT);
          }
          if (printk_ratelimit())
                  printk(KERN_DEBUG "%s: %.8x\n", __func__, hp);
  }

  return ret;
}
Yeah that looks right, you still get 0x300?
You said 'If you write 0x3 back' above, but this code writes 0x300. Which is right?

0x300 is right, the bits are status bits with write 1 to clear semantics. But it looks like this one is just stuck high (probably because port C isn't actually wired up fully).

-- Jesse Barnes, Intel Open Source Technology Center

Daniel Vetter

27 Mar 27 Mar

8:57 a.m.

On Tue, Mar 27, 2012 at 10:40:03AM +0200, Jiri Slaby wrote:

...

Hi,

I'm getting spurious interrupts leading to disabling the interrupt: 42: 1916853 2471662 PCI-MSI-edge i915@pci:0000:00:02.0

The message: irq 42: nobody cared (try booting with the "irqpoll" option) Pid: 20716, comm: virtuoso-t Not tainted 3.3.0-next-20120326_64+ #1673

It is not new, but now I can reproduce it more-or-less reliably after an hour or so. It usually happens when playing a game using wine.

Do you want me to dump some registers when IRQ_NONE is returned from the ISR? As this is MSI, nobody else can sit there.

Yeah, dumping some interrupt regs if the isr returns IRQ_NONE sounds good. For a start just dump everything that the isr itself reads out or writes back - I think we can ignore subordinate interrupt sources in i915 for now.

And please mind the guy with bad memory and tell us which chip you have again?

Yours, Daniel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Jiri Slaby

10:54 a.m.

On 03/27/2012 10:57 AM, Daniel Vetter wrote:

...

And please mind the guy with bad memory and tell us which chip you have again?

Where's that? In xorg.log: https://bugs.freedesktop.org/attachment.cgi?id=58771 ?

(II) intel(0): Integrated Graphics Chipset: Intel(R) G33 (--) intel(0): Chipset: "G33"

thanks,

-- js suse labs

4738

Age (days ago)

4775

Last active (days ago)

dri-devel@lists.freedesktop.org

33 comments

9 participants

tags (0)

participants (9)

Ben Widawsky
Chris Wilson
Daniel Vetter
Dave Airlie
Jesse Barnes
Jiri Slaby
Marcin Slusarz
Michel Dänzer
Thomas Gleixner