Hi all-
Somewhere between 2.6.34-fedora-whatever and 2.6.36, Nouveau became extremely broken on my hardware. It appears to be triggered by a bug in my monitor (HP LP2475w), which causes the monitor to disappear from DVI when it goes to sleep. Every time the console blanks (in X or otherwise AFAICT) the system crashes oddly but unrecoverably. This is 100% reproducible by Ctrl-Alt-F2 followed by 'echo 1
/sys/class/graphics/fb0/blank' *from SSH* and waiting a few seconds
for the monitor to go to sleep, but it also happens if I just walk away from the computer long enough for it to blank itself. This is present on F14's kernel and on 2.6.36 from kernel.org. This may or may not be related to the unreproducible crashes that I used to get rarely on 2.6.34.
The symptoms are:
- netconsole becomes very unreliable. (This makes it rather hard to get any good debugging info because I don't have a real serial port.) - system doesn't answer pings. userspace seems dead as well. - capslock will work intermittently - the lockup detector doesn't say anything. - After a few seconds, the system thinks that the tsc is massively unstable and switches clocksources. (I think this is because the clocksource watchdog fails to schedule for awhile and then somehow ends up running and thinking it detected a clocksource failure.) - SysRq-c will give me my console back and spew (useless?) garbage. Usually it also causes a panic and I get nothing else out of the system.
The most recent time I triggered this, I got an amazing amount of console spew about unexpected NMIs. None of it made it to serial console, and the part left on the screen was so far down as to be pretty much useless. lockdep shows nothing interesting (or at least nothing interesting that stays on the screen long enough for me to read).
The best hint I have is from this patch (sorry for whitespace damage):
diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 612fa6d..6823a4d 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -1014,6 +1014,8 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
+ printk(KERN_ERR "in nv50_display_irq_hotplug_bh\n"); + hpd0 = nv_rd32(dev, 0xe054) & nv_rd32(dev, 0xe050); if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070); @@ -1062,6 +1064,7 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
+ printk(KERN_ERR "about to drm_helper_hpd_irq_event\n"); drm_helper_hpd_irq_event(dev); }
@@ -1072,6 +1075,7 @@ nv50_display_irq_handler(struct drm_device *dev) uint32_t delayed = 0;
if (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_HOTPLUG) { + printk(KERN_ERR "nv50 got hpd irq\n"); if (!work_pending(&dev_priv->hpd_work)) queue_work(dev_priv->wq, &dev_priv->hpd_work); }
which spews "nv50 got hpd irq" once the display blanks.
Nouveau startup says:
[ 15.646535] nouveau 0000:04:00.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 [ 15.646540] nouveau 0000:04:00.0: setting latency timer to 64 [ 15.650606] [drm] nouveau 0000:04:00.0: Detected an NV50 generation card (0x086f00a2) [ 15.657126] [drm] nouveau 0000:04:00.0: Attempting to load BIOS image from PRAMIN [ 15.714410] [drm] nouveau 0000:04:00.0: ... appears to be valid [ 15.714413] [drm] nouveau 0000:04:00.0: BIT BIOS found [ 15.714415] [drm] nouveau 0000:04:00.0: Bios version 60.86.5b.00 [ 15.714418] [drm] nouveau 0000:04:00.0: TMDS table version 2.0 [ 15.714420] [drm] nouveau 0000:04:00.0: Found Display Configuration Block version 4.0 [ 15.714423] [drm] nouveau 0000:04:00.0: Raw DCB entry 0: 02011300 00000028 [ 15.714425] [drm] nouveau 0000:04:00.0: Raw DCB entry 1: 01011302 00000010 [ 15.714427] [drm] nouveau 0000:04:00.0: Raw DCB entry 2: 01000310 00000028 [ 15.714429] [drm] nouveau 0000:04:00.0: Raw DCB entry 3: 02000312 00000010 [ 15.714430] [drm] nouveau 0000:04:00.0: Raw DCB entry 4: 0000000e 00000000 [ 15.714433] [drm] nouveau 0000:04:00.0: DCB connector table: VHER 0x40 5 14 2 [ 15.714435] [drm] nouveau 0000:04:00.0: 0: 0x00002030: type 0x30 idx 0 tag 0x08 [ 15.714438] [drm] nouveau 0000:04:00.0: 1: 0x00001130: type 0x30 idx 1 tag 0x07 [ 15.714441] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table 0 at offset 0xC34B [ 15.740011] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table 1 at offset 0xC6B5 [ 15.758892] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table 2 at offset 0xD2F6 [ 15.758903] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table 3 at offset 0xD3E8 [ 15.760960] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table 4 at offset 0xD5E2 [ 15.760965] [drm] nouveau 0000:04:00.0: Parsing VBIOS init table at offset 0xD647 [ 15.781884] [drm] nouveau 0000:04:00.0: 0xD647: Condition still not met after 20ms, skipping following opcodes [ 15.781953] [drm] nouveau 0000:04:00.0: Detected 256MiB VRAM [ 15.873252] [TTM] Zone kernel: Available graphics memory: 3055420 kiB. [ 15.873256] [TTM] Zone dma32: Available graphics memory: 2097152 kiB. [ 15.873259] [TTM] Initializing pool allocator. [ 15.948218] [drm] nouveau 0000:04:00.0: 512 MiB GART (aperture) [ 15.983208] [drm] nouveau 0000:04:00.0: Allocating FIFO number 1 [ 15.998872] [drm] nouveau 0000:04:00.0: nouveau_channel_alloc: initialised FIFO 1 [ 16.158101] [drm] nouveau 0000:04:00.0: allocated 1920x1200 fb: 0x40230000, bo ffff8801b48a5000 [ 16.158315] fbcon: nouveaufb (fb0) is primary device [ 16.165464] Console: switching to colour frame buffer device 240x75 [ 16.168574] fb0: nouveaufb frame buffer device [ 16.168576] drm: registered panic notifier [ 16.168601] [drm] Initialized nouveau 0.0.16 20090420 for 0000:04:00.0 on minor 0
On Wed, Nov 10, 2010 at 2:28 PM, Andrew Lutomirski andy@luto.us wrote:
Hi all-
Somewhere between 2.6.34-fedora-whatever and 2.6.36, Nouveau became extremely broken on my hardware. It appears to be triggered by a bug in my monitor (HP LP2475w), which causes the monitor to disappear from DVI when it goes to sleep. Every time the console blanks (in X or otherwise AFAICT) the system crashes oddly but unrecoverably. This is 100% reproducible by Ctrl-Alt-F2 followed by 'echo 1
/sys/class/graphics/fb0/blank' *from SSH* and waiting a few seconds
for the monitor to go to sleep, but it also happens if I just walk away from the computer long enough for it to blank itself. This is present on F14's kernel and on 2.6.36 from kernel.org. This may or may not be related to the unreproducible crashes that I used to get rarely on 2.6.34.
The best hint I have is from this patch (sorry for whitespace damage):
which spews "nv50 got hpd irq" once the display blanks.
I tracked it down. The interrupt code in 2.6.36 is totally broken --- it acknowledges the interrupt *in the bottom half*. This might work by accident if the bottom half gets queued on a different CPU, but something probably changed (concurrency-managed workqueues?) that make the BH end up on the same cpu. So the cpu starves the BH and there goes a cpu.
Then the clocksource watchdog hits and takes the whole system down when it calls stop_machine, which also gets starved on that cpu.
Patch coming.
--Andy
Nouveau takes down my system quite reliably when any hotplug event occurs. The bug happens because the IRQ handler didn't acknowledge the hotplug state until the bottom half, so the card generated a new interrupt immediately, starving the bottom half and permanently starving that CPU (and hence the bottom half).
Even with this fix, a lot of the IRQ code looks rather broken.
This is tested on 2.6.36 (and makes the system stable for me), but it also applies cleanly to 2.6.37 (untested, but surely also necessary). Fedora 14's 2.6.35 kernels seem to have to same problem for me, so I suspect that 2.6.35 needs this fix as well. (All of my tests are on an NV50 card.)
Andy Lutomirski (2): Use existing defines for NV50 hotplug registers nouveau: Acknowledge HPD irq in handler, not bottom half
drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 21 +++++++++++++++------ 3 files changed, 21 insertions(+), 6 deletions(-)
[sorry for resend -- apparently git-send-email doesn't like mbox files]
Nouveau takes down my system quite reliably when any hotplug event occurs. The bug happens because the IRQ handler didn't acknowledge the hotplug state until the bottom half, so the card generated a new interrupt immediately, starving the bottom half and permanently starving that CPU (and hence the bottom half).
Even with this fix, a lot of the IRQ code looks rather broken.
This is tested on 2.6.36 (and makes the system stable for me), but it also applies cleanly to 2.6.37 (untested, but surely also necessary). Fedora 14's 2.6.35 kernels seem to have to same problem for me, so I suspect that 2.6.35 needs this fix as well. (All of my tests are on an NV50 card.)
Andy Lutomirski (2): Use existing defines for NV50 hotplug registers nouveau: Acknowledge HPD irq in handler, not bottom half
drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 21 +++++++++++++++------ 3 files changed, 21 insertions(+), 6 deletions(-)
This doesn't change code at all, but it makes it a lot easier to understand.
Signed-off-by: Andy Lutomirski luto@mit.edu Cc: stable@kernel.org --- drivers/gpu/drm/nouveau/nv50_display.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 612fa6d..83a7d27 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -453,8 +453,8 @@ static int nv50_display_disable(struct drm_device *dev) nv_wr32(dev, NV50_PDISPLAY_INTR_EN, 0x00000000);
/* disable hotplug interrupts */ - nv_wr32(dev, 0xe054, 0xffffffff); - nv_wr32(dev, 0xe050, 0x00000000); + nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, 0xffffffff); + nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_INTR, 0x00000000); if (dev_priv->chipset >= 0x90) { nv_wr32(dev, 0xe074, 0xffffffff); nv_wr32(dev, 0xe070, 0x00000000); @@ -1014,7 +1014,7 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
- hpd0 = nv_rd32(dev, 0xe054) & nv_rd32(dev, 0xe050); + hpd0 = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL) & nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR); if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070);
@@ -1058,7 +1058,7 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) helper->dpms(connector->encoder, DRM_MODE_DPMS_OFF); }
- nv_wr32(dev, 0xe054, nv_rd32(dev, 0xe054)); + nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL)); if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Signed-off-by: Andy Lutomirski luto@mit.edu Cc: stable@kernel.org --- drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 17 +++++++++++++---- 3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index b1be617..b6c62cc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -531,6 +531,11 @@ struct drm_nouveau_private { struct work_struct irq_work; struct work_struct hpd_work;
+ struct { + spinlock_t lock; + uint32_t hpd0_bits; + } hpd_state; + struct list_head vbl_waiting;
struct { diff --git a/drivers/gpu/drm/nouveau/nouveau_irq.c b/drivers/gpu/drm/nouveau/nouveau_irq.c index 794b0ee..b62a601 100644 --- a/drivers/gpu/drm/nouveau/nouveau_irq.c +++ b/drivers/gpu/drm/nouveau/nouveau_irq.c @@ -52,6 +52,7 @@ nouveau_irq_preinstall(struct drm_device *dev) if (dev_priv->card_type >= NV_50) { INIT_WORK(&dev_priv->irq_work, nv50_display_irq_handler_bh); INIT_WORK(&dev_priv->hpd_work, nv50_display_irq_hotplug_bh); + spin_lock_init(&dev_priv->hpd_state.lock); INIT_LIST_HEAD(&dev_priv->vbl_waiting); } } diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 83a7d27..0df08e3 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -1014,7 +1014,12 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
- hpd0 = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL) & nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR); + spin_lock_irq(&dev_priv->hpd_state.lock); + hpd0 = dev_priv->hpd_state.hpd0_bits; + dev_priv->hpd_state.hpd0_bits = 0; + spin_unlock_irq(&dev_priv->hpd_state.lock); + + hpd0 &= nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR); if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070);
@@ -1058,7 +1063,6 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) helper->dpms(connector->encoder, DRM_MODE_DPMS_OFF); }
- nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL)); if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
@@ -1072,8 +1076,13 @@ nv50_display_irq_handler(struct drm_device *dev) uint32_t delayed = 0;
if (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_HOTPLUG) { - if (!work_pending(&dev_priv->hpd_work)) - queue_work(dev_priv->wq, &dev_priv->hpd_work); + uint32_t hpd0_bits = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL); + nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, hpd0_bits); + spin_lock(&dev_priv->hpd_state.lock); + dev_priv->hpd_state.hpd0_bits |= hpd0_bits; + spin_unlock(&dev_priv->hpd_state.lock); + + queue_work(dev_priv->wq, &dev_priv->hpd_work); }
while (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_DISPLAY) {
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
Ben.
Signed-off-by: Andy Lutomirski luto@mit.edu Cc: stable@kernel.org
drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 17 +++++++++++++---- 3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index b1be617..b6c62cc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -531,6 +531,11 @@ struct drm_nouveau_private { struct work_struct irq_work; struct work_struct hpd_work;
struct {
spinlock_t lock;
uint32_t hpd0_bits;
} hpd_state;
struct list_head vbl_waiting;
struct {
diff --git a/drivers/gpu/drm/nouveau/nouveau_irq.c b/drivers/gpu/drm/nouveau/nouveau_irq.c index 794b0ee..b62a601 100644 --- a/drivers/gpu/drm/nouveau/nouveau_irq.c +++ b/drivers/gpu/drm/nouveau/nouveau_irq.c @@ -52,6 +52,7 @@ nouveau_irq_preinstall(struct drm_device *dev) if (dev_priv->card_type >= NV_50) { INIT_WORK(&dev_priv->irq_work, nv50_display_irq_handler_bh); INIT_WORK(&dev_priv->hpd_work, nv50_display_irq_hotplug_bh);
INIT_LIST_HEAD(&dev_priv->vbl_waiting); }spin_lock_init(&dev_priv->hpd_state.lock);
} diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 83a7d27..0df08e3 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -1014,7 +1014,12 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
- hpd0 = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL) & nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR);
- spin_lock_irq(&dev_priv->hpd_state.lock);
- hpd0 = dev_priv->hpd_state.hpd0_bits;
- dev_priv->hpd_state.hpd0_bits = 0;
- spin_unlock_irq(&dev_priv->hpd_state.lock);
- hpd0 &= nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR); if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070);
@@ -1058,7 +1063,6 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) helper->dpms(connector->encoder, DRM_MODE_DPMS_OFF); }
- nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL)); if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
@@ -1072,8 +1076,13 @@ nv50_display_irq_handler(struct drm_device *dev) uint32_t delayed = 0;
if (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_HOTPLUG) {
if (!work_pending(&dev_priv->hpd_work))
queue_work(dev_priv->wq, &dev_priv->hpd_work);
uint32_t hpd0_bits = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL);
nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, hpd0_bits);
spin_lock(&dev_priv->hpd_state.lock);
dev_priv->hpd_state.hpd0_bits |= hpd0_bits;
spin_unlock(&dev_priv->hpd_state.lock);
queue_work(dev_priv->wq, &dev_priv->hpd_work);
}
while (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_DISPLAY) {
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
I tried writing 0xffffffff to the display IRQ control in the handler to explicitly acknowledge the IRQ, but either I did it wrong or it had no effect.
I imagine that this explains the unreproducible crashes I had on F13 as well.
--Andy
Ben.
Signed-off-by: Andy Lutomirski luto@mit.edu Cc: stable@kernel.org
drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 17 +++++++++++++---- 3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index b1be617..b6c62cc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -531,6 +531,11 @@ struct drm_nouveau_private { struct work_struct irq_work; struct work_struct hpd_work;
- struct {
- spinlock_t lock;
- uint32_t hpd0_bits;
- } hpd_state;
struct list_head vbl_waiting;
struct { diff --git a/drivers/gpu/drm/nouveau/nouveau_irq.c b/drivers/gpu/drm/nouveau/nouveau_irq.c index 794b0ee..b62a601 100644 --- a/drivers/gpu/drm/nouveau/nouveau_irq.c +++ b/drivers/gpu/drm/nouveau/nouveau_irq.c @@ -52,6 +52,7 @@ nouveau_irq_preinstall(struct drm_device *dev) if (dev_priv->card_type >= NV_50) { INIT_WORK(&dev_priv->irq_work, nv50_display_irq_handler_bh); INIT_WORK(&dev_priv->hpd_work, nv50_display_irq_hotplug_bh);
- spin_lock_init(&dev_priv->hpd_state.lock);
INIT_LIST_HEAD(&dev_priv->vbl_waiting); } } diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 83a7d27..0df08e3 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -1014,7 +1014,12 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
- hpd0 = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL) & nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR);
- spin_lock_irq(&dev_priv->hpd_state.lock);
- hpd0 = dev_priv->hpd_state.hpd0_bits;
- dev_priv->hpd_state.hpd0_bits = 0;
- spin_unlock_irq(&dev_priv->hpd_state.lock);
- hpd0 &= nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR);
if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070);
@@ -1058,7 +1063,6 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) helper->dpms(connector->encoder, DRM_MODE_DPMS_OFF); }
- nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL));
if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
@@ -1072,8 +1076,13 @@ nv50_display_irq_handler(struct drm_device *dev) uint32_t delayed = 0;
if (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_HOTPLUG) {
- if (!work_pending(&dev_priv->hpd_work))
- queue_work(dev_priv->wq, &dev_priv->hpd_work);
- uint32_t hpd0_bits = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL);
- nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, hpd0_bits);
- spin_lock(&dev_priv->hpd_state.lock);
- dev_priv->hpd_state.hpd0_bits |= hpd0_bits;
- spin_unlock(&dev_priv->hpd_state.lock);
- queue_work(dev_priv->wq, &dev_priv->hpd_work);
}
while (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_DISPLAY) {
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
Ben.
I tried writing 0xffffffff to the display IRQ control in the handler to explicitly acknowledge the IRQ, but either I did it wrong or it had no effect.
I imagine that this explains the unreproducible crashes I had on F13 as well.
--Andy
Ben.
Signed-off-by: Andy Lutomirski luto@mit.edu Cc: stable@kernel.org
drivers/gpu/drm/nouveau/nouveau_drv.h | 5 +++++ drivers/gpu/drm/nouveau/nouveau_irq.c | 1 + drivers/gpu/drm/nouveau/nv50_display.c | 17 +++++++++++++---- 3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index b1be617..b6c62cc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -531,6 +531,11 @@ struct drm_nouveau_private { struct work_struct irq_work; struct work_struct hpd_work;
struct {
spinlock_t lock;
uint32_t hpd0_bits;
} hpd_state;
struct list_head vbl_waiting; struct {
diff --git a/drivers/gpu/drm/nouveau/nouveau_irq.c b/drivers/gpu/drm/nouveau/nouveau_irq.c index 794b0ee..b62a601 100644 --- a/drivers/gpu/drm/nouveau/nouveau_irq.c +++ b/drivers/gpu/drm/nouveau/nouveau_irq.c @@ -52,6 +52,7 @@ nouveau_irq_preinstall(struct drm_device *dev) if (dev_priv->card_type >= NV_50) { INIT_WORK(&dev_priv->irq_work, nv50_display_irq_handler_bh); INIT_WORK(&dev_priv->hpd_work, nv50_display_irq_hotplug_bh);
spin_lock_init(&dev_priv->hpd_state.lock); INIT_LIST_HEAD(&dev_priv->vbl_waiting); }
} diff --git a/drivers/gpu/drm/nouveau/nv50_display.c b/drivers/gpu/drm/nouveau/nv50_display.c index 83a7d27..0df08e3 100644 --- a/drivers/gpu/drm/nouveau/nv50_display.c +++ b/drivers/gpu/drm/nouveau/nv50_display.c @@ -1014,7 +1014,12 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) uint32_t unplug_mask, plug_mask, change_mask; uint32_t hpd0, hpd1 = 0;
hpd0 = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL) & nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR);
spin_lock_irq(&dev_priv->hpd_state.lock);
hpd0 = dev_priv->hpd_state.hpd0_bits;
dev_priv->hpd_state.hpd0_bits = 0;
spin_unlock_irq(&dev_priv->hpd_state.lock);
hpd0 &= nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_INTR); if (dev_priv->chipset >= 0x90) hpd1 = nv_rd32(dev, 0xe074) & nv_rd32(dev, 0xe070);
@@ -1058,7 +1063,6 @@ nv50_display_irq_hotplug_bh(struct work_struct *work) helper->dpms(connector->encoder, DRM_MODE_DPMS_OFF); }
nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL)); if (dev_priv->chipset >= 0x90) nv_wr32(dev, 0xe074, nv_rd32(dev, 0xe074));
@@ -1072,8 +1076,13 @@ nv50_display_irq_handler(struct drm_device *dev) uint32_t delayed = 0;
if (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_HOTPLUG) {
if (!work_pending(&dev_priv->hpd_work))
queue_work(dev_priv->wq, &dev_priv->hpd_work);
uint32_t hpd0_bits = nv_rd32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL);
nv_wr32(dev, NV50_PCONNECTOR_HOTPLUG_CTRL, hpd0_bits);
spin_lock(&dev_priv->hpd_state.lock);
dev_priv->hpd_state.hpd0_bits |= hpd0_bits;
spin_unlock(&dev_priv->hpd_state.lock);
queue_work(dev_priv->wq, &dev_priv->hpd_work); } while (nv_rd32(dev, NV50_PMC_INTR_0) & NV50_PMC_INTR_0_DISPLAY) {
On Wed, Nov 10, 2010 at 5:35 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
That explains why INTR and CTRL seemed backwards :) I'll leave the magic numbers for the 0xe07? stuff.
Also, I accidentally dropped the "& enabled_bits" part -- I'll put that back.
Patch to follow after I boot and test it here.
--Andy
On Wed, Nov 10, 2010 at 11:51 PM, Andrew Lutomirski luto@mit.edu wrote:
On Wed, Nov 10, 2010 at 5:35 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
That explains why INTR and CTRL seemed backwards :) I'll leave the magic numbers for the 0xe07? stuff.
Perhaps remove the bad definitions from the reg file, or rename them to UNKsomething?
Also, I accidentally dropped the "& enabled_bits" part -- I'll put that back.
Patch to follow after I boot and test it here.
--Andy _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Nov 10, 2010 at 5:55 PM, Maarten Maathuis madman2003@gmail.com wrote:
On Wed, Nov 10, 2010 at 11:51 PM, Andrew Lutomirski luto@mit.edu wrote:
On Wed, Nov 10, 2010 at 5:35 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
That explains why INTR and CTRL seemed backwards :) I'll leave the magic numbers for the 0xe07? stuff.
Perhaps remove the bad definitions from the reg file, or rename them to UNKsomething?
Well, they're known. One is hotplug detect enable (unless the code is wrong) and the other is hotplug interrupt status.
--Andy
On Wed, 2010-11-10 at 18:01 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:55 PM, Maarten Maathuis madman2003@gmail.com wrote:
On Wed, Nov 10, 2010 at 11:51 PM, Andrew Lutomirski luto@mit.edu wrote:
On Wed, Nov 10, 2010 at 5:35 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote: > The old code generated an interrupt storm bad enough to completely > take down my system. > > This only fixes the bits that are defined nouveau_regs.h. Newer hardware > uses another register that isn't described, and I don't have that hardware > to test. Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
That explains why INTR and CTRL seemed backwards :) I'll leave the magic numbers for the 0xe07? stuff.
Perhaps remove the bad definitions from the reg file, or rename them to UNKsomething?
Well, they're known. One is hotplug detect enable (unless the code is wrong) and the other is hotplug interrupt status.
That's also not correct, if anything the most accurate names so far would probably be:
#define NV_PGPIO_INTR_EN_0 0xe050 #define NV_PGPIO_INTR_0 0xe054 #define NV_PGPIO_INTR_EN_1 0xe070 #define NV_PGPIO_INTR_1 0xe074
PGPIO is a guess, and there's other stuff in that range too, but it's definitely *not* PCONNECTOR.
Anyway, this doesn't matter. Whatever change in names can happen in nouveau git and make it's way to Linus from there, the fix for nouveau git is already going to be different enough from what'll apply on Linus' tree right now. My opinion is, lets just fix the bug in mainline (without register naming) and fix the naming etc in nouveau git.
Ben.
--Andy
On Wed, 2010-11-10 at 17:51 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:35 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 17:25 -0500, Andrew Lutomirski wrote:
On Wed, Nov 10, 2010 at 5:10 PM, Ben Skeggs bskeggs@redhat.com wrote:
On Wed, 2010-11-10 at 16:32 -0500, Andy Lutomirski wrote:
The old code generated an interrupt storm bad enough to completely take down my system.
This only fixes the bits that are defined nouveau_regs.h. Newer hardware uses another register that isn't described, and I don't have that hardware to test.
Thanks for looking at this. I'll take a closer look at the problem today and see what I can come up with too, that'll work with the newer hardware too.
It should be as simple as adding an hpd1 field to the hpd_state and making exactly the same change. (It would be nice to put the register definitions into nouveau_regs.h as well -- I didn't really want to muck around with a bunch of magic numbers that I can't test.)
Yes, it is. I can confirm the problem on another card, but it doesn't actually cause any crashes here. If you can rework the patch to support the newer chips too, that'd be great.
As for magic numbers, the register names for those regs are wrong anyway. The joy of reverse-engineering the support. It doesn't really matter if you want to stick to them or go back to "magic" numbers.
That explains why INTR and CTRL seemed backwards :) I'll leave the magic numbers for the 0xe07? stuff.
That sounds good, it'll all get a cleanup at some point and switched to "proper" (well, our best guess, you'd have to ask NVIDIA about the real ones) names.
Ben.
Also, I accidentally dropped the "& enabled_bits" part -- I'll put that back.
Patch to follow after I boot and test it here.
--Andy
dri-devel@lists.freedesktop.org