Radeon lockup on 3.8.5-201.fc18.x86_64

List overview All Threads
Download

newer

older

[PATCH] drm: Avoid forcing a...

[PATCH v4 0/2] Enhance EDID quirks...

Andy Lutomirski

5 Apr 2013 5 Apr '13

9:11 p.m.

Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

My userspace driver is xorg-x11-drv-ati-7.0.0-0.9.20121015gitbd9e2c064.fc18.src.rpm. I have Option "ColorTiling2D" "true", but I'm not sure this matters.

lspci says: 08:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI Caicos [Radeon HD 6450]

[ 7508.750753] radeon 0000:08:00.0: GPU lockup CP stall for more than 5102593msec [ 7508.750757] radeon 0000:08:00.0: GPU lockup (waiting for 0x0000000000000155 last fence id 0x0000000000000154) [ 7508.751832] radeon 0000:08:00.0: Saved 71 dwords of commands on ring 0. [ 7508.751837] radeon 0000:08:00.0: GPU softreset: 0x00000007 [ 7508.762035] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [ 7508.762037] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [ 7508.762040] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [ 7508.762042] radeon 0000:08:00.0: SRBM_STATUS = 0x200400C0 [ 7508.762045] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 7508.762047] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 7508.762050] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [ 7508.762053] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [ 7508.762054] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [ 7508.762109] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [ 7508.762112] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [ 7508.762114] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [ 7508.762117] radeon 0000:08:00.0: SRBM_STATUS = 0x200400C0 [ 7508.762120] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [ 7508.762122] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [ 7508.762125] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [ 7508.762127] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [ 7508.762131] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x4448C156 [ 7508.762187] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [ 7508.779603] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [ 7508.798041] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [ 7508.798047] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [ 7508.805050] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [ 7508.805186] radeon 0000:08:00.0: WB enabled [ 7508.805189] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880442ddcc00 [ 7508.805190] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880442ddcc0c [ 7508.821692] [drm] ring test on 0 succeeded in 2 usecs [ 7508.821765] [drm] ring test on 3 succeeded in 1 usecs [ 7508.821845] [drm] ib test on ring 0 succeeded in 0 usecs [ 7508.821923] [drm] ib test on ring 3 succeeded in 1 usecs

--Andy

Show replies by date

Alex Deucher

8 Apr 8 Apr

2:01 p.m.

On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...

Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

Alex

Andy Lutomirski

18 Apr 18 Apr

9:11 p.m.

On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher alexdeucher@gmail.com wrote:

...

On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

It lasted longer. I have both of those environment variables set on the Xorg process but not on clients. Do I need it everywhere?

The failure was:

[847785.776441] radeon 0000:08:00.0: GPU lockup CP stall for more than 10654msec [847785.776449] radeon 0000:08:00.0: GPU lockup (waiting for 0x00000000000090f8 last fence id 0x00000000000090f7) [847785.777542] radeon 0000:08:00.0: Saved 71 dwords of commands on ring 0. [847785.777548] radeon 0000:08:00.0: GPU softreset: 0x00000007 [847785.789844] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [847785.789847] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [847785.789850] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [847785.789852] radeon 0000:08:00.0: SRBM_STATUS = 0x200080C0 [847785.789855] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [847785.789857] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [847785.789860] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [847785.789862] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [847785.789864] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [847785.789919] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [847785.789922] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [847785.789924] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [847785.789927] radeon 0000:08:00.0: SRBM_STATUS = 0x200080C0 [847785.789929] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [847785.789932] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [847785.789934] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [847785.789937] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [847785.789941] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x04483106 [847785.789997] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [847785.807356] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [847785.825791] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [847785.825797] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [847785.832851] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [847785.832993] radeon 0000:08:00.0: WB enabled [847785.832996] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88044295bc00 [847785.832997] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88044295bc0c [847785.849454] [drm] ring test on 0 succeeded in 2 usecs [847785.849523] [drm] ring test on 3 succeeded in 1 usecs [847785.849604] [drm] ib test on ring 0 succeeded in 0 usecs [847785.849681] [drm] ib test on ring 3 succeeded in 1 usecs

--Andy

Alex Deucher

9:12 p.m.

On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...

On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

It lasted longer. I have both of those environment variables set on the Xorg process but not on clients. Do I need it everywhere?

For anything that uses the 3D driver.

Alex

...

The failure was:

[847785.776441] radeon 0000:08:00.0: GPU lockup CP stall for more than 10654msec [847785.776449] radeon 0000:08:00.0: GPU lockup (waiting for 0x00000000000090f8 last fence id 0x00000000000090f7) [847785.777542] radeon 0000:08:00.0: Saved 71 dwords of commands on ring 0. [847785.777548] radeon 0000:08:00.0: GPU softreset: 0x00000007 [847785.789844] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [847785.789847] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [847785.789850] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [847785.789852] radeon 0000:08:00.0: SRBM_STATUS = 0x200080C0 [847785.789855] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [847785.789857] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [847785.789860] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [847785.789862] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [847785.789864] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [847785.789919] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [847785.789922] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [847785.789924] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [847785.789927] radeon 0000:08:00.0: SRBM_STATUS = 0x200080C0 [847785.789929] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [847785.789932] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [847785.789934] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [847785.789937] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [847785.789941] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x04483106 [847785.789997] radeon 0000:08:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 [847785.807356] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [847785.825791] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [847785.825797] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [847785.832851] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [847785.832993] radeon 0000:08:00.0: WB enabled [847785.832996] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff88044295bc00 [847785.832997] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff88044295bc0c [847785.849454] [drm] ring test on 0 succeeded in 2 usecs [847785.849523] [drm] ring test on 3 succeeded in 1 usecs [847785.849604] [drm] ib test on ring 0 succeeded in 0 usecs [847785.849681] [drm] ib test on ring 3 succeeded in 1 usecs

--Andy

Andy Lutomirski

22 Apr 22 Apr

11:19 p.m.

On Thu, Apr 18, 2013 at 2:12 PM, Alex Deucher alexdeucher@gmail.com wrote:

...

On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

It lasted longer. I have both of those environment variables set on the Xorg process but not on clients. Do I need it everywhere?

For anything that uses the 3D driver.

This didn't appear to fix it, although it may have fixed some graphical glitches in gmail's compose window.

[350788.530966] radeon 0000:08:00.0: GPU lockup CP stall for more than 40769msec [350788.530970] radeon 0000:08:00.0: GPU lockup (waiting for 0x000000000000178f last fence id 0x000000000000178e) [350788.532047] radeon 0000:08:00.0: Saved 103 dwords of commands on ring 0. [350788.532051] radeon 0000:08:00.0: GPU softreset: 0x00000003 [350788.547792] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [350788.547794] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547797] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547799] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547802] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547805] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547807] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [350788.547810] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [350788.547811] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [350788.547866] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [350788.547869] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547872] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547874] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547877] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547879] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547882] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [350788.547884] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [350788.565361] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [350788.583801] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [350788.583807] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [350788.590840] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [350788.590976] radeon 0000:08:00.0: WB enabled [350788.590978] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880442f58c00 [350788.590979] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880442f58c0c [350788.607480] [drm] ring test on 0 succeeded in 2 usecs [350788.607560] [drm] ring test on 3 succeeded in 1 usecs [350788.615053] [drm] ib test on ring 0 succeeded in 0 usecs [350788.615133] [drm] ib test on ring 3 succeeded in 1 usecs

I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).

--Andy

Michel Dänzer

23 Apr 23 Apr

5:55 a.m.

On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:

...

On Thu, Apr 18, 2013 at 2:12 PM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

It lasted longer. I have both of those environment variables set on the Xorg process but not on clients. Do I need it everywhere?

For anything that uses the 3D driver.

This didn't appear to fix it, although it may have fixed some graphical glitches in gmail's compose window.

Seems rather unlikely that's directly related to HyperZ, but who knows.

...

[350788.530966] radeon 0000:08:00.0: GPU lockup CP stall for more than 40769msec [350788.530970] radeon 0000:08:00.0: GPU lockup (waiting for 0x000000000000178f last fence id 0x000000000000178e) [350788.532047] radeon 0000:08:00.0: Saved 103 dwords of commands on ring 0. [350788.532051] radeon 0000:08:00.0: GPU softreset: 0x00000003 [350788.547792] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [350788.547794] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547797] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547799] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547802] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547805] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547807] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [350788.547810] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [350788.547811] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [350788.547866] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [350788.547869] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547872] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547874] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547877] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547879] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547882] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [350788.547884] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [350788.565361] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [350788.583801] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [350788.583807] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [350788.590840] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [350788.590976] radeon 0000:08:00.0: WB enabled [350788.590978] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880442f58c00 [350788.590979] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880442f58c0c [350788.607480] [drm] ring test on 0 succeeded in 2 usecs [350788.607560] [drm] ring test on 3 succeeded in 1 usecs [350788.615053] [drm] ib test on ring 0 succeeded in 0 usecs [350788.615133] [drm] ib test on ring 3 succeeded in 1 usecs

I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).

Hmm, now that you mention this, I notice in your original report it claims that the CP stalled for 'more than 5102593msec', which is clearly bogus. Looks like something's wrong with the lockup detection. Did this start after a kernel update or something like that?

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer

Andy Lutomirski

5:08 p.m.

On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer michel@daenzer.net wrote:

...

On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:

...
On Thu, Apr 18, 2013 at 2:12 PM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher alexdeucher@gmail.com wrote:

...
On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski luto@amacapital.net wrote:

...
Every day or so, I'll click something and my screens go blank for a second or two. dmesg complains about a lockup, and afterwards everything is painfully slow. (Even switching focus to other emacs windows takes a second or two.) Once this happens, if I restart X, I get a blank screen, although the mouse still works and I can switch VTs and use the console.

Try disabling hyperZ. Set env var R600_HYPERZ=0 (mesa 9.1) or R600_DEBUG=nohyperz (mesa git).

It lasted longer. I have both of those environment variables set on the Xorg process but not on clients. Do I need it everywhere?

For anything that uses the 3D driver.

This didn't appear to fix it, although it may have fixed some graphical glitches in gmail's compose window.

Seems rather unlikely that's directly related to HyperZ, but who knows.

...
[350788.530966] radeon 0000:08:00.0: GPU lockup CP stall for more than 40769msec [350788.530970] radeon 0000:08:00.0: GPU lockup (waiting for 0x000000000000178f last fence id 0x000000000000178e) [350788.532047] radeon 0000:08:00.0: Saved 103 dwords of commands on ring 0. [350788.532051] radeon 0000:08:00.0: GPU softreset: 0x00000003 [350788.547792] radeon 0000:08:00.0: GRBM_STATUS = 0xA0003828 [350788.547794] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547797] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547799] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547802] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547805] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547807] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000004 [350788.547810] radeon 0000:08:00.0: R_008680_CP_STAT = 0x80008647 [350788.547811] radeon 0000:08:00.0: GRBM_SOFT_RESET=0x00007F6B [350788.547866] radeon 0000:08:00.0: GRBM_STATUS = 0x00003828 [350788.547869] radeon 0000:08:00.0: GRBM_STATUS_SE0 = 0x00000007 [350788.547872] radeon 0000:08:00.0: GRBM_STATUS_SE1 = 0x00000007 [350788.547874] radeon 0000:08:00.0: SRBM_STATUS = 0x200000C0 [350788.547877] radeon 0000:08:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 [350788.547879] radeon 0000:08:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 [350788.547882] radeon 0000:08:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 [350788.547884] radeon 0000:08:00.0: R_008680_CP_STAT = 0x00000000 [350788.565361] radeon 0000:08:00.0: GPU reset succeeded, trying to resume [350788.583801] [drm] probing gen 2 caps for device 8086:1d1a = 2/0 [350788.583807] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [350788.590840] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). [350788.590976] radeon 0000:08:00.0: WB enabled [350788.590978] radeon 0000:08:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffff880442f58c00 [350788.590979] radeon 0000:08:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffff880442f58c0c [350788.607480] [drm] ring test on 0 succeeded in 2 usecs [350788.607560] [drm] ring test on 3 succeeded in 1 usecs [350788.615053] [drm] ib test on ring 0 succeeded in 0 usecs [350788.615133] [drm] ib test on ring 3 succeeded in 1 usecs

I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).

Hmm, now that you mention this, I notice in your original report it claims that the CP stalled for 'more than 5102593msec', which is clearly bogus. Looks like something's wrong with the lockup detection. Did this start after a kernel update or something like that?

It's recent. It may have been when F18 switched from 3.7 to 3.8.

I think there are bugs in the lockup detection and in the lockup recovery. Firefox, in particular, is *really* slow afterwards. Are interrupts possibly getting dropped or misconfigured during the reset?

--Andy

Michel Dänzer

5:15 p.m.

On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:

...

On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer michel@daenzer.net wrote:

...
On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:

...
I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).

Hmm, now that you mention this, I notice in your original report it claims that the CP stalled for 'more than 5102593msec', which is clearly bogus. Looks like something's wrong with the lockup detection. Did this start after a kernel update or something like that?

It's recent. It may have been when F18 switched from 3.7 to 3.8.

Can you reproduce it with an upstream kernel? Can you bisect? I realize it'll probably take a long time, but unless someone has an idea which change might have introduced the problem...

...

I think there are bugs in the lockup detection and in the lockup recovery. Firefox, in particular, is *really* slow afterwards. Are interrupts possibly getting dropped or misconfigured during the reset?

Let's not get ahead of ourselves and focus on the lockup detection issue for now.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer

Andy Lutomirski

7:31 p.m.

On Tue, Apr 23, 2013 at 10:15 AM, Michel Dänzer michel@daenzer.net wrote:

...

On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:

...
On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer michel@daenzer.net wrote:

...
On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:

...
I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).

Hmm, now that you mention this, I notice in your original report it claims that the CP stalled for 'more than 5102593msec', which is clearly bogus. Looks like something's wrong with the lockup detection. Did this start after a kernel update or something like that?

It's recent. It may have been when F18 switched from 3.7 to 3.8.

Can you reproduce it with an upstream kernel? Can you bisect? I realize it'll probably take a long time, but unless someone has an idea which change might have introduced the problem...

Yuck. I can try, but it takes days to reproduce this, so it will take forever (and may end up with a wrong answer if I get lucky and don't crash).

...

...
I think there are bugs in the lockup detection and in the lockup recovery. Firefox, in particular, is *really* slow afterwards. Are interrupts possibly getting dropped or misconfigured during the reset?

Let's not get ahead of ourselves and focus on the lockup detection issue for now.

I don't understand the r600_gpu_check_soft_reset code, but could this be the sequence of events that triggers it?

1. radeon_ring_is_lockup is called just as the very last command on the ring completes, so last_rptr gets set to the rptr. 2. Nothing happens for a while (i.e. > lockup_timeout). rptr doesn't change. 3. A very slightly slow operation starts. 4. radeon_ring_is_lockup gets called before that command completes.

radeon_ring_test_lockup will not detect a jiffies wrap-around (because there wasn't one), rptr will equal last_rptr (because there hasn't been any progress since last time), and the elapsed time will be really long, because the function hasn't been called for a long time. So a lockup gets detected, even though nothing's wrong.

There's a comment above radeon_ring_test_lockup that says:

* A possible false positivie is if we get call after while and last_cp_rptr == * the current CP rptr, even if it's unlikely it might happen. To avoid this * if the elapsed time since last call is bigger than 2 second than we return * false and update the tracking information. Due to this the caller must call * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported * the fencing code should be cautious about that.

but the corresponding code doesn't appear to exist anywhere.

Also, and unrelatedly, I revoke my comment about gmail issues being fixed with hyperz off. Gmail still draws incorrectly. This may or may not have anything to do with the radeon driver.

--Andy

Andy Lutomirski

11 Jun 11 Jun

11:23 p.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected.

Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description).

My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski luto@amacapital.net ---

This may be -stable material.

drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 ++++++++++++++++------- 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsigned ring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsigned last_rptr; uint64_t gpu_addr; uint32_t align_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr;

cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed);

-- 1.8.1.4

Michel Dänzer

12 Jun 12 Jun

10:26 a.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:

...

If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected.

Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description).

My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski luto@amacapital.net

[...]

...

diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)

have CP rptr to a different value of jiffies wrap around which will force

initialization of the lockup tracking informations.

A possible false positivie is if we get call after while and last_cp_rptr ==

the current CP rptr, even if it's unlikely it might happen. To avoid this

if the elapsed time since last call is bigger than 2 second than we return

false and update the tracking information. Due to this the caller must call

radeon_ring_test_lockup several time in less than 2sec for lockup to be reported

the fencing code should be cautious about that.

A possible false positive is if we get called after a while and

last_cp_rptr == the current CP rptr, even if it's unlikely it might

happen. To avoid this if the elapsed time since the last call is bigger

than 2 second then we return false and update the tracking

information. Due to this the caller must call radeon_ring_test_lockup

more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected?

Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer

Christian König

11:20 a.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

Am 12.06.2013 12:26, schrieb Michel Dänzer:

...

On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:

...
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected.

Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description).

My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski luto@amacapital.net

[...]

...
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)

have CP rptr to a different value of jiffies wrap around which will force

initialization of the lockup tracking informations.

A possible false positivie is if we get call after while and last_cp_rptr ==

the current CP rptr, even if it's unlikely it might happen. To avoid this

if the elapsed time since last call is bigger than 2 second than we return

false and update the tracking information. Due to this the caller must call

radeon_ring_test_lockup several time in less than 2sec for lockup to be reported

the fencing code should be cautious about that.

A possible false positive is if we get called after a while and

last_cp_rptr == the current CP rptr, even if it's unlikely it might

happen. To avoid this if the elapsed time since the last call is bigger

than 2 second then we return false and update the tracking

information. Due to this the caller must call radeon_ring_test_lockup

more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected?

Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring.

If I remember correctly that's exactly what I used to do when creating radeon_ring_test_lockup, and now I'm really wondering why that stuff isn't there any more.

Anyway the problem you describe should only happen very very rarely in case of a wraparound, so I'm pretty sure that we have a different problem that's just masked by that patch.

Christian.

Jerome Glisse

1:56 p.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer michel@daenzer.net wrote:

...

On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:

...
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected.

Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description).

My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski luto@amacapital.net

[...]

...
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)

have CP rptr to a different value of jiffies wrap around which will force

initialization of the lockup tracking informations.

A possible false positivie is if we get call after while and last_cp_rptr ==

the current CP rptr, even if it's unlikely it might happen. To avoid this

if the elapsed time since last call is bigger than 2 second than we return

false and update the tracking information. Due to this the caller must call

radeon_ring_test_lockup several time in less than 2sec for lockup to be reported

the fencing code should be cautious about that.

A possible false positive is if we get called after a while and

last_cp_rptr == the current CP rptr, even if it's unlikely it might

happen. To avoid this if the elapsed time since the last call is bigger

than 2 second then we return false and update the tracking

information. Due to this the caller must call radeon_ring_test_lockup

more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected?

Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup.

This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one.

The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress.

...

Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring.

Maybe but i still don't think it should matter.

Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking...

Cheers, Jerome

Andy Lutomirski

13 Jun 13 Jun

9:22 p.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.glisse@gmail.com wrote:

...

On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer michel@daenzer.net wrote:

...
On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:

...
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected.

Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description).

My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug).

Signed-off-by: Andy Lutomirski luto@amacapital.net

[...]

...
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)

have CP rptr to a different value of jiffies wrap around which will force

initialization of the lockup tracking informations.

A possible false positivie is if we get call after while and last_cp_rptr ==

the current CP rptr, even if it's unlikely it might happen. To avoid this

if the elapsed time since last call is bigger than 2 second than we return

false and update the tracking information. Due to this the caller must call

radeon_ring_test_lockup several time in less than 2sec for lockup to be reported

the fencing code should be cautious about that.

A possible false positive is if we get called after a while and

last_cp_rptr == the current CP rptr, even if it's unlikely it might

happen. To avoid this if the elapsed time since the last call is bigger

than 2 second then we return false and update the tracking

information. Due to this the caller must call radeon_ring_test_lockup

more frequently than once every 2s when waiting.

Is it guaranteed that radeon_ring_test_lockup will be called more often than every 2s when waiting? If not, this change might prevent a real lockup from being detected?

Yes it will if you wait for a fence, because the fence timeout wait is way smaller than 2sec so radeon_ring_is_lockup get call several time, which call radeon_ring_force_activity and then radeon_ring_test_lockup.

This also means it very very very unlikely (see below for the likely case) to have a wrap around that give last rptr same as current one.

The likely case is when you have something like a long compute, then nothing is lockup but you keep filling ring with radeon_ring_force_activity but the cp is still stuck on the ib of the compute stuff so rptr does not progress.

...
Either way, I wonder if there might not be a simpler solution to the problem, e.g. by updating last_activity when submitting commands to a previously empty ring.

Maybe but i still don't think it should matter.

Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking...

Testing now. I'll report back in a couple of days.

I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens.

AFAIK I've never run an OpenCL or similar application on this system.

--Andy

Andy Lutomirski

18 Jun 18 Jun

6:12 p.m.

New subject: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity

On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski luto@amacapital.net wrote:

...

On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse j.glisse@gmail.com wrote:

...
Andy can you test (without your patch) and see if it helps with your issue : http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking...

Testing now. I'll report back in a couple of days.

3.9.4 plus this patch has been completely stable for several days now.

Tested-by: Andy Lutomirski luto@amacapital.net

Can you send this to Linux and -stable?

Thanks, Andy

4333

Age (days ago)

4407

Last active (days ago)

dri-devel@lists.freedesktop.org

14 comments

5 participants

tags (0)

participants (5)

Alex Deucher
Andy Lutomirski
Christian König
Jerome Glisse
Michel Dänzer