New subject: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

8 Dec 2011


      Thank you for your reply.
I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.
BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?
Best regards,
Huacai Chen
...
2011/12/7  chenhc@lemote.com:
...
When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
#define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
Could you please tell me what does they mean? And if possible,
They refer to sub-blocks in the memory controller.  I don't really
know off hand what the name mean.
...
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET                       0x0E60
#define R_000E50_SRBM_STATUS                           0x0E50
#define R_008020_GRBM_SOFT_RESET                0x8020
#define R_008010_GRBM_STATUS                    0x8010
#define R_008014_GRBM_STATUS2                   0x8014
A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".
The bits are defined in r600d.h.  As to the acronyms:
BIF - Bus InterFace
CG - clocks
DC - Display Controller
GRBM - Graphics block (3D engine)
HDP - Host Data Path (CPU access to vram via the PCI BAR)
IH, RLC - Interrupt controller
MC - Memory controller
ROM - ROM
SEM - semaphore controller
When you reset the MC, you will probably have to reset just about
everything else since most blocks depend on the MC for access to
memory.  If you do reset the MC, you should do it at prior to calling
asic_init so you make sure all the hw gets re-initialized properly.
Additionally, you should probably reset the GRBM either via
SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
Alex
...
Huacai Chen
...
2011/11/8  chenhc@lemote.com:
...
And, I want to know something:
1, Does GPU use MC to access GTT?
Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
memory (vram or gart).
...
2, What can cause MC timeout？
Lots of things.  Some GPU client still active, some GPU client hung or
not properly initialized.
Alex
...
...
Hi,
Some status update.
在 2011年9月29日 下午5:17，Chen Jie chenj@lemote.com 写道：
...
Hi,
Add more information.
We got occasionally "GPU lockup" after resuming from suspend(on
mipsel
platform with a mips64 compatible CPU and rs780e, the kernel is
3.1.0-rc8
64bit).  Related kernel message:
/* return from STR */
[  156.152343] radeon 0000:01:05.0: WB enabled
[  156.187500] [drm] ring test succeeded in 0 usecs
[  156.187500] [drm] ib test succeeded in 0 usecs
[  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
[  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
[  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
300)
[  156.597656] ata1.00: configured for UDMA/133
[  156.613281] usb 1-5: reset high speed USB device number 4 using
ehci_hcd
[  157.027343] usb 3-2: reset low speed USB device number 2 using
ohci_hcd
[  157.609375] usb 3-3: reset low speed USB device number 3 using
ohci_hcd
[  157.683593] r8169 0000:02:00.0: eth0: link up
[  165.621093] PM: resume of devices complete after 9679.556 msecs
[  165.628906] Restarting tasks ... done.
[  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more
than
10019msec
[  177.089843] ------------[ cut here ]------------
[  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
radeon_fence_wait+0x25c/0x33c()
[  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
0x000013AD)
[  177.113281] Modules linked in: psmouse serio_raw
[  177.117187] Call Trace:
[  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
[  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
[  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
[  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
[  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
[  177.148437] [<ffffffff8053b478>]
radeon_gem_wait_idle_ioctl+0x80/0x114
[  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
[  177.160156] [<ffffffff805a1820>]
radeon_kms_compat_ioctl+0x28/0x38
[  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
[  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
[  177.179687] ---[ end trace 92f63d998efe4c6d ]---
[  177.187500] radeon 0000:01:05.0: GPU softreset
[  177.191406] radeon 0000:01:05.0:
R_008010_GRBM_STATUS=0xF57C2030
[  177.195312] radeon 0000:01:05.0:
R_008014_GRBM_STATUS2=0x00111103
[  177.203125] radeon 0000:01:05.0:
R_000E50_SRBM_STATUS=0x20023040
[  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
[  177.367187] radeon 0000:01:05.0:
R_008020_GRBM_SOFT_RESET=0x00007FEE
[  177.390625] radeon 0000:01:05.0:
R_008020_GRBM_SOFT_RESET=0x00000001
[  177.414062] radeon 0000:01:05.0:
R_008010_GRBM_STATUS=0xA0003030
[  177.417968] radeon 0000:01:05.0:
R_008014_GRBM_STATUS2=0x00000003
[  177.425781] radeon 0000:01:05.0:
R_000E50_SRBM_STATUS=0x2002B040
[  177.433593] radeon 0000:01:05.0: GPU reset succeed
[  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
[  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
[  177.804687] radeon 0000:01:05.0: WB enabled
[  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
(scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems
something wrong with accessing through GTT.
We dump gart table just after stopped cp, and compare gart table with
the dumped one just after r600_pcie_gart_enable, and don't find any
difference.
Any idea?
...
[  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on
resume
[  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
schedule
IB(5).
[  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
[  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
schedule
IB(6).
...
Regards,
-- Chen Jie

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.