Thank you for your reply.
I found CP_RB_WPTR has changed when "ring test failed", so I think CP is active, but what it get from ring buffer is wrong. Then, I want to know whether there is a way to check the content that GPU get from ring buffer.
BTW, when I use "echo shutdown > /sys/power/disk; echo disk > /sys/power/state" to do a hibernation, there will be occasionally "GPU reset" just like suspend. However, if I use "echo reboot > /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and wakeup automatically, there is no "GPU reset" after hundreds of tests. What does this imply? Power loss cause something break?
Best regards,
Huacai Chen
2011/12/7 chenhc@lemote.com:
When "MC timeout" happens at GPU reset, we found the 12th and 13th bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these two bits are like this: #define G_000E50_MCDX_BUSY(x) (((x) >> 12) & 1) #define G_000E50_MCDW_BUSY(x) (((x) >> 13) & 1)
Could you please tell me what does they mean? And if possible,
They refer to sub-blocks in the memory controller. I don't really know off hand what the name mean.
I want to know the functionalities of these 5 registers in detail: #define R_000E60_SRBM_SOFT_RESET 0x0E60 #define R_000E50_SRBM_STATUS 0x0E50 #define R_008020_GRBM_SOFT_RESET 0x8020 #define R_008010_GRBM_STATUS 0x8010 #define R_008014_GRBM_STATUS2 0x8014
A bit more info: If I reset the MC after resetting CP (this is what Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will disappear, but there is still "ring test failed".
The bits are defined in r600d.h. As to the acronyms: BIF - Bus InterFace CG - clocks DC - Display Controller GRBM - Graphics block (3D engine) HDP - Host Data Path (CPU access to vram via the PCI BAR) IH, RLC - Interrupt controller MC - Memory controller ROM - ROM SEM - semaphore controller
When you reset the MC, you will probably have to reset just about everything else since most blocks depend on the MC for access to memory. If you do reset the MC, you should do it at prior to calling asic_init so you make sure all the hw gets re-initialized properly. Additionally, you should probably reset the GRBM either via SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
Alex
Huacai Chen
2011/11/8 chenhc@lemote.com:
And, I want to know something: 1, Does GPU use MC to access GTT?
Yes. All GPU clients (display, 3D, etc.) go through the MC to access memory (vram or gart).
2, What can cause MC timeout?
Lots of things. Some GPU client still active, some GPU client hung or not properly initialized.
Alex
Hi,
Some status update. 在 2011年9月29日 下午5:17,Chen Jie chenj@lemote.com 写道:
Hi, Add more information. We got occasionally "GPU lockup" after resuming from suspend(on mipsel platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8 64bit). Related kernel message: /* return from STR */ [ 156.152343] radeon 0000:01:05.0: WB enabled [ 156.187500] [drm] ring test succeeded in 0 usecs [ 156.187500] [drm] ib test succeeded in 0 usecs [ 156.398437] ata2: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata3: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata4: SATA link down (SStatus 0 SControl 300) [ 156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 156.597656] ata1.00: configured for UDMA/133 [ 156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd [ 157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd [ 157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd [ 157.683593] r8169 0000:02:00.0: eth0: link up [ 165.621093] PM: resume of devices complete after 9679.556 msecs [ 165.628906] Restarting tasks ... done. [ 177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than 10019msec [ 177.089843] ------------[ cut here ]------------ [ 177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267 radeon_fence_wait+0x25c/0x33c() [ 177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD) [ 177.113281] Modules linked in: psmouse serio_raw [ 177.117187] Call Trace: [ 177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34 [ 177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0 [ 177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44 [ 177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c [ 177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220 [ 177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114 [ 177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc [ 177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38 [ 177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c [ 177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138 [ 177.179687] ---[ end trace 92f63d998efe4c6d ]--- [ 177.187500] radeon 0000:01:05.0: GPU softreset [ 177.191406] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xF57C2030 [ 177.195312] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00111103 [ 177.203125] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20023040 [ 177.363281] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.367187] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 177.414062] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA0003030 [ 177.417968] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 [ 177.425781] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2002B040 [ 177.433593] radeon 0000:01:05.0: GPU reset succeed [ 177.605468] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.761718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.804687] radeon 0000:01:05.0: WB enabled [ 178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems something wrong with accessing through GTT.
We dump gart table just after stopped cp, and compare gart table with the dumped one just after r600_pcie_gart_enable, and don't find any difference.
Any idea?
[ 178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume [ 178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(5). [ 178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB ! [ 179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(6). ...
Regards, -- Chen Jie
On Don, 2011-12-08 at 19:35 +0800, chenhc@lemote.com wrote:
I found CP_RB_WPTR has changed when "ring test failed", so I think CP is active, but what it get from ring buffer is wrong.
CP_RB_WPTR is normally only changed by the CPU after adding commands to the ring buffer, so I'm afraid that may not be a valid conclusion.
Then, I want to know whether there is a way to check the content that GPU get from ring buffer.
See the r100_debugfs_cp_csq_fifo() function, which generates the output for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
BTW, when I use "echo shutdown > /sys/power/disk; echo disk > /sys/power/state" to do a hibernation, there will be occasionally "GPU reset" just like suspend. However, if I use "echo reboot > /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and wakeup automatically, there is no "GPU reset" after hundreds of tests. What does this imply? Power loss cause something break?
Yeah, it sounds like the resume code doesn't properly re-initialize something that's preserved on a warm boot but lost on a cold boot.
dri-devel@lists.freedesktop.org