New subject: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

29 Feb 2012

      ...
On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
...
在 2012年2月17日 下午5:27，Chen Jie chenj@lemote.com 写道：
...
...
One good way to test gart is to go over GPU gart table and write a
dword using the GPU at end of each page something like 0xCAFEDEAD
or somevalue that is unlikely to be already set. And then go over
all the page and check that GPU write succeed. Abusing the scratch
register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which
will go over GPU gart table:

read(backup) a dword at end of each GPU page
write a mark by GPU and check it
restore the original dword

Attachment validateGART.patch do the job:

It current only works for mips64 platform.
To use it, apply all_in_vram.patch first, which will allocate CP

ring, ih, ib in VRAM and hard code no_wb=1.
The gart test routine will be invoked in r600_resume. We've tried it,
and find that when lockup happened the gart table was good before
userspace restarting. The related dmesg follows:
[ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
at 9000000040040000, 32768 entries, Dummy
Page[0x000000000e004000-0x000000000e007fff]
[ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
entries(valid=8544, invalid=24224, total=32768).
...
[ 1531.156250] PM: resume of devices complete after 9396.588 msecs
[ 1532.152343] Restarting tasks ... done.
[ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
10003msec
[ 1544.472656] ------------[ cut here ]------------
[ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
radeon_fence_wait+0x25c/0x314()
[ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
0x0002136A)
...
[ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
[ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
[ 1545.062500] radeon 0000:01:05.0: WB disabled
[ 1545.097656] [drm] ring test succeeded in 0 usecs
[ 1545.105468] [drm] ib test succeeded in 0 usecs
[ 1545.109375] [drm] Enabling audio support
[ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
at 9000000040040000, 32768 entries, Dummy
Page[0x000000000e004000-0x000000000e007fff]
[ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
unexpected value 0x745aaad1(expect 0xDEADBEEF)
entry=0x000000000e008067, orignal=0x745aaad1
...
/* System blocked here. */
Any idea?
I know lockup are frustrating, my only idea is the memory controller
is lockup because of some failing pci <-> system ram transaction.
...
BTW, we find the following in r600_pcie_gart_enable()
(drivers/gpu/drm/radeon/r600.c):
WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
(u32)(rdev->dummy_page.addr >> 12));
On our platform, PAGE_SIZE is 16K, does it have any problem?
No this should be handled properly.
...
Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
should change to:
  for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
          radeon_gart_set_page(rdev, t, page_base);

    page_base += RADEON_GPU_PAGE_SIZE;

    if (page_base != rdev->dummy_page.addr)

            page_base += RADEON_GPU_PAGE_SIZE;

}

???
No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
...
Cheers,
Jerome
Huacai Chen

Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.