On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
在 2012年2月17日 下午5:27,Chen Jie chenj@lemote.com 写道:
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which will go over GPU gart table:
- read(backup) a dword at end of each GPU page
- write a mark by GPU and check it
- restore the original dword
Attachment validateGART.patch do the job:
- It current only works for mips64 platform.
- To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.
The gart test routine will be invoked in r600_resume. We've tried it, and find that when lockup happened the gart table was good before userspace restarting. The related dmesg follows: [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768 entries(valid=8544, invalid=24224, total=32768). ... [ 1531.156250] PM: resume of devices complete after 9396.588 msecs [ 1532.152343] Restarting tasks ... done. [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec [ 1544.472656] ------------[ cut here ]------------ [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243 radeon_fence_wait+0x25c/0x314() [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A) ... [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.062500] radeon 0000:01:05.0: WB disabled [ 1545.097656] [drm] ring test succeeded in 0 usecs [ 1545.105468] [drm] ib test succeeded in 0 usecs [ 1545.109375] [drm] Enabling audio support [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0: unexpected value 0x745aaad1(expect 0xDEADBEEF) entry=0x000000000e008067, orignal=0x745aaad1 ... /* System blocked here. */
Any idea?
I know lockup are frustrating, my only idea is the memory controller is lockup because of some failing pci <-> system ram transaction.
BTW, we find the following in r600_pcie_gart_enable() (drivers/gpu/drm/radeon/r600.c): WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR, (u32)(rdev->dummy_page.addr >> 12));
On our platform, PAGE_SIZE is 16K, does it have any problem?
No this should be handled properly.
Also in radeon_gart_unbind() and radeon_gart_restore(), the logic should change to: for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) { radeon_gart_set_page(rdev, t, page_base);
page_base += RADEON_GPU_PAGE_SIZE;
if (page_base != rdev->dummy_page.addr)
}page_base += RADEON_GPU_PAGE_SIZE;
???
No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page is 0x8e004000, then there are four types of address in GART:0x8e004000, 0x8e005000, 0x8e006000, 0x8e007000. The value which written in VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
Cheers, Jerome
Huacai Chen
On Wed, 2012-02-29 at 12:49 +0800, chenhc@lemote.com wrote:
On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
在 2012年2月17日 下午5:27,Chen Jie chenj@lemote.com 写道:
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which will go over GPU gart table:
- read(backup) a dword at end of each GPU page
- write a mark by GPU and check it
- restore the original dword
Attachment validateGART.patch do the job:
- It current only works for mips64 platform.
- To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.
The gart test routine will be invoked in r600_resume. We've tried it, and find that when lockup happened the gart table was good before userspace restarting. The related dmesg follows: [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768 entries(valid=8544, invalid=24224, total=32768). ... [ 1531.156250] PM: resume of devices complete after 9396.588 msecs [ 1532.152343] Restarting tasks ... done. [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec [ 1544.472656] ------------[ cut here ]------------ [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243 radeon_fence_wait+0x25c/0x314() [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A) ... [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.062500] radeon 0000:01:05.0: WB disabled [ 1545.097656] [drm] ring test succeeded in 0 usecs [ 1545.105468] [drm] ib test succeeded in 0 usecs [ 1545.109375] [drm] Enabling audio support [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0: unexpected value 0x745aaad1(expect 0xDEADBEEF) entry=0x000000000e008067, orignal=0x745aaad1 ... /* System blocked here. */
Any idea?
I know lockup are frustrating, my only idea is the memory controller is lockup because of some failing pci <-> system ram transaction.
BTW, we find the following in r600_pcie_gart_enable() (drivers/gpu/drm/radeon/r600.c): WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR, (u32)(rdev->dummy_page.addr >> 12));
On our platform, PAGE_SIZE is 16K, does it have any problem?
No this should be handled properly.
Also in radeon_gart_unbind() and radeon_gart_restore(), the logic should change to: for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) { radeon_gart_set_page(rdev, t, page_base);
page_base += RADEON_GPU_PAGE_SIZE;
if (page_base != rdev->dummy_page.addr)
}page_base += RADEON_GPU_PAGE_SIZE;
???
No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page is 0x8e004000, then there are four types of address in GART:0x8e004000, 0x8e005000, 0x8e006000, 0x8e007000. The value which written in VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
When radeon_gart_unbind initialize the gart entry to point to the dummy page it's just to have something safe in the GART table.
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when there is a fault happening. It's like a sandbox for the mc. It doesn't conflict in anyway to have gart table entry to point to same page.
Cheers, Jerome
dri-devel@lists.freedesktop.org