When "MC timeout" happens at GPU reset, we found the 12th and 13th bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these two bits are like this: #define G_000E50_MCDX_BUSY(x) (((x) >> 12) & 1) #define G_000E50_MCDW_BUSY(x) (((x) >> 13) & 1)
Could you please tell me what does they mean? And if possible, I want to know the functionalities of these 5 registers in detail: #define R_000E60_SRBM_SOFT_RESET 0x0E60 #define R_000E50_SRBM_STATUS 0x0E50 #define R_008020_GRBM_SOFT_RESET 0x8020 #define R_008010_GRBM_STATUS 0x8010 #define R_008014_GRBM_STATUS2 0x8014
A bit more info: If I reset the MC after resetting CP (this is what Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will disappear, but there is still "ring test failed".
Huacai Chen
2011/11/8 chenhc@lemote.com:
And, I want to know something: 1, Does GPU use MC to access GTT?
Yes. All GPU clients (display, 3D, etc.) go through the MC to access memory (vram or gart).
2, What can cause MC timeout?
Lots of things. Some GPU client still active, some GPU client hung or not properly initialized.
Alex
Hi,
Some status update. 在 2011年9月29日 下午5:17,Chen Jie chenj@lemote.com 写道:
Hi, Add more information. We got occasionally "GPU lockup" after resuming from suspend(on mipsel platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8 64bit). Related kernel message: /* return from STR */ [ 156.152343] radeon 0000:01:05.0: WB enabled [ 156.187500] [drm] ring test succeeded in 0 usecs [ 156.187500] [drm] ib test succeeded in 0 usecs [ 156.398437] ata2: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata3: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata4: SATA link down (SStatus 0 SControl 300) [ 156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 156.597656] ata1.00: configured for UDMA/133 [ 156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd [ 157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd [ 157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd [ 157.683593] r8169 0000:02:00.0: eth0: link up [ 165.621093] PM: resume of devices complete after 9679.556 msecs [ 165.628906] Restarting tasks ... done. [ 177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than 10019msec [ 177.089843] ------------[ cut here ]------------ [ 177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267 radeon_fence_wait+0x25c/0x33c() [ 177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD) [ 177.113281] Modules linked in: psmouse serio_raw [ 177.117187] Call Trace: [ 177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34 [ 177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0 [ 177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44 [ 177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c [ 177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220 [ 177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114 [ 177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc [ 177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38 [ 177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c [ 177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138 [ 177.179687] ---[ end trace 92f63d998efe4c6d ]--- [ 177.187500] radeon 0000:01:05.0: GPU softreset [ 177.191406] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xF57C2030 [ 177.195312] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00111103 [ 177.203125] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20023040 [ 177.363281] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.367187] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 177.414062] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA0003030 [ 177.417968] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 [ 177.425781] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2002B040 [ 177.433593] radeon 0000:01:05.0: GPU reset succeed [ 177.605468] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.761718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.804687] radeon 0000:01:05.0: WB enabled [ 178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems something wrong with accessing through GTT.
We dump gart table just after stopped cp, and compare gart table with the dumped one just after r600_pcie_gart_enable, and don't find any difference.
Any idea?
[ 178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume [ 178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(5). [ 178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB ! [ 179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(6). ...
Regards, -- Chen Jie
2011/12/7 chenhc@lemote.com:
When "MC timeout" happens at GPU reset, we found the 12th and 13th bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these two bits are like this: #define G_000E50_MCDX_BUSY(x) (((x) >> 12) & 1) #define G_000E50_MCDW_BUSY(x) (((x) >> 13) & 1)
Could you please tell me what does they mean? And if possible,
They refer to sub-blocks in the memory controller. I don't really know off hand what the name mean.
I want to know the functionalities of these 5 registers in detail: #define R_000E60_SRBM_SOFT_RESET 0x0E60 #define R_000E50_SRBM_STATUS 0x0E50 #define R_008020_GRBM_SOFT_RESET 0x8020 #define R_008010_GRBM_STATUS 0x8010 #define R_008014_GRBM_STATUS2 0x8014
A bit more info: If I reset the MC after resetting CP (this is what Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will disappear, but there is still "ring test failed".
The bits are defined in r600d.h. As to the acronyms: BIF - Bus InterFace CG - clocks DC - Display Controller GRBM - Graphics block (3D engine) HDP - Host Data Path (CPU access to vram via the PCI BAR) IH, RLC - Interrupt controller MC - Memory controller ROM - ROM SEM - semaphore controller
When you reset the MC, you will probably have to reset just about everything else since most blocks depend on the MC for access to memory. If you do reset the MC, you should do it at prior to calling asic_init so you make sure all the hw gets re-initialized properly. Additionally, you should probably reset the GRBM either via SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
Alex
Huacai Chen
2011/11/8 chenhc@lemote.com:
And, I want to know something: 1, Does GPU use MC to access GTT?
Yes. All GPU clients (display, 3D, etc.) go through the MC to access memory (vram or gart).
2, What can cause MC timeout?
Lots of things. Some GPU client still active, some GPU client hung or not properly initialized.
Alex
Hi,
Some status update. 在 2011年9月29日 下午5:17,Chen Jie chenj@lemote.com 写道:
Hi, Add more information. We got occasionally "GPU lockup" after resuming from suspend(on mipsel platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8 64bit). Related kernel message: /* return from STR */ [ 156.152343] radeon 0000:01:05.0: WB enabled [ 156.187500] [drm] ring test succeeded in 0 usecs [ 156.187500] [drm] ib test succeeded in 0 usecs [ 156.398437] ata2: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata3: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata4: SATA link down (SStatus 0 SControl 300) [ 156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 156.597656] ata1.00: configured for UDMA/133 [ 156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd [ 157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd [ 157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd [ 157.683593] r8169 0000:02:00.0: eth0: link up [ 165.621093] PM: resume of devices complete after 9679.556 msecs [ 165.628906] Restarting tasks ... done. [ 177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than 10019msec [ 177.089843] ------------[ cut here ]------------ [ 177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267 radeon_fence_wait+0x25c/0x33c() [ 177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD) [ 177.113281] Modules linked in: psmouse serio_raw [ 177.117187] Call Trace: [ 177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34 [ 177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0 [ 177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44 [ 177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c [ 177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220 [ 177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114 [ 177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc [ 177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38 [ 177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c [ 177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138 [ 177.179687] ---[ end trace 92f63d998efe4c6d ]--- [ 177.187500] radeon 0000:01:05.0: GPU softreset [ 177.191406] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xF57C2030 [ 177.195312] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00111103 [ 177.203125] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20023040 [ 177.363281] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.367187] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 177.414062] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA0003030 [ 177.417968] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 [ 177.425781] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2002B040 [ 177.433593] radeon 0000:01:05.0: GPU reset succeed [ 177.605468] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.761718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.804687] radeon 0000:01:05.0: WB enabled [ 178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems something wrong with accessing through GTT.
We dump gart table just after stopped cp, and compare gart table with the dumped one just after r600_pcie_gart_enable, and don't find any difference.
Any idea?
[ 178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume [ 178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(5). [ 178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB ! [ 179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(6). ...
Regards, -- Chen Jie
Hi,
Status update about the problem 'Occasionally "GPU lockup" after resuming from suspend.'
First, this could happen when system returns from a STR(suspend to ram) or STD(suspend to disk, aka hibernation). When returns from STD, the initialization process is most similar to the normal boot. The standby is ok, which is similar to STR, except that standby will not shutdown the power of CPU,GPU etc.
We've dumped and compared the registers, and found something: CP_STAT normal value: 0x00000000 value when this problem occurred: 0x802100C1 or 0x802300C1
CP_ME_CNTL normal value: 0x000000FF value when this problem occurred: always 0x200000FF in our test
Questions: According to the manual, CP_STAT = 0x802100C1 means CSF_RING_BUSY(bit 0): The Ring fetcher still has command buffer data to fetch, or the PFP still has data left to process from the reorder queue. CSF_BUSY(bit 6): The input FIFOs have command buffers to fetch, or one or more of the fetchers are busy, or the arbiter has a request to send to the MIU. MIU_RDREQ_BUSY(bit 7): The read path logic inside the MIU is busy. MEQ_BUSY(bit 16): The PFP-to-ME queue has valid data in it. SURFACE_SYNC_BUSY(bit 21): The Surface Sync unit is busy. CP_BUSY(bit 31): Any block in the CP is busy. What does it suggest?
What does it mean if bit 29 of CP_ME_CNTL is set?
BTW, how does the dummy page work in GART?
Regards, -- Chen Jie
在 2011年12月7日 下午10:21,Alex Deucher alexdeucher@gmail.com 写道:
2011/12/7 chenhc@lemote.com:
When "MC timeout" happens at GPU reset, we found the 12th and 13th bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these two bits are like this: #define G_000E50_MCDX_BUSY(x) (((x) >> 12) & 1) #define G_000E50_MCDW_BUSY(x) (((x) >> 13) & 1)
Could you please tell me what does they mean? And if possible,
They refer to sub-blocks in the memory controller. I don't really know off hand what the name mean.
I want to know the functionalities of these 5 registers in detail: #define R_000E60_SRBM_SOFT_RESET 0x0E60 #define R_000E50_SRBM_STATUS 0x0E50 #define R_008020_GRBM_SOFT_RESET 0x8020 #define R_008010_GRBM_STATUS 0x8010 #define R_008014_GRBM_STATUS2 0x8014
A bit more info: If I reset the MC after resetting CP (this is what Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will disappear, but there is still "ring test failed".
The bits are defined in r600d.h. As to the acronyms: BIF - Bus InterFace CG - clocks DC - Display Controller GRBM - Graphics block (3D engine) HDP - Host Data Path (CPU access to vram via the PCI BAR) IH, RLC - Interrupt controller MC - Memory controller ROM - ROM SEM - semaphore controller
When you reset the MC, you will probably have to reset just about everything else since most blocks depend on the MC for access to memory. If you do reset the MC, you should do it at prior to calling asic_init so you make sure all the hw gets re-initialized properly. Additionally, you should probably reset the GRBM either via SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
Alex
Huacai Chen
2011/11/8 chenhc@lemote.com:
And, I want to know something: 1, Does GPU use MC to access GTT?
Yes. All GPU clients (display, 3D, etc.) go through the MC to access memory (vram or gart).
2, What can cause MC timeout?
Lots of things. Some GPU client still active, some GPU client hung or not properly initialized.
Alex
Hi,
Some status update. 在 2011年9月29日 下午5:17,Chen Jie chenj@lemote.com 写道:
Hi, Add more information. We got occasionally "GPU lockup" after resuming from suspend(on mipsel platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8 64bit). Related kernel message: /* return from STR */ [ 156.152343] radeon 0000:01:05.0: WB enabled [ 156.187500] [drm] ring test succeeded in 0 usecs [ 156.187500] [drm] ib test succeeded in 0 usecs [ 156.398437] ata2: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata3: SATA link down (SStatus 0 SControl 300) [ 156.398437] ata4: SATA link down (SStatus 0 SControl 300) [ 156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 156.597656] ata1.00: configured for UDMA/133 [ 156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd [ 157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd [ 157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd [ 157.683593] r8169 0000:02:00.0: eth0: link up [ 165.621093] PM: resume of devices complete after 9679.556 msecs [ 165.628906] Restarting tasks ... done. [ 177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than 10019msec [ 177.089843] ------------[ cut here ]------------ [ 177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267 radeon_fence_wait+0x25c/0x33c() [ 177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD) [ 177.113281] Modules linked in: psmouse serio_raw [ 177.117187] Call Trace: [ 177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34 [ 177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0 [ 177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44 [ 177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c [ 177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220 [ 177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114 [ 177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc [ 177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38 [ 177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c [ 177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138 [ 177.179687] ---[ end trace 92f63d998efe4c6d ]--- [ 177.187500] radeon 0000:01:05.0: GPU softreset [ 177.191406] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xF57C2030 [ 177.195312] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00111103 [ 177.203125] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x20023040 [ 177.363281] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.367187] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00007FEE [ 177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001 [ 177.414062] radeon 0000:01:05.0: R_008010_GRBM_STATUS=0xA0003030 [ 177.417968] radeon 0000:01:05.0: R_008014_GRBM_STATUS2=0x00000003 [ 177.425781] radeon 0000:01:05.0: R_000E50_SRBM_STATUS=0x2002B040 [ 177.433593] radeon 0000:01:05.0: GPU reset succeed [ 177.605468] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.761718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 177.804687] radeon 0000:01:05.0: WB enabled [ 178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems something wrong with accessing through GTT.
We dump gart table just after stopped cp, and compare gart table with the dumped one just after r600_pcie_gart_enable, and don't find any difference.
Any idea?
[ 178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume [ 178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(5). [ 178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB ! [ 179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule IB(6). ...
On Wed, Feb 15, 2012 at 05:32:35PM +0800, Chen Jie wrote:
Hi,
Status update about the problem 'Occasionally "GPU lockup" after resuming from suspend.'
First, this could happen when system returns from a STR(suspend to ram) or STD(suspend to disk, aka hibernation). When returns from STD, the initialization process is most similar to the normal boot. The standby is ok, which is similar to STR, except that standby will not shutdown the power of CPU,GPU etc.
We've dumped and compared the registers, and found something: CP_STAT normal value: 0x00000000 value when this problem occurred: 0x802100C1 or 0x802300C1
CP_ME_CNTL normal value: 0x000000FF value when this problem occurred: always 0x200000FF in our test
Questions: According to the manual, CP_STAT = 0x802100C1 means CSF_RING_BUSY(bit 0): The Ring fetcher still has command buffer data to fetch, or the PFP still has data left to process from the reorder queue. CSF_BUSY(bit 6): The input FIFOs have command buffers to fetch, or one or more of the fetchers are busy, or the arbiter has a request to send to the MIU. MIU_RDREQ_BUSY(bit 7): The read path logic inside the MIU is busy. MEQ_BUSY(bit 16): The PFP-to-ME queue has valid data in it. SURFACE_SYNC_BUSY(bit 21): The Surface Sync unit is busy. CP_BUSY(bit 31): Any block in the CP is busy. What does it suggest?
What does it mean if bit 29 of CP_ME_CNTL is set?
BTW, how does the dummy page work in GART?
Regards, -- Chen Jie
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Cheers, Jerome
Hi,
在 2012年2月15日 下午11:53,Jerome Glisse j.glisse@gmail.com 写道:
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master' trick, unfortunately doesn't work. The PCI configuration compare will be done later.
Some additional information: The "GPU Lockup" seems always occur after tasks be restarting -- We inserted more ring tests , non of them failed before restarting tasks.
BTW, I hacked GART table to try to simulate the problem: 1. Changes the system memory address(bus address) of ring_obj to an arbitrary value, e.g. 0 or 128M. 2. Changes the system memory address of a BO in radeon_test to an arbitrary value, e.g. 0
Non of above leaded to a GPU Lockup: Point 1 rendered a black screen; Point 2 only the test itself failed
Any idea?
Regards, -- Chen Jie
在 2012年2月16日 下午5:21,Chen Jie chenj@lemote.com 写道:
Hi,
在 2012年2月15日 下午11:53,Jerome Glisse j.glisse@gmail.com 写道:
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master' trick, unfortunately doesn't work. The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space before & after, and didn't find any difference.
Regards, -- Chen Jie
在 2012年2月15日 下午11:53,Jerome Glisse j.glisse@gmail.com 写道:
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master' trick, unfortunately doesn't work. The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space before & after, and didn't find any difference.
Hi,
Status update: We try to analyze the GPU instruction stream when lockup today. The lockup always occurs after tasks restarting, so the related instructions should reside at ib, as pointed by dmesg: [ 2456.585937] GPU lockup (waiting for 0x0002F98B last fence id 0x0002F98A)
Print instructions in related ib: [ 2462.492187] PM4 block 10 has 115 instructions, with fence seq 2f98b .... [ 2462.976562] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.984375] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.988281] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.992187] Type3:PACKET3_SET_ALU_CONST ref_addr <not interpreted> [ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880 [ 2463.003906] Type3:PACKET3_SET_RESOURCE ref_addr <not interpreted> [ 2463.007812] Type3:PACKET3_SET_CONFIG_REG ref_addr <not interpreted> [ 2463.011718] Type3:PACKET3_INDEX_TYPE ref_addr <not interpreted> [ 2463.015625] Type3:PACKET3_NUM_INSTANCES ref_addr <not interpreted> [ 2463.019531] Type3:PACKET3_DRAW_INDEX_AUTO ref_addr <not interpreted> [ 2463.027343] Type3:PACKET3_EVENT_WRITE ref_addr <not interpreted> [ 2463.031250] Type3:PACKET3_SET_CONFIG_REG ref_addr <not interpreted> [ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680 [ 2463.039062] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.046875] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.050781] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.054687] Type3:PACKET3_SET_BOOL_CONST ref_addr <not interpreted> [ 2463.062500] Type3:PACKET3_SURFACE_SYNC ref_addr 10668e
CP_COHER_BASE was 0x0018C880, so the instruction which caused lockup should be in: [ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880 ... [ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680
Here, only SURFACE_SYNC, SET_RESOURCE and EVENT_WRITE will access GPU memory. We guess it maybe SURFACE_SYNC?
BTW, when lockup happens, if places the CP ring at vram, ring_test will pass, but ib_test fails -- which suggests ME fails to feed CP when lockup? May a former SURFACE_SYNC block the MC?
P.S. We hack to place CP ring, ib and ih at vram and disable wb(radeon_no_wb=1) in today's debugging.
Any idea?
Regards, -- Chen Jie
On Thu, Feb 16, 2012 at 05:21:10PM +0800, Chen Jie wrote:
Hi,
在 2012年2月15日 下午11:53,Jerome Glisse j.glisse@gmail.com 写道:
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master' trick, unfortunately doesn't work. The PCI configuration compare will be done later.
Some additional information: The "GPU Lockup" seems always occur after tasks be restarting -- We inserted more ring tests , non of them failed before restarting tasks.
BTW, I hacked GART table to try to simulate the problem:
- Changes the system memory address(bus address) of ring_obj to an
arbitrary value, e.g. 0 or 128M. 2. Changes the system memory address of a BO in radeon_test to an arbitrary value, e.g. 0
Non of above leaded to a GPU Lockup: Point 1 rendered a black screen; Point 2 only the test itself failed
Any idea?
Ok let's start from the begining, i convince it's related to GPU memory controller failing to full fill some request that hit system memory. So in another mail you wrote :
BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks to swiotlb_map_page on our platform, which seems allocates and returns dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since the BO backed by one set of pages, but mapped to GART was another set of pages?
Is this still the case ? As this is obviously wrong, we fixed that recently. What drm code are you using. rs780 dma mask is something like 40bits iirc so you should never have issue on your system with 1G of memory right ?
If you have an iommu what happens on resume ? Are all page previously mapped with pci map page still valid ?
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
Cheers, Jerome
在 2012年2月17日 上午12:32,Jerome Glisse j.glisse@gmail.com 写道:
Ok let's start from the begining, i convince it's related to GPU memory controller failing to full fill some request that hit system memory. So in another mail you wrote :
BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks to swiotlb_map_page on our platform, which seems allocates and returns dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since the BO backed by one set of pages, but mapped to GART was another set of pages?
Is this still the case ? As this is obviously wrong, we fixed that recently. What drm code are you using. rs780 dma mask is something like 40bits iirc so you should never have issue on your system with 1G of memory right ?
Right.
If you have an iommu what happens on resume ? Are all page previously mapped with pci map page still valid ?
The physical address is directly mapped to bus address, so iommu do nothing on resume, the pages should be valid?
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which will go over GPU gart table: 1. read(backup) a dword at end of each GPU page 2. write a mark by GPU and check it 3. restore the original dword
Hopefully, this can do some help.
在 2012年2月17日 下午5:27,Chen Jie chenj@lemote.com 写道:
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which will go over GPU gart table:
- read(backup) a dword at end of each GPU page
- write a mark by GPU and check it
- restore the original dword
Attachment validateGART.patch do the job: * It current only works for mips64 platform. * To use it, apply all_in_vram.patch first, which will allocate CP ring, ih, ib in VRAM and hard code no_wb=1.
The gart test routine will be invoked in r600_resume. We've tried it, and find that when lockup happened the gart table was good before userspace restarting. The related dmesg follows: [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768 entries(valid=8544, invalid=24224, total=32768). ... [ 1531.156250] PM: resume of devices complete after 9396.588 msecs [ 1532.152343] Restarting tasks ... done. [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec [ 1544.472656] ------------[ cut here ]------------ [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243 radeon_fence_wait+0x25c/0x314() [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A) ... [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.062500] radeon 0000:01:05.0: WB disabled [ 1545.097656] [drm] ring test succeeded in 0 usecs [ 1545.105468] [drm] ib test succeeded in 0 usecs [ 1545.109375] [drm] Enabling audio support [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0: unexpected value 0x745aaad1(expect 0xDEADBEEF) entry=0x000000000e008067, orignal=0x745aaad1 ... /* System blocked here. */
Any idea?
BTW, we find the following in r600_pcie_gart_enable() (drivers/gpu/drm/radeon/r600.c): WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR, (u32)(rdev->dummy_page.addr >> 12));
On our platform, PAGE_SIZE is 16K, does it have any problem?
Also in radeon_gart_unbind() and radeon_gart_restore(), the logic should change to: for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) { radeon_gart_set_page(rdev, t, page_base); - page_base += RADEON_GPU_PAGE_SIZE; + if (page_base != rdev->dummy_page.addr) + page_base += RADEON_GPU_PAGE_SIZE; } ???
Regards, -- Chen Jie
Hi,
For this occasional GPU lockup when returns from STR/STD, I find followings(when the problem happens):
The value of SRBM_STATUS is whether 0x20002040 or 0x20003040. Which means: * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM) * MCDW_BUSY(Memory Controller Block is Busy) * BIF_BUSY(Bus Interface is Busy) * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040 Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the relationship among GART mapped memory, On-board video memory and MCDX, MCDW?
CP_STAT: the CSF_RING_BUSY is always set.
There are many CP_PACKET2(0x80000000) in CP ring(more than three hundreds). e.g. r[131800]=0x00028000 r[131801]=0xc0016800 r[131802]=0x00000140 r[131803]=0x000079c5 r[131804]=0x0000304a r[131805] ... r[132143]=0x80000000 r[132144]=0xffff0000 After the first reset, GPU will lockup again, this time, typically there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00 in the end. Are these normal?
BTW, is there any way for X to switch to NOACCEL mode when the problem happens? Thus users will have a chance to save their documents and then reboot machine.
Regards, -- Chen Jie
On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
Hi,
For this occasional GPU lockup when returns from STR/STD, I find followings(when the problem happens):
The value of SRBM_STATUS is whether 0x20002040 or 0x20003040. Which means:
- HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
- MCDW_BUSY(Memory Controller Block is Busy)
- BIF_BUSY(Bus Interface is Busy)
- MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the relationship among GART mapped memory, On-board video memory and MCDX, MCDW?
CP_STAT: the CSF_RING_BUSY is always set.
Once the memory controller fails to do a pci transaction the CP will be stuck. At least if ring is in system memory, if ring is in vram CP might be stuck too because anyway everything goes through the MC.
There are many CP_PACKET2(0x80000000) in CP ring(more than three hundreds). e.g. r[131800]=0x00028000 r[131801]=0xc0016800 r[131802]=0x00000140 r[131803]=0x000079c5 r[131804]=0x0000304a r[131805] ... r[132143]=0x80000000 r[132144]=0xffff0000 After the first reset, GPU will lockup again, this time, typically there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00 in the end. Are these normal?
BTW, is there any way for X to switch to NOACCEL mode when the problem happens? Thus users will have a chance to save their documents and then reboot machine.
I have been meaning to patch the ddx to fallback to sw after GPU lockup. But this is useless in today world, where everything is composited ie the screen is updated using the 3D driver for which there is no easy way to suddenly migrate to software rendering. I will still probably do the ddx patch at one point.
Cheers, Jerome
On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
在 2012年2月17日 下午5:27,Chen Jie chenj@lemote.com 写道:
One good way to test gart is to go over GPU gart table and write a dword using the GPU at end of each page something like 0xCAFEDEAD or somevalue that is unlikely to be already set. And then go over all the page and check that GPU write succeed. Abusing the scratch register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which will go over GPU gart table:
- read(backup) a dword at end of each GPU page
- write a mark by GPU and check it
- restore the original dword
Attachment validateGART.patch do the job:
- It current only works for mips64 platform.
- To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.
The gart test routine will be invoked in r600_resume. We've tried it, and find that when lockup happened the gart table was good before userspace restarting. The related dmesg follows: [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768 entries(valid=8544, invalid=24224, total=32768). ... [ 1531.156250] PM: resume of devices complete after 9396.588 msecs [ 1532.152343] Restarting tasks ... done. [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec [ 1544.472656] ------------[ cut here ]------------ [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243 radeon_fence_wait+0x25c/0x314() [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A) ... [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout ! [ 1545.062500] radeon 0000:01:05.0: WB disabled [ 1545.097656] [drm] ring test succeeded in 0 usecs [ 1545.105468] [drm] ib test succeeded in 0 usecs [ 1545.109375] [drm] Enabling audio support [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table at 9000000040040000, 32768 entries, Dummy Page[0x000000000e004000-0x000000000e007fff] [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0: unexpected value 0x745aaad1(expect 0xDEADBEEF) entry=0x000000000e008067, orignal=0x745aaad1 ... /* System blocked here. */
Any idea?
I know lockup are frustrating, my only idea is the memory controller is lockup because of some failing pci <-> system ram transaction.
BTW, we find the following in r600_pcie_gart_enable() (drivers/gpu/drm/radeon/r600.c): WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR, (u32)(rdev->dummy_page.addr >> 12));
On our platform, PAGE_SIZE is 16K, does it have any problem?
No this should be handled properly.
Also in radeon_gart_unbind() and radeon_gart_restore(), the logic should change to: for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) { radeon_gart_set_page(rdev, t, page_base);
page_base += RADEON_GPU_PAGE_SIZE;
if (page_base != rdev->dummy_page.addr)
}page_base += RADEON_GPU_PAGE_SIZE;
???
No need to do so, dummy page will be 16K too, so it's fine.
Cheers, Jerome
dri-devel@lists.freedesktop.org