在 2012年2月15日 下午11:53,Jerome Glisse j.glisse@gmail.com 写道:
To me it looks like the CP is trying to fetch memory but the GPU memory controller fail to fullfill cp request. Did you check the PCI configuration before & after (when things don't work) My best guest is PCI bus mastering is no properly working or the PCIE GPU gart table as wrong data.
Maybe one need to drop bus master and reenable bus master to work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master' trick, unfortunately doesn't work. The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space before & after, and didn't find any difference.
Hi,
Status update: We try to analyze the GPU instruction stream when lockup today. The lockup always occurs after tasks restarting, so the related instructions should reside at ib, as pointed by dmesg: [ 2456.585937] GPU lockup (waiting for 0x0002F98B last fence id 0x0002F98A)
Print instructions in related ib: [ 2462.492187] PM4 block 10 has 115 instructions, with fence seq 2f98b .... [ 2462.976562] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.984375] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.988281] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2462.992187] Type3:PACKET3_SET_ALU_CONST ref_addr <not interpreted> [ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880 [ 2463.003906] Type3:PACKET3_SET_RESOURCE ref_addr <not interpreted> [ 2463.007812] Type3:PACKET3_SET_CONFIG_REG ref_addr <not interpreted> [ 2463.011718] Type3:PACKET3_INDEX_TYPE ref_addr <not interpreted> [ 2463.015625] Type3:PACKET3_NUM_INSTANCES ref_addr <not interpreted> [ 2463.019531] Type3:PACKET3_DRAW_INDEX_AUTO ref_addr <not interpreted> [ 2463.027343] Type3:PACKET3_EVENT_WRITE ref_addr <not interpreted> [ 2463.031250] Type3:PACKET3_SET_CONFIG_REG ref_addr <not interpreted> [ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680 [ 2463.039062] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.046875] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.050781] Type3:PACKET3_SET_CONTEXT_REG ref_addr <not interpreted> [ 2463.054687] Type3:PACKET3_SET_BOOL_CONST ref_addr <not interpreted> [ 2463.062500] Type3:PACKET3_SURFACE_SYNC ref_addr 10668e
CP_COHER_BASE was 0x0018C880, so the instruction which caused lockup should be in: [ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880 ... [ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680
Here, only SURFACE_SYNC, SET_RESOURCE and EVENT_WRITE will access GPU memory. We guess it maybe SURFACE_SYNC?
BTW, when lockup happens, if places the CP ring at vram, ring_test will pass, but ib_test fails -- which suggests ME fails to feed CP when lockup? May a former SURFACE_SYNC block the MC?
P.S. We hack to place CP ring, ib and ih at vram and disable wb(radeon_no_wb=1) in today's debugging.
Any idea?
Regards, -- Chen Jie