On Mon, Jun 12, 2017 at 12:24 PM, Carlo Caione carlo@endlessm.com wrote:
On Tue, May 9, 2017 at 7:03 PM, Deucher, Alexander Alexander.Deucher@amd.com wrote:
-----Original Message----- From: Daniel Drake [mailto:drake@endlessm.com] Sent: Tuesday, May 09, 2017 12:55 PM To: dri-devel; amd-gfx@lists.freedesktop.org; Deucher, Alexander Cc: Chris Chiu; Linux Upstreaming Team Subject: amdgpu display corruption and hang on AMD A10-9620P
Hi,
We are working with new laptops that have the AMD Bristol Ridge chipset with this SoC:
AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G
I think this is the Bristol Ridge chipset.
During boot, the display becomes unusable at the point where the amdgpu driver loads. You can see at least two horizontal lines of garbage at this point. We have reproduced on 4.8, 4.10 and linus master (early 4.12).
Photo: http://pasteboard.co/qrC9mh4p.jpg
Getting logs is tricky because the system appears to freeze at that point.
Is this a known issue? Anything we can do to help diagnosis?
I'm not aware of any specific issues. Please file a bug and attach your logs (https://bugs.freedesktop.org) along with information about the system.
Opened https://bugs.freedesktop.org/show_bug.cgi?id=101387 to trace this bug. I also have attached there the full log we get when modprobing amdgpu. Reporting here only the trace for the sake of documentation (full log attached to the bug opened on freedesktop)
[ 80.766937] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffc0c88942 [ 80.766937] [ 80.766408] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffc0c88942 [ 80.766408] [ 80.766428] CPU: 1 PID: 1594 Comm: modprobe Not tainted 4.11.3+ #2 [ 80.766431] Hardware name: Acer Aspire A515-41G/Wartortle_BS, BIOS V0.09 04/19/2017 [ 80.766434] Call Trace: [ 80.766445] dump_stack+0x63/0x90 [ 80.766451] panic+0xe8/0x236 [ 80.766526] ? amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu] [ 80.766537] __stack_chk_fail+0x1b/0x20 [ 80.766571] amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu] [ 80.766610] dce_v11_0_hw_init+0x3e/0x2d0 [amdgpu] [ 80.766643] amdgpu_device_init+0xe23/0x13c0 [amdgpu] [ 80.766647] ? kmalloc_order+0x18/0x40 [ 80.766650] ? kmalloc_order_trace+0x24/0xa0 [ 80.766683] amdgpu_driver_load_kms+0x5d/0x240 [amdgpu] [ 80.766708] drm_dev_register+0x148/0x1e0 [drm] [ 80.766721] drm_get_pci_dev+0xa0/0x160 [drm] [ 80.766754] amdgpu_pci_probe+0xb9/0xf0 [amdgpu] [ 80.766759] local_pci_probe+0x45/0xa0 [ 80.766762] pci_device_probe+0xf4/0x150 [ 80.766768] driver_probe_device+0x2c5/0x470 [ 80.766772] __driver_attach+0xdf/0xf0 [ 80.766776] ? driver_probe_device+0x470/0x470 [ 80.766780] bus_for_each_dev+0x6c/0xc0 [ 80.766784] driver_attach+0x1e/0x20 [ 80.766787] bus_add_driver+0x45/0x270 [ 80.766790] ? 0xffffffffc09a8000 [ 80.766794] driver_register+0x60/0xe0 [ 80.766796] ? 0xffffffffc09a8000 [ 80.766799] __pci_register_driver+0x4c/0x50 [ 80.766811] drm_pci_init+0xed/0x100 [drm] [ 80.766816] ? vga_switcheroo_register_handler+0x6c/0x90 [ 80.766819] ? 0xffffffffc09a8000 [ 80.766850] amdgpu_init+0x9b/0xac [amdgpu] [ 80.766855] do_one_initcall+0x53/0x1c0 [ 80.766860] ? __vunmap+0x81/0xd0 [ 80.766865] ? kmem_cache_alloc_trace+0xdb/0x1b0 [ 80.766868] ? kfree+0x161/0x170 [ 80.766876] do_init_module+0x60/0x202 [ 80.766881] load_module+0x2612/0x29f0 [ 80.766885] SYSC_finit_module+0xa6/0xf0 [ 80.766888] ? SYSC_finit_module+0xa6/0xf0 [ 80.766892] SyS_finit_module+0xe/0x10 [ 80.766896] entry_SYSCALL_64_fastpath+0x1e/0xad [ 80.766899] RIP: 0033:0x7fa525e60709 [ 80.766902] RSP: 002b:00007fff2f5bbbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 80.766905] RAX: ffffffffffffffda RBX: 00007fa526129760 RCX: 00007fa525e60709 [ 80.766908] RDX: 0000000000000000 RSI: 000055f51f1c9439 RDI: 000000000000000b [ 80.766910] RBP: 0000000000000070 R08: 0000000000000000 R09: 000055f51fcd83f0 [ 80.766913] R10: 000000000000000b R11: 0000000000000246 R12: 000055f51fcd9ff0 [ 80.766915] R13: 0000000000000007 R14: 00007fa5261297b8 R15: 0000000000002710 [ 80.766931] Kernel Offset: 0x22800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 80.766937] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffc0c88942
Trying to move this discussion here for more visibility. This is what is happening.
In amdgpu_atombios_crtc_powergate_init() we are declaring ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args as parameter space, this is 32bytes wide and passed down to the atombios interpreter in ctx->ps.
When amdgpu_atombios_crtc_powergate_init() is called this triggers the parsing of the command table with index == 13 [>> execute C5C0 (len 589, WS 0, PS 0)]. During the execution of this table several CALL_TABLE (op == 82) are executed. More in detail we first jump to table with index == 78 [>> execute F166 (len 588, WS 0, PS 8)], then to table with index == 51 [>> execute F446 (len 465, WS 4, PS 4)] and to table with index == 75 [>> execute F6CC (len 1330, WS 4, PS 0)] before finally reaching the EOT for table 13. At this point when returning in amdgpu_atombios_crtc_powergate_init() the stack is already corrupted.
The corruption is happening during the execution of the code in the table 75 [>> execute F6CC (len 1330, WS 4, PS 0)]. In this table a MOVE_PS is executed with a destination index == 1, accessing ctx->ps[idx] and causing the stack corruption.
My first guess here is that something is wrong in the atombios code. Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS, etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess what) are the instructions causing the stack corruption. My guess here is that the opcodes *_PS in the atombios are wrong and they should actually be *_WS opcodes.
Another possibility is that the atombios interpreter is doing something wrong. Don't we need to allocate the size of the ps allocation struct (ctx->ps) for the command table we are going to execute after a CALL_TABLE matching the ps size in the table header? IIUC the code in the kernel, when we are jumping to a different table ctx->ps is not being reallocated.
Thanks,