Re: amdgpu display corruption and hang on AMD A10-9620P

15 Jun 2017


      On Mon, Jun 12, 2017 at 12:24 PM, Carlo Caione carlo@endlessm.com wrote:
...
On Tue, May 9, 2017 at 7:03 PM, Deucher, Alexander
Alexander.Deucher@amd.com wrote:
...
...
-----Original Message-----
From: Daniel Drake [mailto:drake@endlessm.com]
Sent: Tuesday, May 09, 2017 12:55 PM
To: dri-devel; amd-gfx@lists.freedesktop.org; Deucher, Alexander
Cc: Chris Chiu; Linux Upstreaming Team
Subject: amdgpu display corruption and hang on AMD A10-9620P
Hi,
We are working with new laptops that have the AMD Bristol Ridge
chipset with this SoC:
AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G
I think this is the Bristol Ridge chipset.
During boot, the display becomes unusable at the point where the
amdgpu driver loads. You can see at least two horizontal lines of
garbage at this point. We have reproduced on 4.8, 4.10 and linus
master (early 4.12).
Photo: http://pasteboard.co/qrC9mh4p.jpg
Getting logs is tricky because the system appears to freeze at that point.
Is this a known issue? Anything we can do to help diagnosis?
I'm not aware of any specific issues.  Please file a bug and attach your logs (https://bugs.freedesktop.org) along with information about the system.
Opened https://bugs.freedesktop.org/show_bug.cgi?id=101387 to trace
this bug. I also have attached there the full log we get when
modprobing amdgpu.
Reporting here only the trace for the sake of documentation (full log
attached to the bug opened on freedesktop)
[   80.766937] ---[ end Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: ffffffffc0c88942
[   80.766937]
[   80.766408] Kernel panic - not syncing: stack-protector: Kernel
stack is corrupted in: ffffffffc0c88942
[   80.766408]
[   80.766428] CPU: 1 PID: 1594 Comm: modprobe Not tainted 4.11.3+ #2
[   80.766431] Hardware name: Acer Aspire A515-41G/Wartortle_BS, BIOS
V0.09 04/19/2017
[   80.766434] Call Trace:
[   80.766445]  dump_stack+0x63/0x90
[   80.766451]  panic+0xe8/0x236
[   80.766526]  ? amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
[   80.766537]  __stack_chk_fail+0x1b/0x20
[   80.766571]  amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
[   80.766610]  dce_v11_0_hw_init+0x3e/0x2d0 [amdgpu]
[   80.766643]  amdgpu_device_init+0xe23/0x13c0 [amdgpu]
[   80.766647]  ? kmalloc_order+0x18/0x40
[   80.766650]  ? kmalloc_order_trace+0x24/0xa0
[   80.766683]  amdgpu_driver_load_kms+0x5d/0x240 [amdgpu]
[   80.766708]  drm_dev_register+0x148/0x1e0 [drm]
[   80.766721]  drm_get_pci_dev+0xa0/0x160 [drm]
[   80.766754]  amdgpu_pci_probe+0xb9/0xf0 [amdgpu]
[   80.766759]  local_pci_probe+0x45/0xa0
[   80.766762]  pci_device_probe+0xf4/0x150
[   80.766768]  driver_probe_device+0x2c5/0x470
[   80.766772]  __driver_attach+0xdf/0xf0
[   80.766776]  ? driver_probe_device+0x470/0x470
[   80.766780]  bus_for_each_dev+0x6c/0xc0
[   80.766784]  driver_attach+0x1e/0x20
[   80.766787]  bus_add_driver+0x45/0x270
[   80.766790]  ? 0xffffffffc09a8000
[   80.766794]  driver_register+0x60/0xe0
[   80.766796]  ? 0xffffffffc09a8000
[   80.766799]  __pci_register_driver+0x4c/0x50
[   80.766811]  drm_pci_init+0xed/0x100 [drm]
[   80.766816]  ? vga_switcheroo_register_handler+0x6c/0x90
[   80.766819]  ? 0xffffffffc09a8000
[   80.766850]  amdgpu_init+0x9b/0xac [amdgpu]
[   80.766855]  do_one_initcall+0x53/0x1c0
[   80.766860]  ? __vunmap+0x81/0xd0
[   80.766865]  ? kmem_cache_alloc_trace+0xdb/0x1b0
[   80.766868]  ? kfree+0x161/0x170
[   80.766876]  do_init_module+0x60/0x202
[   80.766881]  load_module+0x2612/0x29f0
[   80.766885]  SYSC_finit_module+0xa6/0xf0
[   80.766888]  ? SYSC_finit_module+0xa6/0xf0
[   80.766892]  SyS_finit_module+0xe/0x10
[   80.766896]  entry_SYSCALL_64_fastpath+0x1e/0xad
[   80.766899] RIP: 0033:0x7fa525e60709
[   80.766902] RSP: 002b:00007fff2f5bbbf8 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[   80.766905] RAX: ffffffffffffffda RBX: 00007fa526129760 RCX: 00007fa525e60709
[   80.766908] RDX: 0000000000000000 RSI: 000055f51f1c9439 RDI: 000000000000000b
[   80.766910] RBP: 0000000000000070 R08: 0000000000000000 R09: 000055f51fcd83f0
[   80.766913] R10: 000000000000000b R11: 0000000000000246 R12: 000055f51fcd9ff0
[   80.766915] R13: 0000000000000007 R14: 00007fa5261297b8 R15: 0000000000002710
[   80.766931] Kernel Offset: 0x22800000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   80.766937] ---[ end Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: ffffffffc0c88942
Trying to move this discussion here for more visibility. This is what
is happening.
In amdgpu_atombios_crtc_powergate_init() we are declaring
ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args as parameter space, this
is 32bytes wide and passed down to the atombios interpreter in
ctx->ps.
When amdgpu_atombios_crtc_powergate_init() is called this triggers the
parsing of the command table with index == 13 [>> execute C5C0 (len
589, WS 0, PS 0)]. During the execution of this table several
CALL_TABLE (op == 82) are executed. More in detail we first jump to
table with index == 78 [>> execute F166 (len 588, WS 0, PS 8)], then
to table with index == 51 [>> execute F446 (len 465, WS 4, PS 4)] and
to table with index == 75 [>> execute F6CC (len 1330, WS 4, PS 0)]
before finally reaching the EOT for table 13. At this point when
returning in amdgpu_atombios_crtc_powergate_init() the stack is
already corrupted.
The corruption is happening during the execution of the code in the
table 75 [>> execute F6CC (len 1330, WS 4, PS 0)]. In this table a
MOVE_PS is executed with a destination index == 1, accessing
ctx->ps[idx] and causing the stack corruption.
My first guess here is that something is wrong in the atombios code.
Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the
table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS,
etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess
what) are the instructions causing the stack corruption. My guess here
is that the opcodes *_PS in the atombios are wrong and they should
actually be *_WS opcodes.
Another possibility is that the atombios interpreter is doing
something wrong. Don't we need to allocate the size of the ps
allocation struct (ctx->ps) for the command table we are going to
execute after a CALL_TABLE matching the ps size in the table header?
IIUC the code in the kernel, when we are jumping to a different table
ctx->ps is not being reallocated.
Thanks,
-- 
Carlo Caione  |  +39.340.80.30.096  |  Endless

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: amdgpu display corruption and hang on AMD A10-9620P