On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote:
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction
0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction
0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated.
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node
Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@re...
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +- drivers/gpu/drm/drm_gem_vram_helper.c | 2 +- drivers/gpu/drm/nouveau/nouveau_display.c | 2 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 2 +- drivers/gpu/drm/qxl/qxl_object.h | 2 +- drivers/gpu/drm/radeon/radeon_object.h | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 8 ++++---- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 9 +++++---- drivers/gpu/drm/virtio/virtgpu_drv.h | 2 +- drivers/gpu/drm/virtio/virtgpu_prime.c | 3 --- drivers/gpu/drm/vmwgfx/vmwgfx_bo.c | 4 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 ++-- include/drm/ttm/ttm_bo_api.h | 4 ---- 14 files changed, 21 insertions(+), 27 deletions(-)
I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm: use gem reservation object' as being 'good' initially, based on the fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI applications displayed black artifacts across the screen.
I then edited the git-bisect log file where I nominated commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being 'bad' and ran 'git bisect replay' on it. This blamed commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit.
1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:09 2019 +0200
drm/ttm: use gem reservation object
Drop ttm_resv from ttm_buffer_object, use the gem reservation object (base._resv) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@re...
drivers/gpu/drm/ttm/ttm_bo.c | 39 +++++++++++++++++++++++---------------- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- include/drm/ttm/ttm_bo_api.h | 1 - 3 files changed, 24 insertions(+), 18 deletions(-)
In the process of bisection, I nominated the following kernels as being 'bad'. They also booted fine, but the xserver would fail to start. I have attached the error messages generated by xorg.
# kernel boots; Xorg won't start. See Xorg_err.log attached. 5.3.0-rc3-01537-g6a3068065fa4 5.3.0-rc3-00782-gb0383c0653c4 5.3.0-rc1-00391-g54fc01b775fe 5.3.0-rc1-00366-g2e3c9ec4d151 5.3.0-rc1-00365-gb96f3e7c8069
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete?
Thanks to all of you for the tips on how proceed with bisection.