Re: Kernel panic during drm/nouveau init 5.3.0-rc7-next-20190903

20 Sep 2019


      On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote:
...
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to
5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get
written on disk. I resourted to taking photos and videos to get the info
for debugging.
[Kernel panic]
Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack:
__drm_fb_helper_initial_config_and_unlock
drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt
Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
All code
========
   0:	00 48 83             	add    %cl,-0x7d(%rax)
   3:	bb f0 00 00 00       	mov    $0xf0,%ebx
   8:	00 74 16 48          	add    %dh,0x48(%rsi,%rdx,1)
   c:	83 c3 18             	add    $0x18,%ebx
   f:	b9 17 00 00 00       	mov    $0x17,%ecx
  14:	31 c0                	xor    %eax,%eax
  16:	48 89 df             	mov    %rbx,%rdi
  19:	f3 48 ab             	rep stos %rax,%es:(%rdi)
  1c:	5b                   	pop    %rbx
  1d:	41 5c                	pop    %r12
  1f:	5d                   	pop    %rbp
  20:	c3                   	retq   
  21:	4c 89 a3 f0 00 00 00 	mov    %r12,0xf0(%rbx)
  28:	eb e1                	jmp    0xb
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	55                   	push   %rbp
  31:	48 89 e5             	mov    %rsp,%rbp
  34:	41 54                	push   %r12
  36:	49 89 d4             	mov    %rdx,%r12
  39:	53                   	push   %rbx
  3a:	48 89 f3             	mov    %rsi,%rbx
  3d:	e8                   	.byte 0xe8
  3e:	7e ff                	jle    0x3f
Code starting with the faulting instruction
0:	0f 0b                	ud2    
   2:	0f 1f 40 00          	nopl   0x0(%rax)
   6:	55                   	push   %rbp
   7:	48 89 e5             	mov    %rsp,%rbp
   a:	41 54                	push   %r12
   c:	49 89 d4             	mov    %rdx,%r12
   f:	53                   	push   %rbx
  10:	48 89 f3             	mov    %rsi,%rbx
  13:	e8                   	.byte 0xe8
  14:	7e ff                	jle    0x15
The panic occurs after the 'Driver supports precise vblank timestamp
query.' line gets printed to console:
[    2.858970] Linux agpgart interface v0.103
[    2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2)
[    2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19
[    2.989923] nouveau 0000:01:00.0: bios: M0203T not found
[    2.990010] nouveau 0000:01:00.0: bios: M0203E not matched!
[    2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2
[    3.062362] [TTM] Zone  kernel: Available graphics memory: 2015014 KiB
[    3.062494] [TTM] Initializing pool allocator
[    3.062581] [TTM] Initializing DMA pool allocator
[    3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB
[    3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
[    3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
[    3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0
[    3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028
[    3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030
[    3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028
[    3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0
[    3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030
[    3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130
[    3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies
[    3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in
the commentary to function drm_fb_helper_initial_config defined in
drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20
Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt
Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
All code
========
   0:	ba ff 00 00 00       	mov    $0xff,%edx
   5:	f0 0f b1 17          	lock cmpxchg %edx,(%rdi)
   9:	75 01                	jne    0xc
   b:	c3                   	retq   
   c:	55                   	push   %rbp
   d:	48 89 e5             	mov    %rsp,%rbp
  10:	e8 23 a2 6d ff       	callq  0xffffffffff6da238
  15:	5d                   	pop    %rbp
  16:	c3                   	retq   
  17:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
  1e:	00 00 00 00 
  22:	90                   	nop
  23:	31 c0                	xor    %eax,%eax
  25:	ba 01 00 00 00       	mov    $0x1,%edx
  2a:*	f0 0f b1 17          	lock cmpxchg %edx,(%rdi)		<-- trapping instruction
  2e:	75 01                	jne    0x31
  30:	c3                   	retq   
  31:	55                   	push   %rbp
  32:	89 c6                	mov    %eax,%esi
  34:	40 89 e5             	rex mov %esp,%ebp
  37:	e8 e7 8f 6d ff       	callq  0xffffffffff6d9023
  3c:	5d                   	pop    %rbp
  3d:	c3                   	retq   
  3e:	0f                   	.byte 0xf
  3f:	1f                   	(bad)
Code starting with the faulting instruction
0:	f0 0f b1 17          	lock cmpxchg %edx,(%rdi)
   4:	75 01                	jne    0x7
   6:	c3                   	retq   
   7:	55                   	push   %rbp
   8:	89 c6                	mov    %eax,%esi
   a:	40 89 e5             	rex mov %esp,%ebp
   d:	e8 e7 8f 6d ff       	callq  0xffffffffff6d8ff9
  12:	5d                   	pop    %rbp
  13:	c3                   	retq   
  14:	0f                   	.byte 0xf
  15:	1f                   	(bad)
(gdb) list *(_raw_spin_lock+0x7)
0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200).
195	}
196	
197	#define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
198	static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
199	{
200		return try_cmpxchg(&v->counter, old, new);
201	}
202	
203	static inline int arch_atomic_xchg(atomic_t *v, int new)
204	{
(gdb) disassemble _raw_spin_lock+0x7
Dump of assembler code for function _raw_spin_lock:
   0xffffffff81a13b20 <+0>:	xor    %eax,%eax
   0xffffffff81a13b22 <+2>:	mov    $0x1,%edx
   0xffffffff81a13b27 <+7>:	lock cmpxchg %edx,(%rdi)
   0xffffffff81a13b2b <+11>:	jne    0xffffffff81a13b2e <_raw_spin_lock+14>
   0xffffffff81a13b2d <+13>:	retq   
   0xffffffff81a13b2e <+14>:	push   %rbp
   0xffffffff81a13b2f <+15>:	mov    %eax,%esi
   0xffffffff81a13b31 <+17>:	mov    %rsp,%rbp
   0xffffffff81a13b34 <+20>:	callq  0xffffffff810ecb20 <queued_spin_lock_slowpath>
   0xffffffff81a13b39 <+25>:	pop    %rbp
   0xffffffff81a13b3a <+26>:	retq   
End of assembler dump.
Any pointers on how to proceed with this would be appreciated.
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit
commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2
Author: Gerd Hoffmann kraxel@redhat.com
Date:   Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node
Drop vma_node from ttm_buffer_object, use the gem struct
    (base.vma_node) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com
    Reviewed-by: Christian König christian.koenig@amd.com
    Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@re...
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +-
 drivers/gpu/drm/drm_gem_vram_helper.c      | 2 +-
 drivers/gpu/drm/nouveau/nouveau_display.c  | 2 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c      | 2 +-
 drivers/gpu/drm/qxl/qxl_object.h           | 2 +-
 drivers/gpu/drm/radeon/radeon_object.h     | 2 +-
 drivers/gpu/drm/ttm/ttm_bo.c               | 8 ++++----
 drivers/gpu/drm/ttm/ttm_bo_util.c          | 2 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c            | 9 +++++----
 drivers/gpu/drm/virtio/virtgpu_drv.h       | 2 +-
 drivers/gpu/drm/virtio/virtgpu_prime.c     | 3 ---
 drivers/gpu/drm/vmwgfx/vmwgfx_bo.c         | 4 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_surface.c    | 4 ++--
 include/drm/ttm/ttm_bo_api.h               | 4 ----
 14 files changed, 21 insertions(+), 27 deletions(-)
I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm:
use gem reservation object' as being 'good' initially, based on the
fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI
applications displayed black artifacts across the screen.
I then edited the git-bisect log file where I nominated
commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being
'bad' and ran 'git bisect replay' on it. This blamed commit
1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit.
1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit
commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715
Author: Gerd Hoffmann kraxel@redhat.com
Date:   Mon Aug 5 16:01:09 2019 +0200
drm/ttm: use gem reservation object
Drop ttm_resv from ttm_buffer_object, use the gem reservation object
    (base._resv) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com
    Reviewed-by: Christian König christian.koenig@amd.com
    Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@re...
drivers/gpu/drm/ttm/ttm_bo.c      | 39 +++++++++++++++++++++++----------------
 drivers/gpu/drm/ttm/ttm_bo_util.c |  2 +-
 include/drm/ttm/ttm_bo_api.h      |  1 -
 3 files changed, 24 insertions(+), 18 deletions(-)
In the process of bisection, I nominated the following kernels as being
'bad'. They also booted fine, but the xserver would fail to start. I
have attached the error messages generated by xorg.
# kernel boots; Xorg won't start. See Xorg_err.log attached.
5.3.0-rc3-01537-g6a3068065fa4
5.3.0-rc3-00782-gb0383c0653c4
5.3.0-rc1-00391-g54fc01b775fe
5.3.0-rc1-00366-g2e3c9ec4d151
5.3.0-rc1-00365-gb96f3e7c8069
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine
with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of
the changes introduced by the 'bad' commits being perhaps incomplete?
Thanks to all of you for the tips on how proceed with bisection.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Kernel panic during drm/nouveau init 5.3.0-rc7-next-20190903

Code starting with the faulting instruction

Code starting with the faulting instruction