To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction =========================================== 0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated. See the files attached for copies of my .config and output of ver_linux.
On Sat, Sep 7, 2019 at 11:05 AM Alexander Kapshuk alexander.kapshuk@gmail.com wrote:
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction
0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction
0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated. See the files attached for copies of my .config and output of ver_linux.
Lotta people are travelling to plumbers now for next week, so might be slow responses. One option would be to try to bisect where things broke. You've collected a lot of data here painstakingly, but from looking at it I'm not really connecting the docs. Knowing which change broke your machine can help enormously. -Daniel
On Sat, Sep 07, 2019 at 12:32:41PM +0200, Daniel Vetter wrote:
On Sat, Sep 7, 2019 at 11:05 AM Alexander Kapshuk alexander.kapshuk@gmail.com wrote:
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction
0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction
0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated. See the files attached for copies of my .config and output of ver_linux.
Lotta people are travelling to plumbers now for next week, so might be slow responses. One option would be to try to bisect where things broke. You've collected a lot of data here painstakingly, but from looking at it I'm not really connecting the docs. Knowing which change broke your machine can help enormously.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Thanks for your prompt reply, Daniel. This is my first bisect. Here's what I've tried so far and based on the output I got, I seem to be taken in the opposit direction.
git bisect start git bisect bad # 7dc4585e0378:next-20190903 git bisect good next-20190730 #70f4b4ac1655
Instead of being shown commits between 70f4b4ac1655 and 7dc4585e0378, I got this: Bisecting: a merge base must be tested [5d21595b17f693dce313b58f9617a8ea2a45b1b1] pinctrl: Ingenic: Add pinctrl driver for X1500.
git show 5d21595b17f6 | grep -i date Date: Sun Jul 14 11:53:56 2019 +0800
I went ahead and actually built that kernel, which ended up being 5.3.0-rc1-00010-g5d21595b17f6, which booted just fine. So, I said: git bisect good 5d21595b17f6
And got: Bisecting: a merge base must be tested [8c4407de3be44c2a0ec3e316cd3e4a711bc2aaba] pinctrl: aspeed: Make aspeed_pinmux_ips static
git show 8c4407de3be4 | grep -i date Date: Thu Jul 11 22:24:57 2019 +0800
Clearly, I'm doing something wrong there. I'd appreciate being pointed in the right direction.
Hi Alexander,
On Sun, 8 Sep 2019 17:13:07 +0300 Alexander Kapshuk alexander.kapshuk@gmail.com wrote:
This is my first bisect. Here's what I've tried so far and based on the output I got, I seem to be taken in the opposit direction.
git bisect start git bisect bad # 7dc4585e0378:next-20190903 git bisect good next-20190730 #70f4b4ac1655
If you are bisecting linux-next, I will suggest bisecting between the stable branch on linux-next (which is just Linus' tree when I started that day) and the top of the first linux-next that fails. (Assuming that the stable branch is good).
so
git bisect start git bisect good stable git bisect bad next-20190903
and go from there. It will (unfortunately) be quite a few commits to test :-(
Hi,
On Mon, 9 Sep 2019 20:11:59 +1000 Stephen Rothwell sfr@canb.auug.org.au wrote:
If you are bisecting linux-next, I will suggest bisecting between the stable branch on linux-next (which is just Linus' tree when I started that day) and the top of the first linux-next that fails. (Assuming that the stable branch is good).
Actually (since you won't be bisecting the latest linux-next), you probably want to use
git merge-base stable next-20190903 (or whatever linux-next you are bisecting)
as your first good commit (assuming it id good :-)).
On Mon, Sep 9, 2019 at 1:21 PM Stephen Rothwell sfr@canb.auug.org.au wrote:
Hi,
On Mon, 9 Sep 2019 20:11:59 +1000 Stephen Rothwell sfr@canb.auug.org.au wrote:
If you are bisecting linux-next, I will suggest bisecting between the stable branch on linux-next (which is just Linus' tree when I started that day) and the top of the first linux-next that fails. (Assuming that the stable branch is good).
Actually (since you won't be bisecting the latest linux-next), you probably want to use
git merge-base stable next-20190903 (or whatever linux-next you are bisecting)
as your first good commit (assuming it id good :-)).
-- Cheers, Stephen Rothwell
Hi Stephen,
Thanks very much for the tips. I'll go ahead and give those a try.
On Mon, Sep 9, 2019 at 12:25 PM Alexander Kapshuk alexander.kapshuk@gmail.com wrote:
On Mon, Sep 9, 2019 at 1:21 PM Stephen Rothwell sfr@canb.auug.org.au wrote:
Hi,
On Mon, 9 Sep 2019 20:11:59 +1000 Stephen Rothwell sfr@canb.auug.org.au wrote:
If you are bisecting linux-next, I will suggest bisecting between the stable branch on linux-next (which is just Linus' tree when I started that day) and the top of the first linux-next that fails. (Assuming that the stable branch is good).
Actually (since you won't be bisecting the latest linux-next), you probably want to use
git merge-base stable next-20190903 (or whatever linux-next you are bisecting)
as your first good commit (assuming it id good :-)).
-- Cheers, Stephen Rothwell
Hi Stephen,
Thanks very much for the tips. I'll go ahead and give those a try.
Yeah this should help, but in general, due to our non-linear history, git bisect can jump around quite a bit. Especially if you only look at dates. Also due to our non-linear history, it sometimes needs to test you a merge-base, to figure out which part of the history it needs to chase for the bad commit. So all normal, but the hint above should help.
Also, you don't need to restart git bisect, you can just tell it about any good/bad commit you tested with
$ git bisect good|bad $sha1
The more git knows, the quicker it should find the bad commit.
Cheers, Daniel
On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote:
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction
0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction
0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated.
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node
Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@re...
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +- drivers/gpu/drm/drm_gem_vram_helper.c | 2 +- drivers/gpu/drm/nouveau/nouveau_display.c | 2 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 2 +- drivers/gpu/drm/qxl/qxl_object.h | 2 +- drivers/gpu/drm/radeon/radeon_object.h | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 8 ++++---- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 9 +++++---- drivers/gpu/drm/virtio/virtgpu_drv.h | 2 +- drivers/gpu/drm/virtio/virtgpu_prime.c | 3 --- drivers/gpu/drm/vmwgfx/vmwgfx_bo.c | 4 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 ++-- include/drm/ttm/ttm_bo_api.h | 4 ---- 14 files changed, 21 insertions(+), 27 deletions(-)
I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm: use gem reservation object' as being 'good' initially, based on the fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI applications displayed black artifacts across the screen.
I then edited the git-bisect log file where I nominated commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being 'bad' and ran 'git bisect replay' on it. This blamed commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit.
1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:09 2019 +0200
drm/ttm: use gem reservation object
Drop ttm_resv from ttm_buffer_object, use the gem reservation object (base._resv) instead.
Signed-off-by: Gerd Hoffmann kraxel@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@re...
drivers/gpu/drm/ttm/ttm_bo.c | 39 +++++++++++++++++++++++---------------- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- include/drm/ttm/ttm_bo_api.h | 1 - 3 files changed, 24 insertions(+), 18 deletions(-)
In the process of bisection, I nominated the following kernels as being 'bad'. They also booted fine, but the xserver would fail to start. I have attached the error messages generated by xorg.
# kernel boots; Xorg won't start. See Xorg_err.log attached. 5.3.0-rc3-01537-g6a3068065fa4 5.3.0-rc3-00782-gb0383c0653c4 5.3.0-rc1-00391-g54fc01b775fe 5.3.0-rc1-00366-g2e3c9ec4d151 5.3.0-rc1-00365-gb96f3e7c8069
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete?
Thanks to all of you for the tips on how proceed with bisection.
Adding Gerd and Christian. -Daniel
On Fri, Sep 20, 2019 at 9:44 PM Alexander Kapshuk alexander.kapshuk@gmail.com wrote:
On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote:
To Whom It May Concern
Every kernel I have built since 5.3.0-rc2-next-20190730 and up to 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
The panic occurs early on in the boot process, so no records of it get written on disk. I resourted to taking photos and videos to get the info for debugging.
[Kernel panic] Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
Top of call stack: __drm_fb_helper_initial_config_and_unlock drm_fb_helper_initial_config
<scripts/decodecode <~/tmp/panic_code.txt Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff All code ======== 0: 00 48 83 add %cl,-0x7d(%rax) 3: bb f0 00 00 00 mov $0xf0,%ebx 8: 00 74 16 48 add %dh,0x48(%rsi,%rdx,1) c: 83 c3 18 add $0x18,%ebx f: b9 17 00 00 00 mov $0x17,%ecx 14: 31 c0 xor %eax,%eax 16: 48 89 df mov %rbx,%rdi 19: f3 48 ab rep stos %rax,%es:(%rdi) 1c: 5b pop %rbx 1d: 41 5c pop %r12 1f: 5d pop %rbp 20: c3 retq 21: 4c 89 a3 f0 00 00 00 mov %r12,0xf0(%rbx) 28: eb e1 jmp 0xb 2a:* 0f 0b ud2 <-- trapping instruction 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 55 push %rbp 31: 48 89 e5 mov %rsp,%rbp 34: 41 54 push %r12 36: 49 89 d4 mov %rdx,%r12 39: 53 push %rbx 3a: 48 89 f3 mov %rsi,%rbx 3d: e8 .byte 0xe8 3e: 7e ff jle 0x3f
Code starting with the faulting instruction
0: 0f 0b ud2 2: 0f 1f 40 00 nopl 0x0(%rax) 6: 55 push %rbp 7: 48 89 e5 mov %rsp,%rbp a: 41 54 push %r12 c: 49 89 d4 mov %rdx,%r12 f: 53 push %rbx 10: 48 89 f3 mov %rsi,%rbx 13: e8 .byte 0xe8 14: 7e ff jle 0x15
The panic occurs after the 'Driver supports precise vblank timestamp query.' line gets printed to console: [ 2.858970] Linux agpgart interface v0.103 [ 2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2) [ 2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19 [ 2.989923] nouveau 0000:01:00.0: bios: M0203T not found [ 2.990010] nouveau 0000:01:00.0: bios: M0203E not matched! [ 2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2 [ 3.062362] [TTM] Zone kernel: Available graphics memory: 2015014 KiB [ 3.062494] [TTM] Initializing pool allocator [ 3.062581] [TTM] Initializing DMA pool allocator [ 3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB [ 3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB [ 3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0 [ 3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0 [ 3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028 [ 3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030 [ 3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028 [ 3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0 [ 3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030 [ 3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130 [ 3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies [ 3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 3.066754] [drm] Driver supports precise vblank timestamp query.
I was not able to capture the value of RIP for this crash.
With drm_kms_helper.fbdev_emulation=0 enabled, as documented in the commentary to function drm_fb_helper_initial_config defined in drivers/gpu/drm/drm_fb_helper.c, I get the following output:
RIP: 0010: _raw_spin_lock+0x7/0x20 Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
<scripts/decodecode <~/tmp/panic_code.txt Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f All code ======== 0: ba ff 00 00 00 mov $0xff,%edx 5: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9: 75 01 jne 0xc b: c3 retq c: 55 push %rbp d: 48 89 e5 mov %rsp,%rbp 10: e8 23 a2 6d ff callq 0xffffffffff6da238 15: 5d pop %rbp 16: c3 retq 17: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 00 22: 90 nop 23: 31 c0 xor %eax,%eax 25: ba 01 00 00 00 mov $0x1,%edx 2a:* f0 0f b1 17 lock cmpxchg %edx,(%rdi) <-- trapping instruction 2e: 75 01 jne 0x31 30: c3 retq 31: 55 push %rbp 32: 89 c6 mov %eax,%esi 34: 40 89 e5 rex mov %esp,%ebp 37: e8 e7 8f 6d ff callq 0xffffffffff6d9023 3c: 5d pop %rbp 3d: c3 retq 3e: 0f .byte 0xf 3f: 1f (bad)
Code starting with the faulting instruction
0: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 4: 75 01 jne 0x7 6: c3 retq 7: 55 push %rbp 8: 89 c6 mov %eax,%esi a: 40 89 e5 rex mov %esp,%ebp d: e8 e7 8f 6d ff callq 0xffffffffff6d8ff9 12: 5d pop %rbp 13: c3 retq 14: 0f .byte 0xf 15: 1f (bad)
(gdb) list *(_raw_spin_lock+0x7) 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200). 195 } 196 197 #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg 198 static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new) 199 { 200 return try_cmpxchg(&v->counter, old, new); 201 } 202 203 static inline int arch_atomic_xchg(atomic_t *v, int new) 204 {
(gdb) disassemble _raw_spin_lock+0x7 Dump of assembler code for function _raw_spin_lock: 0xffffffff81a13b20 <+0>: xor %eax,%eax 0xffffffff81a13b22 <+2>: mov $0x1,%edx 0xffffffff81a13b27 <+7>: lock cmpxchg %edx,(%rdi) 0xffffffff81a13b2b <+11>: jne 0xffffffff81a13b2e <_raw_spin_lock+14> 0xffffffff81a13b2d <+13>: retq 0xffffffff81a13b2e <+14>: push %rbp 0xffffffff81a13b2f <+15>: mov %eax,%esi 0xffffffff81a13b31 <+17>: mov %rsp,%rbp 0xffffffff81a13b34 <+20>: callq 0xffffffff810ecb20 <queued_spin_lock_slowpath> 0xffffffff81a13b39 <+25>: pop %rbp 0xffffffff81a13b3a <+26>: retq End of assembler dump.
Any pointers on how to proceed with this would be appreciated.
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@redhat.com
drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +- drivers/gpu/drm/drm_gem_vram_helper.c | 2 +- drivers/gpu/drm/nouveau/nouveau_display.c | 2 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 2 +- drivers/gpu/drm/qxl/qxl_object.h | 2 +- drivers/gpu/drm/radeon/radeon_object.h | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 8 ++++---- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- drivers/gpu/drm/ttm/ttm_bo_vm.c | 9 +++++---- drivers/gpu/drm/virtio/virtgpu_drv.h | 2 +- drivers/gpu/drm/virtio/virtgpu_prime.c | 3 --- drivers/gpu/drm/vmwgfx/vmwgfx_bo.c | 4 ++-- drivers/gpu/drm/vmwgfx/vmwgfx_surface.c | 4 ++-- include/drm/ttm/ttm_bo_api.h | 4 ---- 14 files changed, 21 insertions(+), 27 deletions(-)
I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm: use gem reservation object' as being 'good' initially, based on the fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI applications displayed black artifacts across the screen.
I then edited the git-bisect log file where I nominated commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being 'bad' and ran 'git bisect replay' on it. This blamed commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit.
1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:09 2019 +0200
drm/ttm: use gem reservation object Drop ttm_resv from ttm_buffer_object, use the gem reservation object (base._resv) instead. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@redhat.com
drivers/gpu/drm/ttm/ttm_bo.c | 39 +++++++++++++++++++++++---------------- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- include/drm/ttm/ttm_bo_api.h | 1 - 3 files changed, 24 insertions(+), 18 deletions(-)
In the process of bisection, I nominated the following kernels as being 'bad'. They also booted fine, but the xserver would fail to start. I have attached the error messages generated by xorg.
# kernel boots; Xorg won't start. See Xorg_err.log attached. 5.3.0-rc3-01537-g6a3068065fa4 5.3.0-rc3-00782-gb0383c0653c4 5.3.0-rc1-00391-g54fc01b775fe 5.3.0-rc1-00366-g2e3c9ec4d151 5.3.0-rc1-00365-gb96f3e7c8069
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete?
Thanks to all of you for the tips on how proceed with bisection.
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@redhat.com
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete?
Yes, we had a regression in nouveau, fixed by this patch (in drm-misc-next):
commit 019cbd4a4feb3aa3a917d78e7110e3011bbff6d5 Author: Thierry Reding treding@nvidia.com Date: Wed Aug 14 11:00:48 2019 +0200
drm/nouveau: Initialize GEM object before TTM object
TTM assumes that drivers initialize the embedded GEM object before calling the ttm_bo_init() function. This is not currently the case in the Nouveau driver. Fix this by splitting up nouveau_bo_new() into nouveau_bo_alloc() and nouveau_bo_init() so that the GEM can be initialized before TTM BO initialization when necessary.
Fixes: b96f3e7c8069 ("drm/ttm: use gem vma_node") Acked-by: Gerd Hoffmann kraxel@redhat.com Acked-by: Ben Skeggs bskeggs@redhat.com Signed-off-by: Thierry Reding treding@nvidia.com Link: https://patchwork.freedesktop.org/patch/msgid/20190814093524.GA31345@ulmo
HTH, Gerd
On Mon, Sep 23, 2019 at 9:38 AM Gerd Hoffmann kraxel@redhat.com wrote:
'Git bisect' has identified the following commits as being 'bad'.
b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2 Author: Gerd Hoffmann kraxel@redhat.com Date: Mon Aug 5 16:01:10 2019 +0200
drm/ttm: use gem vma_node Drop vma_node from ttm_buffer_object, use the gem struct (base.vma_node) instead. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@redhat.com
Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine with no Xorg regressions to report.
Just wondering if the earlier kernels would not boot for me because of the changes introduced by the 'bad' commits being perhaps incomplete?
Yes, we had a regression in nouveau, fixed by this patch (in drm-misc-next):
commit 019cbd4a4feb3aa3a917d78e7110e3011bbff6d5 Author: Thierry Reding treding@nvidia.com Date: Wed Aug 14 11:00:48 2019 +0200
drm/nouveau: Initialize GEM object before TTM object TTM assumes that drivers initialize the embedded GEM object before calling the ttm_bo_init() function. This is not currently the case in the Nouveau driver. Fix this by splitting up nouveau_bo_new() into nouveau_bo_alloc() and nouveau_bo_init() so that the GEM can be initialized before TTM BO initialization when necessary. Fixes: b96f3e7c8069 ("drm/ttm: use gem vma_node") Acked-by: Gerd Hoffmann <kraxel@redhat.com> Acked-by: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190814093524.GA31345@ulmo
HTH, Gerd
Terrific. Thanks for the info.
dri-devel@lists.freedesktop.org