https://bugs.freedesktop.org/show_bug.cgi?id=107065
Bug ID: 107065 Summary: "BUG: unable to handle kernel paging request at 0000000000002000" at amdgpu_vm_cpu_set_ptes at S3 resume Product: DRI Version: DRI git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: major Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: jb5sgc1n.nya@20mm.eu
When I resume from S3 using the 4.17.2-1-ARCH kernel, with amdgpu.vm_update_mode=3 (for reasons explained in https://bugs.freedesktop.org/show_bug.cgi?id=102322 ) first the amdgpu driver and shortly thereafter the system crashes with the following kernel messages:
Jun 28 21:14:25 ryzen kernel: ACPI: Low-level resume complete Jun 28 21:14:25 ryzen kernel: PM: Restoring platform NVS memory Jun 28 21:14:25 ryzen kernel: Enabling non-boot CPUs ... ... Jun 28 21:14:25 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400040000). Jun 28 21:14:25 ryzen kernel: [drm] UVD and UVD ENC initialized successfully. Jun 28 21:14:25 ryzen kernel: [drm] VCE initialized successfully. Jun 28 21:14:25 ryzen kernel: OOM killer enabled. Jun 28 21:14:25 ryzen kernel: Restarting tasks ... done. Jun 28 21:14:25 ryzen kernel: PM: suspend exit Jun 28 21:14:25 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000002000 Jun 28 21:14:25 ryzen kernel: PGD 0 P4D 0 Jun 28 21:14:25 ryzen kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI Jun 28 21:14:25 ryzen kernel: Modules linked in: arc4 md4 sha512_ssse3 sha512_generic nls_utf8 cifs ccm dns_resolver fscache> Jun 28 21:14:25 ryzen kernel: bluetooth snd_hwdep snd_pcm eeepc_wmi snd_timer asus_wmi snd sparse_keymap mxm_wmi wmi_bmof i> Jun 28 21:14:25 ryzen kernel: dm_crypt dm_mod i2c_dev Jun 28 21:14:25 ryzen kernel: CPU: 3 PID: 882 Comm: amdgpu_cs:0 Tainted: G W O 4.17.2-1-ARCH #1 Jun 28 21:14:25 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 Jun 28 21:14:25 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jun 28 21:14:25 ryzen kernel: RSP: 0018:ffffb8b8c3fa7a70 EFLAGS: 00010202 Jun 28 21:14:25 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000f400956001 Jun 28 21:14:25 ryzen kernel: RDX: 0000000000002000 RSI: 0000000000002000 RDI: ffff9edab48a0000 Jun 28 21:14:25 ryzen kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 Jun 28 21:14:25 ryzen kernel: R10: ffffffffc03e4c50 R11: ffff9edab30d0800 R12: 0000000000002000 Jun 28 21:14:25 ryzen kernel: R13: 0000000000000001 R14: ffffb8b8c3fa7ae8 R15: 000000f400956000 Jun 28 21:14:25 ryzen kernel: FS: 00007f622bb59700(0000) GS:ffff9edadecc0000(0000) knlGS:0000000000000000 Jun 28 21:14:25 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 28 21:14:25 ryzen kernel: CR2: 0000000000002000 CR3: 00000007e03f8000 CR4: 00000000003406e0 Jun 28 21:14:25 ryzen kernel: Call Trace: Jun 28 21:14:25 ryzen kernel: amdgpu_vm_cpu_set_ptes+0x76/0xf0 [amdgpu] Jun 28 21:14:25 ryzen kernel: amdgpu_vm_update_directories+0x1ca/0x3c0 [amdgpu] Jun 28 21:14:25 ryzen kernel: ? amdgpu_vm_do_copy_ptes+0xc0/0xc0 [amdgpu] Jun 28 21:14:25 ryzen kernel: amdgpu_cs_ioctl+0x1169/0x1a70 [amdgpu] Jun 28 21:14:25 ryzen kernel: ? dequeue_entity+0x156/0x950 Jun 28 21:14:25 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jun 28 21:14:25 ryzen kernel: drm_ioctl_kernel+0x5b/0xb0 [drm] Jun 28 21:14:25 ryzen kernel: drm_ioctl+0x1b7/0x370 [drm] Jun 28 21:14:25 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jun 28 21:14:25 ryzen kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Jun 28 21:14:25 ryzen kernel: do_vfs_ioctl+0xa4/0x610 Jun 28 21:14:25 ryzen kernel: ksys_ioctl+0x60/0x90 Jun 28 21:14:25 ryzen kernel: __x64_sys_ioctl+0x16/0x20 Jun 28 21:14:25 ryzen kernel: do_syscall_64+0x5b/0x170 Jun 28 21:14:25 ryzen kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 28 21:14:25 ryzen kernel: RIP: 0033:0x7f623b586667 Jun 28 21:14:25 ryzen kernel: RSP: 002b:00007f622bb58a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jun 28 21:14:25 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f622bb58b88 RCX: 00007f623b586667 Jun 28 21:14:25 ryzen kernel: RDX: 00007f622bb58b00 RSI: 00000000c0186444 RDI: 000000000000000b Jun 28 21:14:25 ryzen kernel: RBP: 00007f622bb58b00 R08: 00007f622bb58bb0 R09: 0000000000000010 Jun 28 21:14:25 ryzen kernel: R10: 00007f622bb58bb0 R11: 0000000000000246 R12: 00000000c0186444 Jun 28 21:14:25 ryzen kernel: R13: 000000000000000b R14: 000000000000000a R15: 0000000000000000 Jun 28 21:14:25 ryzen kernel: Code: 8b 80 d8 00 00 00 e9 85 ed 5c c2 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 0> Jun 28 21:14:25 ryzen kernel: RIP: gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] RSP: ffffb8b8c3fa7a70 Jun 28 21:14:25 ryzen kernel: CR2: 0000000000002000 Jun 28 21:14:25 ryzen kernel: ---[ end trace 6fce4be2faa5be7e ]---
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #1 from dwagner jb5sgc1n.nya@20mm.eu --- Created attachment 140383 --> https://bugs.freedesktop.org/attachment.cgi?id=140383&action=edit dmesg of the system boot and before and at the crash at S3 resume
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #2 from dwagner jb5sgc1n.nya@20mm.eu --- (Just for reference: This bug report is for a different kind of S3-resume-crash than reported in https://bugs.freedesktop.org/show_bug.cgi?id=103277 )
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #3 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Can you use addr2line or gdb with 'list' command to give the line number matching amdgpu_vm_cpu_set_ptes+0x76/0xf0 ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #4 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #3)
Can you use addr2line or gdb with 'list' command to give the line number matching amdgpu_vm_cpu_set_ptes+0x76/0xf0 ?
That would have been easy had I used my self-compiled kernel - but it seems there is no debuginfo file available for the Arch Linux supplied kernels, which I ran in this case.
So I can only provide a disassembled listing of that function, with offset 0x76 aka +118 inside:
Dump of assembler code for function amdgpu_vm_cpu_set_ptes: 0x0000000000027c80 <+0>: callq 0x27c85 <amdgpu_vm_cpu_set_ptes+5> 0x0000000000027c85 <+5>: push %r15 0x0000000000027c87 <+7>: mov %rcx,%r15 0x0000000000027c8a <+10>: push %r14 0x0000000000027c8c <+12>: mov %rdi,%r14 0x0000000000027c8f <+15>: mov %rsi,%rdi 0x0000000000027c92 <+18>: push %r13 0x0000000000027c94 <+20>: mov %r8d,%r13d 0x0000000000027c97 <+23>: push %r12 0x0000000000027c99 <+25>: mov %rdx,%r12 0x0000000000027c9c <+28>: push %rbp 0x0000000000027c9d <+29>: mov %r9d,%ebp 0x0000000000027ca0 <+32>: push %rbx 0x0000000000027ca1 <+33>: callq 0x27ca6 <amdgpu_vm_cpu_set_ptes+38> 0x0000000000027ca6 <+38>: add %rax,%r12 0x0000000000027ca9 <+41>: nopl 0x0(%rax,%rax,1) 0x0000000000027cae <+46>: xor %ebx,%ebx 0x0000000000027cb0 <+48>: test %r13d,%r13d 0x0000000000027cb3 <+51>: je 0x27cfb <amdgpu_vm_cpu_set_ptes+123> 0x0000000000027cb5 <+53>: mov 0x28(%r14),%rax 0x0000000000027cb9 <+57>: mov %r15,%rcx 0x0000000000027cbc <+60>: test %rax,%rax 0x0000000000027cbf <+63>: je 0x27cd3 <amdgpu_vm_cpu_set_ptes+83> 0x0000000000027cc1 <+65>: mov %r15,%rdx 0x0000000000027cc4 <+68>: mov $0xfffffffffffff000,%rcx 0x0000000000027ccb <+75>: shr $0xc,%rdx 0x0000000000027ccf <+79>: and (%rax,%rdx,8),%rcx 0x0000000000027cd3 <+83>: mov (%r14),%rdi 0x0000000000027cd6 <+86>: mov %ebx,%edx 0x0000000000027cd8 <+88>: add $0x1,%ebx 0x0000000000027cdb <+91>: mov 0x38(%rsp),%r8 0x0000000000027ce0 <+96>: mov %r12,%rsi 0x0000000000027ce3 <+99>: add %rbp,%r15 0x0000000000027ce6 <+102>: mov 0x968(%rdi),%rax 0x0000000000027ced <+109>: mov 0x18(%rax),%rax 0x0000000000027cf1 <+113>: callq 0x27cf6 <amdgpu_vm_cpu_set_ptes+118> 0x0000000000027cf6 <+118>: cmp %ebx,%r13d 0x0000000000027cf9 <+121>: jne 0x27cb5 <amdgpu_vm_cpu_set_ptes+53> 0x0000000000027cfb <+123>: pop %rbx 0x0000000000027cfc <+124>: pop %rbp 0x0000000000027cfd <+125>: pop %r12 0x0000000000027cff <+127>: pop %r13 0x0000000000027d01 <+129>: pop %r14 0x0000000000027d03 <+131>: pop %r15 0x0000000000027d05 <+133>: retq 0x0000000000027d06 <+134>: mov %gs:0x0(%rip),%eax # 0x27d0d <amdgpu_vm_cpu_set_ptes+141> 0x0000000000027d0d <+141>: mov %eax,%eax 0x0000000000027d0f <+143>: bt %rax,0x0(%rip) # 0x27d17 <amdgpu_vm_cpu_set_ptes+151> 0x0000000000027d17 <+151>: jae 0x27cae <amdgpu_vm_cpu_set_ptes+46> 0x0000000000027d19 <+153>: incl %gs:0x0(%rip) # 0x27d20 <amdgpu_vm_cpu_set_ptes+160> 0x0000000000027d20 <+160>: mov 0x0(%rip),%rbx # 0x27d27 <amdgpu_vm_cpu_set_ptes+167> 0x0000000000027d27 <+167>: test %rbx,%rbx 0x0000000000027d2a <+170>: je 0x27d55 <amdgpu_vm_cpu_set_ptes+213> 0x0000000000027d2c <+172>: mov (%rbx),%rax 0x0000000000027d2f <+175>: mov 0x8(%rbx),%rdi 0x0000000000027d33 <+179>: add $0x18,%rbx 0x0000000000027d37 <+183>: mov 0x38(%rsp),%r9 0x0000000000027d3c <+188>: mov %ebp,%r8d 0x0000000000027d3f <+191>: mov %r13d,%ecx 0x0000000000027d42 <+194>: mov %r15,%rdx 0x0000000000027d45 <+197>: mov %r12,%rsi 0x0000000000027d48 <+200>: callq 0x27d4d <amdgpu_vm_cpu_set_ptes+205> 0x0000000000027d4d <+205>: mov (%rbx),%rax 0x0000000000027d50 <+208>: test %rax,%rax 0x0000000000027d53 <+211>: jne 0x27d2f <amdgpu_vm_cpu_set_ptes+175> 0x0000000000027d55 <+213>: decl %gs:0x0(%rip) # 0x27d5c <amdgpu_vm_cpu_set_ptes+220> 0x0000000000027d5c <+220>: jne 0x27cae <amdgpu_vm_cpu_set_ptes+46> 0x0000000000027d62 <+226>: callq 0x27d67 <amdgpu_vm_cpu_set_ptes+231> 0x0000000000027d67 <+231>: jmpq 0x27cae <amdgpu_vm_cpu_set_ptes+46>
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #5 from dwagner jb5sgc1n.nya@20mm.eu --- Interesting: With amd-staging-drm-next, I see the same crash at https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/am... with the same backtrace with vm_update_mode=3 immediately upon starting X11 - not only after S3 resume. Here with symbols translated to source lines:
Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989 /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542 /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask (/mm/page_alloc.c:4355) Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl (/./include/linux/pm_runtime.h:108 /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500 /fs/ioctl.c:684) Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39 /fs/ioctl.c:702) Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708 /fs/ioctl.c:706 /fs/ioctl.c:706) Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10)
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #6 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #5)
Interesting: With amd-staging-drm-next, I see the same crash at https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/ amdgpu_vm.c?h=amd-staging-drm-next#n921 with the same backtrace with vm_update_mode=3 immediately upon starting X11 - not only after S3 resume. Here with symbols translated to source lines:
Jun 29 01:49:05 ryzen kernel: amdgpu_vm_cpu_set_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:921 (discriminator 2)) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_vm_update_directories (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:989 /drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1096) amdgpu Jun 29 01:49:05 ryzen kernel: ? amdgpu_vm_do_copy_ptes (/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:913) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_gem_va_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:542 /drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:674) amdgpu Jun 29 01:49:05 ryzen kernel: ? __alloc_pages_nodemask (/mm/page_alloc.c:4355) Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 drm Jun 29 01:49:05 ryzen kernel: drm_ioctl+0x2f1/0x3c0 drm Jun 29 01:49:05 ryzen kernel: ? amdgpu_gem_metadata_ioctl (/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c:548) amdgpu Jun 29 01:49:05 ryzen kernel: amdgpu_drm_ioctl (/./include/linux/pm_runtime.h:108 /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:842) amdgpu Jun 29 01:49:05 ryzen kernel: do_vfs_ioctl (/fs/ioctl.c:46 /fs/ioctl.c:500 /fs/ioctl.c:684) Jun 29 01:49:05 ryzen kernel: ? handle_mm_fault (/mm/memory.c:4133) Jun 29 01:49:05 ryzen kernel: ksys_ioctl (/./include/linux/file.h:39 /fs/ioctl.c:702) Jun 29 01:49:05 ryzen kernel: __x64_sys_ioctl (/fs/ioctl.c:708 /fs/ioctl.c:706 /fs/ioctl.c:706) Jun 29 01:49:05 ryzen kernel: do_syscall_64 (/arch/x86/entry/common.c:290) Jun 29 01:49:05 ryzen kernel: entry_SYSCALL_64_after_hwframe (/./include/trace/events/initcall.h:10 /./include/trace/events/initcall.h:10)
So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #7 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #6)
So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ?
Yes. I know it sounds strange, but it's currently 100% reproducible to me:
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: X11 starts fine, system does not crash (for at least hours of use) but crashes as above if resumed from S3 sleep
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=3: X11 does not start, crashes immediately with the same above pasted kernel BUG message and backtrace
So something with CPU-based vm_update_mode is broken, but in a different way than the SDMA-based method.
I will change the subject of this report to reflect that this crash is not necessarily S3-resume-related.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
dwagner jb5sgc1n.nya@20mm.eu changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|"BUG: unable to handle |"BUG: unable to handle |kernel paging request at |kernel paging request at |0000000000002000" at |0000000000002000" in |amdgpu_vm_cpu_set_ptes at |amdgpu_vm_cpu_set_ptes at |S3 resume |amdgpu_vm.c:921
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #8 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #7)
(In reply to Andrey Grodzovsky from comment #6)
So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ?
Yes. I know it sounds strange, but it's currently 100% reproducible to me:
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: X11 starts fine, system does not crash (for at least hours of use) but crashes as above if resumed from S3 sleep
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=3: X11 does not start, crashes immediately with the same above pasted kernel BUG message and backtrace
So something with CPU-based vm_update_mode is broken, but in a different way than the SDMA-based method.
I will change the subject of this report to reflect that this crash is not necessarily S3-resume-related.
I am going to try and reproduce the crash with CPU update mode here, please describe exactly what ASIC are you using ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #9 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to Andrey Grodzovsky from comment #8)
(In reply to dwagner from comment #7)
(In reply to Andrey Grodzovsky from comment #6)
So with Arch Linux kernel it happens only during S3 but with amd-staging-drm-next it happens once you start X ?
Yes. I know it sounds strange, but it's currently 100% reproducible to me:
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3: X11 starts fine, system does not crash (for at least hours of use) but crashes as above if resumed from S3 sleep
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=0: X11 starts fine, but system crashes after minutes of firefox browsing
Booting linux compiled from amd-staging-drm-next, as of commit 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with amdgpu.vm_update_mode=3: X11 does not start, crashes immediately with the same above pasted kernel BUG message and backtrace
So something with CPU-based vm_update_mode is broken, but in a different way than the SDMA-based method.
I will change the subject of this report to reflect that this crash is not necessarily S3-resume-related.
I am going to try and reproduce the crash with CPU update mode here, please describe exactly what ASIC are you using ?
Got it already.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #10 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Created attachment 140418 --> https://bugs.freedesktop.org/attachment.cgi?id=140418&action=edit drm/amdgpu: Verify root PD is mapped into kernel address space.
dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues.
Christian, please take a look at the patch, problem was that in amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and so later inside amdgpu_vm_cpu_set_ptes pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to 0000000000002000 since parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD.
This was still working in 67b8d5c Linus Torvalds 7 weeks ago Linux 4.17-rc5 (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which change broke it. I am not sure my fix is the right one so please advise.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #11 from Christian König ckoenig.leichtzumerken@gmail.com --- (In reply to Andrey Grodzovsky from comment #10)
Created attachment 140418 [details] [review] drm/amdgpu: Verify root PD is mapped into kernel address space.
dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues.
Christian, please take a look at the patch, problem was that in amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and so later inside amdgpu_vm_cpu_set_ptes pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to 0000000000002000 since parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD.
This was still working in 67b8d5c Linus Torvalds 7 weeks ago Linux 4.17-rc5 (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which change broke it. I am not sure my fix is the right one so please advise.
No idea when that broke either, CPU based updates is not something we usually test.
Anyway it's a good catch, but I would rather add that to amdgpu_vm_bo_base_init() (with the appropriate checks).
That would also allow us to remove the duplicated code from amdgpu_vm_alloc_levels().
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #12 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #10)
Created attachment 140418 [details] [review] drm/amdgpu: Verify root PD is mapped into kernel address space.
dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues.
While I can start X11 with this patch applied to current amd-staging-drm-next, attempts to resume from S3 fail consistently.
The following related output is emitted right before the suspend:
Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend to debug) Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3 Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...
(I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have seen it some other times in conjunction with heavy uses of the amdgpu driver.)
Then, upon resume, the following messages are emitted:
Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000). Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 189 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 306 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 5e ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 18a ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22 Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22 Jul 02 21:31:33 ryzen kernel: OOM killer enabled. Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done. Jul 02 21:31:33 ryzen kernel: PM: suspend exit Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000001000 Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G W O 4.18.0-rc1-amd+ #45 Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0 Jul 02 21:31:33 ryzen kernel: Call Trace: Jul 02 21:31:33 ryzen kernel: amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update_mapping+0xed/0x410 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update+0x310/0x680 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 [drm] Jul 02 21:31:33 ryzen kernel: drm_ioctl+0x2f1/0x3c0 [drm] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Jul 02 21:31:33 ryzen kernel: do_vfs_ioctl+0xa4/0x620 Jul 02 21:31:33 ryzen kernel: ? __se_sys_futex+0x138/0x180 Jul 02 21:31:33 ryzen kernel: ksys_ioctl+0x60/0x90 Jul 02 21:31:33 ryzen kernel: __x64_sys_ioctl+0x16/0x20 Jul 02 21:31:33 ryzen kernel: do_syscall_64+0x48/0xf0 Jul 02 21:31:33 ryzen kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667 Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8> Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88 RCX: 00007f8b66c92667 Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444 RDI: 000000000000000b Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0 R09: 0000000000000010 Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246 R12: 00000000c0186444 Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002 R15: 0000000000000000 Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l> Jul 02 21:31:33 ryzen kernel: serio_raw crc32_pclmul atkbd ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]--- Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0
(At this point, the machine is just dead, and reacts upon nothing.)
So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #13 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #12)
(In reply to Andrey Grodzovsky from comment #10)
Created attachment 140418 [details] [review] [review] drm/amdgpu: Verify root PD is mapped into kernel address space.
dwagner, please try this patch. Fixes the issue for me and I observed no suspend/resume issues.
While I can start X11 with this patch applied to current amd-staging-drm-next, attempts to resume from S3 fail consistently.
The following related output is emitted right before the suspend:
Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done. Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend to debug) Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3 Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...
(I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have seen it some other times in conjunction with heavy uses of the amdgpu driver.)
Then, upon resume, the following messages are emitted:
Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000). Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 189 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 306 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 5e ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 18a ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 148 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 145 ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] last message was failed ret is 0 Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] failed to send message 146 ret is 0 Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22 Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22 Jul 02 21:31:33 ryzen kernel: OOM killer enabled. Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done. Jul 02 21:31:33 ryzen kernel: PM: suspend exit Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at 0000000000001000 Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G W O 4.18.0-rc1-amd+ #45 Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0 Jul 02 21:31:33 ryzen kernel: Call Trace: Jul 02 21:31:33 ryzen kernel: amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update_mapping+0xed/0x410 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_vm_bo_update+0x310/0x680 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: drm_ioctl_kernel+0xa7/0xf0 [drm] Jul 02 21:31:33 ryzen kernel: drm_ioctl+0x2f1/0x3c0 [drm] Jul 02 21:31:33 ryzen kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] Jul 02 21:31:33 ryzen kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu] Jul 02 21:31:33 ryzen kernel: do_vfs_ioctl+0xa4/0x620 Jul 02 21:31:33 ryzen kernel: ? __se_sys_futex+0x138/0x180 Jul 02 21:31:33 ryzen kernel: ksys_ioctl+0x60/0x90 Jul 02 21:31:33 ryzen kernel: __x64_sys_ioctl+0x16/0x20 Jul 02 21:31:33 ryzen kernel: do_syscall_64+0x48/0xf0 Jul 02 21:31:33 ryzen kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667 Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8> Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88 RCX: 00007f8b66c92667 Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444 RDI: 000000000000000b Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0 R09: 0000000000000010 Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246 R12: 00000000c0186444 Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002 R15: 0000000000000000 Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l> Jul 02 21:31:33 ryzen kernel: serio_raw crc32_pclmul atkbd ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]--- Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30 [amdgpu] Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202 Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001 RCX: 000000000fe004f1 Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000 RDI: ffff8807e2f70000 Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1 R09: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000 R12: 0000000000001000 Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18 R15: 000000000fe01000 Jul 02 21:31:33 ryzen kernel: FS: 00007f8b57266700(0000) GS:ffff88081ef80000(0000) knlGS:0000000000000000 Jul 02 21:31:33 ryzen kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000 CR4: 00000000003406e0
(At this point, the machine is just dead, and reacts upon nothing.)
So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76
My guess is that on resume from S3 root PD needs to be again mapped to CPU address space. Maybe changing the patch according to Christian's advise will be enough. I will take a look tomorrow. Or it has to do with the resume failure you are experiencing. What ASIC are you using ? I also tested with gfx8 ASIC and haven't observed any issues with resume. Did you update the firmware for this ASIC to latest #
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #14 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #13)
What ASIC are you using ? I also tested with gfx8 ASIC and haven't observed any issues with resume. Did you update the firmware for this ASIC to latest #
The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF", with the latest firmware from the kernel git, you can see the details from https://bugs.freedesktop.org/attachment.cgi?id=140383 uploaded earlier.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #15 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #14)
(In reply to Andrey Grodzovsky from comment #13)
What ASIC are you using ? I also tested with gfx8 ASIC and haven't observed any issues with resume. Did you update the firmware for this ASIC to latest #
The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF", with the latest firmware from the kernel git, you can see the details from https://bugs.freedesktop.org/attachment.cgi?id=140383 uploaded earlier.
We have only minor differences but I can't reproduce it. Maybe the resume failure is indeed due the eviction failure during suspend. Is S3 failure is happening only when you switch to CPU update mode ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #16 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #15)
We have only minor differences but I can't reproduce it. Maybe the resume failure is indeed due the eviction failure during suspend. Is S3 failure is happening only when you switch to CPU update mode ?
No, when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0 and suspend to S3 then resuming does also crash, but with different messages - _not_ with "BUG: unable to handle kernel paging request at 0000000000002000" like in the vm_update_mode=3 case.
In the journal, I can see see after a vm_update_mode=0 S3 resume attempt:
Jul 05 00:41:59 ryzen kernel: [TTM] Buffer eviction failed Jul 05 00:41:59 ryzen kernel: ACPI: Preparing to enter system sleep state S3 ... Jul 05 00:42:00 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 0 test failed (scratch(0xC040)=0xC> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -22 Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-22). Jul 05 00:42:00 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -22 Jul 05 00:42:00 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume async: error -22 ... Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error scheduling IBs (-22) Jul 05 00:42:00 ryzen kernel: amdgpu 0000:0a:00.0: couldn't schedule ib on ring <sdma0> ... many more of this... but no kernel BUG or Oops.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #17 from dwagner jb5sgc1n.nya@20mm.eu --- Interesting observation: If I first switch from the X11 display to the console display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the console, above described crashes upon S3 resume do not occur, and I do not see the "[TTM] Buffer eviction failed" in the kernel log, neither with vm_update_mode=0, nor with vm_update_mode=3.
Switching back to the X11 display after a successful S3 resume to the console also works fine.
What could be the relevant difference here?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #18 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #17)
Interesting observation: If I first switch from the X11 display to the console display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the console, above described crashes upon S3 resume do not occur, and I do not see the "[TTM] Buffer eviction failed" in the kernel log, neither with vm_update_mode=0, nor with vm_update_mode=3.
Switching back to the X11 display after a successful S3 resume to the console also works fine.
What could be the relevant difference here?
Well, there is no acceleration involved when in console mode. So maybe this has something to do with it.
Anyway, i am sidetracked a bit by an internal requirement but once i finish I will get back to this issue especially because I got another report with the same failure as you describe.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #19 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to Andrey Grodzovsky from comment #18)
(In reply to dwagner from comment #17)
Interesting observation: If I first switch from the X11 display to the console display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the console, above described crashes upon S3 resume do not occur, and I do not see the "[TTM] Buffer eviction failed" in the kernel log, neither with vm_update_mode=0, nor with vm_update_mode=3.
Switching back to the X11 display after a successful S3 resume to the console also works fine.
What could be the relevant difference here?
Well, there is no acceleration involved when in console mode. So maybe this has something to do with it.
Anyway, i am sidetracked a bit by an internal requirement but once i finish I will get back to this issue especially because I got another report with the same failure as you describe.
I was able to reproduce this instantly without even using page tables CPU update mode. Looks like a regression since S3 was working fine for long time. Were you able to find a regression point for this ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #20 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #19)
I was able to reproduce this instantly without even using page tables CPU update mode. Looks like a regression since S3 was working fine for long time. Were you able to find a regression point for this ?
Not for the exact symptom described in this report, but for an older S3 resume issue that was partially resolved - https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the regression caused by the "drm/amd/display: Match actual state during S3 resume" commit.
Unluckily, the many changes that followed thereafter do no longer allow to bisect the symptom there to one specific commit, but given that it still occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I think there is still some bug in the order of things done during re-initialization upon S3 resumes, and setting some fixed EDID seems to expose it as crash.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #21 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #20)
(In reply to Andrey Grodzovsky from comment #19)
I was able to reproduce this instantly without even using page tables CPU update mode. Looks like a regression since S3 was working fine for long time. Were you able to find a regression point for this ?
Not for the exact symptom described in this report, but for an older S3 resume issue that was partially resolved - https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the regression caused by the "drm/amd/display: Match actual state during S3 resume" commit.
Unluckily, the many changes that followed thereafter do no longer allow to bisect the symptom there to one specific commit, but given that it still occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I think there is still some bug in the order of things done during re-initialization upon S3 resumes, and setting some fixed EDID seems to expose it as crash.
I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on atomic drivers Not sure yet what's going on there and not sure it will fix you issue with amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. Still worth a try on your side to revert it and see what happens.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #22 from dwagner jb5sgc1n.nya@20mm.eu --- (In reply to Andrey Grodzovsky from comment #21)
I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on atomic drivers Not sure yet what's going on there and not sure it will fix you issue with amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. Still worth a try on your side to revert it and see what happens.
Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic drivers" for me only changes that after S3 resume, the very picture that was visible before S3 sleep is displayed again - but the kernel crash at "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as frozen as the system is dead.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #23 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #22)
(In reply to Andrey Grodzovsky from comment #21)
I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on atomic drivers Not sure yet what's going on there and not sure it will fix you issue with amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here. Still worth a try on your side to revert it and see what happens.
Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic drivers" for me only changes that after S3 resume, the very picture that was visible before S3 sleep is displayed again - but the kernel crash at "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as frozen as the system is dead.
Can you attach dmesg from the system with reverted patch ?
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #24 from dwagner jb5sgc1n.nya@20mm.eu ---
Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic drivers" for me only changes that after S3 resume, the very picture that was visible before S3 sleep is displayed again - but the kernel crash at "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as frozen as the system is dead.
Can you attach dmesg from the system with reverted patch ?
Sure, will do
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #25 from dwagner jb5sgc1n.nya@20mm.eu --- Created attachment 140634 --> https://bugs.freedesktop.org/attachment.cgi?id=140634&action=edit dmesg before and after S3 sleep with commit "updating plane ..." reverted
https://bugs.freedesktop.org/show_bug.cgi?id=107065
--- Comment #26 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to dwagner from comment #25)
Created attachment 140634 [details] dmesg before and after S3 sleep with commit "updating plane ..." reverted
Reverting the patch makes the TTM eviction failure + following driver resume failure go away. So that one issue. Another issue Is that you still experience page table updates realated fault during S3. I can't reproduce that issue.
I am currently looking into how this patch broke S3, this is more burning issue as other people experience it to. Later i will try to give you some debug printk patch to sort out your page fault issue.
https://bugs.freedesktop.org/show_bug.cgi?id=107065
Andrey Grodzovsky andrey.grodzovsky@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|dri-devel@lists.freedesktop |andrey.grodzovsky@amd.com |.org |
--- Comment #27 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Created attachment 140715 --> https://bugs.freedesktop.org/attachment.cgi?id=140715&action=edit 0001-drm-amdgpu-Fix-S3-resume-failre.patch
Please try the attached patch for the S3 issue, it's might still not be the final fix but still. It's not a fix for your CPU page table updates fault.
dri-devel@lists.freedesktop.org