Hi folks, today I joined to testing Kernel 5.11 and saw that the kernel log was flooded with BUG messages: BUG: sleeping function called from invalid context at mm/vmalloc.c:1756 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0 INFO: lockdep is turned off. CPU: 15 PID: 266 Comm: kswapd0 Tainted: G W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 vm_unmap_aliases+0x21/0x40 change_page_attr_set_clr+0x9e/0x190 set_memory_wb+0x2f/0x80 ttm_pool_free_page+0x28/0x90 [ttm] ttm_pool_shrink+0x45/0xb0 [ttm] ttm_pool_shrinker_scan+0xa/0x20 [ttm] do_shrink_slab+0x177/0x3a0 shrink_slab+0x9c/0x290 shrink_node+0x2e6/0x700 balance_pgdat+0x2f5/0x650 kswapd+0x21d/0x4d0 ? do_wait_intr_irq+0xd0/0xd0 ? balance_pgdat+0x650/0x650 kthread+0x13a/0x150 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30
But the most unpleasant thing is that after a while the monitor turns off and does not go on again until the restart. This is accompanied by an entry in the kernel log:
amdgpu 0000:0b:00.0: amdgpu: 00000000ff7d8b94 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
$ grep "Failed to pin framebuffer with error" -Rn . ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816: DRM_ERROR("Failed to pin framebuffer with error %d\n", r);
$ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Blaming lines: 0% (11/9167), done. 5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811) domain = AMDGPU_GEM_DOMAIN_VRAM; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5812) 7b7c6c81b3a37 (Junwei Zhang 2018-06-25 12:51:14 +0800 5813) r = amdgpu_bo_pin(rbo, domain); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5814) if (unlikely(r != 0)) { 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5815) if (r != -ERESTARTSYS) 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5816) DRM_ERROR("Failed to pin framebuffer with error %d\n", r); 0f257b09531b4 (Chunming Zhou 2019-05-07 19:45:31 +0800 5817) ttm_eu_backoff_reservation(&ticket, &list); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5818) return r; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5819) } e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5820) bb812f1ea87dd (Junwei Zhang 2018-06-25 13:32:24 +0800 5821) r = amdgpu_ttm_alloc_gart(&rbo->tbo);
Who knows how to fix it?
Full kernel logs is here: [1] https://pastebin.com/fLasjDHX [2] https://pastebin.com/g3wR2r9e
-- Best Regards, Mike Gavrilov.
Hi Mikhail
Am 10.01.21 um 23:26 schrieb Mikhail Gavrilov:
Hi folks, today I joined to testing Kernel 5.11 and saw that the kernel log was flooded with BUG messages: BUG: sleeping function called from invalid context at mm/vmalloc.c:1756 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0 INFO: lockdep is turned off. CPU: 15 PID: 266 Comm: kswapd0 Tainted: G W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 vm_unmap_aliases+0x21/0x40 change_page_attr_set_clr+0x9e/0x190 set_memory_wb+0x2f/0x80 ttm_pool_free_page+0x28/0x90 [ttm] ttm_pool_shrink+0x45/0xb0 [ttm] ttm_pool_shrinker_scan+0xa/0x20 [ttm] do_shrink_slab+0x177/0x3a0 shrink_slab+0x9c/0x290 shrink_node+0x2e6/0x700 balance_pgdat+0x2f5/0x650 kswapd+0x21d/0x4d0 ? do_wait_intr_irq+0xd0/0xd0 ? balance_pgdat+0x650/0x650 kthread+0x13a/0x150 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30
I'm probably responsible for this. Need to double check why we try to allocate memory while freeing some.
But the most unpleasant thing is that after a while the monitor turns off and does not go on again until the restart. This is accompanied by an entry in the kernel log:
amdgpu 0000:0b:00.0: amdgpu: 00000000ff7d8b94 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
-12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by the problem above, maybe something completely unrelated.
I will take a look.
Thanks, Christian.
$ grep "Failed to pin framebuffer with error" -Rn . ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816: DRM_ERROR("Failed to pin framebuffer with error %d\n", r);
$ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Blaming lines: 0% (11/9167), done. 5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811) domain = AMDGPU_GEM_DOMAIN_VRAM; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5812) 7b7c6c81b3a37 (Junwei Zhang 2018-06-25 12:51:14 +0800 5813) r = amdgpu_bo_pin(rbo, domain); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5814) if (unlikely(r != 0)) { 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5815) if (r != -ERESTARTSYS) 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5816) DRM_ERROR("Failed to pin framebuffer with error %d\n", r); 0f257b09531b4 (Chunming Zhou 2019-05-07 19:45:31 +0800 5817) ttm_eu_backoff_reservation(&ticket, &list); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5818) return r; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5819) } e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5820) bb812f1ea87dd (Junwei Zhang 2018-06-25 13:32:24 +0800 5821) r = amdgpu_ttm_alloc_gart(&rbo->tbo);
Who knows how to fix it?
Full kernel logs is here: [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.c... [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.c...
-- Best Regards, Mike Gavrilov.
Am 11.01.21 um 10:03 schrieb Christian König:
Hi Mikhail
Am 10.01.21 um 23:26 schrieb Mikhail Gavrilov:
Hi folks, today I joined to testing Kernel 5.11 and saw that the kernel log was flooded with BUG messages: BUG: sleeping function called from invalid context at mm/vmalloc.c:1756 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0 INFO: lockdep is turned off. CPU: 15 PID: 266 Comm: kswapd0 Tainted: G W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 vm_unmap_aliases+0x21/0x40 change_page_attr_set_clr+0x9e/0x190 set_memory_wb+0x2f/0x80 ttm_pool_free_page+0x28/0x90 [ttm] ttm_pool_shrink+0x45/0xb0 [ttm] ttm_pool_shrinker_scan+0xa/0x20 [ttm] do_shrink_slab+0x177/0x3a0 shrink_slab+0x9c/0x290 shrink_node+0x2e6/0x700 balance_pgdat+0x2f5/0x650 kswapd+0x21d/0x4d0 ? do_wait_intr_irq+0xd0/0xd0 ? balance_pgdat+0x650/0x650 kthread+0x13a/0x150 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30
I'm probably responsible for this. Need to double check why we try to allocate memory while freeing some.
Changing the page table attributes while releasing memory might sleep. So we can't use a spinlock here.
Thanks for the report, a patch to fix this is on the mailing list now.
But the most unpleasant thing is that after a while the monitor turns off and does not go on again until the restart. This is accompanied by an entry in the kernel log:
amdgpu 0000:0b:00.0: amdgpu: 00000000ff7d8b94 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
-12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by the problem above, maybe something completely unrelated.
I will take a look.
The looks like a completely unrelated memory leak to me.
Probably best if you open up a bug report for this.
Thanks, Christian.
Thanks, Christian.
$ grep "Failed to pin framebuffer with error" -Rn . ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816: DRM_ERROR("Failed to pin framebuffer with error %d\n", r);
$ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Blaming lines: 0% (11/9167), done. 5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811) domain = AMDGPU_GEM_DOMAIN_VRAM; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5812) 7b7c6c81b3a37 (Junwei Zhang 2018-06-25 12:51:14 +0800 5813) r = amdgpu_bo_pin(rbo, domain); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5814) if (unlikely(r != 0)) { 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5815) if (r != -ERESTARTSYS) 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5816) DRM_ERROR("Failed to pin framebuffer with error %d\n", r); 0f257b09531b4 (Chunming Zhou 2019-05-07 19:45:31 +0800 5817) ttm_eu_backoff_reservation(&ticket, &list); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5818) return r; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5819) } e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5820) bb812f1ea87dd (Junwei Zhang 2018-06-25 13:32:24 +0800 5821) r = amdgpu_ttm_alloc_gart(&rbo->tbo);
Who knows how to fix it?
Full kernel logs is here: [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.c... [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.c...
-- Best Regards, Mike Gavrilov.
On Mon, 11 Jan 2021 at 19:01, Christian König christian.koenig@amd.com wrote:
Changing the page table attributes while releasing memory might sleep. So we can't use a spinlock here.
Thanks for the report, a patch to fix this is on the mailing list now.
Can you look also the first trace? Here a same error message "sleeping function called from invalid context" and a lot of [amdgpu] code.
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 501, name: systemd-udevd 1 lock held by systemd-udevd/501: #0: ffff978e0278d258 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x3b/0xb0 CPU: 25 PID: 501 Comm: systemd-udevd Not tainted 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 ? dcn30_clock_source_create+0x34/0xb0 [amdgpu] kmem_cache_alloc_trace+0x204/0x230 dcn30_clock_source_create+0x34/0xb0 [amdgpu] dcn30_create_resource_pool+0x1d9/0x13a0 [amdgpu] ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? __kmalloc+0x191/0x280 ? dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create+0x205/0x790 [amdgpu] ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu] ? dev_vprintk_emit+0x171/0x195 ? dev_printk_emit+0x3e/0x40 dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x179f/0x1afd [amdgpu] ? pci_conf1_read+0xa4/0x100 amdgpu_driver_load_kms+0x68/0x280 [amdgpu] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] local_pci_probe+0x42/0x80 pci_device_probe+0xd9/0x1a0 really_probe+0x205/0x460 driver_probe_device+0xe1/0x150 device_driver_attach+0xa8/0xb0 __driver_attach+0x8c/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x67/0x90 bus_add_driver+0x12e/0x1f0 driver_register+0x8f/0xe0 ? 0xffffffffc0d9c000 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 do_init_module+0x5c/0x270 __do_sys_init_module+0x130/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f363661deee Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffeb7191588 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 0000561b94563170 RCX: 00007f363661deee RDX: 0000561b94579df0 RSI: 0000000000b8a356 RDI: 00007f3633b9e010 RBP: 00007f3633b9e010 R08: 0000561b94565240 R09: 00007ffeb718d786 R10: 0000561ef5ef1595 R11: 0000000000000246 R12: 0000561b94579df0 R13: 0000561b9457a3e0 R14: 0000000000000000 R15: 0000561b94576530 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x02000001 usb 1-3.2: new high-speed USB device number 5 using xhci_hcd [drm] REG_WAIT timeout 1us * 100000 tries - mpc2_assert_idle_mpcc line:480
-12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by the problem above, maybe something completely unrelated.
I will take a look.
The looks like a completely unrelated memory leak to me.
Probably best if you open up a bug report for this.
Yes, the monitor still turns off after applying patch "make the pool shrinker lock a mutex". Anyway patch fixed the issue with flood of message "BUG: sleeping function called from invalid context at mm/vmalloc.c:1756" so kernel log became cleaner. Now the issue with turns off monitor looks in logs so:
DMA-API: cacheline tracking ENOMEM, dma-debug disabled amdgpu 0000:0b:00.0: amdgpu: 000000006b791523 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: G W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm] Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65 RSP: 0018:ffffa7400532b9c0 EFLAGS: 00010286 RAX: ffff978e2ae25800 RBX: ffff97910ec12058 RCX: ffff978e12caac70 RDX: 0000000080000010 RSI: 0000000000000000 RDI: ffff97912c3d99c0 RBP: ffff97912c3d99c0 R08: 0000000000000000 R09: 0000000070b3a000 R10: 0000000000000002 R11: 0000000000000000 R12: ffff97912c3d99c0 R13: 0000000000000000 R14: ffffa7400532ba90 R15: ffff978e182c6350 FS: 00007f070bb1b640(0000) GS:ffff979509200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 00000001f0cd2000 CR4: 0000000000350ee0 Call Trace: ttm_tt_populate+0xa9/0xe0 [ttm] ttm_bo_handle_move_mem+0x142/0x180 [ttm] ttm_bo_validate+0x12e/0x1c0 [ttm] amdgpu_cs_bo_validate+0x82/0x190 [amdgpu] amdgpu_cs_list_validate+0x105/0x150 [amdgpu] amdgpu_cs_ioctl+0x80a/0x1f10 [amdgpu] ? trace_hardirqs_off_caller+0x21/0xd0 ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] drm_ioctl_kernel+0x8c/0xe0 [drm] drm_ioctl+0x20f/0x3c0 [drm] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] ? selinux_file_ioctl+0x147/0x200 ? lock_acquired+0x1fa/0x380 ? lock_release+0x1e9/0x400 ? trace_hardirqs_on+0x1b/0xe0 amdgpu_drm_ioctl+0x49/0x80 [amdgpu] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f0725633f8b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 be 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f070bb19ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f070bb19f40 RCX: 00007f0725633f8b RDX: 00007f070bb19f40 RSI: 00000000c0186444 RDI: 000000000000001b RBP: 00000000c0186444 R08: 00007f070bb1a540 R09: 00007f070bb19f20 R10: 0000000000000000 R11: 0000000000000246 R12: 00002b89a7bdb088 R13: 000000000000001b R14: 0000000000000000 R15: 00000000fffffffd Modules linked in: snd_seq_dummy snd_hrtimer uinput rfcomm nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep zstd sunrpc vfat fat uas usb_storage hid_logitech_hidpp hid_logitech_dj mt76x2u mt76x2_common mt76x02_usb mt76_usb mt76x02_lib gspca_zc3xx mt76 gspca_main snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg intel_rapl_msr soundwire_intel joydev intel_rapl_common soundwire_generic_allocation iwlmvm snd_soc_core uvcvideo edac_mce_amd videobuf2_vmalloc videobuf2_memops snd_compress snd_usb_audio kvm_amd videobuf2_v4l2 snd_pcm_dmaengine snd_usbmidi_lib soundwire_cadence videobuf2_common btusb mac80211 snd_rawmidi snd_hda_codec videodev kvm snd_hda_core ac97_bus snd_hwdep btrtl libarc4 snd_seq btbcm btintel snd_seq_device irqbypass xpad bluetooth mc snd_pcm iwlwifi rapl ff_memless eeepc_wmi asus_wmi snd_timer sparse_keymap ecdh_generic video wmi_bmof ecc pcspkr snd sp5100_tco cfg80211 k10temp soundcore i2c_piix4 rfkill acpi_cpufreq binfmt_misc ip_tables amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel cec drm ghash_clmulni_intel ccp igb nvme dca i2c_algo_bit xhci_pci nvme_core xhci_pci_renesas wmi pinctrl_amd fuse CR2: 0000000000000060 ---[ end trace b0dd767146d85401 ]--- RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm] Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65 RSP: 0018:ffffa7400532b9c0 EFLAGS: 00010286 RAX: ffff978e2ae25800 RBX: ffff97910ec12058 RCX: ffff978e12caac70 RDX: 0000000080000010 RSI: 0000000000000000 RDI: ffff97912c3d99c0 RBP: ffff97912c3d99c0 R08: 0000000000000000 R09: 0000000070b3a000 R10: 0000000000000002 R11: 0000000000000000 R12: ffff97912c3d99c0 R13: 0000000000000000 R14: ffffa7400532ba90 R15: ffff978e182c6350 FS: 00007f070bb1b640(0000) GS:ffff979509200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 00000001f0cd2000 CR4: 0000000000350ee0 BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 3780, name: brave:cs0 INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8c0d9abb>] copy_process+0x8fb/0x1de0 softirqs last enabled at (0): [<ffffffff8c0d9abb>] copy_process+0x8fb/0x1de0 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: G D W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 exit_signals+0x1c/0x2d0 do_exit+0xcd/0xc20 ? __x64_sys_ioctl+0x82/0xb0 rewind_stack_do_exit+0x17/0x20 RIP: 0033:0x7f0725633f8b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 be 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f070bb19ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f070bb19f40 RCX: 00007f0725633f8b RDX: 00007f070bb19f40 RSI: 00000000c0186444 RDI: 000000000000001b RBP: 00000000c0186444 R08: 00007f070bb1a540 R09: 00007f070bb19f20 R10: 0000000000000000 R11: 0000000000000246 R12: 00002b89a7bdb088 R13: 000000000000001b R14: 0000000000000000 R15: 00000000fffffffd GpuWatchdog[3635]: segfault at 0 ip 000055a8db6e3429 sp 00007fc593e4d420 error 6 in gitkraken[55a8d7d97000+5cb7000] Code: 00 79 09 48 8b 7d c0 e8 85 f6 bd fe c7 45 c0 aa aa aa aa 0f ae f0 41 8b 84 24 e0 00 00 00 89 45 c0 48 8d 7d c0 e8 e7 96 6b fc <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
You said that I need open up a bug report you means site https://bugzilla.kernel.org ? I thought mailing lists is better because bug report on bugzilla.kernel.org usually leave opened for several years without attention.
Full kernel logs is here: [1] https://pastebin.com/w64H4b8w
-- Best Regards, Mike Gavrilov.
Hi Mike,
Am 11.01.21 um 20:23 schrieb Mikhail Gavrilov:
On Mon, 11 Jan 2021 at 19:01, Christian König christian.koenig@amd.com wrote:
Changing the page table attributes while releasing memory might sleep. So we can't use a spinlock here.
Thanks for the report, a patch to fix this is on the mailing list now.
Can you look also the first trace?
Unfortunately not, that's DC stuff. Easiest is to assign this as a bug tracker to our DC team.
Here a same error message "sleeping function called from invalid context" and a lot of [amdgpu] code.
[SNIP]
-12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by the problem above, maybe something completely unrelated.
I will take a look.
The looks like a completely unrelated memory leak to me.
Probably best if you open up a bug report for this.
Yes, the monitor still turns off after applying patch "make the pool shrinker lock a mutex". Anyway patch fixed the issue with flood of message "BUG: sleeping function called from invalid context at mm/vmalloc.c:1756" so kernel log became cleaner.
At least some progress. Any objections that I add your e-mail address as tested-by tag?
Now the issue with turns off monitor looks in logs so:
DMA-API: cacheline tracking ENOMEM, dma-debug disabled amdgpu 0000:0b:00.0: amdgpu: 000000006b791523 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: G W --------- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm] Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65 RSP: 0018:ffffa7400532b9c0 EFLAGS: 00010286 RAX: ffff978e2ae25800 RBX: ffff97910ec12058 RCX: ffff978e12caac70 RDX: 0000000080000010 RSI: 0000000000000000 RDI: ffff97912c3d99c0 RBP: ffff97912c3d99c0 R08: 0000000000000000 R09: 0000000070b3a000 R10: 0000000000000002 R11: 0000000000000000 R12: ffff97912c3d99c0 R13: 0000000000000000 R14: ffffa7400532ba90 R15: ffff978e182c6350 FS: 00007f070bb1b640(0000) GS:ffff979509200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 00000001f0cd2000 CR4: 0000000000350ee0 Call Trace: ttm_tt_populate+0xa9/0xe0 [ttm] ttm_bo_handle_move_mem+0x142/0x180 [ttm] ttm_bo_validate+0x12e/0x1c0 [ttm]
I can take a look at this one here. Looks like some missing error handling when allocating memory.
Can you decode to which line number ttm_tt_swapin+0x34 points to?
[SNIP]
You said that I need open up a bug report you means site https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.k... ? I thought mailing lists is better because bug report on bugzilla.kernel.org usually leave opened for several years without attention.
Please use this one here: https://gitlab.freedesktop.org/drm/amd/-/issues/new
If you can't find the DC guys of hand in the assignee list just assign to me and I will forward.
But what you have in your logs so far are only unrelated symptoms, the root of the problem is that somebody is leaking memory.
What you could do as well is to try to enable kmemleak and maybe try some bleeding edge branch like drm-misc-fixes or Alex amd-staging-drm-next branch.
Thanks for the help, Christian.
Hi Christian,
On Tue, 12 Jan 2021 at 01:45, Christian König christian.koenig@amd.com wrote:
Hi Mike,
Unfortunately not, that's DC stuff. Easiest is to assign this as a bug tracker to our DC team.
Ok
At least some progress. Any objections that I add your e-mail address as tested-by tag?
Yes, feel free add me.
I can take a look at this one here. Looks like some missing error handling when allocating memory. Can you decode to which line number ttm_tt_swapin+0x34 points to?
$ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_tt_swapin+0x34 ttm_tt_swapin+0x34/0xd0: mapping_gfp_mask at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/./include/linux/pagemap.h:105 (discriminator 2) (inlined by) ttm_tt_swapin at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c:210 (discriminator 2)
$ cat -s -n /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c | head -220 | tail -20 201 struct page *from_page; 202 struct page *to_page; 203 gfp_t gfp_mask; 204 int i, ret; 205 206 swap_storage = ttm->swap_storage; 207 BUG_ON(swap_storage == NULL); 208 209 swap_space = swap_storage->f_mapping; 210 gfp_mask = mapping_gfp_mask(swap_space); 211 212 for (i = 0; i < ttm->num_pages; ++i) { 213 from_page = shmem_read_mapping_page_gfp(swap_space, i, 214 gfp_mask); 215 if (IS_ERR(from_page)) { 216 ret = PTR_ERR(from_page); 217 goto out_err; 218 } 219 to_page = ttm->pages[i]; 220 if (unlikely(to_page == NULL)) {
Please use this one here: https://gitlab.freedesktop.org/drm/amd/-/issues/new
If you can't find the DC guys of hand in the assignee list just assign to me and I will forward.
https://gitlab.freedesktop.org/drm/amd/-/issues/1439 Ok, let's continue there.
-- Best Regards, Mike Gavrilov.
On Tue, 12 Jan 2021 at 01:45, Christian König christian.koenig@amd.com wrote:
But what you have in your logs so far are only unrelated symptoms, the root of the problem is that somebody is leaking memory.
What you could do as well is to try to enable kmemleak
I captured some memleaks. Do they contain any useful information?
[1] https://pastebin.com/n0FE7Hsu [2] https://pastebin.com/MUX55L1k [3] https://pastebin.com/a3FT7DVG [4] https://pastebin.com/1ALvJKz7
-- Best Regards, Mike Gavrilov.
Am 14.01.21 um 01:22 schrieb Mikhail Gavrilov:
On Tue, 12 Jan 2021 at 01:45, Christian König christian.koenig@amd.com wrote:
But what you have in your logs so far are only unrelated symptoms, the root of the problem is that somebody is leaking memory.
What you could do as well is to try to enable kmemleak
I captured some memleaks. Do they contain any useful information?
Unfortunately not of hand.
I also don't see any bug reports from other people and can't reproduce the last backtrace you send out TTM here.
Do you have any local modifications or special setup in your system? Like bpf scripts or something like that?
Christian.
[1] https://pastebin.com/n0FE7Hsu [2] https://pastebin.com/MUX55L1k [3] https://pastebin.com/a3FT7DVG [4] https://pastebin.com/1ALvJKz7
-- Best Regards, Mike Gavrilov. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Thu, Jan 14, 2021 at 2:56 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 01:22 schrieb Mikhail Gavrilov:
On Tue, 12 Jan 2021 at 01:45, Christian König christian.koenig@amd.com wrote:
But what you have in your logs so far are only unrelated symptoms, the root of the problem is that somebody is leaking memory.
What you could do as well is to try to enable kmemleak
I captured some memleaks. Do they contain any useful information?
Unfortunately not of hand.
I also don't see any bug reports from other people and can't reproduce the last backtrace you send out TTM here.
Do you have any local modifications or special setup in your system? Like bpf scripts or something like that?
There's another bug report (for rcar-du, bisected to the a switch to use more cma helpers) about leaking mmaps, which keeps too many fb alive, so maybe we have gained a refcount leak somewhere recently. But could also be totally unrelated. -Daniel
Christian.
[1] https://pastebin.com/n0FE7Hsu [2] https://pastebin.com/MUX55L1k [3] https://pastebin.com/a3FT7DVG [4] https://pastebin.com/1ALvJKz7
-- Best Regards, Mike Gavrilov. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, 14 Jan 2021 at 18:56, Christian König christian.koenig@amd.com wrote:
Unfortunately not of hand.
I also don't see any bug reports from other people and can't reproduce the last backtrace you send out TTM here.
Because only the most desperate will install kernels with enabled debug flags and then load the system by opening a huge number of programs and tabs. So you shouldn't be surprised that I'm the only one here. This is what my desktop looks like every day: https://imgur.com/a/Kxlmrem
Do you have any local modifications or special setup in your system? Like bpf scripts or something like that?
No, my I didn't write any bpf scripts, but looks like my distribution Fedora Rawhide uses some bpf scripts by default out of box:
# bpftool prog 20: cgroup_device tag 40ddf486530245f5 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 504B jited 309B memlock 4096B 21: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 22: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 23: cgroup_device tag ca8e50a3c7fb034b gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 496B jited 307B memlock 4096B 24: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 25: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 26: cgroup_device tag be31ae23198a0378 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 464B jited 288B memlock 4096B 27: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 416B jited 255B memlock 4096B 28: cgroup_device tag 438c5618576e5b0c gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 568B jited 354B memlock 4096B 29: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 30: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 31: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 32: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 33: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 34: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 35: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 416B jited 255B memlock 4096B 38: cgroup_device tag 3a0ef5414c2f6fca gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 744B jited 447B memlock 4096B 39: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 40: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 41: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 416B jited 255B memlock 4096B 42: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B 43: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B
I catched yet another couples of leaks , but nothing new: https://pastebin.com/2EgvYJdz
[1] do_detailed_mode+0x7c1/0x13d0 [drm] [2] drm_mode_duplicate+0x45/0x220 [drm] [3] do_seccomp+0x215/0x2280 [4] __vmalloc_node_range+0x464/0x7b0 [5] bpf_prog_alloc_no_stats+0xa2/0x2b0 [6] bpf_prog_store_orig_filter+0x7b/0x1c0 [7] kmemdup+0x1a/0x40
Did the following trace message confuse anyone? ================================================================== BUG: KASAN: slab-out-of-bounds in kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] Read of size 1 at addr ffff88812a6b4181 by task systemd-udevd/491
CPU: 20 PID: 491 Comm: systemd-udevd Not tainted 5.11.0-0.rc3.20210114git65f0d2414b70.125.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 print_address_description.constprop.0+0x18/0x160 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kasan_report.cold+0x7f/0x10e ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] ? kfd_create_crat_image_acpi+0x340/0x340 [amdgpu] ? __raw_spin_lock_init+0x39/0x110 kfd_topology_init+0x2ac/0x400 [amdgpu] ? kfd_create_topology_device+0x320/0x320 [amdgpu] ? __class_register+0x2ad/0x430 ? __class_create+0xc5/0x130 kgd2kfd_init+0x95/0xf0 [amdgpu] amdgpu_amdkfd_init+0x7f/0xb0 [amdgpu] ? smuio_v11_0_update_rom_clock_gating+0x1d0/0x1d0 [amdgpu] ? record_print_text.cold+0x11/0x11 ? kmem_cache_create_usercopy+0x25c/0x310 amdgpu_init+0x59/0x1000 [amdgpu] ? 0xffffffffc1f12000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x29/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? irqtime_account_irq+0x44/0x1e0 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc22aecaeee Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc62d60e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 0000560489080060 RCX: 00007fc22aecaeee RDX: 0000560489080f70 RSI: 0000000001e2e8f6 RDI: 00007fc226471010 RBP: 00007fc226471010 R08: 000056048907d470 R09: 00007ffc62d5d606 R10: 00005601e94f449d R11: 0000000000000246 R12: 0000560489080f70 R13: 000056048907c9b0 R14: 0000000000000000 R15: 00005604890814e0
Allocated by task 491: kasan_save_stack+0x1b/0x40 ____kasan_kmalloc.constprop.0+0x84/0xa0 kfd_create_crat_image_virtual+0x13b/0x1380 [amdgpu] kfd_topology_init+0x2ac/0x400 [amdgpu] kgd2kfd_init+0x95/0xf0 [amdgpu] amdgpu_amdkfd_init+0x7f/0xb0 [amdgpu] amdgpu_init+0x59/0x1000 [amdgpu] do_one_initcall+0xfb/0x530 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 __do_sys_init_module+0x18b/0x220 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88812a6b4100 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 1 bytes to the right of 128-byte region [ffff88812a6b4100, ffff88812a6b4180) The buggy address belongs to the page: page:00000000edb67e0c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12a6b4 flags: 0x17ffffc0000200(slab) raw: 0017ffffc0000200 ffffea000406a140 0000000500000005 ffff888100041640 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88812a6b4080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88812a6b4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88812a6b4180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff88812a6b4200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812a6b4280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint
Full kernel log: https://pastebin.com/bUiXRVYw Kernel build options: https://pastebin.com/v3zsC03i
-- Best Regards, Mike Gavrilov.
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov mikhail.v.gavrilov@gmail.com wrote:
In rc4, the number of warnings has dropped dramatically. No more errors "kasan slab-out-of-bounds" and no "DMA-API device driver failed to check map error". But still not fixed "sleeping function called from invalid context at include/linux/sched/mm.h:196" and "BUG: key ffff88810b0d9148 has not been registered!" Second issue Navi specific because it started to happen in 5.10 kernel after replacing Radeon VII to 6900XT.
1. BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd 1 lock held by systemd-udevd/500: #0: ffff888107690258 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0xa3/0x250 CPU: 9 PID: 500 Comm: systemd-udevd Not tainted 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 ___might_sleep.cold+0x150/0x17e ? dcn30_clock_source_create+0x53/0x110 [amdgpu] kmem_cache_alloc_trace+0x23f/0x270 dcn30_clock_source_create+0x53/0x110 [amdgpu] dcn30_create_resource_pool+0x998/0x4890 [amdgpu] ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu] ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? ____kasan_kmalloc.constprop.0+0x84/0xa0 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create+0x636/0x1bc0 [amdgpu] ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x33/0x110 ? dc_create_state+0xa0/0xa0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? module_assert_mutex_or_preempt+0x3e/0x70 ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? ____kasan_kmalloc.constprop.0+0x84/0xa0 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu] ? vprintk_emit+0x1c0/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xffffffffc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc84d33f88 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b87f8260a0 RCX: 00007f2c109da07e RDX: 000055b87f834060 RSI: 0000000001e2cbf6 RDI: 00007f2c0b7e0010 RBP: 00007f2c0b7e0010 R08: 000055b87f8281e0 R09: 00007ffc84d30a26 R10: 000055bd2404cc18 R11: 0000000000000246 R12: 000055b87f834060 R13: 000055b87f831ca0 R14: 0000000000000000 R15: 000055b87f832640 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x02000001 usb 1-3.2: Device not responding to setup address. usb 1-3.2: device not accepting address 5, error -71 [drm] REG_WAIT timeout 1us * 100000 tries - mpc2_assert_idle_mpcc line:480
2. BUG: key ffff88810b0d9148 has not been registered! ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x592/0x770 Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm ghash_clmulni_intel ccp igb nvme dca nvme_core i2c_algo_bit xhci_pci xhci_pci_renesas wmi pinctrl_amd fuse CPU: 25 PID: 500 Comm: systemd-udevd Tainted: G W --------- --- 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:lockdep_init_map_waits+0x592/0x770 Code: 08 84 d2 0f 85 d8 01 00 00 8b 3d e1 02 38 04 85 ff 0f 85 7e fc ff ff 48 c7 c6 e0 04 ca 8e 48 c7 c7 40 fd c9 8e e8 01 8e 23 02 <0f> 0b e9 64 fc ff ff 48 89 df 44 89 4c 24 0c 44 89 44 24 08 48 89 RSP: 0018:ffffc900029bef88 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff52000537de7 RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8886f9fe72ab R10: ffffed10df3fce55 R11: 0000000000000001 R12: ffff88810b0d9148 R13: 0000000000000000 R14: ffffffff8edbda60 R15: ffff88810b0db690 FS: 00007f2c0fdda140(0000) GS:ffff8886f9e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b8800aec68 CR3: 0000000127fd0000 CR4: 0000000000350ee0 Call Trace: ? lockdep_hardirqs_on+0x75/0xf0 __kernfs_create_file+0x102/0x2f0 sysfs_add_file_mode_ns+0x1af/0x500 sysfs_create_bin_file+0x100/0x160 ? lock_is_held_type+0xb8/0xf0 ? sysfs_add_file_to_group+0x150/0x150 ? static_obj+0x8a/0xc0 ? lockdep_init_map_waits+0x2a2/0x770 hdcp_create_workqueue+0x879/0xb50 [amdgpu] amdgpu_dm_init.isra.0.cold+0x7f2/0x374c [amdgpu] ? vprintk_emit+0x140/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? psp_set_srm+0x250/0x250 [amdgpu] ? hdcp_update_display+0x5b0/0x5b0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xffffffffc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc84d33f88 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b87f8260a0 RCX: 00007f2c109da07e RDX: 000055b87f834060 RSI: 0000000001e2cbf6 RDI: 00007f2c0b7e0010 RBP: 00007f2c0b7e0010 R08: 000055b87f8281e0 R09: 00007ffc84d30a26 R10: 000055bd2404cc18 R11: 0000000000000246 R12: 000055b87f834060 R13: 000055b87f831ca0 R14: 0000000000000000 R15: 000055b87f832640 irq event stamp: 593331 hardirqs last enabled at (593331): [<ffffffff8c3602f0>] console_unlock+0x7c0/0x9a0 hardirqs last disabled at (593330): [<ffffffff8c3601e8>] console_unlock+0x6b8/0x9a0 softirqs last enabled at (593162): [<ffffffff8e801112>] asm_call_irq_on_stack+0x12/0x20 softirqs last disabled at (593157): [<ffffffff8e801112>] asm_call_irq_on_stack+0x12/0x20 ---[ end trace 37dc3a4a3aa1704a ]---
Issue with the switching off monitor still happens too, but messages in logs become more detailed: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4! amdgpu 0000:0b:00.0: amdgpu: 0000000087613007 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4!
I hope "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4!" gives an idea of what happened.
Full kernel log is here: https://pastebin.com/nX69zgvf
I still have no idea what's going on here.
The KASAN messages from the DC code are completely unrelated.
Please add the full dmesg to your bug report.
Christian.
Am 20.01.21 um 01:59 schrieb Mikhail Gavrilov:
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov mikhail.v.gavrilov@gmail.com wrote: In rc4, the number of warnings has dropped dramatically. No more errors "kasan slab-out-of-bounds" and no "DMA-API device driver failed to check map error". But still not fixed "sleeping function called from invalid context at include/linux/sched/mm.h:196" and "BUG: key ffff88810b0d9148 has not been registered!" Second issue Navi specific because it started to happen in 5.10 kernel after replacing Radeon VII to 6900XT.
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd 1 lock held by systemd-udevd/500: #0: ffff888107690258 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0xa3/0x250 CPU: 9 PID: 500 Comm: systemd-udevd Not tainted 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 ___might_sleep.cold+0x150/0x17e ? dcn30_clock_source_create+0x53/0x110 [amdgpu] kmem_cache_alloc_trace+0x23f/0x270 dcn30_clock_source_create+0x53/0x110 [amdgpu] dcn30_create_resource_pool+0x998/0x4890 [amdgpu] ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu] ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? ____kasan_kmalloc.constprop.0+0x84/0xa0 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create+0x636/0x1bc0 [amdgpu] ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x33/0x110 ? dc_create_state+0xa0/0xa0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? module_assert_mutex_or_preempt+0x3e/0x70 ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? ____kasan_kmalloc.constprop.0+0x84/0xa0 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu] ? vprintk_emit+0x1c0/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xffffffffc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc84d33f88 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b87f8260a0 RCX: 00007f2c109da07e RDX: 000055b87f834060 RSI: 0000000001e2cbf6 RDI: 00007f2c0b7e0010 RBP: 00007f2c0b7e0010 R08: 000055b87f8281e0 R09: 00007ffc84d30a26 R10: 000055bd2404cc18 R11: 0000000000000246 R12: 000055b87f834060 R13: 000055b87f831ca0 R14: 0000000000000000 R15: 000055b87f832640 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x02000001 usb 1-3.2: Device not responding to setup address. usb 1-3.2: device not accepting address 5, error -71 [drm] REG_WAIT timeout 1us * 100000 tries - mpc2_assert_idle_mpcc line:480
BUG: key ffff88810b0d9148 has not been registered! ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x592/0x770 Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 gpu_sched drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm ghash_clmulni_intel ccp igb nvme dca nvme_core i2c_algo_bit xhci_pci xhci_pci_renesas wmi pinctrl_amd fuse CPU: 25 PID: 500 Comm: systemd-udevd Tainted: G W --------- --- 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:lockdep_init_map_waits+0x592/0x770 Code: 08 84 d2 0f 85 d8 01 00 00 8b 3d e1 02 38 04 85 ff 0f 85 7e fc ff ff 48 c7 c6 e0 04 ca 8e 48 c7 c7 40 fd c9 8e e8 01 8e 23 02 <0f> 0b e9 64 fc ff ff 48 89 df 44 89 4c 24 0c 44 89 44 24 08 48 89 RSP: 0018:ffffc900029bef88 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000 RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff52000537de7 RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8886f9fe72ab R10: ffffed10df3fce55 R11: 0000000000000001 R12: ffff88810b0d9148 R13: 0000000000000000 R14: ffffffff8edbda60 R15: ffff88810b0db690 FS: 00007f2c0fdda140(0000) GS:ffff8886f9e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b8800aec68 CR3: 0000000127fd0000 CR4: 0000000000350ee0 Call Trace: ? lockdep_hardirqs_on+0x75/0xf0 __kernfs_create_file+0x102/0x2f0 sysfs_add_file_mode_ns+0x1af/0x500 sysfs_create_bin_file+0x100/0x160 ? lock_is_held_type+0xb8/0xf0 ? sysfs_add_file_to_group+0x150/0x150 ? static_obj+0x8a/0xc0 ? lockdep_init_map_waits+0x2a2/0x770 hdcp_create_workqueue+0x879/0xb50 [amdgpu] amdgpu_dm_init.isra.0.cold+0x7f2/0x374c [amdgpu] ? vprintk_emit+0x140/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? psp_set_srm+0x250/0x250 [amdgpu] ? hdcp_update_display+0x5b0/0x5b0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xffffffffc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc84d33f88 EFLAGS: 00000246 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055b87f8260a0 RCX: 00007f2c109da07e RDX: 000055b87f834060 RSI: 0000000001e2cbf6 RDI: 00007f2c0b7e0010 RBP: 00007f2c0b7e0010 R08: 000055b87f8281e0 R09: 00007ffc84d30a26 R10: 000055bd2404cc18 R11: 0000000000000246 R12: 000055b87f834060 R13: 000055b87f831ca0 R14: 0000000000000000 R15: 000055b87f832640 irq event stamp: 593331 hardirqs last enabled at (593331): [<ffffffff8c3602f0>] console_unlock+0x7c0/0x9a0 hardirqs last disabled at (593330): [<ffffffff8c3601e8>] console_unlock+0x6b8/0x9a0 softirqs last enabled at (593162): [<ffffffff8e801112>] asm_call_irq_on_stack+0x12/0x20 softirqs last disabled at (593157): [<ffffffff8e801112>] asm_call_irq_on_stack+0x12/0x20 ---[ end trace 37dc3a4a3aa1704a ]---
Issue with the switching off monitor still happens too, but messages in logs become more detailed: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4! amdgpu 0000:0b:00.0: amdgpu: 0000000087613007 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4! [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4!
I hope "[drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -4!" gives an idea of what happened.
Full kernel log is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.c...
On Thu, 21 Jan 2021 at 18:27, Christian König christian.koenig@amd.com wrote:
I still have no idea what's going on here.
The KASAN messages from the DC code are completely unrelated.
Please add the full dmesg to your bug report.
I did it. https://gitlab.freedesktop.org/drm/amd/-/issues/1439#note_776267
dri-devel@lists.freedesktop.org