https://bugs.freedesktop.org/show_bug.cgi?id=104825
Bug ID: 104825 Summary: [amdgpu] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) when unbinding Product: DRI Version: XOrg git Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: mlen@mlen.pl
I use two amdgpu rx480 cards. During boot one of them is rebound to vfio-pci driver using the following script:
pci_ids=("0000:03:00.0" "0000:03:00.1")
for id in "${pci_ids[@]}"; do vendor="$(cat "/sys/bus/pci/devices/$id/vendor")" device="$(cat "/sys/bus/pci/devices/$id/device")"
if [ -e "/sys/bus/pci/devices/$id/driver/unbind" ]; then echo "$id" >"/sys/bus/pci/devices/$id/driver/unbind" fi
echo "$vendor $device" >/sys/bus/pci/drivers/vfio-pci/new_id done
Starting from Linux 4.15 with amdgpu DC enabled (I wanted to use it for HDMI audio), unbind operation causes general protection failure:
[ 68.011473] [drm] amdgpu: finishing device. [ 68.377945] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 68.575193] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 68.770107] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 68.971775] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 69.164265] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 69.350089] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 69.538302] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 69.729260] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 69.729733] general protection fault: 0000 [#1] PREEMPT SMP PTI [ 69.730901] Modules linked in: [ 69.731936] CPU: 2 PID: 3934 Comm: openrc-run.sh Not tainted 4.15.0-gentoo #2 [ 69.733009] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017 [ 69.734240] RIP: 0010:dm_read_reg_func.isra.0+0x3/0xc [ 69.735314] RSP: 0018:ffffa80d8bd4fc40 EFLAGS: 00010282 [ 69.736353] RAX: ccea607dac10c354 RBX: ffff95af35bfca80 RCX: 0000000180200008 [ 69.737408] RDX: 0000000180200009 RSI: 0000000000005c02 RDI: ffff95af362b91c0 [ 69.738494] RBP: ffff95af35b75c90 R08: 0000000000000001 R09: ffffffffb30fced2 [ 69.739561] R10: ffff95af2f770a40 R11: 00000000ffffff80 R12: ffff95af35f30100 [ 69.740638] R13: 0000000000000000 R14: ffff95af358a3c20 R15: ffff95bf2d5f9c20 [ 69.742007] FS: 00007fe7530af740(0000) GS:ffff95af3da00000(0000) knlGS:0000000000000000 [ 69.743083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 69.744134] CR2: 000000c420870008 CR3: 0000001027426003 CR4: 00000000003606e0 [ 69.745190] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 69.746246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 69.747320] Call Trace: [ 69.748359] destroy+0x21/0x9c [ 69.749395] dal_i2caux_destruct+0x6f/0xab [ 69.750459] destroy+0x15/0x27 [ 69.751495] dal_i2caux_destroy+0x26/0x2f [ 69.752537] destruct+0x86/0xfd [ 69.753571] dc_destroy+0x11/0x22 [ 69.754626] dm_hw_fini+0x1e/0x22 [ 69.755632] amdgpu_fini+0xf3/0x2d6 [ 69.756616] amdgpu_device_fini+0x5c/0x158 [ 69.757597] amdgpu_driver_unload_kms+0x6b/0x7e [ 69.758601] drm_dev_unregister+0x4c/0xc6 [ 69.759584] amdgpu_pci_remove+0x19/0x37 [ 69.760576] pci_device_remove+0x3b/0x8b [ 69.761563] device_release_driver_internal+0x125/0x1f9 [ 69.762580] unbind_store+0x60/0x90 [ 69.763568] kernfs_fop_write+0x111/0x159 [ 69.764557] __vfs_write+0x33/0xd7 [ 69.765543] ? preempt_count_sub+0x8b/0x94 [ 69.766552] ? __sb_start_write+0xc0/0x180 [ 69.767525] vfs_write+0xa5/0xe2 [ 69.768490] SyS_write+0x5f/0xa3 [ 69.769439] do_syscall_64+0x72/0x81 [ 69.770419] entry_SYSCALL64_slow_path+0x25/0x25 [ 69.771373] RIP: 0033:0x7fe7529a1408 [ 69.772324] RSP: 002b:00007ffe1885c200 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 69.773300] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fe7529a1408 [ 69.774290] RDX: 000000000000000d RSI: 0000558994641890 RDI: 0000000000000001 [ 69.775290] RBP: 0000558994641890 R08: 000000000000000a R09: 00005589946475f0 [ 69.776260] R10: 000000000000009b R11: 0000000000000246 R12: 000000000000000d [ 69.777248] R13: 0000000000000001 R14: 00007fe752c6e740 R15: 000000000000000d [ 69.778244] Code: f7 74 24 04 89 43 5c 48 8b 4c 24 08 65 48 33 0c 25 28 00 00 00 4c 89 e0 74 05 e8 39 58 9e ff 48 83 c4 10 5b 5d 41 5c c3 48 8b 07 <48> 8b 40 30 e9 03 4a 70 00 0f 1f 44 00 00 48 8b 47 30 8b 70 04 [ 69.779406] RIP: dm_read_reg_func.isra.0+0x3/0xc RSP: ffffa80d8bd4fc40 [ 69.780491] ---[ end trace 6aa4681ba3a43ec3 ]--- [ 71.815503] [drm:amdgpu_fill_buffer] *ERROR* Trying to clear memory with ring turned off. [ 71.899258] amdgpu 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 71.899263] amdgpu 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 72.217918] [drm:amdgpu_fill_buffer] *ERROR* Trying to clear memory with ring turned off.
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #1 from Harry Wentland harry.wentland@amd.com --- This patch https://patchwork.freedesktop.org/patch/198719/ should fix it, but there could be some other issues as well.
amd-staging-drm-next has fixes for a whole bunch of driver unload issues, including what you're seeing. It's hosted at https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
Can you try the patch and/or amd-staging-drm-next?
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #2 from mlen mlen@mlen.pl --- I'll try running amd-staging-drm-next later today
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #3 from mlen mlen@mlen.pl --- I tested amd-staging-drm-next with HEAD at f1367d12f5fabb04789c7772594887434c8d9e8b. This time the unbind succeeded, but there are still some errors logged and kernel reports locking problem in amdgpu:
[ 77.098923] [drm] amdgpu: finishing device. [ 77.458614] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.481247] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready [ 77.653815] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.845085] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.855055] IPv6: ADDRCONF(NETDEV_CHANGE): virbr10: link becomes ready [ 78.036695] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.233244] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.425058] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.616635] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.808323] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.810659] amdgpu 0000:03:00.0: 00000000a667dd57 unpin not necessary [ 78.810672] amdgpu 0000:03:00.0: 00000000a7594a2b unpin not necessary
[ 78.811733] ===================================== [ 78.813109] WARNING: bad unlock balance detected! [ 78.813947] 4.15.0-rc4+ #2 Not tainted [ 78.814835] ------------------------------------- [ 78.815731] openrc-run.sh/3931 is trying to release lock (&(&mgr->lock)->rlock) at: [ 78.816646] [<000000006fd39549>] amdgpu_gtt_mgr_fini+0x22/0x37 [ 78.817531] but there are no more locks to release! [ 78.818446] other info that might help us debug this: [ 78.820208] 5 locks held by openrc-run.sh/3931: [ 78.821127] #0: (sb_writers#6){....}, at: [<00000000322e5044>] vfs_write+0x87/0xe2 [ 78.822051] #1: (&of->mutex){....}, at: [<00000000660270c4>] kernfs_fop_write+0xca/0x156 [ 78.823007] #2: (kn->count#211){....}, at: [<000000000634dafb>] kernfs_fop_write+0xd2/0x156 [ 78.823936] #3: (&dev->mutex){....}, at: [<00000000c386f49f>] unbind_store+0x58/0x90 [ 78.824912] #4: (&dev->mutex){....}, at: [<00000000eefcc37f>] device_release_driver_internal+0x2f/0x1f3 [ 78.825861] stack backtrace: [ 78.827764] CPU: 7 PID: 3931 Comm: openrc-run.sh Not tainted 4.15.0-rc4+ #2 [ 78.828747] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017 [ 78.829718] Call Trace: [ 78.830717] dump_stack+0x67/0x8e [ 78.831689] ? amdgpu_gtt_mgr_fini+0x22/0x37 [ 78.832687] print_unlock_imbalance_bug+0xcc/0xd3 [ 78.833657] lock_release+0x134/0x267 [ 78.834646] ? _raw_spin_unlock+0x2e/0x40 [ 78.835605] _raw_spin_unlock+0x1c/0x40 [ 78.836586] amdgpu_gtt_mgr_fini+0x22/0x37 [ 78.837549] ttm_bo_clean_mm+0x79/0xab [ 78.838544] amdgpu_ttm_fini+0x75/0x11c [ 78.839507] amdgpu_bo_fini+0xe/0x2d [ 78.840495] gmc_v8_0_sw_fini+0x2e/0x49 [ 78.841454] amdgpu_device_ip_fini+0x21f/0x2d3 [ 78.842439] amdgpu_device_fini+0x4c/0x125 [ 78.843394] amdgpu_driver_unload_kms+0x63/0x76 [ 78.844373] drm_dev_unregister+0x49/0xc3 [ 78.845318] amdgpu_pci_remove+0x19/0x37 [ 78.846244] pci_device_remove+0x36/0x86 [ 78.847190] device_release_driver_internal+0x122/0x1f3 [ 78.848120] unbind_store+0x60/0x90 [ 78.849069] kernfs_fop_write+0x10e/0x156 [ 78.849997] __vfs_write+0x31/0xcc [ 78.850937] ? preempt_count_sub+0x8b/0x94 [ 78.851871] ? __sb_start_write+0xc0/0x180 [ 78.852828] vfs_write+0xa5/0xe2 [ 78.853755] SyS_write+0x5f/0xa3 [ 78.854708] do_syscall_64+0x6c/0x7b [ 78.855630] entry_SYSCALL64_slow_path+0x25/0x25 [ 78.856583] RIP: 0033:0x7fc5804b6408 [ 78.857511] RSP: 002b:00007ffd95228060 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 78.858484] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fc5804b6408 [ 78.859438] RDX: 000000000000000d RSI: 000055b229e9e890 RDI: 0000000000000001 [ 78.860419] RBP: 000055b229e9e890 R08: 000000000000000a R09: 000055b229ea45f0 [ 78.861379] R10: 000000000000009b R11: 0000000000000246 R12: 000000000000000d [ 78.862364] R13: 0000000000000001 R14: 00007fc580783740 R15: 000000000000000d [ 78.863411] [drm] amdgpu: ttm finalized
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #4 from Harry Wentland harry.wentland@amd.com --- Good to hear DC issues are gone. Not sure about the unlock balance myself.
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #5 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Hi, can you please provide full dmesg for the unbind sequence ? So the issue happens when you unbind the card to from the host driver and before binding it to VFIO ?
Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #6 from mlen mlen@mlen.pl --- Currently I don't have access to that machine, I should be able to test next week. I'll extend the script to be more verbose and writing some delimiters to /dev/kmsg
https://bugs.freedesktop.org/show_bug.cgi?id=104825
Andrey Grodzovsky andrey.grodzovsky@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |andrey.grodzovsky@amd.com
--- Comment #7 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- On Carrizo unbinding the driver will indeed show [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disabled failed (scratch(0xC040)=0xCAFEDEAD), I don't have LOCKDEP related kflags enabled in kernel for some reason so that probably why I don't see the locking imbalance warning, will rebuild and check.
What card model are you using ?
Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104825
Andrey Grodzovsky andrey.grodzovsky@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|dri-devel@lists.freedesktop |andrey.grodzovsky@amd.com |.org |
https://bugs.freedesktop.org/show_bug.cgi?id=104825
Andrey Grodzovsky andrey.grodzovsky@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|andrey.grodzovsky@amd.com |dri-devel@lists.freedesktop | |.org
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #8 from mlen mlen@mlen.pl --- I'm using two RX 480 cards, but I'm rebinding only one of them
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #9 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- Created attachment 137652 --> https://bugs.freedesktop.org/attachment.cgi?id=137652&action=edit DAL warnings during driver unbind
Harry, with DAL enabled I observe warnings and unbind can't complete - to reproduce under root do
cd /sys/bus/pci/drivers/amdgpu/ echo 'pci_id' > unbind where pci_id is seen when doing ls -l on this folder and seeing soft link like this 0000:2a:00.0 -> ../../../../devices/pci0000:00/0000:00:03.1/0000:2a:00.0/
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #10 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- P.S I am using CZ or Ellsmire with tip of amd-staging-drm-next. Going to disable DAL for now to debug the KCQ ring failure.
P.S After enabling lockdep i still don't see any locking related warnings.
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #11 from Harry Wentland harry.wentland@amd.com --- Created attachment 137662 --> https://bugs.freedesktop.org/attachment.cgi?id=137662&action=edit [PATCH] drm/amd/display: Use atomic crtc_disable for DC on shutdown
Andrey, can you see if this fixes the warning for you?
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #12 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to Harry Wentland from comment #11)
Created attachment 137662 [details] [review] [PATCH] drm/amd/display: Use atomic crtc_disable for DC on shutdown
Andrey, can you see if this fixes the warning for you?
Get use after free now
[ 82.400097 < 0.000387>] BUG: KASAN: use-after-free in amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu] [ 82.400185 < 0.000088>] Read of size 4 at addr ffff88008f53ee94 by task bash/1178
[ 82.400302 < 0.000117>] CPU: 0 PID: 1178 Comm: bash Tainted: G W OE 4.16.0-rc1.main+ #14 [ 82.400308 < 0.000006>] Hardware name: AMD Gardenia/Gardenia, BIOS RGA1101C 07/20/2015 [ 82.400312 < 0.000004>] Call Trace: [ 82.400329 < 0.000017>] dump_stack+0x5c/0x78 [ 82.400342 < 0.000013>] print_address_description+0xd1/0x270 [ 82.400618 < 0.000276>] ? amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu] [ 82.400627 < 0.000009>] kasan_report+0x260/0x360 [ 82.400913 < 0.000286>] amdgpu_dm_set_pflip_irq_state+0x3d/0xa0 [amdgpu] [ 82.401189 < 0.000276>] amdgpu_irq_disable_all+0x111/0x190 [amdgpu] [ 82.401452 < 0.000263>] amdgpu_device_ip_fini+0x1b7/0x610 [amdgpu] [ 82.401718 < 0.000266>] amdgpu_device_fini+0xa1/0x320 [amdgpu] [ 82.401973 < 0.000255>] amdgpu_driver_unload_kms+0x6a/0xd0 [amdgpu] [ 82.402059 < 0.000086>] drm_dev_unregister+0x79/0x180 [drm] [ 82.402315 < 0.000256>] amdgpu_pci_remove+0x2a/0x60 [amdgpu] [ 82.402331 < 0.000016>] pci_device_remove+0x5b/0x100 [ 82.402347 < 0.000016>] device_release_driver_internal+0x1da/0x300 [ 82.402363 < 0.000016>] unbind_store+0x143/0x190 [ 82.402376 < 0.000013>] ? sysfs_file_ops+0xa0/0xa0 [ 82.402386 < 0.000010>] kernfs_fop_write+0x186/0x220 [ 82.402402 < 0.000016>] __vfs_write+0xb9/0x2e0 [ 82.402412 < 0.000010>] ? locks_remove_posix+0x87/0x220 [ 82.402421 < 0.000009>] ? kernel_read+0xa0/0xa0 [ 82.402430 < 0.000009>] ? find_held_lock+0xfb/0x130 [ 82.402441 < 0.000011>] ? __lock_acquire.isra.30+0x414/0xb00 [ 82.402465 < 0.000024>] ? vfs_write+0x227/0x250 [ 82.402485 < 0.000020>] ? __sb_start_write+0xc3/0x1a0 [ 82.402492 < 0.000007>] ? vfs_write+0x227/0x250 [ 82.402506 < 0.000014>] vfs_write+0xe6/0x250 [ 82.402522 < 0.000016>] SyS_write+0xa1/0x120 [ 82.402532 < 0.000010>] ? SyS_read+0x120/0x120 [ 82.402547 < 0.000015>] ? vtime_user_exit+0xc8/0xe0 [ 82.402558 < 0.000011>] ? SyS_read+0x120/0x120 [ 82.402570 < 0.000012>] do_syscall_64+0xf0/0x270 [ 82.402588 < 0.000018>] entry_SYSCALL_64_after_hwframe+0x21/0x86 [ 82.402597 < 0.000009>] RIP: 0033:0x7fd050bc32c0 [ 82.402603 < 0.000006>] RSP: 002b:00007ffc06b7f3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 82.402615 < 0.000012>] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fd050bc32c0 [ 82.402620 < 0.000005>] RDX: 000000000000000d RSI: 0000000001a6e408 RDI: 0000000000000001 [ 82.402626 < 0.000006>] RBP: 0000000001a6e408 R08: 00007fd050e92780 R09: 00007fd0514d9700 [ 82.402632 < 0.000006>] R10: 000000000000000c R11: 0000000000000246 R12: 000000000000000d [ 82.402637 < 0.000005>] R13: 0000000000000001 R14: 00007fd050e91620 R15: 0000000000000000
[ 82.402711 < 0.000074>] Allocated by task 1084: [ 82.402771 < 0.000060>] kasan_kmalloc+0xa6/0xd0 [ 82.402780 < 0.000009>] kmem_cache_alloc_trace+0x13a/0x270 [ 82.403079 < 0.000299>] dm_hw_init+0x898/0x1660 [amdgpu] [ 82.403338 < 0.000259>] amdgpu_device_init+0x1a97/0x2100 [amdgpu] [ 82.403596 < 0.000258>] amdgpu_driver_load_kms+0xa8/0x3a0 [amdgpu] [ 82.403673 < 0.000077>] drm_dev_register+0x1d5/0x2f0 [drm] [ 82.403931 < 0.000258>] amdgpu_pci_probe+0x1bf/0x290 [amdgpu] [ 82.403941 < 0.000010>] local_pci_probe+0x74/0xe0 [ 82.403951 < 0.000010>] pci_device_probe+0x1dc/0x2d0 [ 82.403970 < 0.000019>] driver_probe_device+0x40e/0x6b0 [ 82.403977 < 0.000007>] __driver_attach+0x11d/0x130 [ 82.403984 < 0.000007>] bus_for_each_dev+0xd8/0x140 [ 82.403990 < 0.000006>] bus_add_driver+0x31d/0x3a0 [ 82.403998 < 0.000008>] driver_register+0xc6/0x170 [ 82.404006 < 0.000008>] do_one_initcall+0x82/0x1d0 [ 82.404012 < 0.000006>] do_init_module+0xe7/0x333 [ 82.404020 < 0.000008>] load_module+0x41b3/0x4c40 [ 82.404028 < 0.000008>] SYSC_finit_module+0x14d/0x180 [ 82.404036 < 0.000008>] do_syscall_64+0xf0/0x270 [ 82.404044 < 0.000008>] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 82.404095 < 0.000051>] Freed by task 1178: [ 82.404152 < 0.000057>] __kasan_slab_free+0x124/0x170 [ 82.404159 < 0.000007>] kfree+0xd4/0x200 [ 82.404239 < 0.000080>] drm_mode_config_cleanup+0x241/0x450 [drm] [ 82.404536 < 0.000297>] amdgpu_dm_fini+0x29/0xb0 [amdgpu] [ 82.404834 < 0.000298>] dm_hw_fini+0x1e/0x30 [amdgpu] [ 82.405091 < 0.000257>] amdgpu_device_ip_fini+0x157/0x610 [amdgpu] [ 82.405349 < 0.000258>] amdgpu_device_fini+0xa1/0x320 [amdgpu] [ 82.405607 < 0.000258>] amdgpu_driver_unload_kms+0x6a/0xd0 [amdgpu] [ 82.405684 < 0.000077>] drm_dev_unregister+0x79/0x180 [drm] [ 82.405941 < 0.000257>] amdgpu_pci_remove+0x2a/0x60 [amdgpu] [ 82.405949 < 0.000008>] pci_device_remove+0x5b/0x100 [ 82.405957 < 0.000008>] device_release_driver_internal+0x1da/0x300 [ 82.405963 < 0.000006>] unbind_store+0x143/0x190 [ 82.405971 < 0.000008>] kernfs_fop_write+0x186/0x220 [ 82.405978 < 0.000007>] __vfs_write+0xb9/0x2e0 [ 82.405985 < 0.000007>] vfs_write+0xe6/0x250 [ 82.405991 < 0.000006>] SyS_write+0xa1/0x120 [ 82.405998 < 0.000007>] do_syscall_64+0xf0/0x270 [ 82.406007 < 0.000009>] entry_SYSCALL_64_after_hwframe+0x21/0x86
[ 82.406057 < 0.000050>] The buggy address belongs to the object at ffff88008f53e600 which belongs to the cache kmalloc-4096 of size 4096 [ 82.406163 < 0.000106>] The buggy address is located 2196 bytes inside of 4096-byte region [ffff88008f53e600, ffff88008f53f600) [ 82.406262 < 0.000099>] The buggy address belongs to the page: [ 82.406326 < 0.000064>] page:ffffea00023d4e00 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 [ 82.406424 < 0.000098>] flags: 0x1ffff0000008100(slab|head) [ 82.406488 < 0.000064>] raw: 01ffff0000008100 0000000000000000 0000000000000000 0000000100070007 [ 82.406571 < 0.000083>] raw: dead000000000100 dead000000000200 ffff880102802600 0000000000000000 [ 82.406649 < 0.000078>] page dumped because: kasan: bad access detected
[ 82.406754 < 0.000105>] Memory state around the buggy address: [ 82.406816 < 0.000062>] ffff88008f53ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 82.406893 < 0.000077>] ffff88008f53ee00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 82.406968 < 0.000075>] >ffff88008f53ee80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 82.407036 < 0.000068>] ^ [ 82.407087 < 0.000051>] ffff88008f53ef00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 82.407157 < 0.000070>] ffff88008f53ef80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 82.407226 < 0.000069>] ==================================================================
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #13 from Harry Wentland harry.wentland@amd.com --- This should be fixed in drm-next-4.17-wip and amd-staging-drm-next. Can someone test if this resolves this ticket satisfactorily?
Both branches can be found on https://cgit.freedesktop.org/~agd5f/linux
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #14 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to Harry Wentland from comment #13)
This should be fixed in drm-next-4.17-wip and amd-staging-drm-next. Can someone test if this resolves this ticket satisfactorily?
Both branches can be found on https://cgit.freedesktop.org/~agd5f/linux
Yes, DAL issues are gone, also gone KIQ ring error on unbind. Still there is a KIQ ring error on rebind which I am investigating now + there is SDMA ring IB test failure which I will get to after KIQ is resloved.
Andrey
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #15 from Andrey Grodzovsky andrey.grodzovsky@amd.com --- (In reply to mlen from comment #3)
I tested amd-staging-drm-next with HEAD at f1367d12f5fabb04789c7772594887434c8d9e8b. This time the unbind succeeded, but there are still some errors logged and kernel reports locking problem in amdgpu:
[ 77.098923] [drm] amdgpu: finishing device. [ 77.458614] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.481247] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready [ 77.653815] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.845085] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 77.855055] IPv6: ADDRCONF(NETDEV_CHANGE): virbr10: link becomes ready [ 78.036695] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.233244] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.425058] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.616635] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000) [ 78.808323] [drm:gfx_v8_0_hw_fini] *ERROR* KCQ disabled failed (scratch(0xC040)=0x00000000)
Can you retest with latest https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.17-wip to see if KCQ related errors are gone ?
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #16 from mlen mlen@mlen.pl --- I'll test it on saturday, I don't have access to that machine at the moment
https://bugs.freedesktop.org/show_bug.cgi?id=104825
--- Comment #17 from mlen mlen@mlen.pl --- No issues on drm-next-4.17-wip, unbinding works without any lockdep issues.
Do you plan to backport these changes to 4.16 or maybe even 4.15?
For the record, suspend fails on drm-next-4.17-wip. This could be possibly related, but I don't know how to debug this and most likely this is out of scope of this bug.
[ 96.222095] PM: suspend entry (deep) [ 96.222099] PM: Syncing filesystems ... done. [ 96.230020] INFO: trying to register non-static key. [ 96.230024] the code is fine but needs lockdep annotation. [ 96.230026] turning off the locking correctness validator. [ 96.230029] CPU: 39 PID: 4506 Comm: pm-suspend Not tainted 4.16.0-rc7+ #2 [ 96.230031] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017 [ 96.230033] Call Trace: [ 96.230040] dump_stack+0x46/0x59 [ 96.230046] register_lock_class+0x192/0x361 [ 96.230050] ? cycles_2_ns+0x55/0x75 [ 96.230054] __lock_acquire.isra.30+0x97/0x595 [ 96.230057] lock_acquire+0x105/0x12e [ 96.230060] ? devres_for_each_res+0x41/0xc4 [ 96.230064] ? __fw_entry_found+0x3b/0x3b [ 96.230068] _raw_spin_lock_irqsave+0x3d/0x74 [ 96.230071] ? devres_for_each_res+0x41/0xc4 [ 96.230073] ? kref_get+0xa/0xa [ 96.230075] ? alloc_fw_cache_entry+0x4e/0x4e [ 96.230077] devres_for_each_res+0x41/0xc4 [ 96.230096] dev_cache_fw_image+0x59/0x11d [ 96.230098] ? fw_pm_notify+0xd1/0xd1 [ 96.230102] dpm_for_each_dev+0x41/0x58 [ 96.230104] fw_pm_notify+0xac/0xd1 [ 96.230108] notifier_call_chain+0x39/0x5a [ 96.230127] __blocking_notifier_call_chain+0x4e/0x65 [ 96.230130] __pm_notifier_call_chain+0x1b/0x2f [ 96.230133] pm_suspend+0x15b/0x2c1 [ 96.230135] state_store+0x4b/0x7e [ 96.230140] kernfs_fop_write+0x114/0x15c [ 96.230145] __vfs_write+0x33/0xd7 [ 96.230148] ? __sb_start_write+0x94/0x180 [ 96.230150] ? __sb_start_write+0xc0/0x180 [ 96.230153] vfs_write+0xa5/0xe2 [ 96.230156] SyS_write+0x5f/0xa3 [ 96.230160] do_syscall_64+0x79/0x88 [ 96.230164] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 96.230166] RIP: 0033:0x7fac7a914468 [ 96.230168] RSP: 002b:00007ffea25554e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 96.230171] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fac7a914468 [ 96.230172] RDX: 0000000000000004 RSI: 0000558679eb4500 RDI: 0000000000000001 [ 96.230174] RBP: 0000558679eb4500 R08: 000000000000000a R09: 0000558679eeb6b0 [ 96.230175] R10: 00007fac7a9a5b20 R11: 0000000000000246 R12: 0000000000000004 [ 96.230177] R13: 0000000000000001 R14: 00007fac7abe1740 R15: 0000000000000004 [ 96.230183] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 96.230189] IP: devres_for_each_res+0x59/0xc4 [ 96.230191] PGD 0 P4D 0 [ 96.230196] Oops: 0000 [#1] PREEMPT SMP PTI [ 96.230199] Modules linked in: [ 96.230203] CPU: 39 PID: 4506 Comm: pm-suspend Not tainted 4.16.0-rc7+ #2 [ 96.230205] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 WS/Z10PE-D16 WS, BIOS 3407 03/10/2017 [ 96.230208] RIP: 0010:devres_for_each_res+0x59/0xc4 [ 96.230210] RSP: 0018:ffffa63689fabc70 EFLAGS: 00010086 [ 96.230214] RAX: 0000000000000000 RBX: ffff9fd574d4ad98 RCX: ffff9fd574d4b1f0 [ 96.230216] RDX: ffff9fd576ea0918 RSI: ffff9fd574d4b1c0 RDI: ffff9fd576ea0918 [ 96.230218] RBP: ffffffffa8951b3a R08: 0000000000000000 R09: 0000000000000000 [ 96.230221] R10: ffff9fd576ea0000 R11: ffffffffab22aa07 R12: ffffffffa8951f44 [ 96.230223] R13: dead000000000100 R14: ffffffffa8951c08 R15: ffff9fd574d4b1a8 [ 96.230226] FS: 00007fac7b021740(0000) GS:ffff9fe57f400000(0000) knlGS:0000000000000000 [ 96.230229] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 96.230231] CR2: 0000000000000008 CR3: 000000202a4a8002 CR4: 00000000003606e0 [ 96.230233] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 96.230236] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 96.230238] Call Trace: [ 96.230243] dev_cache_fw_image+0x59/0x11d [ 96.230247] ? fw_pm_notify+0xd1/0xd1 [ 96.230250] dpm_for_each_dev+0x41/0x58 [ 96.230253] fw_pm_notify+0xac/0xd1 [ 96.230256] notifier_call_chain+0x39/0x5a [ 96.230261] __blocking_notifier_call_chain+0x4e/0x65 [ 96.230264] __pm_notifier_call_chain+0x1b/0x2f [ 96.230267] pm_suspend+0x15b/0x2c1 [ 96.230271] state_store+0x4b/0x7e [ 96.230275] kernfs_fop_write+0x114/0x15c [ 96.230279] __vfs_write+0x33/0xd7 [ 96.230284] ? __sb_start_write+0x94/0x180 [ 96.230286] ? __sb_start_write+0xc0/0x180 [ 96.230290] vfs_write+0xa5/0xe2 [ 96.230295] SyS_write+0x5f/0xa3 [ 96.230299] do_syscall_64+0x79/0x88 [ 96.230303] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 96.230305] RIP: 0033:0x7fac7a914468 [ 96.230308] RSP: 002b:00007ffea25554e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 96.230311] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fac7a914468 [ 96.230313] RDX: 0000000000000004 RSI: 0000558679eb4500 RDI: 0000000000000001 [ 96.230316] RBP: 0000558679eb4500 R08: 000000000000000a R09: 0000558679eeb6b0 [ 96.230318] R10: 00007fac7a9a5b20 R11: 0000000000000246 R12: 0000000000000004 [ 96.230320] R13: 0000000000000001 R14: 00007fac7abe1740 R15: 0000000000000004 [ 96.230326] Code: 48 83 ec 28 48 89 4c 24 18 4c 89 4c 24 20 e8 ca 7d 4b 00 48 8d 8b 58 04 00 00 48 89 44 24 08 48 8b 83 60 04 00 00 48 89 4c 24 10 <4c> 8b 68 08 48 3b 44 24 10 74 44 4c 3b 70 10 75 35 48 83 c0 28 [ 96.230390] RIP: devres_for_each_res+0x59/0xc4 RSP: ffffa63689fabc70 [ 96.230392] CR2: 0000000000000008 [ 96.230395] ---[ end trace 43f33fa700a0efa9 ]--- [ 96.233392] note: pm-suspend[4506] exited with preempt_count 1
https://bugs.freedesktop.org/show_bug.cgi?id=104825
Harry Wentland harry.wentland@amd.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED
--- Comment #18 from Harry Wentland harry.wentland@amd.com --- Thanks for your bug report, and constant testing and feedback. Since the original issue is fixed in drm-next-4.17-wip I'll mark this resolved. If I'm jumping the gun here and you still notice problems feel free to reopen.
The suspend/resume issue you mention seems to be something else and might be best tracked with a new bug report, possibly on kernel.org as I don't see an indication that this is graphics driver related.
As for back-porting the changes, AFAIK this bug report uncovered quite a few issues which were all fixed with separate commits. I don't think all of them have been pulled into 4.16.
dri-devel@lists.freedesktop.org