https://bugs.freedesktop.org/show_bug.cgi?id=104345
Bug ID: 104345 Summary: Playing video hangs X-Server with showing scrambled picture, sound still playing. Product: DRI Version: unspecified Hardware: x86-64 (AMD64) OS: Linux (All) Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: bernhardu@mailbox.org
Created attachment 136312 --> https://bugs.freedesktop.org/attachment.cgi?id=136312&action=edit kernel output captured from the serial line.
Playing video hangs X-Server with showing scrambled picture, sound still playing.
System hangs after some time playing mostly when playing a recorded DVB-S-SD stream via VLC. Happened also with other videos and players.
This left the system showing a distorted version of the last shown frame from the video. Most of the time it is a chess board like pattern. Sound is still playing regularly. A ssh connection is still possible.
Playing the exact same video again later works without problem.
Kernel output shows at this time following:
[ 4353.316286] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0570480c [ 4353.328836] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001004AE [ 4353.343791] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04800C [ 4353.358755] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 5) at page 1049774, read from 'TC0' (0x54433000) (72) [ 4353.378234] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0570440c [ 4353.390777] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010049C [ 4353.405735] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A04400C [ 4353.420692] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 5) at page 1049756, read from 'TC1' (0x54433100) (68)
Followed by this:
[ 4592.681352] INFO: task amdgpu_cs:0:1007 blocked for more than 120 seconds. [ 4592.695116] Tainted: G O 4.13.0-1-amd64 #1 Debian 4.13.13-1 [ 4592.709587] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4592.725255] amdgpu_cs:0 D 0 1007 1000 0x00400000 [ 4592.736245] Call Trace: [ 4592.741145] ? __schedule+0x3c8/0x860 [ 4592.748486] ? schedule+0x32/0x80 [ 4592.755120] ? schedule_timeout+0x1da/0x350 [ 4592.763537] ? amdgpu_vm_validate_level.isra.9+0x80/0x80 [amdgpu] [ 4592.775725] ? dma_fence_default_wait+0x239/0x260 [ 4592.785138] ? dma_fence_default_wait+0x239/0x260 [ 4592.794548] ? dma_fence_free+0x20/0x20 [ 4592.802220] ? dma_fence_wait_timeout+0x33/0xe0 [ 4592.811317] ? amdgpu_ctx_add_fence+0x61/0xf0 [amdgpu] [ 4592.821632] ? amdgpu_cs_ioctl+0x156b/0x1920 [amdgpu] [ 4592.831784] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu] [ 4592.842432] ? drm_ioctl_kernel+0x65/0xb0 [drm] [ 4592.851510] ? drm_ioctl+0x2e3/0x3a0 [drm] [ 4592.859737] ? amdgpu_cs_find_mapping+0x90/0x90 [amdgpu] [ 4592.870373] ? do_futex+0x2df/0xa90 [ 4592.877376] ? amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 4592.886971] ? do_vfs_ioctl+0x9f/0x600 [ 4592.894471] ? SyS_futex+0x7a/0x170 [ 4592.901453] ? SyS_ioctl+0x74/0x80 [ 4592.908264] ? system_call_fast_compare_end+0xc/0x97
Attached is a log captured on the serial port from boot to the crash follwed by a Magic-Sys-Req-T.
This are the system specs:
Graphics: Card: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 460] (XFX) Display Server: x11 (X.Org 1.19.5 ) drivers: amdgpu (unloaded: modesetting,fbdev,vesa) Resolution: 1920x1080@60.00hz OpenGL: renderer: AMD Radeon RX 460 Graphics (AMD POLARIS11 / DRM 3.18.0 / 4.13.0-1-amd64, LLVM 5.0.0) version: 4.5 Mesa 17.2.5 System: Kernel: 4.13.0-1-amd64 x86_64 bits: 64 Desktop: KDE Plasma 5.10.5 Machine: Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: N/A UEFI: American Megatrends v: 3203 date: 11/09/2017 CPU: Octa core AMD Ryzen 7 1700 Eight-Core (-HT-MCP-) speed/max: 1550/3000 MHz
Error happened with Debian Stretch with linux 4.9 and later. Currently it is running a Debian testing.
Is this the right product/component for this report?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
Michel Dänzer michel@daenzer.net changed:
What |Removed |Added ---------------------------------------------------------------------------- QA Contact| |dri-devel@lists.freedesktop | |.org Product|DRI |Mesa Component|DRM/AMDgpu |Drivers/Gallium/radeonsi
--- Comment #1 from Michel Dänzer michel@daenzer.net --- It's more likely a Mesa or maybe LLVM issue. Did this start happening after you upgraded Mesa and/or LLVM, or maybe another component such as the kernel?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #2 from bernhardu bernhardu@mailbox.org --- (In reply to Michel Dänzer from comment #1)
It's more likely a Mesa or maybe LLVM issue. Did this start happening after you upgraded Mesa and/or LLVM, or maybe another component such as the kernel?
I get such hangs since I am using this system since april. I waited to report it because at that time I wondered if this might be a firmware issue, as the hardware was very new. But after some bios updates these hangs did not resolve. (Just mainboard firmware update, found not yet an update for the graphics card.) Also thought that maybe the versions in Debian Stretch would be too old.
The earliest hanging logs I have are from mid June with kernel 4.9.0-3-amd64. There the output was a little different as there was Xorg shown as blocking: INFO: task Xorg:1024 blocked for more than 120 seconds. I tried every following kernel up to 4.12 via backports and upgraded at begin of december to testing with 4.13.
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #3 from Michel Dänzer michel@daenzer.net --- Please attach the output of glxinfo and vdpauinfo.
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #4 from bernhardu bernhardu@mailbox.org --- Created attachment 136335 --> https://bugs.freedesktop.org/attachment.cgi?id=136335&action=edit glxinfo
Sorry for the delay.
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #5 from bernhardu bernhardu@mailbox.org --- Created attachment 136336 --> https://bugs.freedesktop.org/attachment.cgi?id=136336&action=edit vdpauinfo
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #6 from bernhardu bernhardu@mailbox.org --- Created attachment 136709 --> https://bugs.freedesktop.org/attachment.cgi?id=136709&action=edit kernel output captured from the serial line. (4.14.10+lockdep)
Attaching another example with a vanilla kernel 4.14.10 with lockdep enabled.
Shows again such "GPU fault detected:" followed by a "task amdgpu_cs:0:1014 blocked for more than 120 seconds.
I might add that it may be related to one or more suspend and resume cycles.
Is there anything I can attach or test to make identification of the issue easier?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #7 from bernhardu bernhardu@mailbox.org --- Created attachment 137161 --> https://bugs.freedesktop.org/attachment.cgi?id=137161&action=edit kernel output captured from the serial line. (4.14.14 vanilla with debian config)
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #8 from bernhardu bernhardu@mailbox.org --- Created attachment 137162 --> https://bugs.freedesktop.org/attachment.cgi?id=137162&action=edit video output with the chess board like pattern
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #9 from bernhardu bernhardu@mailbox.org --- This time with latest firmware, with "rcu_nocbs=0-16" for #196683 Happened without any hibernation cycle. Just the "GPU fault detected: 146" appeared, no "blocked for more than" message.
Somehow it looks like when this "GPU fault" happens some locks are never unlocked and therefore blocking the X server. At least the possibility for a clean shutdown would be good.
Is there any other information I should collect next time, any other configuration or version I can test?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #10 from bernhardu bernhardu@mailbox.org --- Same happens with vanilla kernel 4.15.2.
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #11 from bernhardu bernhardu@mailbox.org --- Happened again today. Got no such hang since 04.04.2018
From Debian testing:
Linux kernel: 4.16.5-1 Mesa: 18.0.3-1 xserver-xorg-video-amdgpu: 18.0.1-1 libdrm2: 2.4.91-2 (Nearly) current firmware: PRIME-B350M-A-ASUS-4008.CAP
[ 1085.521781] amdgpu 0000:08:00.0: GPU fault detected: 146 0x0bf8440c [ 1085.521789] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010037F [ 1085.521792] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0204400C [ 1085.521798] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 1) at page 1049471, read from 'TC1' (0x54433100) (68) [ 1095.659133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=39661, last emitted seq=39663 [ 1095.659146] [drm] IP block:gfx_v8_0 is hung! [ 1095.659194] [drm] GPU recovery disabled.
Is there anything else I could collect next time or try?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #12 from bernhardu bernhardu@mailbox.org --- Created attachment 142867 --> https://bugs.freedesktop.org/attachment.cgi?id=142867&action=edit 2018-12-20-crash.txt
The last months I observed this less. But in December it started to get more visible again. (I am following debian testing closely.)
It shows now nearly always when e.g. browsing with firefox - even in simple sites e.g. bugtrackers.
The chess board like pattern is still the same.
The blocked kernel threads stack changed with some kernel upgrade to:
Call Trace: ? __schedule+0x2b7/0x880 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 schedule+0x28/0x80 schedule_timeout+0x1ee/0x380 ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu] ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu] dma_fence_default_wait+0x1fd/0x280 ? dma_fence_release+0x90/0x90 dma_fence_wait_timeout+0x39/0xf0 reservation_object_wait_timeout_rcu+0x10a/0x290 amdgpu_dm_do_flip+0x112/0x340 [amdgpu] amdgpu_dm_atomic_commit_tail+0xbac/0xdb0 [amdgpu] ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 ? __switch_to+0x16f/0x450 commit_tail+0x3d/0x70 [drm_kms_helper] process_one_work+0x195/0x380 worker_thread+0x30/0x390 ? process_one_work+0x380/0x380 kthread+0x113/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x22/0x40
What can I do to get this bug forward?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
bernhardu bernhardu@mailbox.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|Playing video hangs |X-Server hangs with showing |X-Server with showing |scrambled picture, sound |scrambled picture, sound |still playing. |still playing. |
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #13 from bernhardu bernhardu@mailbox.org --- Happened again. A way to avoid that crash may be to not stay int first cold boot. I am not sure but I think I never saw this when system was running from a rebooted state (warm boot).
[ 0.000000] Linux version 4.19.0-1-amd64 (debian-kernel@lists.debian.org) (gcc version 8.2.0 (Debian 8.2.0-13)) #1 SMP Debian 4.19.12-1 (2018-12-22) ... [32345.334084] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06407704 for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720 [32345.334092] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001002C8 [32345.334096] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02077004 [32345.334101] amdgpu 0000:08:00.0: VM fault (0x04, vmid 1, pasid 32775) at page 1049288, read from 'SDM0' (0x53444d30) (119) [32345.334114] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06407704 for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720 [32345.334118] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001002C8 [32345.334121] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02077004 [32345.334125] amdgpu 0000:08:00.0: VM fault (0x04, vmid 1, pasid 32775) at page 1049288, read from 'SDM0' (0x53444d30) (119) [32345.334523] amdgpu 0000:08:00.0: GPU fault detected: 146 0x06d8770c for process plasmashell pid 1422 thread plasmashel:cs0 pid 1720 [32345.334528] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001002DB [32345.334532] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0207700C [32345.334538] amdgpu 0000:08:00.0: VM fault (0x0c, vmid 1, pasid 32775) at page 1049307, read from 'SDM0' (0x53444d30) (119) [32355.357958] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=947938, emitted seq=947940 [32355.357965] [drm] GPU recovery disabled. [32505.119425] INFO: task kworker/u32:2:12454 blocked for more than 120 seconds. [32505.119433] Tainted: G OE 4.19.0-1-amd64 #1 Debian 4.19.12-1 [32505.119435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [32505.119438] kworker/u32:2 D 0 12454 2 0x80000000 [32505.119460] Workqueue: events_unbound commit_work [drm_kms_helper] [32505.119464] Call Trace: [32505.119474] ? __schedule+0x2a2/0x870 [32505.119480] ? __switch_to_asm+0x40/0x70 [32505.119484] schedule+0x28/0x80 [32505.119488] schedule_timeout+0x26d/0x390 [32505.119586] ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu] [32505.119682] ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu] [32505.119687] dma_fence_default_wait+0x238/0x2a0 [32505.119691] ? dma_fence_release+0x90/0x90 [32505.119695] dma_fence_wait_timeout+0xdd/0x100 [32505.119699] reservation_object_wait_timeout_rcu+0x173/0x280 [32505.119800] amdgpu_dm_do_flip+0x112/0x340 [amdgpu] [32505.119903] amdgpu_dm_atomic_commit_tail+0x750/0xdb0 [amdgpu] [32505.119909] ? wait_for_completion_timeout+0x3b/0x1a0 [32505.119913] ? __switch_to_asm+0x34/0x70 [32505.119917] ? __switch_to_asm+0x40/0x70 [32505.119921] ? __switch_to_asm+0x34/0x70 [32505.119924] ? __switch_to_asm+0x40/0x70 [32505.119938] commit_tail+0x3d/0x70 [drm_kms_helper] [32505.119946] process_one_work+0x1a7/0x3a0 [32505.119951] worker_thread+0x30/0x390 [32505.119955] ? pwq_unbound_release_workfn+0xd0/0xd0 [32505.119959] kthread+0x112/0x130 [32505.119963] ? kthread_bind+0x30/0x30 [32505.119967] ret_from_fork+0x22/0x40 [32528.537326] sysrq: SysRq : Keyboard mode set to system default
https://bugs.freedesktop.org/show_bug.cgi?id=104345
--- Comment #14 from bernhardu bernhardu@mailbox.org --- This sample happend while just browsing in firefox inside a plasma desktop, running in amd64 debian buster.
[138335.303379] amdgpu 0000:08:00.0: GPU fault detected: 147 0x0a704402 for process Xorg pid 1129 thread Xorg:cs0 pid 1161 [138335.303387] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0012014E [138335.303391] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E044002 [138335.303396] amdgpu 0000:08:00.0: VM fault (0x02, vmid 7, pasid 32768) at page 1179982, read from 'TC1' (0x54433100) (68) [138335.313378] amdgpu 0000:08:00.0: GPU fault detected: 147 0x08904802 for process kwin_x11 pid 1350 thread kwin_x11:cs0 pid 1568 [138335.313384] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00120712 [138335.313388] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048002 [138335.313393] amdgpu 0000:08:00.0: VM fault (0x02, vmid 1, pasid 32773) at page 1181458, read from 'TC0' (0x54433000) (72) [139983.820484] amdgpu 0000:08:00.0: GPU fault detected: 147 0x00004802 for process Xorg pid 1129 thread Xorg:cs0 pid 1161 [139983.820491] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000800 [139983.820495] amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048002 [139983.820500] amdgpu 0000:08:00.0: VM fault (0x02, vmid 6, pasid 32768) at page 2048, read from 'TC0' (0x54433000) (72) [139994.156833] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=14610386, emitted seq=14610389 [139994.156840] [drm] GPU recovery disabled. [140167.215046] INFO: task kworker/u32:3:15564 blocked for more than 120 seconds. [140167.215053] Tainted: G OE 4.19.0-4-amd64 #1 Debian 4.19.28-2 [140167.215055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [140167.215058] kworker/u32:3 D 0 15564 2 0x80000000 [140167.215080] Workqueue: events_unbound commit_work [drm_kms_helper] [140167.215083] Call Trace: [140167.215093] ? __schedule+0x2a2/0x870 [140167.215098] ? __switch_to_asm+0x40/0x70 [140167.215102] schedule+0x28/0x80 [140167.215106] schedule_timeout+0x26d/0x390 [140167.215204] ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu] [140167.215300] ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu] [140167.215305] dma_fence_default_wait+0x238/0x2a0 [140167.215310] ? dma_fence_release+0x90/0x90 [140167.215314] dma_fence_wait_timeout+0xdd/0x100 [140167.215319] reservation_object_wait_timeout_rcu+0x173/0x280 [140167.215420] amdgpu_dm_do_flip+0x112/0x340 [amdgpu] [140167.215523] amdgpu_dm_atomic_commit_tail+0x750/0xdb0 [amdgpu] [140167.215528] ? wait_for_completion_timeout+0x3b/0x1a0 [140167.215531] ? __switch_to_asm+0x34/0x70 [140167.215535] ? __switch_to_asm+0x40/0x70 [140167.215538] ? __switch_to_asm+0x34/0x70 [140167.215541] ? __switch_to_asm+0x40/0x70 [140167.215555] commit_tail+0x3d/0x70 [drm_kms_helper] [140167.215562] process_one_work+0x1a7/0x3a0 [140167.215566] worker_thread+0x30/0x390 [140167.215571] ? create_worker+0x1a0/0x1a0 [140167.215573] kthread+0x112/0x130 [140167.215577] ? kthread_bind+0x30/0x30 [140167.215581] ret_from_fork+0x22/0x40 [140241.805748] sysrq: SysRq : Keyboard mode set to system default
What can I do to get this bug forward?
https://bugs.freedesktop.org/show_bug.cgi?id=104345
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #15 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1293.
dri-devel@lists.freedesktop.org