https://bugs.freedesktop.org/show_bug.cgi?id=106430
Bug ID: 106430 Summary: GPU hang when played video with acceleration (vaapi) Product: DRI Version: XOrg git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: DRM/AMDgpu Assignee: dri-devel@lists.freedesktop.org Reporter: mikhail.v.gavrilov@gmail.com
Created attachment 139407 --> https://bugs.freedesktop.org/attachment.cgi?id=139407&action=edit dmesg
* Fedora 29 (Rawhide) * Latest system updates: - kernel 4.17.0-0.rc3.git4.1 - drm 3.25.0 - mesa 18.1.0-rc2 - llvm 6.0.0
For reproduction issue: 1) # dnf install gstreamer1-vaapi 2) Play video encoded with H.264 in Totem player
Symptoms: 1. The system stop to respod.
kernel output after GPU hang: [ 89.056879] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=2638, last emitted seq=2640 [ 89.056926] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=80, last emitted seq=82 [ 89.056932] [drm] No hardware hang detected. Did some blocks stall? [ 89.056948] [drm] No hardware hang detected. Did some blocks stall?
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #1 from mikhail.v.gavrilov@gmail.com --- If do not restart the computer and leave it in a hang state, then after a while the turbine starts spinning at full speed, and the LEDs on the video card all go out.
I was even frightened. reboot through the reset button did not help the turbine continued to make noise, and the LED on the video card did not catch fire.
Only after turning off the computer it was possible to restore the working of the video card.
Here is that was logged in dmesg at this time: [247125.285043] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=20977028, last emitted seq=20977030 [247125.285083] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd timeout, last signaled seq=30, last emitted seq=31 [247125.285085] [drm] No hardware hang detected. Did some blocks stall? [247125.285087] [drm] No hardware hang detected. Did some blocks stall? [247359.270184] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247359.270188] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270190] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270191] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247359.270196] Call Trace: [247359.270203] ? __schedule+0x2ba/0xaf0 [247359.270220] ? dma_fence_default_wait+0x231/0x370 [247359.270222] schedule+0x2f/0x90 [247359.270235] schedule_timeout+0x35c/0x520 [247359.270238] ? dma_fence_default_wait+0x72/0x370 [247359.270242] ? dma_fence_default_wait+0x231/0x370 [247359.270245] dma_fence_default_wait+0x25d/0x370 [247359.270247] ? dma_fence_release+0x160/0x160 [247359.270251] dma_fence_wait_timeout+0x4f/0x270 [247359.270300] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247359.270325] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247359.270356] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270368] drm_ioctl_kernel+0x5b/0xb0 [drm] [247359.270375] drm_ioctl+0x1b3/0x370 [drm] [247359.270397] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270420] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247359.270424] do_vfs_ioctl+0xa5/0x6d0 [247359.270428] ksys_ioctl+0x60/0x90 [247359.270431] __x64_sys_ioctl+0x16/0x20 [247359.270434] do_syscall_64+0x60/0x1f0 [247359.270438] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247359.270545] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247359.270546] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270548] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270550] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247359.270554] Call Trace: [247359.270557] ? __schedule+0x2ba/0xaf0 [247359.270561] ? dma_fence_default_wait+0x231/0x370 [247359.270564] schedule+0x2f/0x90 [247359.270566] schedule_timeout+0x35c/0x520 [247359.270569] ? dma_fence_default_wait+0x72/0x370 [247359.270573] ? dma_fence_default_wait+0x231/0x370 [247359.270575] dma_fence_default_wait+0x25d/0x370 [247359.270577] ? dma_fence_release+0x160/0x160 [247359.270580] dma_fence_wait_timeout+0x4f/0x270 [247359.270604] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247359.270626] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247359.270656] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270665] drm_ioctl_kernel+0x5b/0xb0 [drm] [247359.270672] drm_ioctl+0x1b3/0x370 [drm] [247359.270692] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247359.270713] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247359.270717] do_vfs_ioctl+0xa5/0x6d0 [247359.270721] ksys_ioctl+0x60/0x90 [247359.270724] __x64_sys_ioctl+0x16/0x20 [247359.270727] do_syscall_64+0x60/0x1f0 [247359.270730] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247359.270886] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247359.270887] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247359.270889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247359.270890] kworker/u16:1 D10936 16581 2 0x80000000 [247359.270905] Workqueue: events_unbound commit_work [drm_kms_helper] [247359.270907] Call Trace: [247359.270910] ? __schedule+0x2ba/0xaf0 [247359.270914] ? dma_fence_default_wait+0x231/0x370 [247359.270916] schedule+0x2f/0x90 [247359.270919] schedule_timeout+0x35c/0x520 [247359.270922] ? dma_fence_default_wait+0x72/0x370 [247359.270925] ? dma_fence_default_wait+0x231/0x370 [247359.270927] dma_fence_default_wait+0x25d/0x370 [247359.270929] ? dma_fence_release+0x160/0x160 [247359.270932] dma_fence_wait_timeout+0x4f/0x270 [247359.270935] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247359.270967] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247359.271003] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247359.271008] ? wait_for_completion_timeout+0x73/0x1a0 [247359.271019] commit_tail+0x3d/0x70 [drm_kms_helper] [247359.271025] process_one_work+0x261/0x630 [247359.271030] worker_thread+0x3a/0x390 [247359.271033] ? process_one_work+0x630/0x630 [247359.271036] kthread+0x120/0x140 [247359.271039] ? kthread_create_worker_on_cpu+0x70/0x70 [247359.271041] ret_from_fork+0x3a/0x50 [247359.271056] INFO: lockdep is turned off. [247482.151777] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247482.151781] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.151782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.151784] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247482.151788] Call Trace: [247482.151796] ? __schedule+0x2ba/0xaf0 [247482.151799] ? dma_fence_default_wait+0x231/0x370 [247482.151802] schedule+0x2f/0x90 [247482.151804] schedule_timeout+0x35c/0x520 [247482.151807] ? dma_fence_default_wait+0x72/0x370 [247482.151810] ? dma_fence_default_wait+0x231/0x370 [247482.151812] dma_fence_default_wait+0x25d/0x370 [247482.151814] ? dma_fence_release+0x160/0x160 [247482.151817] dma_fence_wait_timeout+0x4f/0x270 [247482.151863] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247482.151884] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247482.151912] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.151924] drm_ioctl_kernel+0x5b/0xb0 [drm] [247482.151932] drm_ioctl+0x1b3/0x370 [drm] [247482.151952] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.151973] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247482.151977] do_vfs_ioctl+0xa5/0x6d0 [247482.151982] ksys_ioctl+0x60/0x90 [247482.151986] __x64_sys_ioctl+0x16/0x20 [247482.151989] do_syscall_64+0x60/0x1f0 [247482.151993] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247482.152121] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247482.152123] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.152124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.152126] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247482.152130] Call Trace: [247482.152143] ? __schedule+0x2ba/0xaf0 [247482.152146] ? dma_fence_default_wait+0x231/0x370 [247482.152148] schedule+0x2f/0x90 [247482.152150] schedule_timeout+0x35c/0x520 [247482.152153] ? dma_fence_default_wait+0x72/0x370 [247482.152156] ? dma_fence_default_wait+0x231/0x370 [247482.152169] dma_fence_default_wait+0x25d/0x370 [247482.152171] ? dma_fence_release+0x160/0x160 [247482.152174] dma_fence_wait_timeout+0x4f/0x270 [247482.152203] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247482.152233] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247482.152281] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.152299] drm_ioctl_kernel+0x5b/0xb0 [drm] [247482.152316] drm_ioctl+0x1b3/0x370 [drm] [247482.152335] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247482.152375] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247482.152379] do_vfs_ioctl+0xa5/0x6d0 [247482.152382] ksys_ioctl+0x60/0x90 [247482.152385] __x64_sys_ioctl+0x16/0x20 [247482.152387] do_syscall_64+0x60/0x1f0 [247482.152390] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247482.152554] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247482.152556] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247482.152558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247482.152560] kworker/u16:1 D10936 16581 2 0x80000000 [247482.152571] Workqueue: events_unbound commit_work [drm_kms_helper] [247482.152574] Call Trace: [247482.152579] ? __schedule+0x2ba/0xaf0 [247482.152584] ? dma_fence_default_wait+0x231/0x370 [247482.152587] schedule+0x2f/0x90 [247482.152590] schedule_timeout+0x35c/0x520 [247482.152594] ? dma_fence_default_wait+0x72/0x370 [247482.152599] ? dma_fence_default_wait+0x231/0x370 [247482.152603] dma_fence_default_wait+0x25d/0x370 [247482.152606] ? dma_fence_release+0x160/0x160 [247482.152610] dma_fence_wait_timeout+0x4f/0x270 [247482.152615] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247482.152651] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247482.152691] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247482.152713] ? wait_for_completion_timeout+0x73/0x1a0 [247482.152721] commit_tail+0x3d/0x70 [drm_kms_helper] [247482.152725] process_one_work+0x261/0x630 [247482.152732] worker_thread+0x3a/0x390 [247482.152735] ? process_one_work+0x630/0x630 [247482.152737] kthread+0x120/0x140 [247482.152740] ? kthread_create_worker_on_cpu+0x70/0x70 [247482.152742] ret_from_fork+0x3a/0x50 [247482.152751] INFO: lockdep is turned off. [247605.031356] INFO: task amdgpu_cs:0:21382 blocked for more than 120 seconds. [247605.031360] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.031362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.031364] amdgpu_cs:0 D12728 21382 21309 0x00000000 [247605.031369] Call Trace: [247605.031376] ? __schedule+0x2ba/0xaf0 [247605.031381] ? dma_fence_default_wait+0x231/0x370 [247605.031383] schedule+0x2f/0x90 [247605.031386] schedule_timeout+0x35c/0x520 [247605.031389] ? dma_fence_default_wait+0x72/0x370 [247605.031393] ? dma_fence_default_wait+0x231/0x370 [247605.031396] dma_fence_default_wait+0x25d/0x370 [247605.031398] ? dma_fence_release+0x160/0x160 [247605.031401] dma_fence_wait_timeout+0x4f/0x270 [247605.031439] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247605.031467] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247605.031512] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031525] drm_ioctl_kernel+0x5b/0xb0 [drm] [247605.031543] drm_ioctl+0x1b3/0x370 [drm] [247605.031566] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031590] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247605.031596] do_vfs_ioctl+0xa5/0x6d0 [247605.031600] ksys_ioctl+0x60/0x90 [247605.031603] __x64_sys_ioctl+0x16/0x20 [247605.031606] do_syscall_64+0x60/0x1f0 [247605.031611] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247605.031715] INFO: task amdgpu_cs:0:12186 blocked for more than 120 seconds. [247605.031717] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.031718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.031720] amdgpu_cs:0 D13400 12186 12133 0x00000000 [247605.031725] Call Trace: [247605.031729] ? __schedule+0x2ba/0xaf0 [247605.031733] ? dma_fence_default_wait+0x231/0x370 [247605.031735] schedule+0x2f/0x90 [247605.031738] schedule_timeout+0x35c/0x520 [247605.031741] ? dma_fence_default_wait+0x72/0x370 [247605.031744] ? dma_fence_default_wait+0x231/0x370 [247605.031746] dma_fence_default_wait+0x25d/0x370 [247605.031749] ? dma_fence_release+0x160/0x160 [247605.031752] dma_fence_wait_timeout+0x4f/0x270 [247605.031775] amdgpu_ctx_wait_prev_fence+0x4c/0x80 [amdgpu] [247605.031798] amdgpu_cs_ioctl+0x9d/0x1d10 [amdgpu] [247605.031828] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031838] drm_ioctl_kernel+0x5b/0xb0 [drm] [247605.031846] drm_ioctl+0x1b3/0x370 [drm] [247605.031866] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [247605.031887] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [247605.031892] do_vfs_ioctl+0xa5/0x6d0 [247605.031896] ksys_ioctl+0x60/0x90 [247605.031899] __x64_sys_ioctl+0x16/0x20 [247605.031902] do_syscall_64+0x60/0x1f0 [247605.031906] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [247605.032047] INFO: task kworker/u16:1:16581 blocked for more than 120 seconds. [247605.032049] Not tainted 4.17.0-0.rc3.git4.1.fc29.x86_64 #1 [247605.032050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [247605.032052] kworker/u16:1 D10936 16581 2 0x80000000 [247605.032063] Workqueue: events_unbound commit_work [drm_kms_helper] [247605.032065] Call Trace: [247605.032069] ? __schedule+0x2ba/0xaf0 [247605.032073] ? dma_fence_default_wait+0x231/0x370 [247605.032075] schedule+0x2f/0x90 [247605.032078] schedule_timeout+0x35c/0x520 [247605.032081] ? dma_fence_default_wait+0x72/0x370 [247605.032085] ? dma_fence_default_wait+0x231/0x370 [247605.032087] dma_fence_default_wait+0x25d/0x370 [247605.032089] ? dma_fence_release+0x160/0x160 [247605.032092] dma_fence_wait_timeout+0x4f/0x270 [247605.032095] reservation_object_wait_timeout_rcu+0x236/0x4e0 [247605.032127] amdgpu_dm_do_flip+0x112/0x350 [amdgpu] [247605.032162] amdgpu_dm_atomic_commit_tail+0xa76/0xd00 [amdgpu] [247605.032166] ? wait_for_completion_timeout+0x73/0x1a0 [247605.032175] commit_tail+0x3d/0x70 [drm_kms_helper] [247605.032180] process_one_work+0x261/0x630 [247605.032185] worker_thread+0x3a/0x390 [247605.032188] ? process_one_work+0x630/0x630 [247605.032191] kthread+0x120/0x140 [247605.032194] ? kthread_create_worker_on_cpu+0x70/0x70 [247605.032197] ret_from_fork+0x3a/0x50 [247605.032208] INFO: lockdep is turned off. [247640.263559] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247640.663689] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.416206] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.512251] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247641.773087] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.121791] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.220684] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.481411] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.612305] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.900084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.935635] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247642.999194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.552447] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.668968] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247643.690139] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.099977] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.232435] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.292521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.358833] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.376341] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.390073] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.514553] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.529169] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.581504] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.688219] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.787111] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.812531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.873729] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.928613] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.939548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247644.961052] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.056869] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.198003] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.280336] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.360668] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.434358] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.441931] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.565895] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.639253] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.711531] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.729971] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.744137] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247645.952694] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.140934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.259925] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.319308] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.363976] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.389526] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.457577] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.513275] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.544150] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.637789] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.651337] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.710404] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.785978] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.928178] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247646.955859] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.016425] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.134880] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.159276] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.249781] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.315185] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.325523] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.361488] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.383235] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.439095] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.460806] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.485170] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.502436] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.548979] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.594343] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.621786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.649303] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.670292] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.701090] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.735796] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.774236] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.816521] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.840603] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.869076] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.948394] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247647.977194] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.008216] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.041878] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.102950] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.123688] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.161477] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.210530] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.248898] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.273809] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.308455] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.357214] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.393870] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.418454] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.429277] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.508805] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.529862] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.581775] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.595466] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.679402] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.714558] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.767368] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.784370] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.805855] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.872980] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.933891] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.944161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247648.979727] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.036203] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.094332] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.138191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.175616] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.279457] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.313344] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.483680] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.519062] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.554865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.601461] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.655004] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.760903] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.784816] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.870742] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247649.923269] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.003330] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.129582] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.206246] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.330698] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.481865] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.513212] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.564055] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.773681] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.780123] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.821904] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.841934] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.877117] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.901374] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247650.985498] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.026897] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.068131] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.109751] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.126539] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.355831] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.791237] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.829065] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247651.928932] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.077168] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.083449] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.211548] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.288786] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.302159] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.496320] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.614161] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.655070] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.745940] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247652.808084] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.117247] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.141879] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.166410] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.193642] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.338192] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.560506] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247653.898569] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.135093] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.283233] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.445210] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.465085] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.865339] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247654.987101] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247655.933191] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247655.993198] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247656.465146] amdgpu: [powerplay] GPU over temperature range detected on PCIe 0:0.0! [247669.543630] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=6004078, last emitted seq=6004080 [247669.543635] [drm] No hardware hang detected. Did some blocks stall?
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #2 from mikhail.v.gavrilov@gmail.com --- Created attachment 139579 --> https://bugs.freedesktop.org/attachment.cgi?id=139579&action=edit dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #3 from mikhail.v.gavrilov@gmail.com --- A very strange coincidence: Every time I reproduce the described bug case with GPU hangup while playing a video with VAAPI acceleration. The following messages will appear in the kernel log after reboot:
[ 0.059000] mce: [Hardware Error]: Machine check events logged [ 0.059000] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400 [ 0.059000] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 [ 0.059000] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 0 microcode 24 [ 0.059000] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver. [ 0.059000] ... version: 3 [ 0.059000] ... bit width: 48 [ 0.059000] ... generic registers: 4 [ 0.059000] ... value mask: 0000ffffffffffff [ 0.059000] ... max period: 00007fffffffffff [ 0.059000] ... fixed-purpose events: 3 [ 0.059000] ... event mask: 000000070000000f [ 0.059000] Hierarchical SRCU implementation. [ 0.059692] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 0.059740] smp: Bringing up secondary CPUs ... [ 0.060031] x86: Booting SMP configuration: [ 0.060035] .... node #0, CPUs: #1 [ 0.061563] mce: [Hardware Error]: Machine check events logged [ 0.061567] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400 [ 0.061583] mce: [Hardware Error]: TSC 0 ADDR ffffffff957932bb MISC ffffffff957932bb [ 0.061602] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 2 microcode 24 [ 0.061684] #2 [ 0.063341] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400 [ 0.063341] mce: [Hardware Error]: TSC 0 ADDR ffffffffc02bd4e1 MISC ffffffffc02bd4e1 [ 0.063341] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 4 microcode 24 [ 0.063471] #3 [ 0.065119] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400 [ 0.065125] mce: [Hardware Error]: TSC 0 ADDR ffffffffc07f31d5 MISC ffffffffc07f31d5 [ 0.065144] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527195880 SOCKET 0 APIC 6 microcode 24 [ 0.065255] #4 #5 #6 #7
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #4 from mikhail.v.gavrilov@gmail.com --- Created attachment 139764 --> https://bugs.freedesktop.org/attachment.cgi?id=139764&action=edit dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #5 from mikhail.v.gavrilov@gmail.com --- After updating kernel to 4.17.0-0.rc6.git1.1 strange mce error messages after reboot disappeared. But GPU still hangs.
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #6 from mikhail.v.gavrilov@gmail.com --- Strange mce messages returned again with kernel 4.17.0-0.rc6.git3.1.fc29.x86_64
$ dmesg | grep mce [ 0.027300] mce: CPU supports 9 MCE banks [ 0.058829] mce: [Hardware Error]: Machine check events logged [ 0.058834] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: b200000000800400 [ 0.058856] mce: [Hardware Error]: TSC 0 [ 0.058867] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24 [ 0.058883] mce: [Hardware Error]: Machine check events logged [ 0.058885] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400 [ 0.058898] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 [ 0.058916] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 0 microcode 24 [ 0.061682] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400 [ 0.061700] mce: [Hardware Error]: TSC 0 ADDR ffffffffc0bd9215 MISC ffffffffc0bd9215 [ 0.061719] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 2 microcode 24 [ 0.063495] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400 [ 0.063495] mce: [Hardware Error]: TSC 0 ADDR ffffffffc04404e1 MISC ffffffffc04404e1 [ 0.063495] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 4 microcode 24 [ 0.065271] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 3: be00000000800400 [ 0.065271] mce: [Hardware Error]: TSC 0 ADDR ffffffffc055a2f6 MISC ffffffffc055a2f6 [ 0.065271] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1527435635 SOCKET 0 APIC 6 microcode 24
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #7 from mikhail.v.gavrilov@gmail.com --- Created attachment 139798 --> https://bugs.freedesktop.org/attachment.cgi?id=139798&action=edit dmesg
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #8 from mikhail.v.gavrilov@gmail.com --- Created attachment 140078 --> https://bugs.freedesktop.org/attachment.cgi?id=140078&action=edit fresh dmesg from kernel 4.18.0-0.rc0.git2.1
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #9 from Benjamin Xiao ben.r.xiao@gmail.com --- I am seeing the same thing with VLC when setting Hardware-accelerated decoding from Automatic to VA-API.
Fedora 28 RX Vega 64 Kernel 4.17.2 mesa 18.0.5 llvm 6
https://bugs.freedesktop.org/show_bug.cgi?id=106430
--- Comment #10 from mikhail.v.gavrilov@gmail.com --- Benjamin, I see that in Fedora 29 (Rawhide) with kernel 4.18.0-0.rc0.git9.1.fc29 problem was gone.
But with kernel 4.18.0-0.rc0.git10.1.fc29.x86_64 came yet another problem (video output at least DP stop working)
https://bugs.freedesktop.org/show_bug.cgi?id=106430
Martin Peres martin.peres@free.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |MOVED
--- Comment #11 from Martin Peres martin.peres@free.fr --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/374.
dri-devel@lists.freedesktop.org