https://bugs.freedesktop.org/show_bug.cgi?id=99312
Bug ID: 99312 Summary: Long-running OpenCL kernels cause ring stalls and GPU lockups on Kabini Product: Mesa Version: 13.0 Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: vedran@miletic.net QA Contact: dri-devel@lists.freedesktop.org
Running long lasting OpenCL kernels (e.g. GROMACS with a system of many atoms) using kernel 4.8.15, Mesa git, and LLVM git on Kabini APU:
vendor_id : AuthenticAMD cpu family : 22 model : 0 model name : AMD Athlon(tm) 5350 APU with Radeon(tm) R3 stepping : 1 microcode : 0x700010b
with GPU:
00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Kabini [Radeon HD 8400 / R3 Series] [1002:9830]
causes GPU lockups like:
[338584.980657] radeon 0000:00:01.0: ring 0 stalled for more than 10351msec [338584.980811] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0) [338585.484633] radeon 0000:00:01.0: ring 0 stalled for more than 10855msec [338585.484789] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0) [338585.988632] radeon 0000:00:01.0: ring 0 stalled for more than 11359msec [338585.988787] radeon 0000:00:01.0: GPU lockup (current fence id 0x00000000000827c1 last fence id 0x00000000000827c2 on ring 0)
Machine does not hang. This is reliably reproducible. Any other info I can provide?
https://bugs.freedesktop.org/show_bug.cgi?id=99312
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Hardware|Other |x86-64 (AMD64) Severity|normal |major Version|13.0 |git OS|All |Linux (All)
https://bugs.freedesktop.org/show_bug.cgi?id=99312
--- Comment #1 from John Bridgman john.bridgman@amd.com --- If you have not already done so, try disabling the watchdog timer:
MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 = 10 seconds, 0 = disable)"); module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);
As part of HSA/ROC development we dropped the priority of compute work relative to graphics which improved interactivity and *almost* eliminated timeouts without having to disable the timer - when I get back in the office I'll dig up the changes. In the meantime, I think disabling the timer will do what you need although you will still have sluggish graphics while long-running kernels are active.
Lowering the priority of compute waves across the board won't be a fully general solution because there are going to be some cases (eg Valve's recent work with using high priority compute to improve VR smoothness) where compute will need to be *higher* priority than graphics but it should cover most cases other than "simultaneously running GROMACS and VR".
https://bugs.freedesktop.org/show_bug.cgi?id=99312
Vedran Miletić vedran@miletic.net changed:
What |Removed |Added ---------------------------------------------------------------------------- Summary|Long-running OpenCL kernels |Long-running OpenCL kernels |cause ring stalls and GPU |cause ring stalls and GPU |lockups on Kabini |lockups on Kabini when | |radeon.lockup_timeout is | |enabled
--- Comment #2 from Vedran Miletić vedran@miletic.net --- (In reply to John Bridgman from comment #1)
If you have not already done so, try disabling the watchdog timer:
MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default 10000 = 10 seconds, 0 = disable)"); module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);
Yup, that works around the problem.
As part of HSA/ROC development we dropped the priority of compute work relative to graphics which improved interactivity and *almost* eliminated timeouts without having to disable the timer - when I get back in the office I'll dig up the changes. In the meantime, I think disabling the timer will do what you need although you will still have sluggish graphics while long-running kernels are active.
Eager to hear the details.
https://bugs.freedesktop.org/show_bug.cgi?id=99312
Jan Vesely jv356@scarletmail.rutgers.edu changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks| |99553
Referenced Bugs:
https://bugs.freedesktop.org/show_bug.cgi?id=99553 [Bug 99553] Tracker bug for runnning OpenCL applications on Clover
https://bugs.freedesktop.org/show_bug.cgi?id=99312
GitLab Migration User gitlab-migration@fdo.invalid changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |MOVED Status|NEW |RESOLVED
--- Comment #3 from GitLab Migration User gitlab-migration@fdo.invalid --- -- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1246.
dri-devel@lists.freedesktop.org