On 12/10/2021 10:58 PM, John.C.Harrison@Intel.com wrote:
From: John Harrison John.C.Harrison@Intel.com
A workaround was added to the driver to allow OpenCL workloads to run 'forever' by disabling pre-emption on the RCS engine for Gen12. It is not totally unbound as the heartbeat will kick in eventually and cause a reset of the hung engine.
However, this does not work well in GuC submission mode. In GuC mode, the pre-emption timeout is how GuC detects hung contexts and triggers a per engine reset. Thus, disabling the timeout means also losing all per engine reset ability. A full GT reset will still occur when the heartbeat finally expires, but that is a much more destructive and undesirable mechanism.
The purpose of the workaround is actually to give OpenCL tasks longer to reach a pre-emption point after a pre-emption request has been issued. This is necessary because Gen12 does not support mid-thread pre-emption and OpenCL can have long running threads.
So, rather than disabling the timeout completely, just set it to a 'long' value. Likewise, bump the heartbeat interval. That gives the OpenCL thread sufficient time to reach a pre-emption point without being killed off either by the GuC or by the heartbeat.
Signed-off-by: John Harrison John.C.Harrison@Intel.com
On the approach and the code:
Reviewed-by: Daniele Ceraolo Spurio daniele.ceraolospurio@intel.com
Please get an ack from the interested parties on the actual numbers used for the timeouts (they look big enough to me, but I'm not familiar with the compute use-case).
Daniele
drivers/gpu/drm/i915/gt/intel_engine_cs.c | 42 +++++++++++++++++++++-- 1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 352254e001b4..26af8d60fe2b 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -382,9 +382,45 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id, engine->props.timeslice_duration_ms = CONFIG_DRM_I915_TIMESLICE_DURATION;
- /* Override to uninterruptible for OpenCL workloads. */
- if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS)
engine->props.preempt_timeout_ms = 0;
/*
* Mid-thread pre-emption is not available in Gen12. Unfortunately,
* some OpenCL workloads run quite long threads. That means they get
* reset due to not pre-empting in a timely manner.
* The execlist solution was to disable pre-emption completely.
* However, pre-emption timeouts are the way GuC detects hung contexts
* and triggers engine resets. Thus, without pre-emption, there is no
* per engine reset. And full GT reset is much more intrusive. So keep
* the timeout for GuC submission platforms and just bump it to be
* much larger. Also bump the heartbeat timeout to match, otherwise
* the heartbeat can expire before the pre-emption can timeout and
* thus trigger a full GT reset anyway.
*/
if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) {
if (intel_uc_wants_guc_submission(>->uc)) {
const unsigned long min_preempt = 7500;
const unsigned long min_beat = 5000;
if (engine->props.preempt_timeout_ms &&
engine->props.preempt_timeout_ms < min_preempt) {
drm_info(>->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n",
engine->props.preempt_timeout_ms,
min_preempt, engine->name);
engine->props.preempt_timeout_ms = min_preempt;
}
if (engine->props.heartbeat_interval_ms &&
engine->props.heartbeat_interval_ms < min_beat) {
drm_info(>->i915->drm, "Bumping heartbeat interval from %ld to %ld on %s to allow slow compute pre-emption\n",
engine->props.heartbeat_interval_ms,
min_beat, engine->name);
engine->props.heartbeat_interval_ms = min_beat;
}
} else {
engine->props.preempt_timeout_ms = 0;
}
}
engine->defaults = engine->props; /* never to change again */