After a small fix to error capture code, we now can flush G2H during a GT reset which simplifies code and seals some extreme corner case races.
v2: (CI) - Don't trigger GT reset from G2H handler
Signed-off-by: Matthew Brost matthew.brost@intel.com
Matthew Brost (3): drm/i915: Allocate intel_engine_coredump_alloc with ALLOW_FAIL drm/i915/guc: Add work queue to trigger a GT reset drm/i915/guc: Flush G2H handler during a GT reset
drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 +++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 41 +++++++++---------- drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- 3 files changed, 26 insertions(+), 22 deletions(-)
Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than GFP_KERNEL do fully decouple the error capture from fence signalling.
Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Signed-off-by: Matthew Brost matthew.brost@intel.com --- drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 67f3515f07e7a..aee42eae4729f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine, struct i915_request *rq = NULL; unsigned long flags;
- ee = intel_engine_coredump_alloc(engine, GFP_KERNEL); + ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL); if (!ee) return NULL;
On 1/18/2022 13:43, Matthew Brost wrote:
Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than GFP_KERNEL do fully decouple the error capture from fence signalling.
s/do/to/
Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Does this really count as a bug fix over that patch? Isn't it more of a changing in policy now that we do DRM fence signalling and that other changes related to error capture behaviour have been implemented.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 67f3515f07e7a..aee42eae4729f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine, struct i915_request *rq = NULL; unsigned long flags;
- ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
- ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
This still makes me nervous that we will fail to allocate engine captures in stress test scenarios, which are exactly the kind of situations where we need valid error captures.
There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need updating as well?
John.
if (!ee) return NULL;
On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
On 1/18/2022 13:43, Matthew Brost wrote:
Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than GFP_KERNEL do fully decouple the error capture from fence signalling.
s/do/to/
Yep.
Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Does this really count as a bug fix over that patch? Isn't it more of a changing in policy now that we do DRM fence signalling and that other changes related to error capture behaviour have been implemented.
That patch was supposed to allow signalling annotations to be added, without this change I think these annotations would be broken. So I think the Fixes is correct.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 67f3515f07e7a..aee42eae4729f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine, struct i915_request *rq = NULL; unsigned long flags;
- ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
- ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
This still makes me nervous that we will fail to allocate engine captures in stress test scenarios, which are exactly the kind of situations where we need valid error captures.
Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas and Daniel seem to think this is correct. For what it's worth this allocation is less than a page, so it should be pretty safe to do with ALLOW_FAIL.
There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need updating as well?
Probably just should be deleted. If look it tries with ALLOW_FAIL first, then falls back to GFP_KERNEL. I didn't want to make that update in this series yet but that is something to keep an eye on.
Matt
John.
if (!ee) return NULL;
On 1/19/2022 12:47, Matthew Brost wrote:
On Tue, Jan 18, 2022 at 05:29:54PM -0800, John Harrison wrote:
On 1/18/2022 13:43, Matthew Brost wrote:
Allocate intel_engine_coredump_alloc with ALLOW_FAIL rather than GFP_KERNEL do fully decouple the error capture from fence signalling.
s/do/to/
Yep.
Fixes: 8b91cdd4f8649 ("drm/i915: Use __GFP_KSWAPD_RECLAIM in the capture code")
Does this really count as a bug fix over that patch? Isn't it more of a changing in policy now that we do DRM fence signalling and that other changes related to error capture behaviour have been implemented.
That patch was supposed to allow signalling annotations to be added, without this change I think these annotations would be broken. So I think the Fixes is correct.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 67f3515f07e7a..aee42eae4729f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1516,7 +1516,7 @@ capture_engine(struct intel_engine_cs *engine, struct i915_request *rq = NULL; unsigned long flags;
- ee = intel_engine_coredump_alloc(engine, GFP_KERNEL);
- ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL);
This still makes me nervous that we will fail to allocate engine captures in stress test scenarios, which are exactly the kind of situations where we need valid error captures.
Me too, but this whole file has been changed to the ALLOW_FAIL. Thomas and Daniel seem to think this is correct. For what it's worth this allocation is less than a page, so it should be pretty safe to do with ALLOW_FAIL.
There is also still a GFP_KERNEL in __i915_error_grow(). Doesn't that need updating as well?
Probably just should be deleted. If look it tries with ALLOW_FAIL first, then falls back to GFP_KERNEL. I didn't want to make that update in this series yet but that is something to keep an eye on.
Matt
Okay. Makes sense. With the description typo fixed: Reviewed-by: John Harrison John.C.Harrison@Intel.com
John.
if (!ee) return NULL;
The G2H handler needs to be flushed during a GT reset but a G2H indicating engine reset failure can trigger a GT reset. Add a worker to trigger the GT when a engine reset failure is received to break this circular dependency.
Signed-off-by: Matthew Brost matthew.brost@intel.com --- drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 9d26a86fe557a..60ea8deef5392 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -119,6 +119,11 @@ struct intel_guc { * function as it might be in an atomic context (no sleeping) */ struct work_struct destroyed_worker; + /** + * @reset_worker: worker to trigger a GT reset after an engine + * reset fails + */ + struct work_struct reset_worker; } submission_state;
/** diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 23a40f10d376d..cdd8d691251ff 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) }
static void destroyed_worker_func(struct work_struct *w); +static void reset_worker_func(struct work_struct *w);
/* * Set up the memory resources to be shared with the GuC (via the GGTT) @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc) INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts); INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func); + INIT_WORK(&guc->submission_state.reset_worker, + reset_worker_func);
guc->submission_state.guc_ids_bitmap = bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL); @@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) return gt->engine_class[engine_class][instance]; }
+static void reset_worker_func(struct work_struct *w) +{ + struct intel_guc *guc = container_of(w, struct intel_guc, + submission_state.reset_worker); + struct intel_gt *gt = guc_to_gt(guc); + + intel_gt_handle_error(gt, ALL_ENGINES, + I915_ERROR_CAPTURE, + "GuC failed to reset a engine\n"); +} + int intel_guc_engine_failure_process_msg(struct intel_guc *guc, const u32 *msg, u32 len) { @@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc, drm_err(>->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X", guc_class, instance, engine->name, reason);
- intel_gt_handle_error(gt, engine->mask, - I915_ERROR_CAPTURE, - "GuC failed to reset %s (reason=0x%08x)\n", - engine->name, reason); + /* + * A GT reset flushes this worker queue (G2H handler) so we must use + * another worker to trigger a GT reset. + */ + queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
return 0; }
On 1/18/2022 13:43, Matthew Brost wrote:
The G2H handler needs to be flushed during a GT reset but a G2H indicating engine reset failure can trigger a GT reset. Add a worker to trigger the GT when a engine reset failure is received to break this
s/a/an/
circular dependency.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 9d26a86fe557a..60ea8deef5392 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -119,6 +119,11 @@ struct intel_guc { * function as it might be in an atomic context (no sleeping) */ struct work_struct destroyed_worker;
/**
* @reset_worker: worker to trigger a GT reset after an engine
* reset fails
*/
struct work_struct reset_worker;
} submission_state;
/**
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 23a40f10d376d..cdd8d691251ff 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) }
static void destroyed_worker_func(struct work_struct *w); +static void reset_worker_func(struct work_struct *w);
/*
- Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc) INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts); INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
INIT_WORK(&guc->submission_state.reset_worker,
reset_worker_func);
guc->submission_state.guc_ids_bitmap = bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) return gt->engine_class[engine_class][instance]; }
+static void reset_worker_func(struct work_struct *w) +{
- struct intel_guc *guc = container_of(w, struct intel_guc,
submission_state.reset_worker);
- struct intel_gt *gt = guc_to_gt(guc);
- intel_gt_handle_error(gt, ALL_ENGINES,
I915_ERROR_CAPTURE,
"GuC failed to reset a engine\n");
s/a/an/
+}
- int intel_guc_engine_failure_process_msg(struct intel_guc *guc, const u32 *msg, u32 len) {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc, drm_err(>->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X", guc_class, instance, engine->name, reason);
- intel_gt_handle_error(gt, engine->mask,
I915_ERROR_CAPTURE,
"GuC failed to reset %s (reason=0x%08x)\n",
engine->name, reason);
The engine name and reason code are lost from the error capture? I guess we still get it in the drm_err above, though. So probably not an issue. We shouldn't be getting these from end users and any internal CI run is only likely to give us the dmesg, not the error capture anyway! However, still seems like it is work saving engine->mask in the submission_state structure (ORing in, in case there are multiple resets). Clearing it should be safe because once a GT reset has happened, we aren't getting any more G2Hs. And we can't have multiple message handlers running concurrently, right? So no need to protect the OR either.
John.
/*
* A GT reset flushes this worker queue (G2H handler) so we must use
* another worker to trigger a GT reset.
*/
queue_work(system_unbound_wq, &guc->submission_state.reset_worker);
return 0; }
On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
On 1/18/2022 13:43, Matthew Brost wrote:
The G2H handler needs to be flushed during a GT reset but a G2H indicating engine reset failure can trigger a GT reset. Add a worker to trigger the GT when a engine reset failure is received to break this
s/a/an/
Yep.
circular dependency.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 9d26a86fe557a..60ea8deef5392 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -119,6 +119,11 @@ struct intel_guc { * function as it might be in an atomic context (no sleeping) */ struct work_struct destroyed_worker;
/**
* @reset_worker: worker to trigger a GT reset after an engine
* reset fails
*/
} submission_state; /**struct work_struct reset_worker;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 23a40f10d376d..cdd8d691251ff 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) } static void destroyed_worker_func(struct work_struct *w); +static void reset_worker_func(struct work_struct *w); /*
- Set up the memory resources to be shared with the GuC (via the GGTT)
@@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc) INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts); INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
- INIT_WORK(&guc->submission_state.reset_worker,
guc->submission_state.guc_ids_bitmap = bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);reset_worker_func);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) return gt->engine_class[engine_class][instance]; } +static void reset_worker_func(struct work_struct *w) +{
- struct intel_guc *guc = container_of(w, struct intel_guc,
submission_state.reset_worker);
- struct intel_gt *gt = guc_to_gt(guc);
- intel_gt_handle_error(gt, ALL_ENGINES,
I915_ERROR_CAPTURE,
"GuC failed to reset a engine\n");
s/a/an/
Yep.
+}
- int intel_guc_engine_failure_process_msg(struct intel_guc *guc, const u32 *msg, u32 len) {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc, drm_err(>->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X", guc_class, instance, engine->name, reason);
- intel_gt_handle_error(gt, engine->mask,
I915_ERROR_CAPTURE,
"GuC failed to reset %s (reason=0x%08x)\n",
engine->name, reason);
The engine name and reason code are lost from the error capture? I guess we still get it in the drm_err above, though. So probably not an issue. We shouldn't be getting these from end users and any internal CI run is only likely to give us the dmesg, not the error capture anyway! However, still
That was my reasoning on the msg too.
seems like it is work saving engine->mask in the submission_state structure (ORing in, in case there are multiple resets). Clearing it should be safe because once a GT reset has happened, we aren't getting any more G2Hs. And we can't have multiple message handlers running concurrently, right? So no need to protect the OR either.
I could do that but the engine->mask is really only used for the error capture with GuC submission as any i915 based reset with GuC submission is a GT reset. Going from engine->mask to ALL_ENGINES will just capture all engine state before doing a GT reset which probably isn't a bad thing, right?
I can update the commit message explaining this if that helps.
Matt
John.
- /*
* A GT reset flushes this worker queue (G2H handler) so we must use
* another worker to trigger a GT reset.
*/
- queue_work(system_unbound_wq, &guc->submission_state.reset_worker); return 0; }
On 1/19/2022 12:54, Matthew Brost wrote:
On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
On 1/18/2022 13:43, Matthew Brost wrote:
The G2H handler needs to be flushed during a GT reset but a G2H indicating engine reset failure can trigger a GT reset. Add a worker to trigger the GT when a engine reset failure is received to break this
s/a/an/
Yep.
circular dependency.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 9d26a86fe557a..60ea8deef5392 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -119,6 +119,11 @@ struct intel_guc { * function as it might be in an atomic context (no sleeping) */ struct work_struct destroyed_worker;
/**
* @reset_worker: worker to trigger a GT reset after an engine
* reset fails
*/
} submission_state; /**struct work_struct reset_worker;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 23a40f10d376d..cdd8d691251ff 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) } static void destroyed_worker_func(struct work_struct *w); +static void reset_worker_func(struct work_struct *w); /* * Set up the memory resources to be shared with the GuC (via the GGTT) @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc) INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts); INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
- INIT_WORK(&guc->submission_state.reset_worker,
guc->submission_state.guc_ids_bitmap = bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);reset_worker_func);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) return gt->engine_class[engine_class][instance]; } +static void reset_worker_func(struct work_struct *w) +{
- struct intel_guc *guc = container_of(w, struct intel_guc,
submission_state.reset_worker);
- struct intel_gt *gt = guc_to_gt(guc);
- intel_gt_handle_error(gt, ALL_ENGINES,
I915_ERROR_CAPTURE,
"GuC failed to reset a engine\n");
s/a/an/
Yep.
+}
- int intel_guc_engine_failure_process_msg(struct intel_guc *guc, const u32 *msg, u32 len) {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc, drm_err(>->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X", guc_class, instance, engine->name, reason);
- intel_gt_handle_error(gt, engine->mask,
I915_ERROR_CAPTURE,
"GuC failed to reset %s (reason=0x%08x)\n",
engine->name, reason);
The engine name and reason code are lost from the error capture? I guess we still get it in the drm_err above, though. So probably not an issue. We shouldn't be getting these from end users and any internal CI run is only likely to give us the dmesg, not the error capture anyway! However, still
That was my reasoning on the msg too.
seems like it is work saving engine->mask in the submission_state structure (ORing in, in case there are multiple resets). Clearing it should be safe because once a GT reset has happened, we aren't getting any more G2Hs. And we can't have multiple message handlers running concurrently, right? So no need to protect the OR either.
I could do that but the engine->mask is really only used for the error capture with GuC submission as any i915 based reset with GuC submission is a GT reset. Going from engine->mask to ALL_ENGINES will just capture all engine state before doing a GT reset which probably isn't a bad thing, right?
I can update the commit message explaining this if that helps.
Except that a failure to reset is notionally a hardware bug. As recently demonstrated, it could be a software bug due to timeouts being broken. But officially, it is something that should never happen. So in the rare case where one does show up, we would want to know as much as possible about the issue. Most especially - which engine it was that failed. And if all we get is a customer bug report with an error capture but no dmesg then we will have no idea which. It just seems wrong to be throwing away potentially important information for no real reason.
John.
Matt
John.
- /*
* A GT reset flushes this worker queue (G2H handler) so we must use
* another worker to trigger a GT reset.
*/
- queue_work(system_unbound_wq, &guc->submission_state.reset_worker); return 0; }
On Wed, Jan 19, 2022 at 01:07:22PM -0800, John Harrison wrote:
On 1/19/2022 12:54, Matthew Brost wrote:
On Tue, Jan 18, 2022 at 05:37:01PM -0800, John Harrison wrote:
On 1/18/2022 13:43, Matthew Brost wrote:
The G2H handler needs to be flushed during a GT reset but a G2H indicating engine reset failure can trigger a GT reset. Add a worker to trigger the GT when a engine reset failure is received to break this
s/a/an/
Yep.
circular dependency.
Signed-off-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/i915/gt/uc/intel_guc.h | 5 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 +++++++++++++++---- 2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 9d26a86fe557a..60ea8deef5392 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -119,6 +119,11 @@ struct intel_guc { * function as it might be in an atomic context (no sleeping) */ struct work_struct destroyed_worker;
/**
* @reset_worker: worker to trigger a GT reset after an engine
* reset fails
*/
} submission_state; /**struct work_struct reset_worker;
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 23a40f10d376d..cdd8d691251ff 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1746,6 +1746,7 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) } static void destroyed_worker_func(struct work_struct *w); +static void reset_worker_func(struct work_struct *w); /* * Set up the memory resources to be shared with the GuC (via the GGTT) @@ -1776,6 +1777,8 @@ int intel_guc_submission_init(struct intel_guc *guc) INIT_LIST_HEAD(&guc->submission_state.destroyed_contexts); INIT_WORK(&guc->submission_state.destroyed_worker, destroyed_worker_func);
- INIT_WORK(&guc->submission_state.reset_worker,
guc->submission_state.guc_ids_bitmap = bitmap_zalloc(NUMBER_MULTI_LRC_GUC_ID(guc), GFP_KERNEL);reset_worker_func);
@@ -4052,6 +4055,17 @@ guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance) return gt->engine_class[engine_class][instance]; } +static void reset_worker_func(struct work_struct *w) +{
- struct intel_guc *guc = container_of(w, struct intel_guc,
submission_state.reset_worker);
- struct intel_gt *gt = guc_to_gt(guc);
- intel_gt_handle_error(gt, ALL_ENGINES,
I915_ERROR_CAPTURE,
"GuC failed to reset a engine\n");
s/a/an/
Yep.
+}
- int intel_guc_engine_failure_process_msg(struct intel_guc *guc, const u32 *msg, u32 len) {
@@ -4083,10 +4097,11 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc, drm_err(>->i915->drm, "GuC engine reset request failed on %d:%d (%s) because 0x%08X", guc_class, instance, engine->name, reason);
- intel_gt_handle_error(gt, engine->mask,
I915_ERROR_CAPTURE,
"GuC failed to reset %s (reason=0x%08x)\n",
engine->name, reason);
The engine name and reason code are lost from the error capture? I guess we still get it in the drm_err above, though. So probably not an issue. We shouldn't be getting these from end users and any internal CI run is only likely to give us the dmesg, not the error capture anyway! However, still
That was my reasoning on the msg too.
seems like it is work saving engine->mask in the submission_state structure (ORing in, in case there are multiple resets). Clearing it should be safe because once a GT reset has happened, we aren't getting any more G2Hs. And we can't have multiple message handlers running concurrently, right? So no need to protect the OR either.
I could do that but the engine->mask is really only used for the error capture with GuC submission as any i915 based reset with GuC submission is a GT reset. Going from engine->mask to ALL_ENGINES will just capture all engine state before doing a GT reset which probably isn't a bad thing, right?
I can update the commit message explaining this if that helps.
Except that a failure to reset is notionally a hardware bug. As recently demonstrated, it could be a software bug due to timeouts being broken. But officially, it is something that should never happen. So in the rare case where one does show up, we would want to know as much as possible about the issue. Most especially - which engine it was that failed. And if all we get is a customer bug report with an error capture but no dmesg then we will have no idea which. It just seems wrong to be throwing away potentially important information for no real reason.
Ok, will add a engine->mask that gets OR'd on every engine reset failure and cleared on every GT reset in the worker. Probably to be really safe I should protect this field by the submission state lock too.
Matt
John.
Matt
John.
- /*
* A GT reset flushes this worker queue (G2H handler) so we must use
* another worker to trigger a GT reset.
*/
- queue_work(system_unbound_wq, &guc->submission_state.reset_worker); return 0; }
Now that the error capture is fully decoupled from fence signalling (request retirement to free memory, which is turn depends on resets) we can safely flush the G2H handler during a GT reset. This is eliminates corner cases where GuC generated G2H (e.g. engine resets) race with a GT reset.
Signed-off-by: Matthew Brost mattthew.brost@intel.com --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index cdd8d691251ff..1a11e8986948b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
void intel_guc_submission_reset_prepare(struct intel_guc *guc) { - int i; - if (unlikely(!guc_submission_initialized(guc))) { /* Reset called during driver load? GuC not yet initialised! */ return; @@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
guc_flush_submissions(guc); guc_flush_destroyed_contexts(guc); - - /* - * Handle any outstanding G2Hs before reset. Call IRQ handler directly - * each pass as interrupt have been disabled. We always scrub for - * outstanding G2H as it is possible for outstanding_submission_g2h to - * be incremented after the context state update. - */ - for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) { - intel_guc_to_host_event_handler(guc); -#define wait_for_reset(guc, wait_var) \ - intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20)) - do { - wait_for_reset(guc, &guc->outstanding_submission_g2h); - } while (!list_empty(&guc->ct.requests.incoming)); - } + flush_work(&guc->ct.requests.worker);
scrub_guc_desc_for_outstanding_g2h(guc); }
On 1/18/2022 13:43, Matthew Brost wrote:
Now that the error capture is fully decoupled from fence signalling (request retirement to free memory, which is turn depends on resets) we
s/is/in/
With that fixed: Reviewed-by: John Harrison John.C.Harrison@Intel.com
John.
can safely flush the G2H handler during a GT reset. This is eliminates corner cases where GuC generated G2H (e.g. engine resets) race with a GT reset.
Signed-off-by: Matthew Brost mattthew.brost@intel.com
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index cdd8d691251ff..1a11e8986948b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1396,8 +1396,6 @@ static void guc_flush_destroyed_contexts(struct intel_guc *guc);
void intel_guc_submission_reset_prepare(struct intel_guc *guc) {
- int i;
- if (unlikely(!guc_submission_initialized(guc))) { /* Reset called during driver load? GuC not yet initialised! */ return;
@@ -1414,21 +1412,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc)
guc_flush_submissions(guc); guc_flush_destroyed_contexts(guc);
- /*
* Handle any outstanding G2Hs before reset. Call IRQ handler directly
* each pass as interrupt have been disabled. We always scrub for
* outstanding G2H as it is possible for outstanding_submission_g2h to
* be incremented after the context state update.
*/
- for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) {
intel_guc_to_host_event_handler(guc);
-#define wait_for_reset(guc, wait_var) \
intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20))
do {
wait_for_reset(guc, &guc->outstanding_submission_g2h);
} while (!list_empty(&guc->ct.requests.incoming));
- }
flush_work(&guc->ct.requests.worker);
scrub_guc_desc_for_outstanding_g2h(guc); }
dri-devel@lists.freedesktop.org