i915 selftest hangcheck is causing the i915 driver timeouts, as reported by Intel CI bot:
http://gfx-ci.fi.intel.com/cibuglog-ng/issuefilterassoc/24297?query_key=42a9...
When such test runs, the only output is:
[ 68.811639] i915: Performing live selftests with st_random_seed=0xe138eac7 st_timeout=500 [ 68.811792] i915: Running hangcheck [ 68.811859] i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck [ 68.816910] i915 0000:00:02.0: [drm] Cannot find any crtc or sizes [ 68.841597] i915: Running intel_hangcheck_live_selftests/igt_reset_nop [ 69.346347] igt_reset_nop: 80 resets [ 69.362695] i915: Running intel_hangcheck_live_selftests/igt_reset_nop_engine [ 69.863559] igt_reset_nop_engine(rcs0): 709 resets [ 70.364924] igt_reset_nop_engine(bcs0): 903 resets [ 70.866005] igt_reset_nop_engine(vcs0): 659 resets [ 71.367934] igt_reset_nop_engine(vcs1): 549 resets [ 71.869259] igt_reset_nop_engine(vecs0): 553 resets [ 71.882592] i915: Running intel_hangcheck_live_selftests/igt_reset_idle_engine [ 72.383554] rcs0: Completed 16605 idle resets [ 72.884599] bcs0: Completed 18641 idle resets [ 73.385592] vcs0: Completed 17517 idle resets [ 73.886658] vcs1: Completed 15474 idle resets [ 74.387600] vecs0: Completed 17983 idle resets [ 74.387667] i915: Running intel_hangcheck_live_selftests/igt_reset_active_engine [ 74.889017] rcs0: Completed 747 active resets [ 75.174240] intel_engine_reset(bcs0) failed, err:-110 [ 75.174301] bcs0: Completed 525 active resets
After that, the machine just silently hangs.
Bisecting the issue, the patch that introduced the regression is:
7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reverting it fix the issues, but introduce other problems, as TLB won't be invalidated anymore. So, instead, let's fix the root cause.
It turns that the TLB flush logic ends conflicting with i915 reset, which is called during selftest hangcheck. So, the TLB cache should be serialized, but other TLB fix patches are required for this one to work.
Tested on an Intel NUC5i7RYB with an i7-5557U Broadwell CPU.
Chris Wilson (6): drm/i915/gt: Ignore TLB invalidations on idle engines drm/i915/gt: Invalidate TLB of the OA unit at TLB invalidations drm/i915/gt: Skip TLB invalidations once wedged drm/i915/gt: Only invalidate TLBs exposed to user manipulation drm/i915/gt: Serialize GRDOM access between multiple engine resets drm/i915/gt: Serialize TLB invalidates with GT resets
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 10 +++--- drivers/gpu/drm/i915/gt/intel_gt.c | 43 +++++++++++++++++++---- drivers/gpu/drm/i915/gt/intel_gt_pm.h | 3 ++ drivers/gpu/drm/i915/gt/intel_reset.c | 37 ++++++++++++++----- drivers/gpu/drm/i915/i915_vma.c | 3 +- 5 files changed, 75 insertions(+), 21 deletions(-)
From: Chris Wilson chris.p.wilson@intel.com
As an extension of the current skip TLB invalidations, check if the device is powered down prior to any engine activity,
as, on such cases, all the TLBs were already invalidated, so an explicit TLB invalidation is not needed.
This becomes more significant with GuC, as it can only do so when the connection to the GuC is awake.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 10 +++++---- drivers/gpu/drm/i915/gt/intel_gt.c | 26 +++++++++++++++++------ drivers/gpu/drm/i915/gt/intel_gt_pm.h | 3 +++ 3 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index 97c820eee115..6835279943df 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -6,14 +6,15 @@
#include <drm/drm_cache.h>
+#include "gt/intel_gt.h" +#include "gt/intel_gt_pm.h" + #include "i915_drv.h" #include "i915_gem_object.h" #include "i915_scatterlist.h" #include "i915_gem_lmem.h" #include "i915_gem_mman.h"
-#include "gt/intel_gt.h" - void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj, struct sg_table *pages, unsigned int sg_page_sizes) @@ -217,10 +218,11 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
if (test_and_clear_bit(I915_BO_WAS_BOUND_BIT, &obj->flags)) { struct drm_i915_private *i915 = to_i915(obj->base.dev); + struct intel_gt *gt = to_gt(i915); intel_wakeref_t wakeref;
- with_intel_runtime_pm_if_active(&i915->runtime_pm, wakeref) - intel_gt_invalidate_tlbs(to_gt(i915)); + with_intel_gt_pm_if_awake(gt, wakeref) + intel_gt_invalidate_tlbs(gt); }
return pages; diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index f33290358c51..d5ed6a6ac67c 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -11,6 +11,7 @@
#include "i915_drv.h" #include "intel_context.h" +#include "intel_engine_pm.h" #include "intel_engine_regs.h" #include "intel_gt.h" #include "intel_gt_buffer_pool.h" @@ -1216,6 +1217,7 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) struct drm_i915_private *i915 = gt->i915; struct intel_uncore *uncore = gt->uncore; struct intel_engine_cs *engine; + intel_engine_mask_t awake, tmp; enum intel_engine_id id; const i915_reg_t *regs; unsigned int num = 0; @@ -1239,12 +1241,27 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
GEM_TRACE("\n");
- assert_rpm_wakelock_held(&i915->runtime_pm); - mutex_lock(>->tlb_invalidate_lock); intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
+ awake = 0; for_each_engine(engine, gt, id) { + struct reg_and_bit rb; + + if (!intel_engine_pm_is_awake(engine)) + continue; + + rb = get_reg_and_bit(engine, regs == gen8_regs, regs, num); + if (!i915_mmio_reg_offset(rb.reg)) + continue; + + intel_uncore_write_fw(uncore, rb.reg, rb.bit); + awake |= engine->mask; + } + + for_each_engine_masked(engine, gt, awake, tmp) { + struct reg_and_bit rb; + /* * HW architecture suggest typical invalidation time at 40us, * with pessimistic cases up to 100us and a recommendation to @@ -1252,13 +1269,8 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) */ const unsigned int timeout_us = 100; const unsigned int timeout_ms = 4; - struct reg_and_bit rb;
rb = get_reg_and_bit(engine, regs == gen8_regs, regs, num); - if (!i915_mmio_reg_offset(rb.reg)) - continue; - - intel_uncore_write_fw(uncore, rb.reg, rb.bit); if (__intel_wait_for_register_fw(uncore, rb.reg, rb.bit, 0, timeout_us, timeout_ms, diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h index bc898df7a48c..a334787a4939 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h @@ -55,6 +55,9 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt) for (tmp = 1, intel_gt_pm_get(gt); tmp; \ intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake(gt, wf) \ + for (wf = intel_gt_pm_get_if_awake(gt); wf; intel_gt_pm_put_async(gt), wf = 0) + static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt) { return intel_wakeref_wait_for_idle(>->wakeref);
On 15/06/2022 16:27, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
As an extension of the current skip TLB invalidations, check if the device is powered down prior to any engine activity,
as, on such cases, all the TLBs were already invalidated, so an explicit TLB invalidation is not needed.
This becomes more significant with GuC, as it can only do so when the connection to the GuC is awake.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Hmmm is this a fix or "an extension" as the commit text mentions both options?! GuC angle does not appear relevant for upstream yet so is cc: stable really required is the question.
Regards,
Tvrtko
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gem/i915_gem_pages.c | 10 +++++---- drivers/gpu/drm/i915/gt/intel_gt.c | 26 +++++++++++++++++------ drivers/gpu/drm/i915/gt/intel_gt_pm.h | 3 +++ 3 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index 97c820eee115..6835279943df 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -6,14 +6,15 @@
#include <drm/drm_cache.h>
+#include "gt/intel_gt.h" +#include "gt/intel_gt_pm.h"
- #include "i915_drv.h" #include "i915_gem_object.h" #include "i915_scatterlist.h" #include "i915_gem_lmem.h" #include "i915_gem_mman.h"
-#include "gt/intel_gt.h"
- void __i915_gem_object_set_pages(struct drm_i915_gem_object *obj, struct sg_table *pages, unsigned int sg_page_sizes)
@@ -217,10 +218,11 @@ __i915_gem_object_unset_pages(struct drm_i915_gem_object *obj)
if (test_and_clear_bit(I915_BO_WAS_BOUND_BIT, &obj->flags)) { struct drm_i915_private *i915 = to_i915(obj->base.dev);
intel_wakeref_t wakeref;struct intel_gt *gt = to_gt(i915);
with_intel_runtime_pm_if_active(&i915->runtime_pm, wakeref)
intel_gt_invalidate_tlbs(to_gt(i915));
with_intel_gt_pm_if_awake(gt, wakeref)
intel_gt_invalidate_tlbs(gt);
}
return pages;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index f33290358c51..d5ed6a6ac67c 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -11,6 +11,7 @@
#include "i915_drv.h" #include "intel_context.h" +#include "intel_engine_pm.h" #include "intel_engine_regs.h" #include "intel_gt.h" #include "intel_gt_buffer_pool.h" @@ -1216,6 +1217,7 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) struct drm_i915_private *i915 = gt->i915; struct intel_uncore *uncore = gt->uncore; struct intel_engine_cs *engine;
- intel_engine_mask_t awake, tmp; enum intel_engine_id id; const i915_reg_t *regs; unsigned int num = 0;
@@ -1239,12 +1241,27 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
GEM_TRACE("\n");
- assert_rpm_wakelock_held(&i915->runtime_pm);
- mutex_lock(>->tlb_invalidate_lock); intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
- awake = 0; for_each_engine(engine, gt, id) {
struct reg_and_bit rb;
if (!intel_engine_pm_is_awake(engine))
continue;
rb = get_reg_and_bit(engine, regs == gen8_regs, regs, num);
if (!i915_mmio_reg_offset(rb.reg))
continue;
intel_uncore_write_fw(uncore, rb.reg, rb.bit);
awake |= engine->mask;
- }
- for_each_engine_masked(engine, gt, awake, tmp) {
struct reg_and_bit rb;
- /*
- HW architecture suggest typical invalidation time at 40us,
- with pessimistic cases up to 100us and a recommendation to
@@ -1252,13 +1269,8 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) */ const unsigned int timeout_us = 100; const unsigned int timeout_ms = 4;
struct reg_and_bit rb;
rb = get_reg_and_bit(engine, regs == gen8_regs, regs, num);
if (!i915_mmio_reg_offset(rb.reg))
continue;
intel_uncore_write_fw(uncore, rb.reg, rb.bit);
if (__intel_wait_for_register_fw(uncore, rb.reg, rb.bit, 0, timeout_us, timeout_ms,
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h index bc898df7a48c..a334787a4939 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h @@ -55,6 +55,9 @@ static inline void intel_gt_pm_might_put(struct intel_gt *gt) for (tmp = 1, intel_gt_pm_get(gt); tmp; \ intel_gt_pm_put(gt), tmp = 0)
+#define with_intel_gt_pm_if_awake(gt, wf) \
- for (wf = intel_gt_pm_get_if_awake(gt); wf; intel_gt_pm_put_async(gt), wf = 0)
- static inline int intel_gt_pm_wait_for_idle(struct intel_gt *gt) { return intel_wakeref_wait_for_idle(>->wakeref);
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:35PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
As an extension of the current skip TLB invalidations, check if the device is powered down prior to any engine activity,
as, on such cases, all the TLBs were already invalidated, so an explicit TLB invalidation is not needed.
This becomes more significant with GuC, as it can only do so when the connection to the GuC is awake.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Andi
From: Chris Wilson chris.p.wilson@intel.com
On gen12 HW, ensure that the TLB of the OA unit is also invalidated as just invalidating the TLB of an engine is not enough.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index d5ed6a6ac67c..61b7ec5118f9 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -10,6 +10,7 @@ #include "pxp/intel_pxp.h"
#include "i915_drv.h" +#include "i915_perf_oa_regs.h" #include "intel_context.h" #include "intel_engine_pm.h" #include "intel_engine_regs.h" @@ -1259,6 +1260,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) awake |= engine->mask; }
+ /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */ + if (awake && + (IS_TIGERLAKE(i915) || + IS_DG1(i915) || + IS_ROCKETLAKE(i915) || + IS_ALDERLAKE_S(i915) || + IS_ALDERLAKE_P(i915))) + intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1); + for_each_engine_masked(engine, gt, awake, tmp) { struct reg_and_bit rb;
On Wed, Jun 15, 2022 at 04:27:36PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
On gen12 HW, ensure that the TLB of the OA unit is also invalidated as just invalidating the TLB of an engine is not enough.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index d5ed6a6ac67c..61b7ec5118f9 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -10,6 +10,7 @@ #include "pxp/intel_pxp.h"
#include "i915_drv.h" +#include "i915_perf_oa_regs.h" #include "intel_context.h" #include "intel_engine_pm.h" #include "intel_engine_regs.h" @@ -1259,6 +1260,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) awake |= engine->mask; }
- /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
- if (awake &&
(IS_TIGERLAKE(i915) ||
IS_DG1(i915) ||
IS_ROCKETLAKE(i915) ||
IS_ALDERLAKE_S(i915) ||
IS_ALDERLAKE_P(i915)))
intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
This patch can be dropped since this is being done in i915/i915_perf.c -> gen12_oa_disable and is synchronized with OA use cases.
Regards, Umesh
for_each_engine_masked(engine, gt, awake, tmp) { struct reg_and_bit rb;
-- 2.36.1
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:36PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
On gen12 HW, ensure that the TLB of the OA unit is also invalidated as just invalidating the TLB of an engine is not enough.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Thanks, Andi
From: Chris Wilson chris.p.wilson@intel.com
Skip all further TLB invalidations once the device is wedged and had been reset, as, on such cases, it can no longer process instructions on the GPU and the user no longer has access to the TLB's in each engine.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 61b7ec5118f9..fb4fd5273ca4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1226,6 +1226,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) if (I915_SELFTEST_ONLY(gt->awake == -ENODEV)) return;
+ if (intel_gt_is_wedged(gt)) + return; + if (GRAPHICS_VER(i915) == 12) { regs = gen12_regs; num = ARRAY_SIZE(gen12_regs);
On 15/06/2022 16:27, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Skip all further TLB invalidations once the device is wedged and had been reset, as, on such cases, it can no longer process instructions on the GPU and the user no longer has access to the TLB's in each engine.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Are there any real problems fixed or it's just a logical thing to do? Not much harm tagging it as fixes in terms of process since it is tiny but, again, wanting a clear picture.
Regards,
Tvrtko
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 61b7ec5118f9..fb4fd5273ca4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1226,6 +1226,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) if (I915_SELFTEST_ONLY(gt->awake == -ENODEV)) return;
- if (intel_gt_is_wedged(gt))
return;
- if (GRAPHICS_VER(i915) == 12) { regs = gen12_regs; num = ARRAY_SIZE(gen12_regs);
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:37PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Skip all further TLB invalidations once the device is wedged and had been reset, as, on such cases, it can no longer process instructions on the GPU and the user no longer has access to the TLB's in each engine.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 61b7ec5118f9..fb4fd5273ca4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1226,6 +1226,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) if (I915_SELFTEST_ONLY(gt->awake == -ENODEV)) return;
- if (intel_gt_is_wedged(gt))
return;
This looks familiar :)
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Thanks, Andi
From: Chris Wilson chris.p.wilson@intel.com
Don't flush TLBs when the buffer is only used in the GGTT under full control of the kernel, as there's no risk of of concurrent access and stale access from prefetch.
We only need to invalidate the TLB if they are accessible by the user.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/i915_vma.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 0bffb70b3c5f..7989986161e8 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -537,7 +537,8 @@ int i915_vma_bind(struct i915_vma *vma, bind_flags); }
- set_bit(I915_BO_WAS_BOUND_BIT, &vma->obj->flags); + if (bind_flags & I915_VMA_LOCAL_BIND) + set_bit(I915_BO_WAS_BOUND_BIT, &vma->obj->flags);
atomic_or(bind_flags, &vma->flags); return 0;
On 15/06/2022 16:27, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Don't flush TLBs when the buffer is only used in the GGTT under full control of the kernel, as there's no risk of of concurrent access and stale access from prefetch.
We only need to invalidate the TLB if they are accessible by the user.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Same question as against the other patch - fix or optimisation?
Regards,
Tvrtko
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/i915_vma.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 0bffb70b3c5f..7989986161e8 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -537,7 +537,8 @@ int i915_vma_bind(struct i915_vma *vma, bind_flags); }
- set_bit(I915_BO_WAS_BOUND_BIT, &vma->obj->flags);
if (bind_flags & I915_VMA_LOCAL_BIND)
set_bit(I915_BO_WAS_BOUND_BIT, &vma->obj->flags);
atomic_or(bind_flags, &vma->flags); return 0;
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:38PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Don't flush TLBs when the buffer is only used in the GGTT under full control of the kernel, as there's no risk of of concurrent access and stale access from prefetch.
We only need to invalidate the TLB if they are accessible by the user.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Fei Yang fei.yang@intel.com Cc: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Thanks, Andi
From: Chris Wilson chris.p.wilson@intel.com
Don't allow two engines to be reset in parallel, as they would both try to select a reset bit (and send requests to common registers) and wait on that register, at the same time. Serialize control of the reset requests/acks using the uncore->lock, which will also ensure that no other GT state changes at the same time as the actual reset.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reported-by: Mika Kuoppala mika.kuoppala@linux.intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_reset.c | 37 ++++++++++++++++++++------- 1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index a5338c3fde7a..c68d36fb5bbd 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -300,9 +300,9 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask) return err; }
-static int gen6_reset_engines(struct intel_gt *gt, - intel_engine_mask_t engine_mask, - unsigned int retry) +static int __gen6_reset_engines(struct intel_gt *gt, + intel_engine_mask_t engine_mask, + unsigned int retry) { struct intel_engine_cs *engine; u32 hw_mask; @@ -321,6 +321,20 @@ static int gen6_reset_engines(struct intel_gt *gt, return gen6_hw_domain_reset(gt, hw_mask); }
+static int gen6_reset_engines(struct intel_gt *gt, + intel_engine_mask_t engine_mask, + unsigned int retry) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(>->uncore->lock, flags); + ret = __gen6_reset_engines(gt, engine_mask, retry); + spin_unlock_irqrestore(>->uncore->lock, flags); + + return ret; +} + static struct intel_engine_cs *find_sfc_paired_vecs_engine(struct intel_engine_cs *engine) { int vecs_id; @@ -487,9 +501,9 @@ static void gen11_unlock_sfc(struct intel_engine_cs *engine) rmw_clear_fw(uncore, sfc_lock.lock_reg, sfc_lock.lock_bit); }
-static int gen11_reset_engines(struct intel_gt *gt, - intel_engine_mask_t engine_mask, - unsigned int retry) +static int __gen11_reset_engines(struct intel_gt *gt, + intel_engine_mask_t engine_mask, + unsigned int retry) { struct intel_engine_cs *engine; intel_engine_mask_t tmp; @@ -583,8 +597,11 @@ static int gen8_reset_engines(struct intel_gt *gt, struct intel_engine_cs *engine; const bool reset_non_ready = retry >= 1; intel_engine_mask_t tmp; + unsigned long flags; int ret;
+ spin_lock_irqsave(>->uncore->lock, flags); + for_each_engine_masked(engine, gt, engine_mask, tmp) { ret = gen8_engine_reset_prepare(engine); if (ret && !reset_non_ready) @@ -612,17 +629,19 @@ static int gen8_reset_engines(struct intel_gt *gt, * This is best effort, so ignore any error from the initial reset. */ if (IS_DG2(gt->i915) && engine_mask == ALL_ENGINES) - gen11_reset_engines(gt, gt->info.engine_mask, 0); + __gen11_reset_engines(gt, gt->info.engine_mask, 0);
if (GRAPHICS_VER(gt->i915) >= 11) - ret = gen11_reset_engines(gt, engine_mask, retry); + ret = __gen11_reset_engines(gt, engine_mask, retry); else - ret = gen6_reset_engines(gt, engine_mask, retry); + ret = __gen6_reset_engines(gt, engine_mask, retry);
skip_reset: for_each_engine_masked(engine, gt, engine_mask, tmp) gen8_engine_reset_cancel(engine);
+ spin_unlock_irqrestore(>->uncore->lock, flags); + return ret; }
On 15/06/2022 16:27, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Don't allow two engines to be reset in parallel, as they would both try to select a reset bit (and send requests to common registers) and wait on that register, at the same time. Serialize control of the reset requests/acks using the uncore->lock, which will also ensure that no other GT state changes at the same time as the actual reset.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Ah okay I get it, the fixes tag was applied indiscriminately to the whole series. :) It definitely does not belong in this patch.
Otherwise LGTM:
Reviewed-by: Tvrtko Ursulin tvrtko.ursulin@intel.com
Regards,
Tvrtko
Reported-by: Mika Kuoppala mika.kuoppala@linux.intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_reset.c | 37 ++++++++++++++++++++------- 1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index a5338c3fde7a..c68d36fb5bbd 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -300,9 +300,9 @@ static int gen6_hw_domain_reset(struct intel_gt *gt, u32 hw_domain_mask) return err; }
-static int gen6_reset_engines(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
unsigned int retry)
+static int __gen6_reset_engines(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
{ struct intel_engine_cs *engine; u32 hw_mask;unsigned int retry)
@@ -321,6 +321,20 @@ static int gen6_reset_engines(struct intel_gt *gt, return gen6_hw_domain_reset(gt, hw_mask); }
+static int gen6_reset_engines(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
unsigned int retry)
+{
- unsigned long flags;
- int ret;
- spin_lock_irqsave(>->uncore->lock, flags);
- ret = __gen6_reset_engines(gt, engine_mask, retry);
- spin_unlock_irqrestore(>->uncore->lock, flags);
- return ret;
+}
- static struct intel_engine_cs *find_sfc_paired_vecs_engine(struct intel_engine_cs *engine) { int vecs_id;
@@ -487,9 +501,9 @@ static void gen11_unlock_sfc(struct intel_engine_cs *engine) rmw_clear_fw(uncore, sfc_lock.lock_reg, sfc_lock.lock_bit); }
-static int gen11_reset_engines(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
unsigned int retry)
+static int __gen11_reset_engines(struct intel_gt *gt,
intel_engine_mask_t engine_mask,
{ struct intel_engine_cs *engine; intel_engine_mask_t tmp;unsigned int retry)
@@ -583,8 +597,11 @@ static int gen8_reset_engines(struct intel_gt *gt, struct intel_engine_cs *engine; const bool reset_non_ready = retry >= 1; intel_engine_mask_t tmp;
unsigned long flags; int ret;
spin_lock_irqsave(>->uncore->lock, flags);
for_each_engine_masked(engine, gt, engine_mask, tmp) { ret = gen8_engine_reset_prepare(engine); if (ret && !reset_non_ready)
@@ -612,17 +629,19 @@ static int gen8_reset_engines(struct intel_gt *gt, * This is best effort, so ignore any error from the initial reset. */ if (IS_DG2(gt->i915) && engine_mask == ALL_ENGINES)
gen11_reset_engines(gt, gt->info.engine_mask, 0);
__gen11_reset_engines(gt, gt->info.engine_mask, 0);
if (GRAPHICS_VER(gt->i915) >= 11)
ret = gen11_reset_engines(gt, engine_mask, retry);
elseret = __gen11_reset_engines(gt, engine_mask, retry);
ret = gen6_reset_engines(gt, engine_mask, retry);
ret = __gen6_reset_engines(gt, engine_mask, retry);
skip_reset: for_each_engine_masked(engine, gt, engine_mask, tmp) gen8_engine_reset_cancel(engine);
spin_unlock_irqrestore(>->uncore->lock, flags);
return ret; }
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:39PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Don't allow two engines to be reset in parallel, as they would both try to select a reset bit (and send requests to common registers) and wait on that register, at the same time. Serialize control of the reset requests/acks using the uncore->lock, which will also ensure that no other GT state changes at the same time as the actual reset.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reported-by: Mika Kuoppala mika.kuoppala@linux.intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Thanks, Andi
On 23/06/2022 12:17, Andi Shyti wrote:
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:39PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Don't allow two engines to be reset in parallel, as they would both try to select a reset bit (and send requests to common registers) and wait on that register, at the same time. Serialize control of the reset requests/acks using the uncore->lock, which will also ensure that no other GT state changes at the same time as the actual reset.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reported-by: Mika Kuoppala mika.kuoppala@linux.intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: Andi Shyti andi.shyti@intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Notice I had a bunch of questions and asks in this series so please do not merge until those are addressed.
In this particular patch (and some others) for instance Fixes: tag, at least against that sha, shouldn't be there.
Regards,
Tvrtko
From: Chris Wilson chris.p.wilson@intel.com
Avoid trying to invalidate the TLB in the middle of performing an engine reset, as this may result in the reset timing out. Currently, the TLB invalidate is only serialised by its own mutex, forgoing the uncore lock, but we can take the uncore->lock as well to serialise the mmio access, thereby serialising with the GDRST.
Tested on a NUC5i7RYB, BIOS RYBDWi35.86A.0380.2019.0517.1530 with i915 selftest/hangcheck.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reported-by: Mauro Carvalho Chehab mchehab@kernel.org Tested-by: Mauro Carvalho Chehab mchehab@kernel.org Reviewed-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org ---
See [PATCH 0/6] at: https://lore.kernel.org/all/cover.1655306128.git.mchehab@kernel.org/
drivers/gpu/drm/i915/gt/intel_gt.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index fb4fd5273ca4..33eb93586858 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1248,6 +1248,8 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) mutex_lock(>->tlb_invalidate_lock); intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL);
+ spin_lock_irq(&uncore->lock); /* seralise invalidate with GT reset */ + awake = 0; for_each_engine(engine, gt, id) { struct reg_and_bit rb; @@ -1272,6 +1274,8 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt) IS_ALDERLAKE_P(i915))) intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+ spin_unlock_irq(&uncore->lock); + for_each_engine_masked(engine, gt, awake, tmp) { struct reg_and_bit rb;
Hi Mauro,
On Wed, Jun 15, 2022 at 04:27:40PM +0100, Mauro Carvalho Chehab wrote:
From: Chris Wilson chris.p.wilson@intel.com
Avoid trying to invalidate the TLB in the middle of performing an engine reset, as this may result in the reset timing out. Currently, the TLB invalidate is only serialised by its own mutex, forgoing the uncore lock, but we can take the uncore->lock as well to serialise the mmio access, thereby serialising with the GDRST.
Tested on a NUC5i7RYB, BIOS RYBDWi35.86A.0380.2019.0517.1530 with i915 selftest/hangcheck.
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Reported-by: Mauro Carvalho Chehab mchehab@kernel.org Tested-by: Mauro Carvalho Chehab mchehab@kernel.org Reviewed-by: Mauro Carvalho Chehab mchehab@kernel.org Signed-off-by: Chris Wilson chris.p.wilson@intel.com Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: stable@vger.kernel.org Acked-by: Thomas Hellström thomas.hellstrom@linux.intel.com Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
Reviewed-by: Andi Shyti andi.shyti@linux.intel.com
Thanks, Andi
dri-devel@lists.freedesktop.org