A couple of recent commits introduced lockdep warnings, breaking some DG1 BAT tests.
Two fixes for those and one HAX patch making CI behave better.
Kai Vehmanen (1): HAX: component: do not leave master devres group open after bind
Thomas Hellström (2): drm/i915/gem: Fix a lockdep warning the __i915_gem_is_lmem() function drm/i915/ttm: Fix lockdep warning in __i915_gem_free_object()
drivers/base/component.c | 5 +++-- drivers/gpu/drm/i915/gem/i915_gem_lmem.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++ 3 files changed, 8 insertions(+), 3 deletions(-)
Somehow we managed to invert the test for i915_gem_object_evictable(), which causes a warning in DG1 BAT, igt@debugfs_test@read_all_entries.
Fix the lock check to only warn if the object *is* indeed evictable and not protected from eviction by fences.
Cc: Matthew Brost matthew.brost@intel.com Fixes: 91160c839824 ("drm/i915: Take pinning into account in __i915_gem_object_is_lmem")
Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/i915/gem/i915_gem_lmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_lmem.c b/drivers/gpu/drm/i915/gem/i915_gem_lmem.c index d659239fcbcc..444f8268b9c5 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_lmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_lmem.c @@ -67,7 +67,7 @@ bool __i915_gem_object_is_lmem(struct drm_i915_gem_object *obj)
#ifdef CONFIG_LOCKDEP GEM_WARN_ON(dma_resv_test_signaled(obj->base.resv, true) && - !i915_gem_object_evictable(obj)); + i915_gem_object_evictable(obj)); #endif return mr && (mr->type == INTEL_MEMORY_LOCAL || mr->type == INTEL_MEMORY_STOLEN_LOCAL);
On Wed, 22 Sept 2021 at 09:38, Thomas Hellström thomas.hellstrom@linux.intel.com wrote:
Somehow we managed to invert the test for i915_gem_object_evictable(), which causes a warning in DG1 BAT, igt@debugfs_test@read_all_entries.
Fix the lock check to only warn if the object *is* indeed evictable and not protected from eviction by fences.
Cc: Matthew Brost matthew.brost@intel.com Fixes: 91160c839824 ("drm/i915: Take pinning into account in __i915_gem_object_is_lmem")
Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com
Reviewed-by: Matthew Auld matthew.auld@intel.com
In the mman selftest, some tests make the ttm_bo_init_reserved() fail, which may trigger a call to the i915_ttm_bo_destroy() function. However, at this point the gem object refcount is set to 1, which triggers a lockdep warning in __i915_gem_free_object() and a corresponding failure in DG1 BAT, i915_selftest@live@mman.
Fix this by clearing the gem object refcount if called from that failure path.
Fixes: f9b23c157a78 ("drm/i915: Move __i915_gem_free_object to ttm_bo_destroy") Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c index b94497989995..b1f561543ff3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c @@ -900,6 +900,10 @@ void i915_ttm_bo_destroy(struct ttm_buffer_object *bo)
i915_ttm_backup_free(obj);
+ /* Failure during ttm_bo_init_reserved leaves the refcount set to 1. */ + if (IS_ENABLED(CONFIG_LOCKDEP) && !obj->ttm.created) + refcount_set(&obj->base.refcount.refcount, 0); + /* This releases all gem object bindings to the backend. */ __i915_gem_free_object(obj);
On Wed, 22 Sept 2021 at 09:38, Thomas Hellström thomas.hellstrom@linux.intel.com wrote:
In the mman selftest, some tests make the ttm_bo_init_reserved() fail, which may trigger a call to the i915_ttm_bo_destroy() function. However, at this point the gem object refcount is set to 1, which triggers a lockdep warning in __i915_gem_free_object() and a corresponding failure in DG1 BAT, i915_selftest@live@mman.
Fix this by clearing the gem object refcount if called from that failure path.
Fixes: f9b23c157a78 ("drm/i915: Move __i915_gem_free_object to ttm_bo_destroy") Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com
drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c index b94497989995..b1f561543ff3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c @@ -900,6 +900,10 @@ void i915_ttm_bo_destroy(struct ttm_buffer_object *bo)
i915_ttm_backup_free(obj);
/* Failure during ttm_bo_init_reserved leaves the refcount set to 1. */
if (IS_ENABLED(CONFIG_LOCKDEP) && !obj->ttm.created)
refcount_set(&obj->base.refcount.refcount, 0);
/* This releases all gem object bindings to the backend. */ __i915_gem_free_object(obj);
The __i915_gem_free_object is also nuking stuff like mm.placements, which is still owned by the caller AFAIK, or at least it is until we have successfully initialised the object, so smells like potential double free? Can we easily move that under the ttm.created check? Otherwise maybe we are meant to move the mm.placements handling into the RCU callback?
-- 2.31.1
On 9/22/21 12:55 PM, Matthew Auld wrote:
On Wed, 22 Sept 2021 at 09:38, Thomas Hellström thomas.hellstrom@linux.intel.com wrote:
In the mman selftest, some tests make the ttm_bo_init_reserved() fail, which may trigger a call to the i915_ttm_bo_destroy() function. However, at this point the gem object refcount is set to 1, which triggers a lockdep warning in __i915_gem_free_object() and a corresponding failure in DG1 BAT, i915_selftest@live@mman.
Fix this by clearing the gem object refcount if called from that failure path.
Fixes: f9b23c157a78 ("drm/i915: Move __i915_gem_free_object to ttm_bo_destroy") Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com
drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c index b94497989995..b1f561543ff3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c @@ -900,6 +900,10 @@ void i915_ttm_bo_destroy(struct ttm_buffer_object *bo)
i915_ttm_backup_free(obj);
/* Failure during ttm_bo_init_reserved leaves the refcount set to 1. */
if (IS_ENABLED(CONFIG_LOCKDEP) && !obj->ttm.created)
refcount_set(&obj->base.refcount.refcount, 0);
/* This releases all gem object bindings to the backend. */ __i915_gem_free_object(obj);
The __i915_gem_free_object is also nuking stuff like mm.placements, which is still owned by the caller AFAIK, or at least it is until we have successfully initialised the object, so smells like potential double free? Can we easily move that under the ttm.created check? Otherwise maybe we are meant to move the mm.placements handling into the RCU callback?
Yes, it indeed sounds like a closer look is needed for the error handling here. Perhaps it makes sense to initialize the TTM part and then the GEM part while still having the lock. Meanwhile I'll put it under the ttm.created check.
Thanks,
Thomas
-- 2.31.1
From: Kai Vehmanen kai.vehmanen@linux.intel.com
In current code, the devres group for aggregate master is left open after call to component_master_add_*(). This leads to problems when the master does further managed allocations on its own. When any participating driver calls component_del(), this leads to immediate release of resources.
This came up when investigating a page fault occurring with i915 DRM driver unbind with 5.15-rc1 kernel. The following sequence occurs:
i915_pci_remove() -> intel_display_driver_unregister() -> i915_audio_component_cleanup() -> component_del() -> component.c:take_down_master() -> hdac_component_master_unbind() [via master->ops->unbind()] -> devres_release_group(master->parent, NULL)
With older kernels this has not caused issues, but with audio driver moving to use managed interfaces for more of its allocations, this no longer works. Devres log shows following to occur:
component_master_add_with_match() [ 126.886032] snd_hda_intel 0000:00:1f.3: DEVRES ADD 00000000323ccdc5 devm_component_match_release (24 bytes) [ 126.886045] snd_hda_intel 0000:00:1f.3: DEVRES ADD 00000000865cdb29 grp< (0 bytes) [ 126.886049] snd_hda_intel 0000:00:1f.3: DEVRES ADD 000000001b480725 grp< (0 bytes)
audio driver completes its PCI probe() [ 126.892238] snd_hda_intel 0000:00:1f.3: DEVRES ADD 000000001b480725 pcim_iomap_release (48 bytes)
component_del() called() at DRM/i915 unbind() [ 137.579422] i915 0000:00:02.0: DEVRES REL 00000000ef44c293 grp< (0 bytes) [ 137.579445] snd_hda_intel 0000:00:1f.3: DEVRES REL 00000000865cdb29 grp< (0 bytes) [ 137.579458] snd_hda_intel 0000:00:1f.3: DEVRES REL 000000001b480725 pcim_iomap_release (48 bytes)
So the "devres_release_group(master->parent, NULL)" ends up freeing the pcim_iomap allocation. Upon next runtime resume, the audio driver will cause a page fault as the iomap alloc was released without the driver knowing about it.
Fix this issue by using the "struct master" pointer as identifier for the devres group, and by closing the devres group after the master->ops->bind() call is done. This allows devres allocations done by the driver acting as master to be isolated from the binding state of the aggregate driver. This modifies the logic originally introduced in commit 9e1ccb4a7700 ("drivers/base: fix devres handling for master device").
BugLink: https://gitlab.freedesktop.org/drm/intel/-/issues/4136 Signed-off-by: Kai Vehmanen kai.vehmanen@linux.intel.com Acked-by: Imre Deak imre.deak@intel.com Acked-by: Russell King (Oracle) rmk+kernel@armlinux.org.uk --- drivers/base/component.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/base/component.c b/drivers/base/component.c index 5e79299f6c3f..870485cbbb87 100644 --- a/drivers/base/component.c +++ b/drivers/base/component.c @@ -246,7 +246,7 @@ static int try_to_bring_up_master(struct master *master, return 0; }
- if (!devres_open_group(master->parent, NULL, GFP_KERNEL)) + if (!devres_open_group(master->parent, master, GFP_KERNEL)) return -ENOMEM;
/* Found all components */ @@ -258,6 +258,7 @@ static int try_to_bring_up_master(struct master *master, return ret; }
+ devres_close_group(master->parent, NULL); master->bound = true; return 1; } @@ -282,7 +283,7 @@ static void take_down_master(struct master *master) { if (master->bound) { master->ops->unbind(master->parent); - devres_release_group(master->parent, NULL); + devres_release_group(master->parent, master); master->bound = false; } }
dri-devel@lists.freedesktop.org