On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand jason@jlekstrand.net wrote:
On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld matthew.william.auld@gmail.com wrote:
On Fri, 16 Jul 2021 at 16:52, Matthew Auld matthew.william.auld@gmail.com wrote:
On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand jason@jlekstrand.net wrote:
On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld matthew.william.auld@gmail.com wrote:
On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand jason@jlekstrand.net wrote:
Whenever we had a user object (n_placements > 0), we were ignoring obj->mm.region and always putting obj->placements[0] as the requested region. For LMEM+SMEM objects, this was causing them to get shoved into LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say, i915_gem_object_migrate().
i915_ttm_migrate calls i915_ttm_place_from_region() directly with the requested region, so there shouldn't be an issue with migration right? Do you have some more details?
With i915_ttm_migrate directly, no. But, in the last patch in the series, we're trying to migrate LMEM+SMEM buffers into SMEM on attach() and pin it there. This blows up in a very unexpected (IMO) way. The flow goes something like this:
- Client attempts a dma-buf import from another device
- In attach() we call i915_gem_object_migrate() which calls
i915_ttm_migrate() which migrates as requested.
- Once the migration is complete, we call i915_gem_object_pin_pages()
which calls i915_ttm_get_pages() which depends on i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
The mm.pages must be NULL here, otherwise it would just increment the pages_pin_count?
Given that the test is using the ____four_underscores version, it doesn't have that check. However, this executes after we've done the dma-buf import which pinned pages. So we should definitely have pages.
We shouldn't call ____four_underscores() if we might already have pages though. Under non-TTM that would leak the pages, and in TTM we might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for example nothing was moved. I take it we can't just call pin_pages()? Four scary underscores usually means "don't call this in normal code".
Maybe the problem here is actually that our TTM code isn't respecting obj->mm.pages_pin_count?
I think if the resource is moved, we always nuke the mm.pages after being notified of the move. Also TTM is also not allowed to move pinned buffers.
I guess if we are evicted/swapped, so assuming we are not holding the object lock, and it's not pinned, the future call to get_pages() will see mm.pages = NULL, even though the ttm_resource is still there, and because we prioritise the placements[0], instead of mm.region we end up moving it for no good reason. But in your case you are holding the lock, or it's pinned? Also is this just with the selftest, or something real?
Or at least in the selftest I see ____i915_gem_object_get_pages() which doesn't even consider the mm.pages AFAIK.
The bogus migration is happening as part of the __i915_gem_object_get_pages() (2 __underscores) call in i915_gem_dmabuf_attach (see last patch). That code is attempting to migrate the BO to SMEM and then pin it there using the obvious calls to do so. However, in the pin_pages call, it gets implicitly migrated back to LMEM thanks to i915_ttm_get_pages(). Why is _get_pages() migrating things at all?
Not sure yet, but __two_underscores() checks if i915_gem_object_has_pages() before actually calling into i915_ttm_get_pages(), so the mm.pages would have to be NULL here for some reason, so best guess is something to do with move_notify().
--Jason
In case you can't tell, I really have no clue what I'm doing here. I'm really stumbling around in the dark finding things that make my bug go away. I'm happy for the feedback.
--Jason
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com
drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c index d30f274c329c7..5985e994d56cf 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj, unsigned int i;
placement->num_placement = 1;
i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
obj->mm.region, requested, flags);
i915_ttm_place_from_region(obj->mm.region, requested, flags); /* Cache this on object? */ placement->num_busy_placement = num_allowed;
-- 2.31.1