[RFC PATCH 0/5] Add option to disable implicit sync for userspace submits.

List overview All Threads
Download

newer

older

[PATCH 1/2] drm/mediatek: Config...

[PATCH 1/7] dt-bindings: display:...

Bas Nieuwenhuizen

1 Jun 2022 1 Jun '22

12:40 a.m.

This adds a context option to use DMA_RESV_USAGE_BOOKKEEP for userspace submissions, based on Christians TTM work.

Disabling implicit sync is something we've wanted in radv for a while for resolving some corner cases. A more immediate thing that would be solved here is avoiding a bunch of implicit sync on GPU map/unmap operations as well, which helps with stutter around sparse maps/unmaps.

I have experimental userspace in radv, but it isn't 100% ready yet. There are still issues with some games that I'm looking at, but in the meantime I'm looking for early feedback on the idea.

Besides the debugging an open question is whether it is worth adding the option to wait on additional explicit syncobj in the VM map/unmap operations. My current radv code waits on the wait syncobj in userspace on a thread before doing the operation which results in some corner cases because we can't provide binary syncobj at submission time (impacting the usual sync file exports). However adding these fences adds the risk of head of line blocking because all VM operations get executed on the same ring, so all later operations get blocked by waiting on the fences as well, which can cause head of line blocking.

I'm looking to get more implementation experience with different games to see if we need this, but if we need it it would be a somewhat separate addition to the UAPI.

Bas Nieuwenhuizen (5): drm/ttm: Refactor num_shared into usage. drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP. drm/amdgpu: Allow explicit sync for VM ops. drm/amdgpu: Refactor amdgpu_vm_get_pd_bo. drm/amdgpu: Add option to disable implicit sync for a context.

.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 21 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 19 ++++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 10 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 ++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h | 4 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 +-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++--- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 +- include/uapi/drm/amdgpu_drm.h | 3 ++ 28 files changed, 112 insertions(+), 63 deletions(-)

-- 2.36.1

Show replies by date

Bas Nieuwenhuizen

1 Jun 1 Jun

12:40 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;

INIT_LIST_HEAD(&entry->head); - entry->num_shared = 1; + entry->usage = DMA_RESV_USAGE_READ; entry->bo = &bo->tbo; mutex_lock(&process_info->lock); if (userptr) @@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,

ctx->kfd_bo.priority = 0; ctx->kfd_bo.tv.bo = &bo->tbo; - ctx->kfd_bo.tv.num_shared = 1; + ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ; list_add(&ctx->kfd_bo.tv.head, &ctx->list);

amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]); @@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,

ctx->kfd_bo.priority = 0; ctx->kfd_bo.tv.bo = &bo->tbo; - ctx->kfd_bo.tv.num_shared = 1; + ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ; list_add(&ctx->kfd_bo.tv.head, &ctx->list);

i = 0; @@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo; - mem->resv_list.num_shared = mem->validate_list.num_shared; + mem->resv_list.usage = mem->validate_list.usage; }

/* Reserve all BOs and page tables for validation */ @@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)

list_add_tail(&mem->resv_list.head, &ctx.list); mem->resv_list.bo = mem->validate_list.bo; - mem->resv_list.num_shared = mem->validate_list.num_shared; + mem->resv_list.usage = mem->validate_list.usage; }

ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo; - /* One for TTM and two for the CS job */ - p->uf_entry.tv.num_shared = 3; + p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

drm_gem_object_put(gobj);

@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }

- /* One for TTM and one for the CS job */ amdgpu_bo_list_for_each_entry(e, p->bo_list) - e->tv.num_shared = 2; + e->tv.usage = DMA_RESV_USAGE_READ;

amdgpu_bo_list_get_list(p->bo_list, &p->validated);

@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

/* Make sure all BOs are remembered as writers */ amdgpu_bo_list_for_each_entry(e, p->bo_list) - e->tv.num_shared = 0; + e->tv.usage = DMA_RESV_USAGE_WRITE;

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo; - csa_tv.num_shared = 1; + csa_tv.usage = DMA_RESV_USAGE_READ;

list_add(&csa_tv.head, &list); amdgpu_vm_get_pd_bo(vm, &list, &pd); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);

tv.bo = &bo->tbo; - tv.num_shared = 2; + tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

amdgpu_vm_get_pd_bo(vm, &list, &vm_pd); @@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID) - tv.num_shared = 1; + tv.usage = DMA_RESV_USAGE_READ; else - tv.num_shared = 0; + tv.usage = DMA_RESV_USAGE_WRITE; list_add(&tv.head, &list); } else { gobj = NULL; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);

tv.bo = &rbo->tbo; - tv.num_shared = 1; + tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo; - /* Two for VM updates, one for TTM and one for the CS job */ - entry->tv.num_shared = 4; + entry->tv.usage = DMA_RESV_USAGE_READ; entry->user_pages = NULL; list_add(&entry->tv.head, validated); } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);

ctx->tv[gpuidx].bo = &vm->root.bo->tbo; - ctx->tv[gpuidx].num_shared = 4; + ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ; list_add(&ctx->tv[gpuidx].head, &ctx->validate_list); }

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);

tv.bo = &rbo->tbo; - tv.num_shared = 1; + tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL); diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)

qxl_bo_ref(bo); entry->tv.bo = &bo->tbo; - entry->tv.num_shared = 0; + entry->tv.usage = DMA_RESV_USAGE_WRITE; list_add_tail(&entry->tv.head, &release->bos); return 0; } diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }

p->relocs[i].tv.bo = &p->relocs[i].robj->tbo; - p->relocs[i].tv.num_shared = !r->write_domain; + p->relocs[i].tv.usage = + r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head, priority); @@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)

resv = reloc->robj->tbo.base.resv; r = radeon_sync_resv(p->rdev, &p->ib.sync, resv, - reloc->tv.num_shared); + reloc->tv.usage != DMA_RESV_USAGE_WRITE); if (r) return r; } diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);

tv.bo = &bo_va->bo->tbo; - tv.num_shared = 1; + tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list); diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo; - list[0].tv.num_shared = 1; + list[0].tv.usage = DMA_RESV_USAGE_READ; list[0].tiling_flags = 0; list_add(&list[0].tv.head, head);

@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo; - list[idx].tv.num_shared = 1; + list[idx].tv.usage = DMA_RESV_USAGE_READ; list[idx].tiling_flags = 0; list_add(&list[idx++].tv.head, head); } diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }

- num_fences = min(entry->num_shared, 1u); + num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u; if (!ret) { ret = dma_resv_reserve_fences(bo->base.resv, num_fences); @@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;

- dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ? - DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE); + dma_resv_add_fence(bo->base.resv, fence, entry->usage); ttm_bo_move_to_lru_tail_unlocked(bo); dma_resv_unlock(bo->base.resv); } diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;

val_buf.bo = bo; - val_buf.num_shared = 0; + val_buf.usage = DMA_RESV_USAGE_WRITE; res->func->unbind(res, false, &val_buf); } res->backup_dirty = false; @@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base; - val_buf->num_shared = 0; + val_buf->usage = DMA_RESV_USAGE_WRITE; list_add_tail(&val_buf->head, &val_list); ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL); if (unlikely(ret != 0)) @@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);

val_buf.bo = NULL; - val_buf.num_shared = 0; + val_buf.usage = DMA_RESV_USAGE_WRITE; ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf); if (unlikely(ret != 0)) return ret; @@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;

val_buf.bo = NULL; - val_buf.num_shared = 0; + val_buf.usage = DMA_RESV_USAGE_WRITE; if (res->backup) val_buf.bo = &res->backup->base; do { @@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base, - .num_shared = 0 + .usage = DMA_RESV_USAGE_WRITE };

dma_resv_assert_held(vbo->base.base.resv); diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH; - val_buf->num_shared = 0; + val_buf->usage = DMA_RESV_USAGE_WRITE; list_add_tail(&val_buf->head, &ctx->bo_list); bo_node->as_mob = as_mob; bo_node->cpu_blit = cpu_blit; diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo; - unsigned int num_shared; + enum dma_resv_usage usage; };

/**

-- 2.36.1

Christian König

8:02 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...

So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;

INIT_LIST_HEAD(&entry->head);

entry->num_shared = 1;

entry->usage = DMA_RESV_USAGE_READ; entry->bo = &bo->tbo; mutex_lock(&process_info->lock); if (userptr)

@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,

ctx->kfd_bo.priority = 0; ctx->kfd_bo.tv.bo = &bo->tbo;

ctx->kfd_bo.tv.num_shared = 1;

ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ; list_add(&ctx->kfd_bo.tv.head, &ctx->list);

amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);

@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,

ctx->kfd_bo.priority = 0; ctx->kfd_bo.tv.bo = &bo->tbo;

ctx->kfd_bo.tv.num_shared = 1;

ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ; list_add(&ctx->kfd_bo.tv.head, &ctx->list);

i = 0;

@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
mem->resv_list.num_shared = mem->validate_list.num_shared;
mem->resv_list.usage = mem->validate_list.usage;
}

/* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
list_add_tail(&mem->resv_list.head, &ctx.list);
mem->resv_list.bo = mem->validate_list.bo;
mem->resv_list.num_shared = mem->validate_list.num_shared;
mem->resv_list.usage = mem->validate_list.usage;
}

ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;

/* One for TTM and two for the CS job */

p->uf_entry.tv.num_shared = 3;

p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

drm_gem_object_put(gobj);

@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */ amdgpu_bo_list_for_each_entry(e, p->bo_list)
e->tv.num_shared = 2;
e->tv.usage = DMA_RESV_USAGE_READ;
amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

/* Make sure all BOs are remembered as writers */ amdgpu_bo_list_for_each_entry(e, p->bo_list)
e->tv.num_shared = 0;
e->tv.usage = DMA_RESV_USAGE_WRITE;
ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;

csa_tv.num_shared = 1;

csa_tv.usage = DMA_RESV_USAGE_READ;

list_add(&csa_tv.head, &list); amdgpu_vm_get_pd_bo(vm, &list, &pd);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);

tv.bo = &bo->tbo;

tv.num_shared = 2;

tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);

@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
	tv.num_shared = 1;
	tv.usage = DMA_RESV_USAGE_READ;
else
	tv.num_shared = 0;
	tv.usage = DMA_RESV_USAGE_WRITE;
list_add(&tv.head, &list); } else { gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);

tv.bo = &rbo->tbo;

tv.num_shared = 1;

tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;

/* Two for VM updates, one for TTM and one for the CS job */

entry->tv.num_shared = 4;

entry->tv.usage = DMA_RESV_USAGE_READ; entry->user_pages = NULL; list_add(&entry->tv.head, validated); }

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
ctx->tv[gpuidx].num_shared = 4;
ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
list_add(&ctx->tv[gpuidx].head, &ctx->validate_list); }
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);

tv.bo = &rbo->tbo;

tv.num_shared = 1;

tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)

qxl_bo_ref(bo); entry->tv.bo = &bo->tbo;

entry->tv.num_shared = 0;

entry->tv.usage = DMA_RESV_USAGE_WRITE; list_add_tail(&entry->tv.head, &release->bos); return 0; }

diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
p->relocs[i].tv.num_shared = !r->write_domain;
p->relocs[i].tv.usage =
	r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;
radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head, priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
resv = reloc->robj->tbo.base.resv;
r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
		     reloc->tv.num_shared);
		     reloc->tv.usage != DMA_RESV_USAGE_WRITE);
if (r) return r; }
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);

tv.bo = &bo_va->bo->tbo;

tv.num_shared = 1;

tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);

diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;

list[0].tv.num_shared = 1;

list[0].tv.usage = DMA_RESV_USAGE_READ; list[0].tiling_flags = 0; list_add(&list[0].tv.head, head);

@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
list[idx].tv.num_shared = 1;
list[idx].tv.usage = DMA_RESV_USAGE_READ;
list[idx].tiling_flags = 0; list_add(&list[idx++].tv.head, head); }
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
num_fences = min(entry->num_shared, 1u);
num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
if (!ret) { ret = dma_resv_reserve_fences(bo->base.resv, num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
		   DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
dma_resv_add_fence(bo->base.resv, fence, entry->usage);
ttm_bo_move_to_lru_tail_unlocked(bo); dma_resv_unlock(bo->base.resv); }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
	val_buf.bo = bo;
	val_buf.num_shared = 0;
	val_buf.usage = DMA_RESV_USAGE_WRITE;
res->func->unbind(res, false, &val_buf);
} res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;

val_buf->num_shared = 0;

val_buf->usage = DMA_RESV_USAGE_WRITE; list_add_tail(&val_buf->head, &val_list); ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL); if (unlikely(ret != 0))

@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);

val_buf.bo = NULL;

val_buf.num_shared = 0;

val_buf.usage = DMA_RESV_USAGE_WRITE; ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf); if (unlikely(ret != 0)) return ret;

@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;

val_buf.bo = NULL;

val_buf.num_shared = 0;

val_buf.usage = DMA_RESV_USAGE_WRITE; if (res->backup) val_buf.bo = &res->backup->base; do {

@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
.num_shared = 0
.usage = DMA_RESV_USAGE_WRITE
};

dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
list_add_tail(&val_buf->head, &ctx->bo_list); bo_node->as_mob = as_mob; bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;

unsigned int num_shared;

enum dma_resv_usage usage; };

/**

Bas Nieuwenhuizen

8:11 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

On Wed, Jun 1, 2022 at 10:02 AM Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

TTM already didn't do that? From ttm_execbuf_util.c :

...

        num_fences = min(entry->num_shared, 1u);

        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;

...

Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

That is the plan and what we do later in the series, use BOOKKEEP for submissions that don't want to participate in implicit sync?

This refactor sets everything to READ or WRITE based on the previous num_shared value, to make sure this patch by itself is not a functional change.

...

BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
  INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
entry->bo = &bo->tbo;
mutex_lock(&process_info->lock);
if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
  ctx->kfd_bo.priority = 0;
  ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
list_add(&ctx->kfd_bo.tv.head, &ctx->list);

amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
  ctx->kfd_bo.priority = 0;
  ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
list_add(&ctx->kfd_bo.tv.head, &ctx->list);

i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
}

/* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
          list_add_tail(&mem->resv_list.head, &ctx.list);
          mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
}

ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
  /* Make sure all BOs are remembered as writers */
  amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

list_add(&csa_tv.head, &list);
amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
  tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
        else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
        list_add(&tv.head, &list);
} else {
        gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
  tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
entry->user_pages = NULL;
list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
          ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
        list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
}
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
  tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
  qxl_bo_ref(bo);
  entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
list_add_tail(&entry->tv.head, &release->bos);
return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
          p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

        radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                              priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
          resv = reloc->robj->tbo.base.resv;
          r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
        if (r)
                return r;
}
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
  tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
list[0].tiling_flags = 0;
list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
        list[idx].tiling_flags = 0;
        list_add(&list[idx++].tv.head, head);
}
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
        if (!ret) {
                ret = dma_resv_reserve_fences(bo->base.resv,
                                              num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
        ttm_bo_move_to_lru_tail_unlocked(bo);
        dma_resv_unlock(bo->base.resv);
}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                  val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                res->func->unbind(res, false, &val_buf);
        }
        res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
list_add_tail(&val_buf->head, &val_list);
ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
  val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
if (unlikely(ret != 0))
        return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
  val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
if (res->backup)
        val_buf.bo = &res->backup->base;
do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
};

dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
        list_add_tail(&val_buf->head, &ctx->bo_list);
        bo_node->as_mob = as_mob;
        bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

Christian König

8:29 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:

...

On Wed, Jun 1, 2022 at 10:02 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

TTM already didn't do that? From ttm_execbuf_util.c :

...
...
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;

That's doing a min(entry->num_shared, 1u). In other words even when the driver requested to reserve no fence we at least reserve at least one.

But if the driver requested to reserve more than one then we do reserve more than one. That's rather important because both radeon and amdgpu need that for their VM updates.

This patch here completely breaks that.

There is already an drm_exec patch set from me on the dri-devel mailing list which untangles all of this and deprecates the whole ttm_exec_buf_util handling.

Regards, Christian.

...

...
Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

That is the plan and what we do later in the series, use BOOKKEEP for submissions that don't want to participate in implicit sync?

This refactor sets everything to READ or WRITE based on the previous num_shared value, to make sure this patch by itself is not a functional change.

...
BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
   INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
 entry->bo = &bo->tbo;
 mutex_lock(&process_info->lock);
 if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 /* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
           list_add_tail(&mem->resv_list.head, &ctx.list);
           mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

 drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
 amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

 amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
   /* Make sure all BOs are remembered as writers */
   amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

 ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
 mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

 list_add(&csa_tv.head, &list);
 amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
   tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
         else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
         list_add(&tv.head, &list);
 } else {
         gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
 entry->user_pages = NULL;
 list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
           ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
         list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
 }
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
   qxl_bo_ref(bo);
   entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&entry->tv.head, &release->bos);
 return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
           p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

         radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                               priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
           resv = reloc->robj->tbo.base.resv;
           r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
         if (r)
                 return r;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
   tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
 list[0].tiling_flags = 0;
 list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
         list[idx].tiling_flags = 0;
         list_add(&list[idx++].tv.head, head);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
         if (!ret) {
                 ret = dma_resv_reserve_fences(bo->base.resv,
                                               num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
         ttm_bo_move_to_lru_tail_unlocked(bo);
         dma_resv_unlock(bo->base.resv);
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                   val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                 res->func->unbind(res, false, &val_buf);
         }
         res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&val_buf->head, &val_list);
 ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
 if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
 if (unlikely(ret != 0))
         return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 if (res->backup)
         val_buf.bo = &res->backup->base;
 do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
 };

 dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
         list_add_tail(&val_buf->head, &ctx->bo_list);
         bo_node->as_mob = as_mob;
         bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

Bas Nieuwenhuizen

8:39 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

On Wed, Jun 1, 2022 at 10:29 AM Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:02 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

TTM already didn't do that? From ttm_execbuf_util.c :

...
...
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
That's doing a min(entry->num_shared, 1u). In other words even when the driver requested to reserve no fence we at least reserve at least one.

That would be the case if it was a max, not a min. However, since it is a min, it only ever resulted in 0 or 1, behavior that we mimic based on DMA_RESV_USAGE_*.

Nowhere else do we actually use the specific number assigned to num_shared.

...

But if the driver requested to reserve more than one then we do reserve more than one. That's rather important because both radeon and amdgpu need that for their VM updates.

This patch here completely breaks that.

There is already an drm_exec patch set from me on the dri-devel mailing list which untangles all of this and deprecates the whole ttm_exec_buf_util handling.

Can take a look at your patch, but I believe in pre-patch state this is a correct non functional change.

...

Regards, Christian.

...
...
Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

That is the plan and what we do later in the series, use BOOKKEEP for submissions that don't want to participate in implicit sync?

This refactor sets everything to READ or WRITE based on the previous num_shared value, to make sure this patch by itself is not a functional change.

...
BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
   INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
 entry->bo = &bo->tbo;
 mutex_lock(&process_info->lock);
 if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 /* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
           list_add_tail(&mem->resv_list.head, &ctx.list);
           mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

 drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
 amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

 amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
   /* Make sure all BOs are remembered as writers */
   amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

 ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
 mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

 list_add(&csa_tv.head, &list);
 amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
   tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
         else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
         list_add(&tv.head, &list);
 } else {
         gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
 entry->user_pages = NULL;
 list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
           ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
         list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
 }
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
   qxl_bo_ref(bo);
   entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&entry->tv.head, &release->bos);
 return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
           p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

         radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                               priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
           resv = reloc->robj->tbo.base.resv;
           r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
         if (r)
                 return r;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
   tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
 list[0].tiling_flags = 0;
 list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
         list[idx].tiling_flags = 0;
         list_add(&list[idx++].tv.head, head);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
         if (!ret) {
                 ret = dma_resv_reserve_fences(bo->base.resv,
                                               num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
         ttm_bo_move_to_lru_tail_unlocked(bo);
         dma_resv_unlock(bo->base.resv);
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                   val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                 res->func->unbind(res, false, &val_buf);
         }
         res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&val_buf->head, &val_list);
 ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
 if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
 if (unlikely(ret != 0))
         return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 if (res->backup)
         val_buf.bo = &res->backup->base;
 do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
 };

 dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
         list_add_tail(&val_buf->head, &ctx->bo_list);
         bo_node->as_mob = as_mob;
         bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

Christian König

8:42 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

Am 01.06.22 um 10:39 schrieb Bas Nieuwenhuizen:

...

On Wed, Jun 1, 2022 at 10:29 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 10:11 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:02 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

TTM already didn't do that? From ttm_execbuf_util.c :

...
...
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
That's doing a min(entry->num_shared, 1u). In other words even when the driver requested to reserve no fence we at least reserve at least one.
That would be the case if it was a max, not a min. However, since it is a min, it only ever resulted in 0 or 1, behavior that we mimic based on DMA_RESV_USAGE_*.

Ah! You are working on a broken branch, that was fixed with:

commit d72dcbe9fce505228dae43bef9da8f2b707d1b3d Author: Christian König christian.koenig@amd.com Date: Mon Apr 11 15:21:59 2022 +0200

drm/ttm: fix logic inversion in ttm_eu_reserve_buffers

That should have been max, not min.

Without that fix your branch can cause rare to debug memory corruptions.

Regards, Christian.

...

Nowhere else do we actually use the specific number assigned to num_shared.

...
But if the driver requested to reserve more than one then we do reserve more than one. That's rather important because both radeon and amdgpu need that for their VM updates.

This patch here completely breaks that.

There is already an drm_exec patch set from me on the dri-devel mailing list which untangles all of this and deprecates the whole ttm_exec_buf_util handling.

Can take a look at your patch, but I believe in pre-patch state this is a correct non functional change.

...
Regards, Christian.

...
...
Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

That is the plan and what we do later in the series, use BOOKKEEP for submissions that don't want to participate in implicit sync?

This refactor sets everything to READ or WRITE based on the previous num_shared value, to make sure this patch by itself is not a functional change.

...
BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 10 +++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  8 +++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c           |  2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c           |  6 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c          |  2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  3 +--
drivers/gpu/drm/amd/amdkfd/kfd_svm.c              |  2 +-
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
drivers/gpu/drm/qxl/qxl_release.c                 |  2 +-
drivers/gpu/drm/radeon/radeon_cs.c                |  5 +++--
drivers/gpu/drm/radeon/radeon_gem.c               |  2 +-
drivers/gpu/drm/radeon/radeon_vm.c                |  4 ++--
drivers/gpu/drm/ttm/ttm_execbuf_util.c            |  5 ++---
drivers/gpu/drm/vmwgfx/vmwgfx_resource.c          | 10 +++++-----
drivers/gpu/drm/vmwgfx/vmwgfx_validation.c        |  2 +-
include/drm/ttm/ttm_execbuf_util.h                |  3 ++-
16 files changed, 33 insertions(+), 35 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
    INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
  entry->bo = &bo->tbo;
  mutex_lock(&process_info->lock);
  if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
    ctx->kfd_bo.priority = 0;
    ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
  list_add(&ctx->kfd_bo.tv.head, &ctx->list);

  amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
    ctx->kfd_bo.priority = 0;
    ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
  list_add(&ctx->kfd_bo.tv.head, &ctx->list);

  i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
  }

  /* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
            list_add_tail(&mem->resv_list.head, &ctx.list);
            mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
  }

  ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

  drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
  amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

  amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
    /* Make sure all BOs are remembered as writers */
    amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

  ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
  mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

  list_add(&csa_tv.head, &list);
  amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
    tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
  list_add(&tv.head, &list);

  amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
          else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
          list_add(&tv.head, &list);
  } else {
          gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
    tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
  list_add(&tv.head, &list);

  r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
  entry->user_pages = NULL;
  list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
            ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
          list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
  }
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
    tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
  list_add(&tv.head, &list);

  r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
    qxl_bo_ref(bo);
    entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
  list_add_tail(&entry->tv.head, &release->bos);
  return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
            p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

          radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                                priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
            resv = reloc->robj->tbo.base.resv;
            r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
          if (r)
                  return r;
  }
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
    tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
  list_add(&tv.head, &list);

  vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
  list[0].tiling_flags = 0;
  list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
          list[idx].tiling_flags = 0;
          list_add(&list[idx++].tv.head, head);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
          if (!ret) {
                  ret = dma_resv_reserve_fences(bo->base.resv,
                                                num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
          ttm_bo_move_to_lru_tail_unlocked(bo);
          dma_resv_unlock(bo->base.resv);
  }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                    val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                  res->func->unbind(res, false, &val_buf);
          }
          res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
  list_add_tail(&val_buf->head, &val_list);
  ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
  if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
    val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
  ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
  if (unlikely(ret != 0))
          return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
    val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
  if (res->backup)
          val_buf.bo = &res->backup->base;
  do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
  };

  dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
          list_add_tail(&val_buf->head, &ctx->bo_list);
          bo_node->as_mob = as_mob;
          bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>
#include "ttm_bo_api.h"
@@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

Daniel Vetter

8:41 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

On Wed, 1 Jun 2022 at 10:02, Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

Quick reminder, we talked about this in the past. For many folks (not you) NAK means "fuck off" and not "this wont work for the reasons I just explained". Looks like the conversation is all on a good track in the further replies, just figured I'll drop this again as a reminder :-)

Maybe do an autocomplete in your mail editor which replaces NAK with NAK (note: this means "fuck off" for many folks) so you can decide whether that's really the message you want to send out to start the morning. And in some rare case I do agree that just dropping a polite "fuck off" is the right thing to make it clear what's up ...

Cheers, Daniel

...

Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
  INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
entry->bo = &bo->tbo;
mutex_lock(&process_info->lock);
if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
  ctx->kfd_bo.priority = 0;
  ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
list_add(&ctx->kfd_bo.tv.head, &ctx->list);

amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
  ctx->kfd_bo.priority = 0;
  ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
list_add(&ctx->kfd_bo.tv.head, &ctx->list);

i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
}

/* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
          list_add_tail(&mem->resv_list.head, &ctx.list);
          mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
}

ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
  /* Make sure all BOs are remembered as writers */
  amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

list_add(&csa_tv.head, &list);
amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
  tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
        else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
        list_add(&tv.head, &list);
} else {
        gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
  tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
entry->user_pages = NULL;
list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
          ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
        list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
}
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
  tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
  qxl_bo_ref(bo);
  entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
list_add_tail(&entry->tv.head, &release->bos);
return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
          p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

        radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                              priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
          resv = reloc->robj->tbo.base.resv;
          r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
        if (r)
                return r;
}
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
  tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
list_add(&tv.head, &list);

vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
list[0].tiling_flags = 0;
list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
        list[idx].tiling_flags = 0;
        list_add(&list[idx++].tv.head, head);
}
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
        if (!ret) {
                ret = dma_resv_reserve_fences(bo->base.resv,
                                              num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
        ttm_bo_move_to_lru_tail_unlocked(bo);
        dma_resv_unlock(bo->base.resv);
}
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                  val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                res->func->unbind(res, false, &val_buf);
        }
        res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
list_add_tail(&val_buf->head, &val_list);
ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
  val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
if (unlikely(ret != 0))
        return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
  val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
if (res->backup)
        val_buf.bo = &res->backup->base;
do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
};

dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
        list_add_tail(&val_buf->head, &ctx->bo_list);
        bo_node->as_mob = as_mob;
        bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

8:47 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Refactor num_shared into usage.

Am 01.06.22 um 10:41 schrieb Daniel Vetter:

...

On Wed, 1 Jun 2022 at 10:02, Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
So that the driver can set some BOOKKEEP for explicit sync. Maybe some of the existing places would already make sense for that, but I targeted this for no functional changes.

Well first of all NAK to that one since it will totally break cases which need to reserve more than one fence slot.

Quick reminder, we talked about this in the past. For many folks (not you) NAK means "fuck off" and not "this wont work for the reasons I just explained". Looks like the conversation is all on a good track in the further replies, just figured I'll drop this again as a reminder :-)

Yeah, that came to my mind as well.

But I still prefer NAK as what it means in computer since, e.g. "Not AcKnowledged", please restart from scratch.

We do need a clear indicator that the whole approach taken in a patch needs to be dropped and restarted from scratch and a NAK seems to fit that.

When I would want to tell somebody to fuck of I would clearly write that.

Christian.

...

Maybe do an autocomplete in your mail editor which replaces NAK with NAK (note: this means "fuck off" for many folks) so you can decide whether that's really the message you want to send out to start the morning. And in some rare case I do agree that just dropping a polite "fuck off" is the right thing to make it clear what's up ...

Cheers, Daniel

...
Also as discussed with Daniel we don't want to use BOOKKEEP for implicit sync. We should instead use READ for that.

BOOKKEEP is for stuff userspace should never be aware of, e.g. like page table updates and KFD eviction fences.

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 10 +++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 +++----- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 6 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 +-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- drivers/gpu/drm/qxl/qxl_release.c | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 5 +++-- drivers/gpu/drm/radeon/radeon_gem.c | 2 +- drivers/gpu/drm/radeon/radeon_vm.c | 4 ++-- drivers/gpu/drm/ttm/ttm_execbuf_util.c | 5 ++--- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 10 +++++----- drivers/gpu/drm/vmwgfx/vmwgfx_validation.c | 2 +- include/drm/ttm/ttm_execbuf_util.h | 3 ++- 16 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a4955ef76cfc..a790a089e829 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -774,7 +774,7 @@ static void add_kgd_mem_to_kfd_bo_list(struct kgd_mem *mem, struct amdgpu_bo *bo = mem->bo;
   INIT_LIST_HEAD(&entry->head);
entry->num_shared = 1;
entry->usage = DMA_RESV_USAGE_READ;
 entry->bo = &bo->tbo;
 mutex_lock(&process_info->lock);
 if (userptr)
@@ -918,7 +918,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]);
@@ -981,7 +981,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem,
   ctx->kfd_bo.priority = 0;
   ctx->kfd_bo.tv.bo = &bo->tbo;
ctx->kfd_bo.tv.num_shared = 1;
ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ;
 list_add(&ctx->kfd_bo.tv.head, &ctx->list);

 i = 0;
@@ -2218,7 +2218,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) validate_list.head) { list_add_tail(&mem->resv_list.head, &resv_list); mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 /* Reserve all BOs and page tables for validation */
@@ -2417,7 +2417,7 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef)
           list_add_tail(&mem->resv_list.head, &ctx.list);
           mem->resv_list.bo = mem->validate_list.bo;
        mem->resv_list.num_shared = mem->validate_list.num_shared;
        mem->resv_list.usage = mem->validate_list.usage;
 }

 ret = ttm_eu_reserve_buffers(&ctx.ticket, &ctx.list,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 60ca14afb879..2ae1c0d9d33a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -55,8 +55,7 @@ static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p, bo = amdgpu_bo_ref(gem_to_amdgpu_bo(gobj)); p->uf_entry.priority = 0; p->uf_entry.tv.bo = &bo->tbo;
/* One for TTM and two for the CS job */
p->uf_entry.tv.num_shared = 3;
p->uf_entry.tv.usage = DMA_RESV_USAGE_READ;

 drm_gem_object_put(gobj);
@@ -519,9 +518,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
/* One for TTM and one for the CS job */
 amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 2;
        e->tv.usage = DMA_RESV_USAGE_READ;

 amdgpu_bo_list_get_list(p->bo_list, &p->validated);
@@ -1261,7 +1259,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
   /* Make sure all BOs are remembered as writers */
   amdgpu_bo_list_for_each_entry(e, p->bo_list)
        e->tv.num_shared = 0;
        e->tv.usage = DMA_RESV_USAGE_WRITE;

 ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
 mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index c6d4d41c4393..71277257d94d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -74,7 +74,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, INIT_LIST_HEAD(&list); INIT_LIST_HEAD(&csa_tv.head); csa_tv.bo = &bo->tbo;
csa_tv.num_shared = 1;
csa_tv.usage = DMA_RESV_USAGE_READ;

 list_add(&csa_tv.head, &list);
 amdgpu_vm_get_pd_bo(vm, &list, &pd);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 84a53758e18e..7483411229f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -207,7 +207,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, INIT_LIST_HEAD(&duplicates);
   tv.bo = &bo->tbo;
tv.num_shared = 2;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 amdgpu_vm_get_pd_bo(vm, &list, &vm_pd);
@@ -731,9 +731,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = gem_to_amdgpu_bo(gobj); tv.bo = &abo->tbo; if (abo->flags & AMDGPU_GEM_CREATE_VM_ALWAYS_VALID)
                tv.num_shared = 1;
                tv.usage = DMA_RESV_USAGE_READ;
         else
                tv.num_shared = 0;
                tv.usage = DMA_RESV_USAGE_WRITE;
         list_add(&tv.head, &list);
 } else {
         gobj = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c index 5224d9a39737..f670d8473993 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c @@ -319,7 +319,7 @@ static int amdgpu_vkms_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 15184153e2b9..515be19ab279 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -633,8 +633,7 @@ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo;
/* Two for VM updates, one for TTM and one for the CS job */
entry->tv.num_shared = 4;
entry->tv.usage = DMA_RESV_USAGE_READ;
 entry->user_pages = NULL;
 list_add(&entry->tv.head, validated);
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index b3fc3e958227..af844b636778 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1395,7 +1395,7 @@ static int svm_range_reserve_bos(struct svm_validate_context *ctx) vm = drm_priv_to_vm(pdd->drm_priv);
           ctx->tv[gpuidx].bo = &vm->root.bo->tbo;
        ctx->tv[gpuidx].num_shared = 4;
        ctx->tv[gpuidx].usage = DMA_RESV_USAGE_READ;
         list_add(&ctx->tv[gpuidx].head, &ctx->validate_list);
 }
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 73423b805b54..851b7844b084 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -7601,7 +7601,7 @@ static int dm_plane_helper_prepare_fb(struct drm_plane *plane, INIT_LIST_HEAD(&list);
   tv.bo = &rbo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 r = ttm_eu_reserve_buffers(&ticket, &list, false, NULL);
diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c index 368d26da0d6a..689e35192070 100644 --- a/drivers/gpu/drm/qxl/qxl_release.c +++ b/drivers/gpu/drm/qxl/qxl_release.c @@ -183,7 +183,7 @@ int qxl_release_list_add(struct qxl_release *release, struct qxl_bo *bo)
   qxl_bo_ref(bo);
   entry->tv.bo = &bo->tbo;
entry->tv.num_shared = 0;
entry->tv.usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&entry->tv.head, &release->bos);
 return 0;
}
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 446f7bae54c4..30afe0c62dd9 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -183,7 +183,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) }
           p->relocs[i].tv.bo = &p->relocs[i].robj->tbo;
        p->relocs[i].tv.num_shared = !r->write_domain;
        p->relocs[i].tv.usage =
                r->write_domain ? DMA_RESV_USAGE_WRITE : DMA_RESV_USAGE_READ;

         radeon_cs_buckets_add(&buckets, &p->relocs[i].tv.head,
                               priority);
@@ -258,7 +259,7 @@ static int radeon_cs_sync_rings(struct radeon_cs_parser *p)
           resv = reloc->robj->tbo.base.resv;
           r = radeon_sync_resv(p->rdev, &p->ib.sync, resv,
                             reloc->tv.num_shared);
                             reloc->tv.usage != DMA_RESV_USAGE_WRITE);
         if (r)
                 return r;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index 8c01a7f0e027..eae47c709f5d 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -635,7 +635,7 @@ static void radeon_gem_va_update_vm(struct radeon_device *rdev, INIT_LIST_HEAD(&list);
   tv.bo = &bo_va->bo->tbo;
tv.num_shared = 1;
tv.usage = DMA_RESV_USAGE_READ;
 list_add(&tv.head, &list);

 vm_bos = radeon_vm_get_bos(rdev, bo_va->vm, &list);
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index 987cabbf1318..702627b48dae 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -143,7 +143,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[0].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[0].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[0].tv.bo = &vm->page_directory->tbo;
list[0].tv.num_shared = 1;
list[0].tv.usage = DMA_RESV_USAGE_READ;
 list[0].tiling_flags = 0;
 list_add(&list[0].tv.head, head);
@@ -155,7 +155,7 @@ struct radeon_bo_list *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].preferred_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].allowed_domains = RADEON_GEM_DOMAIN_VRAM; list[idx].tv.bo = &list[idx].robj->tbo;
        list[idx].tv.num_shared = 1;
        list[idx].tv.usage = DMA_RESV_USAGE_READ;
         list[idx].tiling_flags = 0;
         list_add(&list[idx++].tv.head, head);
 }
diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c b/drivers/gpu/drm/ttm/ttm_execbuf_util.c index 0eb995d25df1..c39d8e5ac271 100644 --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c @@ -101,7 +101,7 @@ int ttm_eu_reserve_buffers(struct ww_acquire_ctx *ticket, continue; }
        num_fences = min(entry->num_shared, 1u);
        num_fences = entry->usage <= DMA_RESV_USAGE_WRITE ? 0u : 1u;
         if (!ret) {
                 ret = dma_resv_reserve_fences(bo->base.resv,
                                               num_fences);
@@ -154,8 +154,7 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx *ticket, list_for_each_entry(entry, list, head) { struct ttm_buffer_object *bo = entry->bo;
        dma_resv_add_fence(bo->base.resv, fence, entry->num_shared ?
                           DMA_RESV_USAGE_READ : DMA_RESV_USAGE_WRITE);
        dma_resv_add_fence(bo->base.resv, fence, entry->usage);
         ttm_bo_move_to_lru_tail_unlocked(bo);
         dma_resv_unlock(bo->base.resv);
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index c6d02c98a19a..58dfff7d6c76 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -130,7 +130,7 @@ static void vmw_resource_release(struct kref *kref) struct ttm_validate_buffer val_buf;
                   val_buf.bo = bo;
                val_buf.num_shared = 0;
                val_buf.usage = DMA_RESV_USAGE_WRITE;
                 res->func->unbind(res, false, &val_buf);
         }
         res->backup_dirty = false;
@@ -552,7 +552,7 @@ vmw_resource_check_buffer(struct ww_acquire_ctx *ticket, INIT_LIST_HEAD(&val_list); ttm_bo_get(&res->backup->base); val_buf->bo = &res->backup->base;
val_buf->num_shared = 0;
val_buf->usage = DMA_RESV_USAGE_WRITE;
 list_add_tail(&val_buf->head, &val_list);
 ret = ttm_eu_reserve_buffers(ticket, &val_list, interruptible, NULL);
 if (unlikely(ret != 0))
@@ -657,7 +657,7 @@ static int vmw_resource_do_evict(struct ww_acquire_ctx *ticket, BUG_ON(!func->may_evict);
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 ret = vmw_resource_check_buffer(ticket, res, interruptible, &val_buf);
 if (unlikely(ret != 0))
         return ret;
@@ -708,7 +708,7 @@ int vmw_resource_validate(struct vmw_resource *res, bool intr, return 0;
   val_buf.bo = NULL;
val_buf.num_shared = 0;
val_buf.usage = DMA_RESV_USAGE_WRITE;
 if (res->backup)
         val_buf.bo = &res->backup->base;
 do {
@@ -777,7 +777,7 @@ void vmw_resource_unbind_list(struct vmw_buffer_object *vbo) { struct ttm_validate_buffer val_buf = { .bo = &vbo->base,
        .num_shared = 0
        .usage = DMA_RESV_USAGE_WRITE
 };

 dma_resv_assert_held(vbo->base.base.resv);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c index f46891012be3..0476ba498321 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_validation.c @@ -288,7 +288,7 @@ int vmw_validation_add_bo(struct vmw_validation_context *ctx, val_buf->bo = ttm_bo_get_unless_zero(&vbo->base); if (!val_buf->bo) return -ESRCH;
        val_buf->num_shared = 0;
        val_buf->usage = DMA_RESV_USAGE_WRITE;
         list_add_tail(&val_buf->head, &ctx->bo_list);
         bo_node->as_mob = as_mob;
         bo_node->cpu_blit = cpu_blit;
diff --git a/include/drm/ttm/ttm_execbuf_util.h b/include/drm/ttm/ttm_execbuf_util.h index a99d7fdf2964..851961a06c27 100644 --- a/include/drm/ttm/ttm_execbuf_util.h +++ b/include/drm/ttm/ttm_execbuf_util.h @@ -31,6 +31,7 @@ #ifndef _TTM_EXECBUF_UTIL_H_ #define _TTM_EXECBUF_UTIL_H_

+#include <linux/dma-resv.h> #include <linux/list.h>

#include "ttm_bo_api.h" @@ -46,7 +47,7 @@ struct ttm_validate_buffer { struct list_head head; struct ttm_buffer_object *bo;
unsigned int num_shared;
enum dma_resv_usage usage;
};

/**

Bas Nieuwenhuizen

12:40 a.m.

New subject: [RFC PATCH 2/5] drm/amdgpu: Add separate mode for syncing DMA_RESV_USAGE_BOOKKEEP.

To prep for allowing different sync modes in a follow-up patch.

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 +++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 3 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 ++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 10 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index a790a089e829..92a1b08b3bbc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1157,7 +1157,7 @@ static int process_sync_pds_resv(struct amdkfd_process_info *process_info, struct amdgpu_bo *pd = peer_vm->root.bo;

ret = amdgpu_sync_resv(NULL, sync, pd->tbo.base.resv, - AMDGPU_SYNC_NE_OWNER, + AMDGPU_SYNC_NE_OWNER, AMDGPU_SYNC_NE_OWNER, AMDGPU_FENCE_OWNER_KFD); if (ret) return ret; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 2ae1c0d9d33a..0318a6d46a41 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -654,7 +654,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p) sync_mode = amdgpu_bo_explicit_sync(bo) ? AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER; r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, - &fpriv->vm); + AMDGPU_SYNC_EXPLICIT, &fpriv->vm); if (r) return r; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 91b99eb7dc35..63e6f7b8b522 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1407,7 +1407,8 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence, * * @adev: amdgpu device pointer * @resv: reservation object to sync to - * @sync_mode: synchronization mode + * @implicit_sync_mode: synchronization mode for usage <= DMA_RESV_USAGE_READ + * @explicit_sync_mode: synchronization mode for usage DMA_RESV_USAGE_BOOKKEEP * @owner: fence owner * @intr: Whether the wait is interruptible * @@ -1417,14 +1418,15 @@ void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence, * 0 on success, errno otherwise. */ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv, - enum amdgpu_sync_mode sync_mode, void *owner, + enum amdgpu_sync_mode implicit_sync_mode, + enum amdgpu_sync_mode explicit_sync_mode, void *owner, bool intr) { struct amdgpu_sync sync; int r;

amdgpu_sync_create(&sync); - amdgpu_sync_resv(adev, &sync, resv, sync_mode, owner); + amdgpu_sync_resv(adev, &sync, resv, implicit_sync_mode, explicit_sync_mode, owner); r = amdgpu_sync_wait(&sync, intr); amdgpu_sync_free(&sync); return r; @@ -1445,7 +1447,8 @@ int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr) struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);

return amdgpu_bo_sync_wait_resv(adev, bo->tbo.base.resv, - AMDGPU_SYNC_NE_OWNER, owner, intr); + AMDGPU_SYNC_NE_OWNER, AMDGPU_SYNC_EXPLICIT, + owner, intr); }

/** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index 4c9cbdc66995..9540ee1102ad 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -321,7 +321,8 @@ vm_fault_t amdgpu_bo_fault_reserve_notify(struct ttm_buffer_object *bo); void amdgpu_bo_fence(struct amdgpu_bo *bo, struct dma_fence *fence, bool shared); int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv, - enum amdgpu_sync_mode sync_mode, void *owner, + enum amdgpu_sync_mode implicit_sync_mode, + enum amdgpu_sync_mode explicit_sync_mode, void *owner, bool intr); int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr); u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 11c46b3e4c60..b40cd4eff6a3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -243,14 +243,15 @@ static bool amdgpu_sync_test_fence(struct amdgpu_device *adev, * @adev: amdgpu device * @sync: sync object to add fences from reservation object to * @resv: reservation object with embedded fence - * @mode: how owner affects which fences we sync to + * @implicit_mode: how owner affects which fences with usage <= DMA_RESV_USAGE_READ we sync to + * @explicit_mode: how owner affects which fences with usage DMA_RESV_USAGE_BOOKKEEP we sync to * @owner: owner of the planned job submission * * Sync to the fence */ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, - struct dma_resv *resv, enum amdgpu_sync_mode mode, - void *owner) + struct dma_resv *resv, enum amdgpu_sync_mode implicit_mode, + enum amdgpu_sync_mode explicit_mode, void *owner) { struct dma_resv_iter cursor; struct dma_fence *f; @@ -263,6 +264,10 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, dma_resv_for_each_fence(&cursor, resv, DMA_RESV_USAGE_BOOKKEEP, f) { dma_fence_chain_for_each(f, f) { struct dma_fence *tmp = dma_fence_chain_contained(f); + enum amdgpu_sync_mode mode = implicit_mode; + + if (dma_resv_iter_usage(&cursor) >= DMA_RESV_USAGE_BOOKKEEP) + mode = explicit_mode;

if (amdgpu_sync_test_fence(adev, mode, owner, tmp)) { r = amdgpu_sync_fence(sync, f); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h index 7c0fe20c470d..f786e30eb0a3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h @@ -50,8 +50,8 @@ void amdgpu_sync_create(struct amdgpu_sync *sync); int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f); int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence); int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, - struct dma_resv *resv, enum amdgpu_sync_mode mode, - void *owner); + struct dma_resv *resv, enum amdgpu_sync_mode implicit_mode, + enum amdgpu_sync_mode explicit_mode, void *owner); struct dma_fence *amdgpu_sync_peek_fence(struct amdgpu_sync *sync, struct amdgpu_ring *ring); struct dma_fence *amdgpu_sync_get_fence(struct amdgpu_sync *sync); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 48a635864a92..00a749016b6d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1971,6 +1971,7 @@ static int amdgpu_ttm_prepare_job(struct amdgpu_device *adev, if (resv) { r = amdgpu_sync_resv(adev, &(*job)->sync, resv, AMDGPU_SYNC_ALWAYS, + AMDGPU_SYNC_EXPLICIT, AMDGPU_FENCE_OWNER_UNDEFINED); if (r) { DRM_ERROR("sync failed (%d).\n", r); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c index 6eac649499d3..de08bab400d5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c @@ -1176,7 +1176,7 @@ static int amdgpu_uvd_send_msg(struct amdgpu_ring *ring, struct amdgpu_bo *bo, goto err_free; } else { r = amdgpu_sync_resv(adev, &job->sync, bo->tbo.base.resv, - AMDGPU_SYNC_ALWAYS, + AMDGPU_SYNC_ALWAYS, AMDGPU_SYNC_ALWAYS, AMDGPU_FENCE_OWNER_UNDEFINED); if (r) goto err_free; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index 31913ae86de6..f10332e1c6c0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

- return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, p->vm, true); + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); }

/** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index bdb44cee19d3..63b484dc76c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

- return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, p->vm); + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); }

/**

-- 2.36.1

Bas Nieuwenhuizen

12:40 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

- return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); }

/** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

- return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); }

/**

-- 2.36.1

Christian König

8:03 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...

This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

Regards, Christian.

...

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);

return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); }

/**

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;

return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);

return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); }

/**

Bas Nieuwenhuizen

8:16 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

...

Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
}

/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
}

/**

Christian König

8:40 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:

...

On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything when we unmap entries:

if (!(flags & AMDGPU_PTE_VALID)) sync_mode = AMDGPU_SYNC_EQ_OWNER; else sync_mode = AMDGPU_SYNC_EXPLICIT;

This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion.

Regards, Christian.

...

...
Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
}

/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
}

/**

Bas Nieuwenhuizen

8:48 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Wed, Jun 1, 2022 at 10:40 AM Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything when we unmap entries:
     if (!(flags & AMDGPU_PTE_VALID))
             sync_mode = AMDGPU_SYNC_EQ_OWNER;
     else
             sync_mode = AMDGPU_SYNC_EXPLICIT;
This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion.

That might cause an unmap operation too early, but for freeing up the actual backing memory we still wait for all fences on the BO to finish first, no? In that case, since BOOKKEEP fences are still added for explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)

...

Regards, Christian.

...
...
Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
}

/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
}

/**

Bas Nieuwenhuizen

8:59 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen bas@basnieuwenhuizen.nl wrote:

...

On Wed, Jun 1, 2022 at 10:40 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a

complete

...
...
...
no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything when we unmap entries:
     if (!(flags & AMDGPU_PTE_VALID))
             sync_mode = AMDGPU_SYNC_EQ_OWNER;
     else
             sync_mode = AMDGPU_SYNC_EXPLICIT;
This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion.
That might cause an unmap operation too early, but for freeing up the actual backing memory we still wait for all fences on the BO to finish first, no? In that case, since BOOKKEEP fences are still added for explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)

As an aside this is the same hole/issue as when an app forgets a bo in the bo list on submission.

...

...
Regards, Christian.

...
...
Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl

drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c

...
...
...
...
index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct

amdgpu_vm_update_params *p,

...
...
...
...
   if (!resv)
           return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
sync_mode, p->vm, true);

...
...
...
...
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
AMDGPU_SYNC_EXPLICIT, p->vm, true);

...
...
...
...
}

/** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c

...
...
...
...
index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct

amdgpu_vm_update_params *p,

...
...
...
...
   if (!resv)
           return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
sync_mode, sync_mode, p->vm);

...
...
...
...
return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);

...
...
...
...
}

/**

Christian König

9:01 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:

...

On Wed, Jun 1, 2022 at 10:40 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything when we unmap entries:
      if (!(flags & AMDGPU_PTE_VALID))
              sync_mode = AMDGPU_SYNC_EQ_OWNER;
      else
              sync_mode = AMDGPU_SYNC_EXPLICIT;
This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion.
That might cause an unmap operation too early, but for freeing up the actual backing memory we still wait for all fences on the BO to finish first, no? In that case, since BOOKKEEP fences are still added for explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)

The problem is we need to wait on fences *not* added to the buffer object.

E.g. what we currently do here while freeing memory is: 1. Update the PTEs and make that update wait for everything! 2. Add the fence of that update to the freed up BO so that this BO isn't freed before the next CS.

We might be able to fix this by adding the fences to the BO before freeing it manually, but I'm not 100% sure we can actually allocate memory for the fences in that moment.

Regards, Christian.

...

...
Regards, Christian.

...
...
Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
}

/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
}

/**

Bas Nieuwenhuizen

3 Jun 3 Jun

1:21 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Wed, Jun 1, 2022 at 11:01 AM Christian König christian.koenig@amd.com wrote:

...

Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:40 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:

...
On Wed, Jun 1, 2022 at 10:03 AM Christian König christian.koenig@amd.com wrote:

...
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around)

Well NAK again. This allows access to freed up memory and is a complete no-go.

How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything when we unmap entries:
      if (!(flags & AMDGPU_PTE_VALID))
              sync_mode = AMDGPU_SYNC_EQ_OWNER;
      else
              sync_mode = AMDGPU_SYNC_EXPLICIT;
This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion.
That might cause an unmap operation too early, but for freeing up the actual backing memory we still wait for all fences on the BO to finish first, no? In that case, since BOOKKEEP fences are still added for explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)
The problem is we need to wait on fences *not* added to the buffer object.

What fences wouldn't be added to the buffer object that we need here?

...

E.g. what we currently do here while freeing memory is:

Update the PTEs and make that update wait for everything!

Add the fence of that update to the freed up BO so that this BO isn't

freed before the next CS.

We might be able to fix this by adding the fences to the BO before freeing it manually, but I'm not 100% sure we can actually allocate memory for the fences in that moment.

I think we don't need to be able to. We're already adding the unmap fence to the BO in the gem close ioctl, and that has the fallback that if we can't allocate space for the fence in the BO, we wait on the fence manually on the CPU. I think that is a reasonable fallback for this as well?

For the TTM move path amdgpu_copy_buffer will wait on the BO resv and then following submissions will trigger VM updates that will wait on the amdgpu_copy_buffer jobs (and hence transitively) will wait on the work. AFAICT the amdgpu_bo_move does not trigger any VM updates by itself, and the amdgpu_bo_move_notify is way after the move (and after the ttm_bo_move_accel_cleanup which would free the old resource), so any VM changes triggered by that would see the TTM copy and sync to it.

I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

...

Regards, Christian.

...
...
Regards, Christian.

...
...
Regards, Christian.

...
Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
}

/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0;
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
}

/**

Christian König

8:11 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:

...

[SNIP]

...
The problem is we need to wait on fences *not* added to the buffer object.

What fences wouldn't be added to the buffer object that we need here?

Basically all still running submissions from the VM which could potentially access the BO.

That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().

...

...
E.g. what we currently do here while freeing memory is:

Update the PTEs and make that update wait for everything!

Add the fence of that update to the freed up BO so that this BO isn't

freed before the next CS.

We might be able to fix this by adding the fences to the BO before freeing it manually, but I'm not 100% sure we can actually allocate memory for the fences in that moment.

I think we don't need to be able to. We're already adding the unmap fence to the BO in the gem close ioctl, and that has the fallback that if we can't allocate space for the fence in the BO, we wait on the fence manually on the CPU. I think that is a reasonable fallback for this as well?

Yes, just blocking might work in an OOM situation as well.

...

For the TTM move path amdgpu_copy_buffer will wait on the BO resv and then following submissions will trigger VM updates that will wait on the amdgpu_copy_buffer jobs (and hence transitively) will wait on the work. AFAICT the amdgpu_bo_move does not trigger any VM updates by itself, and the amdgpu_bo_move_notify is way after the move (and after the ttm_bo_move_accel_cleanup which would free the old resource), so any VM changes triggered by that would see the TTM copy and sync to it.

I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following: 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. 2. When we get a VM operation we not only lock the VM page tables, but also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

Regards, Christian.

Bas Nieuwenhuizen

10:08 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 10:11 AM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
The problem is we need to wait on fences *not* added to the buffer object.

What fences wouldn't be added to the buffer object that we need here?

Basically all still running submissions from the VM which could potentially access the BO.

That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().

...
...
E.g. what we currently do here while freeing memory is:

Update the PTEs and make that update wait for everything!

Add the fence of that update to the freed up BO so that this BO isn't

freed before the next CS.

We might be able to fix this by adding the fences to the BO before freeing it manually, but I'm not 100% sure we can actually allocate memory for the fences in that moment.

I think we don't need to be able to. We're already adding the unmap fence to the BO in the gem close ioctl, and that has the fallback that if we can't allocate space for the fence in the BO, we wait on the fence manually on the CPU. I think that is a reasonable fallback for this as well?

Yes, just blocking might work in an OOM situation as well.

...
For the TTM move path amdgpu_copy_buffer will wait on the BO resv and then following submissions will trigger VM updates that will wait on the amdgpu_copy_buffer jobs (and hence transitively) will wait on the work. AFAICT the amdgpu_bo_move does not trigger any VM updates by itself, and the amdgpu_bo_move_notify is way after the move (and after the ttm_bo_move_accel_cleanup which would free the old resource), so any VM changes triggered by that would see the TTM copy and sync to it.

I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...

Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

...

Regards, Christian.

Christian König

10:16 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...

[SNIP]

...
...
I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...
Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...

...
Regards, Christian.

Bas Nieuwenhuizen

11:07 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...
Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; }

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{ + struct dma_resv_iter cursor; + struct dma_fence *f; + int r; + unsigned num_fences = 0; + + if (src == dst) + return; + + /* We assume the later loops get the same fences as the caller should + * lock the resv. */ + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + ++num_fences; + dma_fence_put(f); + } + + r = dma_resv_reserve_fences(dst, num_fences); + if (r) { + /* As last resort on OOM we block for the fence */ + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + dma_fence_wait(f, false); + dma_fence_put(f); + } + } + + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); + dma_fence_put(f); + } +} + + static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence);

+ dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); + out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page "

...

When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...
...
Regards, Christian.

Christian König

12:08 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...

On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...
Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

I think I know all the necessary steps now, it's just tons of work to do.

Regards, Christian.

...

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; }

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{
  struct dma_resv_iter cursor;
  struct dma_fence *f;
  int r;
  unsigned num_fences = 0;
  if (src == dst)
          return;
  /* We assume the later loops get the same fences as the caller should
   * lock the resv. */
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          ++num_fences;
          dma_fence_put(f);
  }
  r = dma_resv_reserve_fences(dst, num_fences);
  if (r) {
          /* As last resort on OOM we block for the fence */
          dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
                  dma_fence_wait(f, false);
                  dma_fence_put(f);
          }
  }
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
          dma_fence_put(f);
  }
+}

static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence);
  dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page "

...
When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...
...
Regards, Christian.

Bas Nieuwenhuizen

12:39 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 2:08 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...
Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is

1) non-implicit CS gets submitted that touches a BO 2) VM unmap on that BO happens 2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush 3) A TLB flush happens for a new CS 4) All CS submissions here see the TLB flush and hence the unmap

So the main problem would be the CS from step 1, but (a) if that VMFaults that is the apps own fault and (b) because we don't free the memory until (1) finishes it is not a security issue kernel-wise.

...

I think I know all the necessary steps now, it's just tons of work to do.

Regards, Christian.

...
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; }

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{
  struct dma_resv_iter cursor;
  struct dma_fence *f;
  int r;
  unsigned num_fences = 0;
  if (src == dst)
          return;
  /* We assume the later loops get the same fences as the caller should
   * lock the resv. */
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          ++num_fences;
          dma_fence_put(f);
  }
  r = dma_resv_reserve_fences(dst, num_fences);
  if (r) {
          /* As last resort on OOM we block for the fence */
          dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
                  dma_fence_wait(f, false);
                  dma_fence_put(f);
          }
  }
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
          dma_fence_put(f);
  }
+}

static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence);
  dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page "

...
When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...
...
Regards, Christian.

Christian König

12:49 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:

...

On Fri, Jun 3, 2022 at 2:08 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
> I do have to fix some stuff indeed, especially for the GEM close but > with that we should be able to keep the same basic approach? Nope, not even remotely.

What we need is the following:

Rolling out my drm_exec patch set, so that we can lock buffers as needed.

When we get a VM operation we not only lock the VM page tables, but

also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync.

Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

...
Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is

non-implicit CS gets submitted that touches a BO

VM unmap on that BO happens

2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

When we want to do a TLB flush the unmap operation must already be completed. Otherwise the flush is rather pointless since any access could reloads the not yet updated PTEs.

And this means that we need to artificially add a dependency on every command submission after 2 to wait until the unmap operation is completed.

Christian.

...

A TLB flush happens for a new CS

All CS submissions here see the TLB flush and hence the unmap

So the main problem would be the CS from step 1, but (a) if that VMFaults that is the apps own fault and (b) because we don't free the memory until (1) finishes it is not a security issue kernel-wise.

...
I think I know all the necessary steps now, it's just tons of work to do.

Regards, Christian.

...
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; }

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{
  struct dma_resv_iter cursor;
  struct dma_fence *f;
  int r;
  unsigned num_fences = 0;
  if (src == dst)
          return;
  /* We assume the later loops get the same fences as the caller should
   * lock the resv. */
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          ++num_fences;
          dma_fence_put(f);
  }
  r = dma_resv_reserve_fences(dst, num_fences);
  if (r) {
          /* As last resort on OOM we block for the fence */
          dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
                  dma_fence_wait(f, false);
                  dma_fence_put(f);
          }
  }
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
          dma_fence_put(f);
  }
+}

static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence);
  dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page "

...
When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...
...
Regards, Christian.

Bas Nieuwenhuizen

1:23 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 2:49 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 2:08 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:

...
[SNIP] >> I do have to fix some stuff indeed, especially for the GEM close but >> with that we should be able to keep the same basic approach? > Nope, not even remotely. > > What we need is the following: > 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > 2. When we get a VM operation we not only lock the VM page tables, but > also all buffers we potentially need to unmap. > 3. Nuking the freed list in the amdgpu_vm structure by updating freed > areas directly when they are unmapped. > 4. Tracking those updates inside the bo_va structure for the BO+VM > combination. > 5. When the bo_va structure is destroy because of closing the handle > move the last clear operation over to the VM as implicit sync. > Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

> Only when all this is done we then can resolve the dependency that the > CS currently must wait for any clear operation on the VM. isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is

non-implicit CS gets submitted that touches a BO

VM unmap on that BO happens

2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue. In particular, by doing all dependency waits in userspace, we can make almost all VM operations start pretty much immediately (with a bunch of exceptions, like other VM work that takes time, radeonsi still submitting implicitly synced stuff etc.).

So I think (2) is valuable, just not what this series tries to focus on or touch at all.

(And then the cherry on top would be having two timelines for VM operations, a high priority one for non-sparse bindings and a low priority one for sparse bindings, but that is very complex and not super high value on top of eliminating (1) + (2), so I'd punt that for "maybe later". See e.g. the discussion wrt Intel at https://patchwork.freedesktop.org/patch/486604/#comment_879193)

...

When we want to do a TLB flush the unmap operation must already be completed. Otherwise the flush is rather pointless since any access could reloads the not yet updated PTEs.

And this means that we need to artificially add a dependency on every command submission after 2 to wait until the unmap operation is completed.

Christian.

...
A TLB flush happens for a new CS

All CS submissions here see the TLB flush and hence the unmap

So the main problem would be the CS from step 1, but (a) if that VMFaults that is the apps own fault and (b) because we don't free the memory until (1) finishes it is not a security issue kernel-wise.

...
I think I know all the necessary steps now, it's just tons of work to do.

Regards, Christian.

...
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; }

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{
  struct dma_resv_iter cursor;
  struct dma_fence *f;
  int r;
  unsigned num_fences = 0;
  if (src == dst)
          return;
  /* We assume the later loops get the same fences as the caller should
   * lock the resv. */
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          ++num_fences;
          dma_fence_put(f);
  }
  r = dma_resv_reserve_fences(dst, num_fences);
  if (r) {
          /* As last resort on OOM we block for the fence */
          dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
                  dma_fence_wait(f, false);
                  dma_fence_put(f);
          }
  }
  dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
          dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
          dma_fence_put(f);
  }
+}

static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence);
  dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page "

...
When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations.

Regards, Christian.

...
> Regards, > Christian. > >

Christian König

5:41 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:

...

On Fri, Jun 3, 2022 at 2:49 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 2:08 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > [SNIP] >>> I do have to fix some stuff indeed, especially for the GEM close but >>> with that we should be able to keep the same basic approach? >> Nope, not even remotely. >> >> What we need is the following: >> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >> 2. When we get a VM operation we not only lock the VM page tables, but >> also all buffers we potentially need to unmap. >> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >> areas directly when they are unmapped. >> 4. Tracking those updates inside the bo_va structure for the BO+VM >> combination. >> 5. When the bo_va structure is destroy because of closing the handle >> move the last clear operation over to the VM as implicit sync. >> > Hi Christian, isn't that a different problem though (that we're also > trying to solve, but in your series)? > > What this patch tries to achieve: > > (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > (t+1) a VM operation on a BO/VM accessed by the CS. > > to run concurrently. What it *doesn't* try is > > (t+0) a VM operation on a BO/VM accessed by the CS. > (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > > to run concurrently. When you write > >> Only when all this is done we then can resolve the dependency that the >> CS currently must wait for any clear operation on the VM. > isn't that all about the second problem? No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is

non-implicit CS gets submitted that touches a BO

VM unmap on that BO happens

2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Bas Nieuwenhuizen

5:50 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 7:42 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 2:49 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 2:08 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 12:16 PM Christian König christian.koenig@amd.com wrote: > Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: >> [SNIP] >>>> I do have to fix some stuff indeed, especially for the GEM close but >>>> with that we should be able to keep the same basic approach? >>> Nope, not even remotely. >>> >>> What we need is the following: >>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >>> 2. When we get a VM operation we not only lock the VM page tables, but >>> also all buffers we potentially need to unmap. >>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >>> areas directly when they are unmapped. >>> 4. Tracking those updates inside the bo_va structure for the BO+VM >>> combination. >>> 5. When the bo_va structure is destroy because of closing the handle >>> move the last clear operation over to the VM as implicit sync. >>> >> Hi Christian, isn't that a different problem though (that we're also >> trying to solve, but in your series)? >> >> What this patch tries to achieve: >> >> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >> (t+1) a VM operation on a BO/VM accessed by the CS. >> >> to run concurrently. What it *doesn't* try is >> >> (t+0) a VM operation on a BO/VM accessed by the CS. >> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >> >> to run concurrently. When you write >> >>> Only when all this is done we then can resolve the dependency that the >>> CS currently must wait for any clear operation on the VM. >> isn't that all about the second problem? > No, it's the same. > > See what we do in the VM code is to artificially insert a bubble so that > all VM clear operations wait for all CS operations and then use the > clear fence to indicate when the backing store of the BO can be freed. Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is

non-implicit CS gets submitted that touches a BO

VM unmap on that BO happens

2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

1) the VM unmap is a barrier between prior CS and later memory free. 2) The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

...

To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Christian König

6:41 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...

[SNIP]

...
...
...
Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...

...
To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Bas Nieuwenhuizen

7:11 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
...
Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

...

#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...
...
To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Christian König

6 Jun 6 Jun

10:15 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...

On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
...
Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...

...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...
...
To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Bas Nieuwenhuizen

10:30 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Mon, Jun 6, 2022 at 12:15 PM Christian König christian.koenig@amd.com wrote:

...

Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
...
> Yeah, but that's exactly the bubble we try to avoid. Isn't it? For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists.

So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions.

...

That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...
...
To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Christian König

10:35 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:

...

On Mon, Jun 6, 2022 at 12:15 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...
[SNIP]

...
>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > For this series, not really. To clarify there are two sides for > getting GPU bubbles and no overlap: > > (1) VM operations implicitly wait for earlier CS submissions > (2) CS submissions implicitly wait for earlier VM operations > > Together, these combine to ensure that you get a (potentially small) > bubble any time VM work happens. > > Your series (and further ideas) tackles (2), and is a worthwhile thing > to do. However, while writing the userspace for this I noticed this > isn't enough to get rid of all our GPU bubbles. In particular when > doing a non-sparse map of a new BO, that tends to need to be waited on > for the next CS anyway for API semantics. Due to VM operations > happening on a single timeline that means this high priority map can > end up being blocked by earlier sparse maps and hence the bubble in > that case still exists. > > So in this series I try to tackle (1) instead. Since GPU work > typically lags behind CPU submissions and VM operations aren't that > slow, we can typically execute VM operations early enough that any > implicit syncs from (2) are less/no issue. Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO.

Regards, Christian.

...

...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...
...
To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier.

I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction.

Regards, Christian.

Bas Nieuwenhuizen

11 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Mon, Jun 6, 2022 at 12:35 PM Christian König christian.koenig@amd.com wrote:

...

Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:

...
On Mon, Jun 6, 2022 at 12:15 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:

...
[SNIP] >>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >> For this series, not really. To clarify there are two sides for >> getting GPU bubbles and no overlap: >> >> (1) VM operations implicitly wait for earlier CS submissions >> (2) CS submissions implicitly wait for earlier VM operations >> >> Together, these combine to ensure that you get a (potentially small) >> bubble any time VM work happens. >> >> Your series (and further ideas) tackles (2), and is a worthwhile thing >> to do. However, while writing the userspace for this I noticed this >> isn't enough to get rid of all our GPU bubbles. In particular when >> doing a non-sparse map of a new BO, that tends to need to be waited on >> for the next CS anyway for API semantics. Due to VM operations >> happening on a single timeline that means this high priority map can >> end up being blocked by earlier sparse maps and hence the bubble in >> that case still exists. >> >> So in this series I try to tackle (1) instead. Since GPU work >> typically lags behind CPU submissions and VM operations aren't that >> slow, we can typically execute VM operations early enough that any >> implicit syncs from (2) are less/no issue. > Ok, once more since you don't seem to understand what I want to say: It > isn't possible to fix #1 before you have fixed #2. > > The VM unmap operation here is a barrier which divides the CS operations > in a before and after. This is intentional design. Why is that barrier needed? The two barriers I got and understood and I think we can deal with:

the VM unmap is a barrier between prior CS and later memory free.

The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior CS and later CS?

Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences.

...

BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

...

Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO.

It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2

(the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit)

...

Regards, Christian.

...
...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

...
> To get rid of this barrier you must first fix the part where CS > submissions wait for the VM operation to complete, e.g. the necessity of > the barrier. > > I'm working on this for a couple of years now and I'm really running out > of idea how to explain this restriction. > > Regards, > Christian. >

Bas Nieuwenhuizen

15 Jun 15 Jun

12:40 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Hi Christian,

Friendly ping on the comments here. Are you okay with me re-spinning the series with a fixed patch 1 or do we need further discussion on the approach here?

Thanks, Bas

On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen bas@basnieuwenhuizen.nl wrote:

...

On Mon, Jun 6, 2022 at 12:35 PM Christian König christian.koenig@amd.com wrote:

...
Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:

...
On Mon, Jun 6, 2022 at 12:15 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > [SNIP] >>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>> For this series, not really. To clarify there are two sides for >>> getting GPU bubbles and no overlap: >>> >>> (1) VM operations implicitly wait for earlier CS submissions >>> (2) CS submissions implicitly wait for earlier VM operations >>> >>> Together, these combine to ensure that you get a (potentially small) >>> bubble any time VM work happens. >>> >>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>> to do. However, while writing the userspace for this I noticed this >>> isn't enough to get rid of all our GPU bubbles. In particular when >>> doing a non-sparse map of a new BO, that tends to need to be waited on >>> for the next CS anyway for API semantics. Due to VM operations >>> happening on a single timeline that means this high priority map can >>> end up being blocked by earlier sparse maps and hence the bubble in >>> that case still exists. >>> >>> So in this series I try to tackle (1) instead. Since GPU work >>> typically lags behind CPU submissions and VM operations aren't that >>> slow, we can typically execute VM operations early enough that any >>> implicit syncs from (2) are less/no issue. >> Ok, once more since you don't seem to understand what I want to say: It >> isn't possible to fix #1 before you have fixed #2. >> >> The VM unmap operation here is a barrier which divides the CS operations >> in a before and after. This is intentional design. > Why is that barrier needed? The two barriers I got and understood and > I think we can deal with: > > 1) the VM unmap is a barrier between prior CS and later memory free. > 2) The TLB flush need to happen between a VM unmap and later CS. > > But why do we need the VM unmap to be a strict barrier between prior > CS and later CS? Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences.

...
BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

...
Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO.

It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2

(the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit)

...
Regards, Christian.

...
...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

>> To get rid of this barrier you must first fix the part where CS >> submissions wait for the VM operation to complete, e.g. the necessity of >> the barrier. >> >> I'm working on this for a couple of years now and I'm really running out >> of idea how to explain this restriction. >> >> Regards, >> Christian. >>

Christian König

7 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Hi Bas,

sorry I totally missed your reply. Just tried to answer your original questions.

Regards, Christian.

Am 15.06.22 um 02:40 schrieb Bas Nieuwenhuizen:

...

Hi Christian,

Friendly ping on the comments here. Are you okay with me re-spinning the series with a fixed patch 1 or do we need further discussion on the approach here?

Thanks, Bas

On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen bas@basnieuwenhuizen.nl wrote:

...
On Mon, Jun 6, 2022 at 12:35 PM Christian König christian.koenig@amd.com wrote:

...
Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:

...
On Mon, Jun 6, 2022 at 12:15 PM Christian König christian.koenig@amd.com wrote:

...
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:

...
On Fri, Jun 3, 2022 at 8:41 PM Christian König christian.koenig@amd.com wrote: > Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: >> [SNIP] >>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>>> For this series, not really. To clarify there are two sides for >>>> getting GPU bubbles and no overlap: >>>> >>>> (1) VM operations implicitly wait for earlier CS submissions >>>> (2) CS submissions implicitly wait for earlier VM operations >>>> >>>> Together, these combine to ensure that you get a (potentially small) >>>> bubble any time VM work happens. >>>> >>>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>>> to do. However, while writing the userspace for this I noticed this >>>> isn't enough to get rid of all our GPU bubbles. In particular when >>>> doing a non-sparse map of a new BO, that tends to need to be waited on >>>> for the next CS anyway for API semantics. Due to VM operations >>>> happening on a single timeline that means this high priority map can >>>> end up being blocked by earlier sparse maps and hence the bubble in >>>> that case still exists. >>>> >>>> So in this series I try to tackle (1) instead. Since GPU work >>>> typically lags behind CPU submissions and VM operations aren't that >>>> slow, we can typically execute VM operations early enough that any >>>> implicit syncs from (2) are less/no issue. >>> Ok, once more since you don't seem to understand what I want to say: It >>> isn't possible to fix #1 before you have fixed #2. >>> >>> The VM unmap operation here is a barrier which divides the CS operations >>> in a before and after. This is intentional design. >> Why is that barrier needed? The two barriers I got and understood and >> I think we can deal with: >> >> 1) the VM unmap is a barrier between prior CS and later memory free. >> 2) The TLB flush need to happen between a VM unmap and later CS. >> >> But why do we need the VM unmap to be a strict barrier between prior >> CS and later CS? > Exactly because of the two reasons you mentioned. This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint.

Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay.

We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences.

...
BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

...
Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO.

It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork....

(the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit)

...
Regards, Christian.

...
...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
> #1 Is rather easy to fix, you just need to copy all dma_fences from the > page table dma_resv object over to the BOs dma_resv object in the gem > close handler. E.g. exactly what you suggested with the dma_resv_copy > function. > > #2 is a nightmare. > > We can't move the TLB flush at the end of the unmap operation because on > async TLB flushes are either a bit complicated (double flushes etc..) or > don't even work at all because of hw bugs. So to have a reliable TLB > flush we must make sure that nothing else is ongoing and that means > CS->VM->CS barrier. > > We try very hard to circumvent that already on maps by (for example) > using a completely new VMID for CS after the VM map operation. > > But for the unmap operation we would need some kind special dma_fence > implementation which would not only wait for all existing dma_fence but > also for the one added until the unmap operation is completed. Cause > otherwise our operation we do at #1 would simply not catch all > dma_fences which have access to the memory. > > That's certainly doable, but I think just using the drm_exec stuff I > already came up with is easier. > > When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > goes away and we can keep track of the unmap operations in the bo_va > structure. > > With that done you can make the explicit sync you noted in the bo_va > structure and implicit sync when the bo_va structure goes away. > > Then the only reason I can see why we would need a CS->VM dependency is > implicit synchronization, and that's what we are trying to avoid here in > the first place. > > Regards, > Christian. > >>> To get rid of this barrier you must first fix the part where CS >>> submissions wait for the VM operation to complete, e.g. the necessity of >>> the barrier. >>> >>> I'm working on this for a couple of years now and I'm really running out >>> of idea how to explain this restriction. >>> >>> Regards, >>> Christian. >>>

Christian König

7 a.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:

...

On Mon, Jun 6, 2022 at 12:35 PM Christian König christian.koenig@amd.com wrote:

...
[SNIP] That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences.

No, exactly that is wrong.

Implicit CS submissions should add WRITE fences.

Explicit CS submissions should add READ fences.

Only VM updates should add BOOKKEEP fences.

...

...
BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

No, that would break existing DMA-buf rules.

Explicit CS submissions are still a dependency for implicit submissions.

...

Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO. It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork....

No, exactly that's completely broken.

Regards, Christian.

...

(the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit)

...
Regards, Christian.

...
...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
...
#1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure.

With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place.

Regards, Christian.

>> To get rid of this barrier you must first fix the part where CS >> submissions wait for the VM operation to complete, e.g. the necessity of >> the barrier. >> >> I'm working on this for a couple of years now and I'm really running out >> of idea how to explain this restriction. >> >> Regards, >> Christian. >>

Bas Nieuwenhuizen

17 Jun 17 Jun

1:03 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

On Wed, Jun 15, 2022 at 9:00 AM Christian König christian.koenig@amd.com wrote:

...

Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:

...
On Mon, Jun 6, 2022 at 12:35 PM Christian König christian.koenig@amd.com wrote:

...
[SNIP] That part won't work at all and would cause additional synchronization problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences.

No, exactly that is wrong.

Implicit CS submissions should add WRITE fences.

Explicit CS submissions should add READ fences.

Only VM updates should add BOOKKEEP fences.

...
...
BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

No, that would break existing DMA-buf rules.

Explicit CS submissions are still a dependency for implicit submissions.

This is explicitly what we don't want for explicit submissions and why I waited with this series until the DMA_RESV_USAGE series landed. We wish to opt out from implicit sync completely, and just use the IOCTLs Jason wrote for back-compat with windowing systems that need it.

If BOOKKEEP isn't for that, should we add a new USAGE?

...

...
Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO. It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork....

No, exactly that's completely broken.

Regards, Christian.

...
(the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit)

...
Regards, Christian.

...
...
That should be doable, but then you don't need any of the other changes.

Regards, Christian.

...
> #1 Is rather easy to fix, you just need to copy all dma_fences from the > page table dma_resv object over to the BOs dma_resv object in the gem > close handler. E.g. exactly what you suggested with the dma_resv_copy > function. > > #2 is a nightmare. > > We can't move the TLB flush at the end of the unmap operation because on > async TLB flushes are either a bit complicated (double flushes etc..) or > don't even work at all because of hw bugs. So to have a reliable TLB > flush we must make sure that nothing else is ongoing and that means > CS->VM->CS barrier. > > We try very hard to circumvent that already on maps by (for example) > using a completely new VMID for CS after the VM map operation. > > But for the unmap operation we would need some kind special dma_fence > implementation which would not only wait for all existing dma_fence but > also for the one added until the unmap operation is completed. Cause > otherwise our operation we do at #1 would simply not catch all > dma_fences which have access to the memory. > > That's certainly doable, but I think just using the drm_exec stuff I > already came up with is easier. > > When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > goes away and we can keep track of the unmap operations in the bo_va > structure. > > With that done you can make the explicit sync you noted in the bo_va > structure and implicit sync when the bo_va structure goes away. > > Then the only reason I can see why we would need a CS->VM dependency is > implicit synchronization, and that's what we are trying to avoid here in > the first place. > > Regards, > Christian. > >>> To get rid of this barrier you must first fix the part where CS >>> submissions wait for the VM operation to complete, e.g. the necessity of >>> the barrier. >>> >>> I'm working on this for a couple of years now and I'm really running out >>> of idea how to explain this restriction. >>> >>> Regards, >>> Christian. >>>

Christian König

1:08 p.m.

New subject: [RFC PATCH 3/5] drm/amdgpu: Allow explicit sync for VM ops.

Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:

...

[SNIP]

...
...
...
BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs)

No, that would break existing DMA-buf rules.

Explicit CS submissions are still a dependency for implicit submissions.

This is explicitly what we don't want for explicit submissions and why I waited with this series until the DMA_RESV_USAGE series landed. We wish to opt out from implicit sync completely, and just use the IOCTLs Jason wrote for back-compat with windowing systems that need it.

If BOOKKEEP isn't for that, should we add a new USAGE?

BOOKKEEP is exactly for that, but as discussed with Daniel that's not what we want in the kernel.

When you mix implicit with explicit synchronization (OpenGL with RADV for example) it should be mandatory for the OpenGL to wait for any RADV submission before issuing an operation.

What you want to do is intentionally not supported.

Regards, Christian.

Bas Nieuwenhuizen

1 Jun 1 Jun

12:40 a.m.

New subject: [RFC PATCH 4/5] drm/amdgpu: Refactor amdgpu_vm_get_pd_bo.

We want to take only a BOOKKEEP usage for contexts that are not implicitly synced.

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 ++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 3 ++- 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 92a1b08b3bbc..c47695b37a1c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -921,7 +921,7 @@ static int reserve_bo_and_vm(struct kgd_mem *mem, ctx->kfd_bo.tv.usage = DMA_RESV_USAGE_READ; list_add(&ctx->kfd_bo.tv.head, &ctx->list);

- amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0]); + amdgpu_vm_get_pd_bo(vm, &ctx->list, &ctx->vm_pd[0], DMA_RESV_USAGE_READ);

ret = ttm_eu_reserve_buffers(&ctx->ticket, &ctx->list, false, &ctx->duplicates); @@ -992,7 +992,7 @@ static int reserve_bo_and_cond_vms(struct kgd_mem *mem, continue;

amdgpu_vm_get_pd_bo(entry->bo_va->base.vm, &ctx->list, - &ctx->vm_pd[i]); + &ctx->vm_pd[i], DMA_RESV_USAGE_READ); i++; }

@@ -2212,7 +2212,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) list_for_each_entry(peer_vm, &process_info->vm_list_head, vm_list_node) amdgpu_vm_get_pd_bo(peer_vm, &resv_list, - &pd_bo_list_entries[i++]); + &pd_bo_list_entries[i++], DMA_RESV_USAGE_READ); /* Add the userptr_inval_list entries to resv_list */ list_for_each_entry(mem, &process_info->userptr_inval_list, validate_list.head) { @@ -2407,7 +2407,8 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) mutex_lock(&process_info->lock); list_for_each_entry(peer_vm, &process_info->vm_list_head, vm_list_node) - amdgpu_vm_get_pd_bo(peer_vm, &ctx.list, &pd_bo_list[i++]); + amdgpu_vm_get_pd_bo(peer_vm, &ctx.list, &pd_bo_list[i++], + DMA_RESV_USAGE_READ);

/* Reserve all BOs and page tables/directory. Add all BOs from * kfd_bo_list to ctx.list diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 0318a6d46a41..64419f55606f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -524,7 +524,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, amdgpu_bo_list_get_list(p->bo_list, &p->validated);

INIT_LIST_HEAD(&duplicates); - amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd); + amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, DMA_RESV_USAGE_READ);

if (p->uf_entry.tv.bo && !ttm_to_amdgpu_bo(p->uf_entry.tv.bo)->parent) list_add(&p->uf_entry.tv.head, &p->validated); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index 71277257d94d..f091fe6bb985 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -77,7 +77,7 @@ int amdgpu_map_static_csa(struct amdgpu_device *adev, struct amdgpu_vm *vm, csa_tv.usage = DMA_RESV_USAGE_READ;

list_add(&csa_tv.head, &list); - amdgpu_vm_get_pd_bo(vm, &list, &pd); + amdgpu_vm_get_pd_bo(vm, &list, &pd, DMA_RESV_USAGE_READ);

r = ttm_eu_reserve_buffers(&ticket, &list, true, NULL); if (r) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 7483411229f4..a1194a0986bf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -210,7 +210,7 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, tv.usage = DMA_RESV_USAGE_READ; list_add(&tv.head, &list);

- amdgpu_vm_get_pd_bo(vm, &list, &vm_pd); + amdgpu_vm_get_pd_bo(vm, &list, &vm_pd, DMA_RESV_USAGE_READ);

r = ttm_eu_reserve_buffers(&ticket, &list, false, &duplicates); if (r) { @@ -740,7 +740,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, abo = NULL; }

- amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd); + amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd, DMA_RESV_USAGE_READ);

r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); if (r) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 515be19ab279..da04072a3ea6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -623,17 +623,19 @@ static void amdgpu_vm_pt_next_dfs(struct amdgpu_device *adev, * @vm: vm providing the BOs * @validated: head of validation list * @entry: entry to add + * @resv_usage: resv usage for the synchronization * * Add the page directory to the list of BOs to * validate for command submission. */ void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, struct list_head *validated, - struct amdgpu_bo_list_entry *entry) + struct amdgpu_bo_list_entry *entry, + enum dma_resv_usage resv_usage) { entry->priority = 0; entry->tv.bo = &vm->root.bo->tbo; - entry->tv.usage = DMA_RESV_USAGE_READ; + entry->tv.usage = resv_usage; entry->user_pages = NULL; list_add(&entry->tv.head, validated); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index a40a6a993bb0..a14cd9716f44 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -384,7 +384,8 @@ void amdgpu_vm_release_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm); void amdgpu_vm_get_pd_bo(struct amdgpu_vm *vm, struct list_head *validated, - struct amdgpu_bo_list_entry *entry); + struct amdgpu_bo_list_entry *entry, + enum dma_resv_usage resv_usage); bool amdgpu_vm_ready(struct amdgpu_vm *vm); int amdgpu_vm_validate_pt_bos(struct amdgpu_device *adev, struct amdgpu_vm *vm, int (*callback)(void *p, struct amdgpu_bo *bo),

-- 2.36.1

Bas Nieuwenhuizen

12:40 a.m.

New subject: [RFC PATCH 5/5] drm/amdgpu: Add option to disable implicit sync for a context.

This changes all BO usages in a submit to BOOKKEEP instead of READ, which effectively disables implicit sync for these submits.

This is configured at a context level using the existing IOCTL.

Signed-off-by: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 13 ++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 32 +++++++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 + include/uapi/drm/amdgpu_drm.h | 3 +++ 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 64419f55606f..944028d0ed6d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -498,6 +498,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, struct amdgpu_bo *gws; struct amdgpu_bo *oa; int r; + enum dma_resv_usage resv_usage;

INIT_LIST_HEAD(&p->validated);

@@ -518,13 +519,16 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }

+ resv_usage = p->ctx->disable_implicit_sync ? DMA_RESV_USAGE_BOOKKEEP : + DMA_RESV_USAGE_READ; + amdgpu_bo_list_for_each_entry(e, p->bo_list) - e->tv.usage = DMA_RESV_USAGE_READ; + e->tv.usage = resv_usage;

amdgpu_bo_list_get_list(p->bo_list, &p->validated);

INIT_LIST_HEAD(&duplicates); - amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, DMA_RESV_USAGE_READ); + amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd, resv_usage);

if (p->uf_entry.tv.bo && !ttm_to_amdgpu_bo(p->uf_entry.tv.bo)->parent) list_add(&p->uf_entry.tv.head, &p->validated); @@ -651,7 +655,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p) struct dma_resv *resv = bo->tbo.base.resv; enum amdgpu_sync_mode sync_mode;

- sync_mode = amdgpu_bo_explicit_sync(bo) ? + sync_mode = (amdgpu_bo_explicit_sync(bo) || p->ctx->disable_implicit_sync) ? AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER; r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, &fpriv->vm); @@ -1259,7 +1263,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

/* Make sure all BOs are remembered as writers */ amdgpu_bo_list_for_each_entry(e, p->bo_list) - e->tv.usage = DMA_RESV_USAGE_WRITE; + e->tv.usage = p->ctx->disable_implicit_sync ? DMA_RESV_USAGE_BOOKKEEP + : DMA_RESV_USAGE_WRITE;

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index c317078d1afd..5fd3ad630194 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -559,8 +559,6 @@ static int amdgpu_ctx_query2(struct amdgpu_device *adev, return 0; }

- - static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev, struct amdgpu_fpriv *fpriv, uint32_t id, bool set, u32 *stable_pstate) @@ -589,6 +587,30 @@ static int amdgpu_ctx_stable_pstate(struct amdgpu_device *adev, return r; }

+static int amdgpu_ctx_set_implicit_sync(struct amdgpu_device *adev, + struct amdgpu_fpriv *fpriv, uint32_t id, + bool enable) +{ + struct amdgpu_ctx *ctx; + struct amdgpu_ctx_mgr *mgr; + + if (!fpriv) + return -EINVAL; + + mgr = &fpriv->ctx_mgr; + mutex_lock(&mgr->lock); + ctx = idr_find(&mgr->ctx_handles, id); + if (!ctx) { + mutex_unlock(&mgr->lock); + return -EINVAL; + } + + ctx->disable_implicit_sync = !enable; + + mutex_unlock(&mgr->lock); + return 0; +} + int amdgpu_ctx_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) { @@ -637,6 +659,12 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data, return -EINVAL; r = amdgpu_ctx_stable_pstate(adev, fpriv, id, true, &stable_pstate); break; + case AMDGPU_CTX_OP_SET_IMPLICIT_SYNC: + if ((args->in.flags & ~AMDGPU_CTX_IMPICIT_SYNC_ENABLED) || args->in.priority) + return -EINVAL; + r = amdgpu_ctx_set_implicit_sync(adev, fpriv, id, + args->in.flags & ~AMDGPU_CTX_IMPICIT_SYNC_ENABLED); + break; default: return -EINVAL; } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h index 142f2f87d44c..7675838d1640 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h @@ -54,6 +54,7 @@ struct amdgpu_ctx { unsigned long ras_counter_ce; unsigned long ras_counter_ue; uint32_t stable_pstate; + bool disable_implicit_sync; };

struct amdgpu_ctx_mgr { diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h index 1d65c1fbc4ec..09d9388e35a7 100644 --- a/include/uapi/drm/amdgpu_drm.h +++ b/include/uapi/drm/amdgpu_drm.h @@ -208,6 +208,7 @@ union drm_amdgpu_bo_list { #define AMDGPU_CTX_OP_QUERY_STATE2 4 #define AMDGPU_CTX_OP_GET_STABLE_PSTATE 5 #define AMDGPU_CTX_OP_SET_STABLE_PSTATE 6 +#define AMDGPU_CTX_OP_SET_IMPLICIT_SYNC 7

/* GPU reset status */ #define AMDGPU_CTX_NO_RESET 0 @@ -248,6 +249,8 @@ union drm_amdgpu_bo_list { #define AMDGPU_CTX_STABLE_PSTATE_MIN_MCLK 3 #define AMDGPU_CTX_STABLE_PSTATE_PEAK 4

+#define AMDGPU_CTX_IMPICIT_SYNC_ENABLED 1 + struct drm_amdgpu_ctx_in { /** AMDGPU_CTX_OP_* */ __u32 op;

-- 2.36.1

1060

Age (days ago)

1076

Last active (days ago)

dri-devel@lists.freedesktop.org

40 comments

3 participants

tags (0)

participants (3)

Bas Nieuwenhuizen
Christian König
Daniel Vetter