Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
Regards, Christian.
This is a RFC/WIP patch which just adds the interface and lockdep annotation without any actual implementation.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-resv.c | 18 ++++++++++++++++++ include/linux/dma-resv.h | 1 + 2 files changed, 19 insertions(+)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 6ddbeb5dfbf6..e0305424957b 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -681,3 +681,21 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) return ret; } EXPORT_SYMBOL_GPL(dma_resv_test_signaled_rcu); + +/** + * dma_resv_sync_user_fence - block for user fences to signal + * + * @obj: The DMA resv object with the user fence attached + * + * To make sure we have proper synchronization between accesses block for user + * fences before starting a dma_fence based operation on the buffer. + */ +int dma_resv_sync_user_fence(struct dma_resv *obj) +{ + might_fault(); + + /* TODO: Actually come up with an implementation for this! */ + + return 0; +} +EXPORT_SYMBOL_GPL(dma_resv_sync_user_fence); diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h index d44a77e8a7e3..c525a36be900 100644 --- a/include/linux/dma-resv.h +++ b/include/linux/dma-resv.h @@ -289,5 +289,6 @@ long dma_resv_wait_timeout_rcu(struct dma_resv *obj, bool wait_all, bool intr, unsigned long timeout);
bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all); +int dma_resv_sync_user_fence(struct dma_resv *obj);
#endif /* _LINUX_RESERVATION_H */
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/infiniband/hw/mlx5/odp.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index b103555b1f5d..6b4d980c02e8 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -804,6 +804,10 @@ static int pagefault_dmabuf_mr(struct mlx5_ib_mr *mr, size_t bcnt, if (flags & MLX5_PF_FLAGS_ENABLE) xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;
+ err = dma_resv_sync_user_fence(umem_dmabuf->attach->dmabuf->resv); + if (err) + return err; + dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL); err = ib_umem_dmabuf_map_pages(umem_dmabuf); if (err) {
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 7 +++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_display.c | 6 ++++++ 2 files changed, 13 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index b5c766998045..afd58c6d88a4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -534,6 +534,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, return r; }
+ /* Sync to user fences */ + amdgpu_bo_list_for_each_entry(e, p->bo_list) { + r = dma_resv_sync_user_fence(e->tv.bo->base.resv); + if (r) + return r; + } + /* One for TTM and one for the CS job */ amdgpu_bo_list_for_each_entry(e, p->bo_list) e->tv.num_shared = 2; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c index 9a2f811450ed..3edd6dbae71f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_display.c @@ -181,6 +181,12 @@ int amdgpu_display_crtc_page_flip_target(struct drm_crtc *crtc, obj = fb->obj[0]; new_abo = gem_to_amdgpu_bo(obj);
+ r = dma_resv_sync_user_fence(obj->resv); + if (unlikely(r)) { + DRM_ERROR("failed to wait for user fence before flip\n"); + goto cleanup; + } + /* pin the new buffer */ r = amdgpu_bo_reserve(new_abo, false); if (unlikely(r != 0)) {
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/drm_gem_atomic_helper.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/drm_gem_atomic_helper.c b/drivers/gpu/drm/drm_gem_atomic_helper.c index a005c5a0ba46..fe0d18486643 100644 --- a/drivers/gpu/drm/drm_gem_atomic_helper.c +++ b/drivers/gpu/drm/drm_gem_atomic_helper.c @@ -142,11 +142,15 @@ int drm_gem_plane_helper_prepare_fb(struct drm_plane *plane, struct drm_plane_st { struct drm_gem_object *obj; struct dma_fence *fence; + int ret;
if (!state->fb) return 0;
obj = drm_gem_fb_get_obj(state->fb, 0); + ret = dma_resv_sync_user_fence(obj->resv); + if (ret) + return ret; fence = dma_resv_get_excl_rcu(obj->resv); drm_atomic_set_fence_for_plane(state, fence);
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c | 23 ++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c index d05c35994579..2e440674ca5b 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c @@ -108,6 +108,21 @@ static int submit_lookup_objects(struct etnaviv_gem_submit *submit, return ret; }
+static int submit_sync_user(struct etnaviv_gem_submit *submit) +{ + unsigned int i; + int ret; + + for (i = 0; i < submit->nr_bos; i++) { + struct drm_gem_object *obj = &submit->bos[i].obj->base; + + ret = dma_resv_sync_user_fence(obj->resv); + if (ret) + return ret; + } + return 0; +} + static void submit_unlock_object(struct etnaviv_gem_submit *submit, int i) { if (submit->bos[i].flags & BO_LOCKED) { @@ -518,8 +533,6 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data, } }
- ww_acquire_init(&ticket, &reservation_ww_class); - submit = submit_create(dev, gpu, args->nr_bos, args->nr_pmrs); if (!submit) { ret = -ENOMEM; @@ -541,6 +554,12 @@ int etnaviv_ioctl_gem_submit(struct drm_device *dev, void *data, if (ret) goto err_submit_objects;
+ ret = submit_sync_user(submit); + if (ret) + goto err_submit_objects; + + ww_acquire_init(&ticket, &reservation_ww_class); + if ((priv->mmu_global->version != ETNAVIV_IOMMU_V2) && !etnaviv_cmd_validate_one(gpu, stream, args->stream_size / 4, relocs, args->nr_relocs)) {
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 5964e67c7d36..24c575d762db 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -872,6 +872,12 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb) goto err; }
+ err = dma_resv_sync_user_fence(vma->obj->base.resv); + if (unlikely(err)) { + i915_vma_put(vma); + goto err; + } + eb_add_vma(eb, i, batch, vma);
if (i915_gem_object_is_userptr(vma->obj)) {
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/lima/lima_gem.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c index de62966243cd..d3d68218568d 100644 --- a/drivers/gpu/drm/lima/lima_gem.c +++ b/drivers/gpu/drm/lima/lima_gem.c @@ -321,6 +321,12 @@ int lima_gem_submit(struct drm_file *file, struct lima_submit *submit) goto err_out0; }
+ err = dma_resv_sync_user_fence(obj->resv); + if (err) { + drm_gem_object_put(obj); + goto err_out0; + } + bo = to_lima_bo(obj);
/* increase refcnt of gpu va map to prevent unmapped when executing,
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/msm/msm_gem_submit.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 5480852bdeda..a77389ce23d0 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -285,6 +285,20 @@ static int submit_lock_objects(struct msm_gem_submit *submit) return ret; }
+static int submit_sync_user_fence(struct msm_gem_submit *submit) +{ + int i, ret; + + for (i = 0; i < submit->nr_bos; i++) { + struct msm_gem_object *msm_obj = submit->bos[i].obj; + + ret = dma_resv_sync_user_fence(msm_obj->base.resv); + if (ret) + return ret; + } + return 0; +} + static int submit_fence_sync(struct msm_gem_submit *submit, bool no_implicit) { int i, ret = 0; @@ -769,6 +783,10 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data, */ pm_runtime_get_sync(&gpu->pdev->dev);
+ ret = submit_sync_user_fence(submit); + if (ret) + goto out; + /* copy_*_user while holding a ww ticket upsets lockdep */ ww_acquire_init(&submit->ticket, &reservation_ww_class); has_ww_ticket = true;
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c index a70e82413fa7..e349a8b32549 100644 --- a/drivers/gpu/drm/nouveau/nouveau_gem.c +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c @@ -552,6 +552,7 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan, struct validate_op *op, bool *apply_relocs) { struct nouveau_cli *cli = nouveau_cli(file_priv); + unsigned int i; int ret;
INIT_LIST_HEAD(&op->list); @@ -559,6 +560,17 @@ nouveau_gem_pushbuf_validate(struct nouveau_channel *chan, if (nr_buffers == 0) return 0;
+ for (i = 0; i < nr_buffers; i++) { + struct drm_nouveau_gem_pushbuf_bo *b = &pbbo[i]; + struct drm_gem_object *gem; + + gem = drm_gem_object_lookup(file_priv, b->handle); + if (!gem) + return -ENOENT; + dma_resv_sync_user_fence(gem->resv); + drm_gem_object_put(gem); + } + ret = validate_init(chan, file_priv, pbbo, nr_buffers, op); if (unlikely(ret)) { if (ret != -ERESTARTSYS)
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/panfrost/panfrost_job.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 6003cfeb1322..9174ceb1d16d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -216,6 +216,18 @@ static void panfrost_attach_object_fences(struct drm_gem_object **bos, dma_resv_add_excl_fence(bos[i]->resv, fence); }
+static int panfrost_sync_user_fences(struct drm_gem_object **bos, int bo_count) +{ + int i, ret; + + for (i = 0; i < bo_count; i++) { + ret = dma_resv_sync_user_fence(bos[i]->resv); + if (ret) + return ret; + } + return 0; +} + int panfrost_job_push(struct panfrost_job *job) { struct panfrost_device *pfdev = job->pfdev; @@ -224,6 +236,10 @@ int panfrost_job_push(struct panfrost_job *job) struct ww_acquire_ctx acquire_ctx; int ret = 0;
+ ret = panfrost_sync_user_fences(job->bos, job->bo_count); + if (ret) + return ret; + mutex_lock(&pfdev->sched_lock);
ret = drm_gem_lock_reservations(job->bos, job->bo_count,
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/radeon_cs.c | 6 ++++++ drivers/gpu/drm/radeon/radeon_display.c | 4 ++++ 2 files changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 059431689c2d..fb0e238535f3 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -189,6 +189,12 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) priority); }
+ for (i = 0; i < p->nrelocs; i++) { + r = dma_resv_sync_user_fence(p->relocs[i].tv.bo->base.resv); + if (r) + return r; + } + radeon_cs_buckets_get_list(&buckets, &p->validated);
if (p->cs_flags & RADEON_CS_USE_VM) diff --git a/drivers/gpu/drm/radeon/radeon_display.c b/drivers/gpu/drm/radeon/radeon_display.c index 652af7a134bd..75ebd2338809 100644 --- a/drivers/gpu/drm/radeon/radeon_display.c +++ b/drivers/gpu/drm/radeon/radeon_display.c @@ -519,6 +519,10 @@ static int radeon_crtc_page_flip_target(struct drm_crtc *crtc, DRM_DEBUG_DRIVER("flip-ioctl() cur_rbo = %p, new_rbo = %p\n", work->old_rbo, new_rbo);
+ r = dma_resv_sync_user_fence(new_rbo->tbo.base.resv); + if (unlikely(r != 0)) + goto cleanup; + r = radeon_bo_reserve(new_rbo, false); if (unlikely(r != 0)) { DRM_ERROR("failed to reserve new rbo buffer before flip\n");
Just add the call before taking locks.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/v3d/v3d_gem.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/v3d/v3d_gem.c b/drivers/gpu/drm/v3d/v3d_gem.c index 4eb354226972..7c45292c641c 100644 --- a/drivers/gpu/drm/v3d/v3d_gem.c +++ b/drivers/gpu/drm/v3d/v3d_gem.c @@ -345,6 +345,12 @@ v3d_lookup_bos(struct drm_device *dev, } spin_unlock(&file_priv->table_lock);
+ for (i = 0; i < job->bo_count; i++) { + ret = dma_resv_sync_user_fence(job->bo[i]->resv); + if (ret) + break; + } + fail: kvfree(handles); return ret;
Hi Christian,
On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote:
Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
It's a lot more I think - sync_file/drm_syncobj still need to be supported somehow - we need userspace to handle the stall in a submit thread at least - there's nothing here that sets the sync object - implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything - we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that. - ... at this point I stopped pondering but there's definitely more
Imo the only way we'll even get the complete is if we do the following: 1. roll out implicit sync with userspace fences on a driver-by-driver basis 1a. including all the winsys/modeset stuff 2. roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already. 3. only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense.
Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-)
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set. -Daniel
Hi Daniel,
Am 04.05.21 um 16:15 schrieb Daniel Vetter:
Hi Christian,
On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote:
Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
It's a lot more I think
- sync_file/drm_syncobj still need to be supported somehow
You need that with explicit fences as well.
I'm just concentrating on what extra burden implicit sync would get us.
- we need userspace to handle the stall in a submit thread at least
- there's nothing here that sets the sync object
- implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything
Good point, but that should be relatively easily to add as well.
- we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that.
Again as far as I can see we run into the same problem with explicit sync.
So the question is where could we block for atomic modeset for user fences in general?
- ... at this point I stopped pondering but there's definitely more
Imo the only way we'll even get the complete is if we do the following:
- roll out implicit sync with userspace fences on a driver-by-driver basis 1a. including all the winsys/modeset stuff
Completely agree, that's why I've split that up into individual patches.
I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that.
- roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already.
Correct, but again we need this for explicit fencing as well.
- only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense.
Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-)
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set.
No it isn't. The hard part is getting the user sync stuff up in general.
Adding implicit synchronization on top of that is then rather trivial.
Christian.
-Daniel
On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote:
Hi Daniel,
Am 04.05.21 um 16:15 schrieb Daniel Vetter:
Hi Christian,
On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote:
Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
It's a lot more I think
- sync_file/drm_syncobj still need to be supported somehow
You need that with explicit fences as well.
I'm just concentrating on what extra burden implicit sync would get us.
It's not just implicit sync. Currently the best approach we have for explicit sync is hiding them in drm_syncobj. Because for that all the work with intentional stall points and userspace submit thread already exists.
None of this work has been done for sync_file. And looking at how much work it was to get drm_syncobj going, that will be anything but easy.
- we need userspace to handle the stall in a submit thread at least
- there's nothing here that sets the sync object
- implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything
Good point, but that should be relatively easily to add as well.
- we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that.
Again as far as I can see we run into the same problem with explicit sync.
So the question is where could we block for atomic modeset for user fences in general?
Nah, I have an idea. But it only works if userspace is aware, because the rules are essentialyl:
- when you supply a userspace in-fence, then you only get a userspace out-fence - mixing in fences between dma-fence and user fence is ok - mixing out fences isn't
And we currently do have sync_file out fence. So it's not possible to support implicit user fence in atomic in a way which doesn't break the uapi somewhere.
Doing the explicit user fence support first will make that very obvious.
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example).
- ... at this point I stopped pondering but there's definitely more
Imo the only way we'll even get the complete is if we do the following:
- roll out implicit sync with userspace fences on a driver-by-driver basis
s/implicit/explicit/
But I think you got that.
1a. including all the winsys/modeset stuff
Completely agree, that's why I've split that up into individual patches.
I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that.
Yeah, but that boils down to us just breaking those use-cases. Which is exactly what you're trying to avoid by rolling out implicit user fence I think.
- roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already.
Correct, but again we need this for explicit fencing as well.
- only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense.
Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-)
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set.
No it isn't. The hard part is getting the user sync stuff up in general.
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
Am 04.05.21 um 17:11 schrieb Daniel Vetter:
On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote:
Hi Daniel,
Am 04.05.21 um 16:15 schrieb Daniel Vetter:
Hi Christian,
On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote:
Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
It's a lot more I think
- sync_file/drm_syncobj still need to be supported somehow
You need that with explicit fences as well.
I'm just concentrating on what extra burden implicit sync would get us.
It's not just implicit sync. Currently the best approach we have for explicit sync is hiding them in drm_syncobj. Because for that all the work with intentional stall points and userspace submit thread already exists.
None of this work has been done for sync_file. And looking at how much work it was to get drm_syncobj going, that will be anything but easy.
I don't think we will want this for sync_file in the first place.
- we need userspace to handle the stall in a submit thread at least
- there's nothing here that sets the sync object
- implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything
Good point, but that should be relatively easily to add as well.
- we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that.
Again as far as I can see we run into the same problem with explicit sync.
So the question is where could we block for atomic modeset for user fences in general?
Nah, I have an idea. But it only works if userspace is aware, because the rules are essentialyl:
- when you supply a userspace in-fence, then you only get a userspace out-fence
- mixing in fences between dma-fence and user fence is ok
- mixing out fences isn't
And we currently do have sync_file out fence. So it's not possible to support implicit user fence in atomic in a way which doesn't break the uapi somewhere.
Doing the explicit user fence support first will make that very obvious.
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports.
Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example).
Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem.
- ... at this point I stopped pondering but there's definitely more
Imo the only way we'll even get the complete is if we do the following:
- roll out implicit sync with userspace fences on a driver-by-driver basis
s/implicit/explicit/
But I think you got that.
1a. including all the winsys/modeset stuff
Completely agree, that's why I've split that up into individual patches.
I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that.
Yeah, but that boils down to us just breaking those use-cases. Which is exactly what you're trying to avoid by rolling out implicit user fence I think.
But we can add support to all drivers as necessary.
- roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already.
Correct, but again we need this for explicit fencing as well.
- only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense.
Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-)
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set.
No it isn't. The hard part is getting the user sync stuff up in general.
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
On Mon, May 10, 2021 at 08:12:31PM +0200, Christian König wrote:
Am 04.05.21 um 17:11 schrieb Daniel Vetter:
On Tue, May 04, 2021 at 04:26:42PM +0200, Christian König wrote:
Hi Daniel,
Am 04.05.21 um 16:15 schrieb Daniel Vetter:
Hi Christian,
On Tue, May 04, 2021 at 03:27:17PM +0200, Christian König wrote:
Hi guys,
with this patch set I want to look into how much more additional work it would be to support implicit sync compared to only explicit sync.
Turned out that this is much simpler than expected since the only addition is that before a command submission or flip the kernel and classic drivers would need to wait for the user fence to signal before taking any locks.
It's a lot more I think
- sync_file/drm_syncobj still need to be supported somehow
You need that with explicit fences as well.
I'm just concentrating on what extra burden implicit sync would get us.
It's not just implicit sync. Currently the best approach we have for explicit sync is hiding them in drm_syncobj. Because for that all the work with intentional stall points and userspace submit thread already exists.
None of this work has been done for sync_file. And looking at how much work it was to get drm_syncobj going, that will be anything but easy.
I don't think we will want this for sync_file in the first place.
- we need userspace to handle the stall in a submit thread at least
- there's nothing here that sets the sync object
- implicit sync isn't just execbuf, it's everything. E.g. the various wait_bo ioctl also need to keep working, including timeout and everything
Good point, but that should be relatively easily to add as well.
- we can't stall in atomic kms where you're currently stalling, that's for sure. The uapi says "we're not stalling for fences in there", and you're breaking that.
Again as far as I can see we run into the same problem with explicit sync.
So the question is where could we block for atomic modeset for user fences in general?
Nah, I have an idea. But it only works if userspace is aware, because the rules are essentialyl:
- when you supply a userspace in-fence, then you only get a userspace out-fence
- mixing in fences between dma-fence and user fence is ok
- mixing out fences isn't
And we currently do have sync_file out fence. So it's not possible to support implicit user fence in atomic in a way which doesn't break the uapi somewhere.
Doing the explicit user fence support first will make that very obvious.
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
Uh that's rather bold. Android is built on it. Currently atomic kms is built on it.
What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports.
Convince Android folks, minimally. Probably a lot more. Yes with hindsight we should have just gone for drm_syncobj instead of the sync_file thing, but hindsight and all that.
This is kinda why I don't think trying to support the existing uapi with userspace fences underneath with some magic tricks is a good idea. It's just a pile of work, plus it's not really architecturally clean.
Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example).
Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem.
Why is this not much of a problem if it's just within one driver?
- ... at this point I stopped pondering but there's definitely more
Imo the only way we'll even get the complete is if we do the following:
- roll out implicit sync with userspace fences on a driver-by-driver basis
s/implicit/explicit/
But I think you got that.
1a. including all the winsys/modeset stuff
Completely agree, that's why I've split that up into individual patches.
I'm also fine if drivers can just opt out of user fence based synchronization and we return an error from dma_buf_dynamic_attach() if some driver says it can't handle that.
Yeah, but that boils down to us just breaking those use-cases. Which is exactly what you're trying to avoid by rolling out implicit user fence I think.
But we can add support to all drivers as necessary.
- roll out support for userspace fences to drm_syncobj timeline for interop, both across process/userspace and across drivers 2a. including all the winsys/modeset stuff, but hopefully that's largely solved with 1. already.
Correct, but again we need this for explicit fencing as well.
- only then try to figure out how to retroshoehorn this into implicit sync, and whether that even makes sense.
Because doing 3 before we've done 1&2 for at least 2 drivers (2 because interop fun across drivers) is just praying that this time around we're not collectively idiots and can correctly predict the future. That never worked :-)
For this prototype this patch set doesn't implement any user fence synchronization at all, but just assumes that faulting user pages is sufficient to make sure that we can wait for user space to finish submitting the work. If necessary this can be made even more strict, the only use case I could find which blocks this is the radeon driver and that should be handle able.
This of course doesn't give you the same semantic as the classic implicit sync to guarantee that you have exclusive access to a buffers, but this is also not necessary.
So I think the conclusion should be that we don't need to concentrate on implicit vs. explicit sync, but rather how to get the synchronization and timeout signalling figured out in general.
I'm not sure what exactly you're proving here aside from "it's possible to roll out a function with ill-defined semantics to all drivers". This really is a lot harder than just this one function and just this one patch set.
No it isn't. The hard part is getting the user sync stuff up in general.
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Well, figure it out :-)
This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case.
I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order: 1. enable them purely within a single userspace driver (like vk with winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care
If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences - the butchered implicit one that didn't quite work - the explicit one - the not-so-butchered implicit one with the lessons from the properly done explicit one
The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-)
And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
Am 11.05.21 um 09:31 schrieb Daniel Vetter:
[SNIP]
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
Uh that's rather bold. Android is built on it. Currently atomic kms is built on it.
To be honest I don't think we care about Android at all.
What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports.
Convince Android folks, minimally. Probably a lot more. Yes with hindsight we should have just gone for drm_syncobj instead of the sync_file thing, but hindsight and all that.
This is kinda why I don't think trying to support the existing uapi with userspace fences underneath with some magic tricks is a good idea. It's just a pile of work, plus it's not really architecturally clean.
Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example).
Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem.
Why is this not much of a problem if it's just within one driver?
Because inside the same driver I can easily add the waits before submitting the MM work as necessary.
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Well, figure it out :-)
This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case.
I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order:
- enable them purely within a single userspace driver (like vk with
winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care
If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences
- the butchered implicit one that didn't quite work
- the explicit one
- the not-so-butchered implicit one with the lessons from the properly done explicit one
The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-)
Well I agree on that. But the question is still how would you do explicit with atomic?
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Regards, Christian.
And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote:
Am 11.05.21 um 09:31 schrieb Daniel Vetter:
[SNIP]
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
Uh that's rather bold. Android is built on it. Currently atomic kms is built on it.
To be honest I don't think we care about Android at all.
we = amd or we = upstream here?
What we should support is drm_syncobj, but that also only as an in-fence since that's what our hardware supports.
Convince Android folks, minimally. Probably a lot more. Yes with hindsight we should have just gone for drm_syncobj instead of the sync_file thing, but hindsight and all that.
This is kinda why I don't think trying to support the existing uapi with userspace fences underneath with some magic tricks is a good idea. It's just a pile of work, plus it's not really architecturally clean.
Anotherone that looks very sketchy right now is buffer sharing between different userspace drivers, like compute <-> media (if you have some fancy AI pipeline in your media workload, as an example).
Yeah, we are certainly going to get that. But only inside the same driver, so not much of a problem.
Why is this not much of a problem if it's just within one driver?
Because inside the same driver I can easily add the waits before submitting the MM work as necessary.
What is MM work here now?
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Well, figure it out :-)
This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case.
I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order:
- enable them purely within a single userspace driver (like vk with
winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care
If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences
- the butchered implicit one that didn't quite work
- the explicit one
- the not-so-butchered implicit one with the lessons from the properly done explicit one
The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-)
Well I agree on that. But the question is still how would you do explicit with atomic?
If you supply an userpace fence (is that what we call them now) as in-fence, then your only allowed to get a userspace fence as out-fence. That way we - don't block anywhere we shouldn't - don't create a dma_fence out of a userspace fence
The problem is this completely breaks your "magically make implicit fencing with userspace fences" plan.
So I have a plan here, what was yours?
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate. -Daniel
Regards, Christian.
And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
Am 11.05.21 um 16:23 schrieb Daniel Vetter:
On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote:
Am 11.05.21 um 09:31 schrieb Daniel Vetter:
[SNIP]
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
Uh that's rather bold. Android is built on it. Currently atomic kms is built on it.
To be honest I don't think we care about Android at all.
we = amd or we = upstream here?
we = amd, for everybody else that is certainly a different topic.
But for now AMD is the only one running into this problem.
Could be that Nouveau sees this as well with the next hw generation, but who knows?
Why is this not much of a problem if it's just within one driver?
Because inside the same driver I can easily add the waits before submitting the MM work as necessary.
What is MM work here now?
MM=multimedia, e.g. UVD, VCE, VCN engines on AMD hardware.
Adding implicit synchronization on top of that is then rather trivial.
Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Well, figure it out :-)
This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case.
I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order:
- enable them purely within a single userspace driver (like vk with
winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care
If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences
- the butchered implicit one that didn't quite work
- the explicit one
- the not-so-butchered implicit one with the lessons from the properly done explicit one
The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-)
Well I agree on that. But the question is still how would you do explicit with atomic?
If you supply an userpace fence (is that what we call them now) as in-fence, then your only allowed to get a userspace fence as out-fence.
Yeah, that part makes perfectly sense. But I don't see the problem with that?
That way we
- don't block anywhere we shouldn't
- don't create a dma_fence out of a userspace fence
The problem is this completely breaks your "magically make implicit fencing with userspace fences" plan.
Why?
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
Christian.
-Daniel
Regards, Christian.
And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
On Tue, May 11, 2021 at 05:32:29PM +0200, Christian König wrote:
Am 11.05.21 um 16:23 schrieb Daniel Vetter:
On Tue, May 11, 2021 at 09:47:56AM +0200, Christian König wrote:
Am 11.05.21 um 09:31 schrieb Daniel Vetter:
[SNIP]
And that's just the one ioctl I know is big trouble, I'm sure we'll find more funny corner cases when we roll out explicit user fencing.
I think we can just ignore sync_file. As far as it concerns me that UAPI is pretty much dead.
Uh that's rather bold. Android is built on it. Currently atomic kms is built on it.
To be honest I don't think we care about Android at all.
we = amd or we = upstream here?
we = amd, for everybody else that is certainly a different topic.
But for now AMD is the only one running into this problem.
Could be that Nouveau sees this as well with the next hw generation, but who knows?
Why is this not much of a problem if it's just within one driver?
Because inside the same driver I can easily add the waits before submitting the MM work as necessary.
What is MM work here now?
MM=multimedia, e.g. UVD, VCE, VCN engines on AMD hardware.
> Adding implicit synchronization on top of that is then rather trivial. Well that's what I disagree with, since I already see some problems that I don't think we can overcome (the atomic ioctl is one). And that's with us only having a fairly theoretical understanding of the overall situation.
But how should we then ever support user fences with the atomic IOCTL?
We can't wait in user space since that will disable the support for waiting in the hardware.
Well, figure it out :-)
This is exactly why I'm not seeing anything solved with just rolling a function call to a bunch of places, because it's pretending all things are solved when clearly that's not the case.
I really think what we need is to first figure out how to support userspace fences as explicit entities across the stack, maybe with something like this order:
- enable them purely within a single userspace driver (like vk with
winsys disabled, or something else like that except not amd because there's this amdkfd split for "real" compute) 1a. including atomic ioctl, e.g. for vk direct display support this can be used without cross-process sharing, new winsys protocols and all that fun 2. figure out how to transport these userspace fences with something like drm_syncobj 2a. figure out the compat story for drivers which dont do userspace fences 2b. figure out how to absorb the overhead if the winsys/compositor doesn't support explicit sync 3. maybe figure out how to make this all happen magically with implicit sync, if we really, really care
If we do 3 before we've nailed all these problems, we're just guaranteeing we'll get the wrong solutions and so we'll then have 3 ways of doing userspace fences
- the butchered implicit one that didn't quite work
- the explicit one
- the not-so-butchered implicit one with the lessons from the properly done explicit one
The thing is, if you have no idea how to integrate userspace fences explicitly into atomic ioctl, then you definitely have no idea how to do it implicitly :-)
Well I agree on that. But the question is still how would you do explicit with atomic?
If you supply an userpace fence (is that what we call them now) as in-fence, then your only allowed to get a userspace fence as out-fence.
Yeah, that part makes perfectly sense. But I don't see the problem with that?
That way we
- don't block anywhere we shouldn't
- don't create a dma_fence out of a userspace fence
The problem is this completely breaks your "magically make implicit fencing with userspace fences" plan.
Why?
If you allow implicit fencing then you can end up with - an implicit userspace fence as the in-fence - but an explicit dma_fence as the out fence
Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work.
So you have to do an uapi change here. At that point we might as well do it right.
Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road.
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
See above. It falls apart with the atomic ioctl. -Daniel
Christian.
-Daniel
Regards, Christian.
And "just block" might be good enough for a quick demo, it still breaks the contract. Same holds for a bunch of the winsys problems we'll have to deal with here. -Daniel
Regards, Christian.
Like here at intel we have internal code for compute, and we're starting to hit some interesting cases with interop with media already, but that's it. Nothing even close to desktop/winsys/kms, and that's where I expect will all the pain be at.
Cheers, Daniel
Am 11.05.21 um 18:48 schrieb Daniel Vetter:
[SNIP]
Why?
If you allow implicit fencing then you can end up with
- an implicit userspace fence as the in-fence
- but an explicit dma_fence as the out fence
Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work.
Ok, wait a second. I really don't understand what's going on here.
The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right?
Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we?
So you have to do an uapi change here. At that point we might as well do it right.
I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android.
But I still don't see the fundamental problem here.
Regards, Christian.
Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road.
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
See above. It falls apart with the atomic ioctl. -Daniel
On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote:
Am 11.05.21 um 18:48 schrieb Daniel Vetter:
[SNIP]
Why?
If you allow implicit fencing then you can end up with
- an implicit userspace fence as the in-fence
- but an explicit dma_fence as the out fence
Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work.
Ok, wait a second. I really don't understand what's going on here.
The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right?
Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we?
It can be a sync_file, so you can queue up the next rendering targeting the old backbuffer right away. On memory constraint devices where triple-buffering is prohibitive this is apparently a pretty cool trick or something like that.
So you have to do an uapi change here. At that point we might as well do it right.
I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android.
But I still don't see the fundamental problem here.
Making sync_file use implicit is more pain, it becomes sprawling pretty quickly.
Anyway I think we're just turning in circles. My take is that your patch series here demonstrates nothing beyond "adding function calls that do nothing is easy", the real deal is in making it work. And I think it's easier to make this all work with explicit userspace fencing first and then worry about how we maybe make this work implicitly. Instead of what you propose, which is rig up a lot of scaffolding without having any idea whether it's even in the right place. That seems very backwards to me.
Plus I really don't like retro-fitting userspace fences into implicit sync and sync_file and everything. But that's not the main issue I have here. -Daniel
Regards, Christian.
Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road.
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
See above. It falls apart with the atomic ioctl. -Daniel
Am 12.05.21 um 10:13 schrieb Daniel Vetter:
On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote:
Am 11.05.21 um 18:48 schrieb Daniel Vetter:
[SNIP]
Why?
If you allow implicit fencing then you can end up with
- an implicit userspace fence as the in-fence
- but an explicit dma_fence as the out fence
Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work.
Ok, wait a second. I really don't understand what's going on here.
The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right?
Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we?
It can be a sync_file, so you can queue up the next rendering targeting the old backbuffer right away. On memory constraint devices where triple-buffering is prohibitive this is apparently a pretty cool trick or something like that.
Yeah, I'm aware of that.
But we don't really need that for device we want to interop with userspace queues, don't we?
So you have to do an uapi change here. At that point we might as well do it right.
I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android.
But I still don't see the fundamental problem here.
Making sync_file use implicit is more pain, it becomes sprawling pretty quickly.
Agreed, but I don't see supporting sync_file and out fences as something necessarily.
As far as I can see we don't need to support that burden for the use cases we have for implicit sync.
And even if we have to it is possible to implement.
Anyway I think we're just turning in circles. My take is that your patch series here demonstrates nothing beyond "adding function calls that do nothing is easy", the real deal is in making it work. And I think it's easier to make this all work with explicit userspace fencing first and then worry about how we maybe make this work implicitly. Instead of what you propose, which is rig up a lot of scaffolding without having any idea whether it's even in the right place. That seems very backwards to me.
And that's what I disagree on.
Supporting implicit sync in the kernel for the functionality we need to amdgpu is relatively easily.
Change all of userspace to not rely on implicit sync any more is really hard compared to that.
Dropping implicit sync in userspace has a lot of advantage and should be pushed for, but not because it is a prerequisite of user fences.
Regards, Christian.
Plus I really don't like retro-fitting userspace fences into implicit sync and sync_file and everything. But that's not the main issue I have here. -Daniel
Regards, Christian.
Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road.
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
Transporting fences between processes is not the fundamental problem here, but rather the question how we represent all this in the kernel?
In other words I think what you outlined above is just approaching it from the wrong side again. Instead of looking what the kernel needs to support this you take a look at userspace and the requirements there.
Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
See above. It falls apart with the atomic ioctl. -Daniel
On Wed, May 12, 2021 at 10:23:04AM +0200, Christian König wrote:
Am 12.05.21 um 10:13 schrieb Daniel Vetter:
On Tue, May 11, 2021 at 09:34:11PM +0200, Christian König wrote:
Am 11.05.21 um 18:48 schrieb Daniel Vetter:
[SNIP]
Why?
If you allow implicit fencing then you can end up with
- an implicit userspace fence as the in-fence
- but an explicit dma_fence as the out fence
Which is not allowed. So there's really no way to make this work, except if you stall in the ioctl, which also doesn't work.
Ok, wait a second. I really don't understand what's going on here.
The out fence is just to let the userspace know when the frame is displayed. Or rather when the old frame is no longer displayed so that it can be reused, right?
Then why does that need to be a dma_fence? We don't use that for memory management anywhere, don't we?
It can be a sync_file, so you can queue up the next rendering targeting the old backbuffer right away. On memory constraint devices where triple-buffering is prohibitive this is apparently a pretty cool trick or something like that.
Yeah, I'm aware of that.
But we don't really need that for device we want to interop with userspace queues, don't we?
So you have to do an uapi change here. At that point we might as well do it right.
I mean in the worst case we might need to allow user fences with sync_files as well when that is really used outside of Android.
But I still don't see the fundamental problem here.
Making sync_file use implicit is more pain, it becomes sprawling pretty quickly.
Agreed, but I don't see supporting sync_file and out fences as something necessarily.
As far as I can see we don't need to support that burden for the use cases we have for implicit sync.
And even if we have to it is possible to implement.
Anyway I think we're just turning in circles. My take is that your patch series here demonstrates nothing beyond "adding function calls that do nothing is easy", the real deal is in making it work. And I think it's easier to make this all work with explicit userspace fencing first and then worry about how we maybe make this work implicitly. Instead of what you propose, which is rig up a lot of scaffolding without having any idea whether it's even in the right place. That seems very backwards to me.
And that's what I disagree on.
Supporting implicit sync in the kernel for the functionality we need to amdgpu is relatively easily.
Change all of userspace to not rely on implicit sync any more is really hard compared to that.
Dropping implicit sync in userspace has a lot of advantage and should be pushed for, but not because it is a prerequisite of user fences.
Yeah but if we support userspace fences with implicit sync only, because that's right now the only thing amdgpu needs, then we make really, really sure the transition to explicit sync will never happen.
Also if the goal really is to only do the minimal thing to keep amdgpu working on existing compositors, then imo the right solution is to just do the kernel-augmented userspace submit I've proposed. Not this.
The kernel augmented userspace submit keeps all dma_fence stuff working _exactly_ like it currently does, including drm/scheduler sorting depenendencies and everything. So from a "will it work" point of view a much more solid solution.
And it doesn't require us to create an extremely warty uapi where if you happen to use implicit userspace fences alot of things kinda stop working, for no reason.
So if youre overall goal is to just not touch any compositor/winsys protocol at all, there _is_ a very clean solution imo. -Daniel
Regards, Christian.
Plus I really don't like retro-fitting userspace fences into implicit sync and sync_file and everything. But that's not the main issue I have here. -Daniel
Regards, Christian.
Of course if you only care about some specific compositors (or maybe only the -amdgpu Xorg driver even) then this isn't a concern, but atomic is cross-driver so we can't do that. Or at least I don't see a way how to do this without causing endless amounts of fun down the road.
So I have a plan here, what was yours?
As far as I see that should still work perfectly fine and I have the strong feeling I'm missing something here.
> Transporting fences between processes is not the fundamental problem here, > but rather the question how we represent all this in the kernel? > > In other words I think what you outlined above is just approaching it from > the wrong side again. Instead of looking what the kernel needs to support > this you take a look at userspace and the requirements there. Uh ... that was my idea here? That's why I put "build userspace fences in userspace only" as the very first thing. Then extend to winsys and atomic/display and all these cases where things get more tricky.
I agree that transporting the fences is easy, which is why it's not interesting trying to solve that problem first. Which is kinda what you're trying to do here by adding implicit userspace fences (well not even that, just a bunch of function calls without any semantics attached to them).
So if there's more here, you need to flesh it out more or I just dont get what you're actually trying to demonstrate.
Well I'm trying to figure out why you see it as such a problem to keep implicit sync around.
As far as I can tell it is completely octagonal if we use implicit/explicit and dma_fence/user_fence.
It's just a different implementation inside the kernel.
See above. It falls apart with the atomic ioctl. -Daniel
dri-devel@lists.freedesktop.org