Hello,
I've been playing with Vulkan lately and struggled quite a bit to implement VkQueueSubmit with the submit ioctl we have. There are several limiting factors that can be worked around if we really have to, but I think it'd be much easier and future-proof if we introduce a new ioctl that addresses the current limitations:
1/ There can only be one out_sync, but Vulkan might ask us to signal several VkSemaphores and possibly one VkFence too, both of those being based on sync objects in my PoC. Making out_sync an array of syncobjs to attach the render_done fence to would make that possible. The other option would be to collect syncobj updates in userspace in a separate thread and propagate those updates to all semaphores+fences waiting on those events (I think the v3dv driver does something like that, but I didn't spend enough time studying the code to be sure, so I might be wrong).
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
4/ There's also the fact that submitting one job at a time adds an overhead when QueueSubmit is being passed more than one CommandBuffer. That one is less problematic, but if we're adding a new ioctl we'd better design it to limit the userspace -> kernel transition overhead.
Right now I'm just trying to collect feedback. I don't intend to get those patches merged until we have a userspace user, but I thought starting the discussion early would be a good thing.
Feel free to suggest other approaches.
Regards,
Boris
Boris Brezillon (7): drm/panfrost: Pass a job to panfrost_{acquire,attach_object_fences}() drm/panfrost: Collect implicit and explicit deps in an XArray drm/panfrost: Move the mappings collection out of panfrost_lookup_bos() drm/panfrost: Add BO access flags to relax dependencies between jobs drm/panfrost: Add a new ioctl to submit batches drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature drm/panfrost: Bump minor version to reflect the feature additions
drivers/gpu/drm/panfrost/panfrost_drv.c | 408 +++++++++++++++++++++--- drivers/gpu/drm/panfrost/panfrost_job.c | 80 +++-- drivers/gpu/drm/panfrost/panfrost_job.h | 8 +- include/uapi/drm/panfrost_drm.h | 83 +++++ 4 files changed, 483 insertions(+), 96 deletions(-)
So we don't have to change the prototype if we extend the function.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_job.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 6003cfeb1322..564054a96ca9 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -196,24 +196,20 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START); }
-static void panfrost_acquire_object_fences(struct drm_gem_object **bos, - int bo_count, - struct dma_fence **implicit_fences) +static void panfrost_acquire_object_fences(struct panfrost_job *job) { int i;
- for (i = 0; i < bo_count; i++) - implicit_fences[i] = dma_resv_get_excl_rcu(bos[i]->resv); + for (i = 0; i < job->bo_count; i++) + job->implicit_fences[i] = dma_resv_get_excl_rcu(job->bos[i]->resv); }
-static void panfrost_attach_object_fences(struct drm_gem_object **bos, - int bo_count, - struct dma_fence *fence) +static void panfrost_attach_object_fences(struct panfrost_job *job) { int i;
- for (i = 0; i < bo_count; i++) - dma_resv_add_excl_fence(bos[i]->resv, fence); + for (i = 0; i < job->bo_count; i++) + dma_resv_add_excl_fence(job->bos[i]->resv, job->render_done_fence); }
int panfrost_job_push(struct panfrost_job *job) @@ -243,15 +239,13 @@ int panfrost_job_push(struct panfrost_job *job)
kref_get(&job->refcount); /* put by scheduler job completion */
- panfrost_acquire_object_fences(job->bos, job->bo_count, - job->implicit_fences); + panfrost_acquire_object_fences(job);
drm_sched_entity_push_job(&job->base, entity);
mutex_unlock(&pfdev->sched_lock);
- panfrost_attach_object_fences(job->bos, job->bo_count, - job->render_done_fence); + panfrost_attach_object_fences(job);
unlock: drm_gem_unlock_reservations(job->bos, job->bo_count, &acquire_ctx);
This way we can re-use the standard drm_gem_fence_array_add_implicit() helper and simplify the panfrost_job_dependency() logic.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 42 +++++++----------- drivers/gpu/drm/panfrost/panfrost_job.c | 57 +++++++++++-------------- drivers/gpu/drm/panfrost/panfrost_job.h | 7 +-- 3 files changed, 42 insertions(+), 64 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 83a461bdeea8..86c1347c6f0e 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -137,12 +137,6 @@ panfrost_lookup_bos(struct drm_device *dev, if (!job->bo_count) return 0;
- job->implicit_fences = kvmalloc_array(job->bo_count, - sizeof(struct dma_fence *), - GFP_KERNEL | __GFP_ZERO); - if (!job->implicit_fences) - return -ENOMEM; - ret = drm_gem_objects_lookup(file_priv, (void __user *)(uintptr_t)args->bo_handles, job->bo_count, &job->bos); @@ -173,8 +167,7 @@ panfrost_lookup_bos(struct drm_device *dev, }
/** - * panfrost_copy_in_sync() - Sets up job->in_fences[] with the sync objects - * referenced by the job. + * panfrost_copy_in_sync() - Add in_syncs to the job->deps array. * @dev: DRM device * @file_priv: DRM file for this fd * @args: IOCTL args @@ -187,28 +180,18 @@ panfrost_lookup_bos(struct drm_device *dev, */ static int panfrost_copy_in_sync(struct drm_device *dev, - struct drm_file *file_priv, - struct drm_panfrost_submit *args, - struct panfrost_job *job) + struct drm_file *file_priv, + struct drm_panfrost_submit *args, + struct panfrost_job *job) { u32 *handles; int ret = 0; int i;
- job->in_fence_count = args->in_sync_count; - - if (!job->in_fence_count) + if (!args->in_sync_count) return 0;
- job->in_fences = kvmalloc_array(job->in_fence_count, - sizeof(struct dma_fence *), - GFP_KERNEL | __GFP_ZERO); - if (!job->in_fences) { - DRM_DEBUG("Failed to allocate job in fences\n"); - return -ENOMEM; - } - - handles = kvmalloc_array(job->in_fence_count, sizeof(u32), GFP_KERNEL); + handles = kvmalloc_array(args->in_sync_count, sizeof(u32), GFP_KERNEL); if (!handles) { ret = -ENOMEM; DRM_DEBUG("Failed to allocate incoming syncobj handles\n"); @@ -217,17 +200,23 @@ panfrost_copy_in_sync(struct drm_device *dev,
if (copy_from_user(handles, (void __user *)(uintptr_t)args->in_syncs, - job->in_fence_count * sizeof(u32))) { + args->in_sync_count * sizeof(u32))) { ret = -EFAULT; DRM_DEBUG("Failed to copy in syncobj handles\n"); goto fail; }
- for (i = 0; i < job->in_fence_count; i++) { + for (i = 0; i < args->in_sync_count; i++) { + struct dma_fence *fence; + ret = drm_syncobj_find_fence(file_priv, handles[i], 0, 0, - &job->in_fences[i]); + &fence); if (ret == -EINVAL) goto fail; + + ret = drm_gem_fence_array_add(&job->deps, fence); + if (ret) + goto fail; }
fail: @@ -269,6 +258,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, void *data, job->requirements = args->requirements; job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev); job->file_priv = file->driver_priv; + xa_init_flags(&job->deps, XA_FLAGS_ALLOC);
ret = panfrost_copy_in_sync(dev, file, args, job); if (ret) diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 564054a96ca9..87bfbd77d753 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -196,20 +196,31 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js) job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START); }
-static void panfrost_acquire_object_fences(struct panfrost_job *job) +static int panfrost_acquire_object_fences(struct panfrost_job *job) { int i;
- for (i = 0; i < job->bo_count; i++) - job->implicit_fences[i] = dma_resv_get_excl_rcu(job->bos[i]->resv); + for (i = 0; i < job->bo_count; i++) { + int ret; + + ret = drm_gem_fence_array_add_implicit(&job->deps, + job->bos[i], true); + if (ret) + return ret; + } + + return 0; }
static void panfrost_attach_object_fences(struct panfrost_job *job) { int i;
- for (i = 0; i < job->bo_count; i++) - dma_resv_add_excl_fence(job->bos[i]->resv, job->render_done_fence); + for (i = 0; i < job->bo_count; i++) { + struct dma_resv *robj = job->bos[i]->resv; + + dma_resv_add_excl_fence(robj, job->render_done_fence); + } }
int panfrost_job_push(struct panfrost_job *job) @@ -257,18 +268,15 @@ static void panfrost_job_cleanup(struct kref *ref) { struct panfrost_job *job = container_of(ref, struct panfrost_job, refcount); + struct dma_fence *fence; + unsigned long dep_idx; unsigned int i;
- if (job->in_fences) { - for (i = 0; i < job->in_fence_count; i++) - dma_fence_put(job->in_fences[i]); - kvfree(job->in_fences); - } - if (job->implicit_fences) { - for (i = 0; i < job->bo_count; i++) - dma_fence_put(job->implicit_fences[i]); - kvfree(job->implicit_fences); + xa_for_each(&job->deps, dep_idx, fence) { + dma_fence_put(fence); } + xa_destroy(&job->deps); + dma_fence_put(job->done_fence); dma_fence_put(job->render_done_fence);
@@ -311,26 +319,9 @@ static struct dma_fence *panfrost_job_dependency(struct drm_sched_job *sched_job struct drm_sched_entity *s_entity) { struct panfrost_job *job = to_panfrost_job(sched_job); - struct dma_fence *fence; - unsigned int i;
- /* Explicit fences */ - for (i = 0; i < job->in_fence_count; i++) { - if (job->in_fences[i]) { - fence = job->in_fences[i]; - job->in_fences[i] = NULL; - return fence; - } - } - - /* Implicit fences, max. one per BO */ - for (i = 0; i < job->bo_count; i++) { - if (job->implicit_fences[i]) { - fence = job->implicit_fences[i]; - job->implicit_fences[i] = NULL; - return fence; - } - } + if (!xa_empty(&job->deps)) + return xa_erase(&job->deps, job->last_dep++);
return NULL; } diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h index bbd3ba97ff67..b10b5be57dd2 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.h +++ b/drivers/gpu/drm/panfrost/panfrost_job.h @@ -19,9 +19,8 @@ struct panfrost_job { struct panfrost_device *pfdev; struct panfrost_file_priv *file_priv;
- /* Optional fences userspace can pass in for the job to depend on. */ - struct dma_fence **in_fences; - u32 in_fence_count; + struct xarray deps; + unsigned long last_dep;
/* Fence to be signaled by IRQ handler when the job is complete. */ struct dma_fence *done_fence; @@ -30,8 +29,6 @@ struct panfrost_job { __u32 requirements; __u32 flush_id;
- /* Exclusive fences we have taken from the BOs to wait for */ - struct dma_fence **implicit_fences; struct panfrost_gem_mapping **mappings; struct drm_gem_object **bos; u32 bo_count;
So we can re-use it from elsewhere.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 52 ++++++++++++++----------- 1 file changed, 29 insertions(+), 23 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 86c1347c6f0e..32e822e00a08 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -108,6 +108,34 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, return 0; }
+static int +panfrost_get_job_mappings(struct drm_file *file_priv, struct panfrost_job *job) +{ + struct panfrost_file_priv *priv = file_priv->driver_priv; + unsigned int i; + + job->mappings = kvmalloc_array(job->bo_count, + sizeof(*job->mappings), + GFP_KERNEL | __GFP_ZERO); + if (!job->mappings) + return -ENOMEM; + + for (i = 0; i < job->bo_count; i++) { + struct panfrost_gem_mapping *mapping; + struct panfrost_gem_object *bo; + + bo = to_panfrost_bo(job->bos[i]); + mapping = panfrost_gem_mapping_get(bo, priv); + if (!mapping) + return -EINVAL; + + atomic_inc(&bo->gpu_usecount); + job->mappings[i] = mapping; + } + + return 0; +} + /** * panfrost_lookup_bos() - Sets up job->bo[] with the GEM objects * referenced by the job. @@ -127,8 +155,6 @@ panfrost_lookup_bos(struct drm_device *dev, struct drm_panfrost_submit *args, struct panfrost_job *job) { - struct panfrost_file_priv *priv = file_priv->driver_priv; - struct panfrost_gem_object *bo; unsigned int i; int ret;
@@ -143,27 +169,7 @@ panfrost_lookup_bos(struct drm_device *dev, if (ret) return ret;
- job->mappings = kvmalloc_array(job->bo_count, - sizeof(struct panfrost_gem_mapping *), - GFP_KERNEL | __GFP_ZERO); - if (!job->mappings) - return -ENOMEM; - - for (i = 0; i < job->bo_count; i++) { - struct panfrost_gem_mapping *mapping; - - bo = to_panfrost_bo(job->bos[i]); - mapping = panfrost_gem_mapping_get(bo, priv); - if (!mapping) { - ret = -EINVAL; - break; - } - - atomic_inc(&bo->gpu_usecount); - job->mappings[i] = mapping; - } - - return ret; + return panfrost_get_job_mappings(file_priv, job); }
/**
Jobs reading from the same BO should not be serialized. Add access flags so we can relax the implicit dependencies in that case. We force RW access for now to keep the behavior unchanged, but a new SUBMIT ioctl taking explicit access flags will be introduced.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 9 +++++++++ drivers/gpu/drm/panfrost/panfrost_job.c | 15 +++++++++++++-- drivers/gpu/drm/panfrost/panfrost_job.h | 1 + include/uapi/drm/panfrost_drm.h | 4 ++++ 4 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 32e822e00a08..be45efee47a2 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -163,6 +163,15 @@ panfrost_lookup_bos(struct drm_device *dev, if (!job->bo_count) return 0;
+ job->bo_flags = kvmalloc_array(job->bo_count, + sizeof(*job->bo_flags), + GFP_KERNEL | __GFP_ZERO); + if (!job->bo_flags) + return -ENOMEM; + + for (i = 0; i < job->bo_count; i++) + job->bo_flags[i] = PANFROST_BO_REF_RW; + ret = drm_gem_objects_lookup(file_priv, (void __user *)(uintptr_t)args->bo_handles, job->bo_count, &job->bos); diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index 87bfbd77d753..170c3b0c8641 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -201,10 +201,17 @@ static int panfrost_acquire_object_fences(struct panfrost_job *job) int i;
for (i = 0; i < job->bo_count; i++) { + bool write = job->bo_flags[i] & PANFROST_BO_REF_WRITE; int ret;
+ if (!(job->bo_flags[i] & PANFROST_BO_REF_WRITE)) { + ret = dma_resv_reserve_shared(job->bos[i]->resv, 1); + if (ret) + return ret; + } + ret = drm_gem_fence_array_add_implicit(&job->deps, - job->bos[i], true); + job->bos[i], write); if (ret) return ret; } @@ -219,7 +226,10 @@ static void panfrost_attach_object_fences(struct panfrost_job *job) for (i = 0; i < job->bo_count; i++) { struct dma_resv *robj = job->bos[i]->resv;
- dma_resv_add_excl_fence(robj, job->render_done_fence); + if (job->bo_flags[i] & PANFROST_BO_REF_WRITE) + dma_resv_add_excl_fence(robj, job->render_done_fence); + else + dma_resv_add_shared_fence(robj, job->render_done_fence); } }
@@ -298,6 +308,7 @@ static void panfrost_job_cleanup(struct kref *ref) kvfree(job->bos); }
+ kvfree(job->bo_flags); kfree(job); }
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h b/drivers/gpu/drm/panfrost/panfrost_job.h index b10b5be57dd2..c3f662771ed9 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.h +++ b/drivers/gpu/drm/panfrost/panfrost_job.h @@ -31,6 +31,7 @@ struct panfrost_job {
struct panfrost_gem_mapping **mappings; struct drm_gem_object **bos; + u32 *bo_flags; u32 bo_count;
/* Fence to be signaled by drm-sched once its done with the job */ diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h index ec19db1eead8..eab96d7e0e0e 100644 --- a/include/uapi/drm/panfrost_drm.h +++ b/include/uapi/drm/panfrost_drm.h @@ -223,6 +223,10 @@ struct drm_panfrost_madvise { __u32 retained; /* out, whether backing store still exists */ };
+#define PANFROST_BO_REF_READ 0x1 +#define PANFROST_BO_REF_WRITE 0x2 +#define PANFROST_BO_REF_RW (PANFROST_BO_REF_READ | PANFROST_BO_REF_WRITE) + #if defined(__cplusplus) } #endif
This should help limit the number of ioctls when submitting multiple jobs. The new ioctl also supports syncobj timelines and BO access flags.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 303 ++++++++++++++++++++++++ include/uapi/drm/panfrost_drm.h | 79 ++++++ 2 files changed, 382 insertions(+)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index be45efee47a2..3f8addab8bab 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -446,6 +446,308 @@ static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, return ret; }
+static int +panfrost_get_job_in_syncs(struct drm_file *file_priv, + u64 refs, u32 ref_stride, + u32 count, struct panfrost_job *job) +{ + const void __user *in = u64_to_user_ptr(refs); + unsigned int i; + int ret; + + if (!count) + return 0; + + for (i = 0; i < count; i++) { + struct drm_panfrost_syncobj_ref ref = { }; + struct dma_fence *fence; + + ret = copy_struct_from_user(&ref, sizeof(ref), + in + (i * ref_stride), + ref_stride); + if (ret) + return ret; + + if (ref.pad) + return -EINVAL; + + ret = drm_syncobj_find_fence(file_priv, ref.handle, ref.point, + 0, &fence); + if (ret) + return ret; + + ret = drm_gem_fence_array_add(&job->deps, fence); + if (ret) + return ret; + } + + return 0; +} + +struct panfrost_job_out_sync { + struct drm_syncobj *syncobj; + struct dma_fence_chain *chain; + u64 point; +}; + +static void +panfrost_put_job_out_syncs(struct panfrost_job_out_sync *out_syncs, u32 count) +{ + unsigned int i; + + for (i = 0; i < count; i++) { + if (!out_syncs[i].syncobj) + break; + + drm_syncobj_put(out_syncs[i].syncobj); + kvfree(out_syncs[i].chain); + } + + kvfree(out_syncs); +} + +static struct panfrost_job_out_sync * +panfrost_get_job_out_syncs(struct drm_file *file_priv, + u64 refs, u32 ref_stride, + u32 count) +{ + void __user *in = u64_to_user_ptr(refs); + struct panfrost_job_out_sync *out_syncs; + unsigned int i; + int ret; + + if (!count) + return NULL; + + out_syncs = kvmalloc_array(count, sizeof(*out_syncs), + GFP_KERNEL | __GFP_ZERO); + if (!out_syncs) + return ERR_PTR(-ENOMEM); + + for (i = 0; i < count; i++) { + struct drm_panfrost_syncobj_ref ref = { }; + + ret = copy_struct_from_user(&ref, sizeof(ref), + in + (i * ref_stride), + ref_stride); + if (ret) + goto err_free_out_syncs; + + if (ref.pad) { + ret = -EINVAL; + goto err_free_out_syncs; + } + + out_syncs[i].syncobj = drm_syncobj_find(file_priv, ref.handle); + if (!out_syncs[i].syncobj) { + ret = -EINVAL; + goto err_free_out_syncs; + } + + out_syncs[i].point = ref.point; + if (!out_syncs[i].point) + continue; + + out_syncs[i].chain = kmalloc(sizeof(*out_syncs[i].chain), + GFP_KERNEL); + if (!out_syncs[i].chain) { + ret = -ENOMEM; + goto err_free_out_syncs; + } + } + + return out_syncs; + +err_free_out_syncs: + panfrost_put_job_out_syncs(out_syncs, count); + return ERR_PTR(ret); +} + +static void +panfrost_set_job_out_fence(struct panfrost_job_out_sync *out_syncs, + unsigned int count, struct dma_fence *fence) +{ + unsigned int i; + + for (i = 0; i < count; i++) { + if (out_syncs[i].chain) { + drm_syncobj_add_point(out_syncs[i].syncobj, + out_syncs[i].chain, + fence, out_syncs[i].point); + out_syncs[i].chain = NULL; + } else { + drm_syncobj_replace_fence(out_syncs[i].syncobj, + fence); + } + } +} + +#define PANFROST_BO_REF_ALLOWED_FLAGS PANFROST_BO_REF_RW + +static int +panfrost_get_job_bos(struct drm_file *file_priv, + u64 refs, u32 ref_stride, u32 count, + struct panfrost_job *job) +{ + void __user *in = u64_to_user_ptr(refs); + unsigned int i; + + job->bo_count = count; + + if (!count) + return 0; + + job->bos = kvmalloc_array(job->bo_count, sizeof(*job->bos), + GFP_KERNEL | __GFP_ZERO); + job->bo_flags = kvmalloc_array(job->bo_count, + sizeof(*job->bo_flags), + GFP_KERNEL | __GFP_ZERO); + if (!job->bos || !job->bo_flags) + return -ENOMEM; + + for (i = 0; i < count; i++) { + struct drm_panfrost_bo_ref ref = { }; + int ret; + + ret = copy_struct_from_user(&ref, sizeof(ref), + in + (i * ref_stride), + ref_stride); + if (ret) + return ret; + + if ((ref.flags & ~PANFROST_BO_REF_ALLOWED_FLAGS)) + return -EINVAL; + + /* We need to know the access type. */ + if (!(ref.flags & PANFROST_BO_REF_RW)) + return -EINVAL; + + job->bos[i] = drm_gem_object_lookup(file_priv, ref.handle); + if (!job->bos[i]) + return -EINVAL; + + job->bo_flags[i] = ref.flags; + } + + return 0; +} + +#define PANFROST_JD_ALLOWED_REQS PANFROST_JD_REQ_FS + +static int +panfrost_submit_job(struct drm_device *dev, struct drm_file *file_priv, + const struct drm_panfrost_job *args, + u32 bo_stride, u32 syncobj_stride) +{ + struct panfrost_device *pfdev = dev->dev_private; + struct panfrost_job_out_sync *out_syncs; + struct panfrost_job *job; + int ret; + + if (!args->head) + return -EINVAL; + + if (args->requirements & ~PANFROST_JD_ALLOWED_REQS) + return -EINVAL; + + job = kzalloc(sizeof(*job), GFP_KERNEL); + if (!job) + return -ENOMEM; + + kref_init(&job->refcount); + + job->pfdev = pfdev; + job->jc = args->head; + job->requirements = args->requirements; + job->flush_id = panfrost_gpu_get_latest_flush_id(pfdev); + job->file_priv = file_priv->driver_priv; + xa_init_flags(&job->deps, XA_FLAGS_ALLOC); + + ret = panfrost_get_job_in_syncs(file_priv, + args->in_syncs, + syncobj_stride, + args->in_sync_count, + job); + if (ret) + goto err_put_job; + + out_syncs = panfrost_get_job_out_syncs(file_priv, + args->out_syncs, + syncobj_stride, + args->out_sync_count); + if (IS_ERR(out_syncs)) { + ret = PTR_ERR(out_syncs); + goto err_put_job; + } + + ret = panfrost_get_job_bos(file_priv, args->bos, bo_stride, + args->bo_count, job); + if (ret) + goto err_put_job; + + ret = panfrost_get_job_mappings(file_priv, job); + if (ret) + goto err_put_job; + + ret = panfrost_job_push(job); + if (ret) { + panfrost_put_job_out_syncs(out_syncs, args->out_sync_count); + goto err_put_job; + } + + panfrost_set_job_out_fence(out_syncs, args->out_sync_count, + job->render_done_fence); + panfrost_put_job_out_syncs(out_syncs, args->out_sync_count); + return 0; + +err_put_job: + panfrost_job_put(job); + return ret; +} + +static int +panfrost_ioctl_batch_submit(struct drm_device *dev, void *data, + struct drm_file *file_priv) +{ + struct drm_panfrost_batch_submit *args = data; + void __user *jobs_args = u64_to_user_ptr(args->jobs); + unsigned int i; + int ret; + + if (args->pad) + return -EINVAL; + + /* Relax this test if new fields are added to + * drm_panfrost_{bo_ref,syncobj_ref,job}. + */ + if (args->job_stride < sizeof(struct drm_panfrost_job) || + args->bo_ref_stride < sizeof(struct drm_panfrost_bo_ref) || + args->syncobj_ref_stride < sizeof(struct drm_panfrost_syncobj_ref)) + return -EINVAL; + + for (i = 0; i < args->job_count; i++) { + struct drm_panfrost_job job_args = { }; + + ret = copy_struct_from_user(&job_args, sizeof(job_args), + jobs_args + (i * args->job_stride), + args->job_stride); + if (ret) { + args->fail_idx = i; + return ret; + } + + ret = panfrost_submit_job(dev, file_priv, &job_args, + args->bo_ref_stride, + args->syncobj_ref_stride); + if (ret) { + args->fail_idx = i; + return ret; + } + } + + return 0; +} + int panfrost_unstable_ioctl_check(void) { if (!unstable_ioctls) @@ -544,6 +846,7 @@ static const struct drm_ioctl_desc panfrost_drm_driver_ioctls[] = { PANFROST_IOCTL(PERFCNT_ENABLE, perfcnt_enable, DRM_RENDER_ALLOW), PANFROST_IOCTL(PERFCNT_DUMP, perfcnt_dump, DRM_RENDER_ALLOW), PANFROST_IOCTL(MADVISE, madvise, DRM_RENDER_ALLOW), + PANFROST_IOCTL(BATCH_SUBMIT, batch_submit, DRM_RENDER_ALLOW), };
DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops); diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h index eab96d7e0e0e..34a10dac7ade 100644 --- a/include/uapi/drm/panfrost_drm.h +++ b/include/uapi/drm/panfrost_drm.h @@ -21,6 +21,7 @@ extern "C" { #define DRM_PANFROST_PERFCNT_ENABLE 0x06 #define DRM_PANFROST_PERFCNT_DUMP 0x07 #define DRM_PANFROST_MADVISE 0x08 +#define DRM_PANFROST_BATCH_SUBMIT 0x09
#define DRM_IOCTL_PANFROST_SUBMIT DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_SUBMIT, struct drm_panfrost_submit) #define DRM_IOCTL_PANFROST_WAIT_BO DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_WAIT_BO, struct drm_panfrost_wait_bo) @@ -29,6 +30,7 @@ extern "C" { #define DRM_IOCTL_PANFROST_GET_PARAM DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_GET_PARAM, struct drm_panfrost_get_param) #define DRM_IOCTL_PANFROST_GET_BO_OFFSET DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_GET_BO_OFFSET, struct drm_panfrost_get_bo_offset) #define DRM_IOCTL_PANFROST_MADVISE DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_MADVISE, struct drm_panfrost_madvise) +#define DRM_IOCTL_PANFROST_BATCH_SUBMIT DRM_IOWR(DRM_COMMAND_BASE + DRM_PANFROST_BATCH_SUBMIT, struct drm_panfrost_batch_submit)
/* * Unstable ioctl(s): only exposed when the unsafe unstable_ioctls module @@ -223,10 +225,87 @@ struct drm_panfrost_madvise { __u32 retained; /* out, whether backing store still exists */ };
+/* Syncobj reference passed at job submission time to encode explicit + * input/output fences. + */ +struct drm_panfrost_syncobj_ref { + __u32 handle; + __u32 pad; + __u64 point; +}; + #define PANFROST_BO_REF_READ 0x1 #define PANFROST_BO_REF_WRITE 0x2 #define PANFROST_BO_REF_RW (PANFROST_BO_REF_READ | PANFROST_BO_REF_WRITE)
+/* Describes a BO referenced by a job and the type of access. */ +struct drm_panfrost_bo_ref { + __u32 handle; + __u32 flags; +}; + +/* Describes a GPU job and the resources attached to it. */ +struct drm_panfrost_job { + /** GPU pointer to the head of the job chain. */ + __u64 head; + + /** + * Array of drm_panfrost_bo_ref objects describing the BOs referenced + * by this job. + */ + __u64 bos; + + /** + * Arrays of drm_panfrost_syncobj_ref objects describing the input + * and output fences. + */ + __u64 in_syncs; + __u64 out_syncs; + + /** Syncobj reference array sizes. */ + __u32 in_sync_count; + __u32 out_sync_count; + + /** BO reference array size. */ + __u32 bo_count; + + /** Combination of PANFROST_JD_REQ_* flags. */ + __u32 requirements; +}; + +/* Used to submit multiple jobs in one call */ +struct drm_panfrost_batch_submit { + /** + * Stride of the jobs array (needed to ease extension of the + * BATCH_SUBMIT ioctl). Should be set to + * sizeof(struct drm_panfrost_job). + */ + __u32 job_stride; + + /** Number of jobs to submit. */ + __u32 job_count; + + /* Pointer to a job array. */ + __u64 jobs; + + /** + * Stride of the BO and syncobj reference arrays (needed to ease + * extension of the BATCH_SUBMIT ioctl). Should be set to + * sizeof(struct drm_panfrost_bo_ref). + */ + __u32 bo_ref_stride; + __u32 syncobj_ref_stride; + + /** + * If the submission fails, this encodes the index of the job + * failed. + */ + __u32 fail_idx; + + /** Always 0. */ + __u32 pad; +}; + #if defined(__cplusplus) } #endif
Now that we have a new SUBMIT ioctl dealing with timelined syncojbs we can advertise the feature.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 3f8addab8bab..b3cc6fecb18d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -857,7 +857,8 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops); * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO */ static const struct drm_driver panfrost_drm_driver = { - .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ, + .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ | + DRIVER_SYNCOBJ_TIMELINE, .open = panfrost_open, .postclose = panfrost_postclose, .ioctls = panfrost_drm_driver_ioctls,
We now have a new ioctl that allows submitting multiple jobs at once (among other things) and we support timelined syncobjs. Bump the minor version number to reflect those changes.
Signed-off-by: Boris Brezillon boris.brezillon@collabora.com --- drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index b3cc6fecb18d..7fb9b20ba27f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -855,6 +855,7 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops); * Panfrost driver version: * - 1.0 - initial interface * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO + * - 1.2 - adds BATCH_SUBMIT ioctl and SYNCOBJ_TIMELINE feature */ static const struct drm_driver panfrost_drm_driver = { .driver_features = DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ | @@ -868,7 +869,7 @@ static const struct drm_driver panfrost_drm_driver = { .desc = "panfrost DRM", .date = "20180908", .major = 1, - .minor = 1, + .minor = 2,
.gem_create_object = panfrost_gem_create_object, .prime_handle_to_fd = drm_gem_prime_handle_to_fd,
On 11/03/2021 09:25, Boris Brezillon wrote:
Hello,
I've been playing with Vulkan lately and struggled quite a bit to implement VkQueueSubmit with the submit ioctl we have. There are several limiting factors that can be worked around if we really have to, but I think it'd be much easier and future-proof if we introduce a new ioctl that addresses the current limitations:
Hi Boris,
I think what you've proposed is quite reasonable, some detailed comments to your points below.
1/ There can only be one out_sync, but Vulkan might ask us to signal several VkSemaphores and possibly one VkFence too, both of those being based on sync objects in my PoC. Making out_sync an array of syncobjs to attach the render_done fence to would make that possible. The other option would be to collect syncobj updates in userspace in a separate thread and propagate those updates to all semaphores+fences waiting on those events (I think the v3dv driver does something like that, but I didn't spend enough time studying the code to be sure, so I might be wrong).
You should be able to avoid the separate thread to propagate by having a proxy object in user space that maps between the one outsync of the job and the possibly many Vulkan objects. But I've had this argument before with the DDK... and the upshot of it was that he Vulkan API is unnecessarily complex here and makes this really hard to do in practise. So I agree adding this capability to the kernel is likely the best approach.
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence) or you still need a dummy job to do the many-to-one dependency.
Or I may have completely misunderstood timeline syncobjs - definitely a possibility :)
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
* Exclusive ('write' access) * Shared ('read' access) * No fence - ensures the BO is mapped, but doesn't add any implicit fences.
The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing.
4/ There's also the fact that submitting one job at a time adds an overhead when QueueSubmit is being passed more than one CommandBuffer. That one is less problematic, but if we're adding a new ioctl we'd better design it to limit the userspace -> kernel transition overhead.
I've no objection - but I doubt the performance effect is significant. I was pleased to see the handling of stride which makes the interface extendable. In particular I suspect at some point we're going to want a priority field in some form.
Right now I'm just trying to collect feedback. I don't intend to get those patches merged until we have a userspace user, but I thought starting the discussion early would be a good thing.
Feel free to suggest other approaches.
Other than the above I didn't see any obvious issues, and I know the Vulkan API is problematic in terms of synchronisation primitives - so if this makes it easier to implement then it seems like a good idea to me.
Thanks,
Steve
Hi Steven,
On Thu, 11 Mar 2021 12:16:33 +0000 Steven Price steven.price@arm.com wrote:
On 11/03/2021 09:25, Boris Brezillon wrote:
Hello,
I've been playing with Vulkan lately and struggled quite a bit to implement VkQueueSubmit with the submit ioctl we have. There are several limiting factors that can be worked around if we really have to, but I think it'd be much easier and future-proof if we introduce a new ioctl that addresses the current limitations:
Hi Boris,
I think what you've proposed is quite reasonable, some detailed comments to your points below.
1/ There can only be one out_sync, but Vulkan might ask us to signal several VkSemaphores and possibly one VkFence too, both of those being based on sync objects in my PoC. Making out_sync an array of syncobjs to attach the render_done fence to would make that possible. The other option would be to collect syncobj updates in userspace in a separate thread and propagate those updates to all semaphores+fences waiting on those events (I think the v3dv driver does something like that, but I didn't spend enough time studying the code to be sure, so I might be wrong).
You should be able to avoid the separate thread to propagate by having a proxy object in user space that maps between the one outsync of the job and the possibly many Vulkan objects. But I've had this argument before with the DDK... and the upshot of it was that he Vulkan API is unnecessarily complex here and makes this really hard to do in practise. So I agree adding this capability to the kernel is likely the best approach.
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
Note that I also considered using a sync file, which has the ability to merge fences, but that required 3 extra ioctls for each syncobj to merge (for the export/merge/import round trip), and AFAICT, fences stay around until the sync file is destroyed, which forces some garbage collection if we want to keep the number of active fences low. One nice thing about the fence-chain/syncobj-timeline logic is that signaled fences are collected automatically when walking the chain.
So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence)
That options requires adding deps that do not really exist on the last jobs, so I'm not sure I like it.
or you still need a dummy job to do the many-to-one dependency.
Yeah, that's what I've considered doing before realizing timelined syncojbs could solve this problem (assuming I got it right :-/).
Or I may have completely misunderstood timeline syncobjs - definitely a possibility :)
I wouldn't pretend I got it right either :-).
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
- Exclusive ('write' access)
- Shared ('read' access)
- No fence - ensures the BO is mapped, but doesn't add any implicit
fences.
The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing.
Sounds good to me.
4/ There's also the fact that submitting one job at a time adds an overhead when QueueSubmit is being passed more than one CommandBuffer. That one is less problematic, but if we're adding a new ioctl we'd better design it to limit the userspace -> kernel transition overhead.
I've no objection - but I doubt the performance effect is significant. I was pleased to see the handling of stride which makes the interface extendable. In particular I suspect at some point we're going to want a priority field in some form.
Right now I'm just trying to collect feedback. I don't intend to get those patches merged until we have a userspace user, but I thought starting the discussion early would be a good thing.
Feel free to suggest other approaches.
Other than the above I didn't see any obvious issues, and I know the Vulkan API is problematic in terms of synchronisation primitives - so if this makes it easier to implement then it seems like a good idea to me.
Thanks a lot for you review.
Regards,
Boris
Hi all,
Dropping in where I may or may not be wanted to feel free to ignore. : -)
On Thu, Mar 11, 2021 at 7:00 AM Boris Brezillon boris.brezillon@collabora.com wrote:
Hi Steven,
On Thu, 11 Mar 2021 12:16:33 +0000 Steven Price steven.price@arm.com wrote:
On 11/03/2021 09:25, Boris Brezillon wrote:
Hello,
I've been playing with Vulkan lately and struggled quite a bit to implement VkQueueSubmit with the submit ioctl we have. There are several limiting factors that can be worked around if we really have to, but I think it'd be much easier and future-proof if we introduce a new ioctl that addresses the current limitations:
Hi Boris,
I think what you've proposed is quite reasonable, some detailed comments to your points below.
1/ There can only be one out_sync, but Vulkan might ask us to signal several VkSemaphores and possibly one VkFence too, both of those being based on sync objects in my PoC. Making out_sync an array of syncobjs to attach the render_done fence to would make that possible. The other option would be to collect syncobj updates in userspace in a separate thread and propagate those updates to all semaphores+fences waiting on those events (I think the v3dv driver does something like that, but I didn't spend enough time studying the code to be sure, so I might be wrong).
You should be able to avoid the separate thread to propagate by having a proxy object in user space that maps between the one outsync of the job and the possibly many Vulkan objects. But I've had this argument before with the DDK... and the upshot of it was that he Vulkan API is unnecessarily complex here and makes this really hard to do in practise. So I agree adding this capability to the kernel is likely the best approach.
This is pretty easy to do from userspace. Just take the out sync_file and stuff it into each of the syncobjs using DRM_IOCTL_SYNCOBJ_FD_TO_HANDLE. That's all the kernel would be doing under the hood anyway. Having it built into your submit ioctl does have the advantage of fewer round-trips to kernel space, though.
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that. Timelines are overkill here, IMO.
Note that I also considered using a sync file, which has the ability to merge fences, but that required 3 extra ioctls for each syncobj to merge (for the export/merge/import round trip), and AFAICT, fences stay around until the sync file is destroyed, which forces some garbage collection if we want to keep the number of active fences low. One nice thing about the fence-chain/syncobj-timeline logic is that signaled fences are collected automatically when walking the chain.
Yeah, that's the pain when working with sync files. This is one of the reasons why our driver takes an arbitrary number of in/out syncobjs.
So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence)
That options requires adding deps that do not really exist on the last jobs, so I'm not sure I like it.
or you still need a dummy job to do the many-to-one dependency.
Yeah, that's what I've considered doing before realizing timelined syncojbs could solve this problem (assuming I got it right :-/).
Or I may have completely misunderstood timeline syncobjs - definitely a possibility :)
I wouldn't pretend I got it right either :-).
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
- Exclusive ('write' access)
- Shared ('read' access)
- No fence - ensures the BO is mapped, but doesn't add any implicit
fences.
The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing.
This one's really annoying. TBH, we've still not gotten it right on Intel, AFAICT. That is roughly the set of modes you need but you'll have to watch out for window-system buffers. RADV and ANV take slightly different approaches here and they each have their own problems. On the balance, I'd look at what RADV is doing rather than ANV because ANV's results in some over-synchronization every time you vkWaitForFences on the WSI fence. I've got a patch floating round somewhere that adds some new kernel API to make that case a bit better but it's a snarly mess. Sorry for being cryptic but it's a 5-page e-mail if I type out all the annoying details. (-:
--Jason
Sounds good to me.
4/ There's also the fact that submitting one job at a time adds an overhead when QueueSubmit is being passed more than one CommandBuffer. That one is less problematic, but if we're adding a new ioctl we'd better design it to limit the userspace -> kernel transition overhead.
I've no objection - but I doubt the performance effect is significant. I was pleased to see the handling of stride which makes the interface extendable. In particular I suspect at some point we're going to want a priority field in some form.
Right now I'm just trying to collect feedback. I don't intend to get those patches merged until we have a userspace user, but I thought starting the discussion early would be a good thing.
Feel free to suggest other approaches.
Other than the above I didn't see any obvious issues, and I know the Vulkan API is problematic in terms of synchronisation primitives - so if this makes it easier to implement then it seems like a good idea to me.
Thanks a lot for you review.
Regards,
Boris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Jason,
On Thu, 11 Mar 2021 10:58:46 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
Hi all,
Dropping in where I may or may not be wanted to feel free to ignore. : -)
I'm glad you decided to chime in. :-)
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that.
That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it.
Timelines are overkill here, IMO.
Mind developing why you think this is overkill? After looking at the kernel implementation I thought using timeline syncobjs would be pretty cheap compared to the other options I considered.
Note that I also considered using a sync file, which has the ability to merge fences, but that required 3 extra ioctls for each syncobj to merge (for the export/merge/import round trip), and AFAICT, fences stay around until the sync file is destroyed, which forces some garbage collection if we want to keep the number of active fences low. One nice thing about the fence-chain/syncobj-timeline logic is that signaled fences are collected automatically when walking the chain.
Yeah, that's the pain when working with sync files. This is one of the reasons why our driver takes an arbitrary number of in/out syncobjs.
So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence)
That options requires adding deps that do not really exist on the last jobs, so I'm not sure I like it.
or you still need a dummy job to do the many-to-one dependency.
Yeah, that's what I've considered doing before realizing timelined syncojbs could solve this problem (assuming I got it right :-/).
Or I may have completely misunderstood timeline syncobjs - definitely a possibility :)
I wouldn't pretend I got it right either :-).
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
- Exclusive ('write' access)
- Shared ('read' access)
- No fence - ensures the BO is mapped, but doesn't add any implicit
fences.
The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing.
This one's really annoying. TBH, we've still not gotten it right on Intel, AFAICT. That is roughly the set of modes you need but you'll have to watch out for window-system buffers. RADV and ANV take slightly different approaches here and they each have their own problems. On the balance, I'd look at what RADV is doing rather than ANV because ANV's results in some over-synchronization every time you vkWaitForFences on the WSI fence. I've got a patch floating round somewhere that adds some new kernel API to make that case a bit better but it's a snarly mess. Sorry for being cryptic but it's a 5-page e-mail if I type out all the annoying details. (-:
Ok, I'll have a look at the RADV driver.
Thanks for your feedback.
Boris
On Thu, Mar 11, 2021 at 11:25 AM Boris Brezillon boris.brezillon@collabora.com wrote:
Hi Jason,
On Thu, 11 Mar 2021 10:58:46 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
Hi all,
Dropping in where I may or may not be wanted to feel free to ignore. : -)
I'm glad you decided to chime in. :-)
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that.
That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it.
I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But...
The Vulkan spec requires that everything that all the submits that happen on a given vkQueue happen in-order. Search the spec for "Submission order" for more details.
So, generally speaking, there are some in-order requirements there. Again, not having a lot of tiler experience, I'm not the one to weigh in.
Timelines are overkill here, IMO.
Mind developing why you think this is overkill? After looking at the kernel implementation I thought using timeline syncobjs would be pretty cheap compared to the other options I considered.
If you use a regular syncobj, every time you wait on it it inserts a dependency between the current submit and the last thing to signal it on the CPU timeline. The internal dma_fences will hang around as-needed to ensure those dependencies. If you use a timeline, you have to also track a uint64_t to reference the current time point. This may work if you need to sync a bunch of in-flight stuff at one go, that may work but if you're trying to serialize, it's just extra tracking for no point. Again, maybe there's something I'm missing and you don't actually want to serialize.
--Jason
Note that I also considered using a sync file, which has the ability to merge fences, but that required 3 extra ioctls for each syncobj to merge (for the export/merge/import round trip), and AFAICT, fences stay around until the sync file is destroyed, which forces some garbage collection if we want to keep the number of active fences low. One nice thing about the fence-chain/syncobj-timeline logic is that signaled fences are collected automatically when walking the chain.
Yeah, that's the pain when working with sync files. This is one of the reasons why our driver takes an arbitrary number of in/out syncobjs.
So I'm not sure that you've actually fixed this point - you either need to force an order (in which case the last job can signal the Vulkan fence)
That options requires adding deps that do not really exist on the last jobs, so I'm not sure I like it.
or you still need a dummy job to do the many-to-one dependency.
Yeah, that's what I've considered doing before realizing timelined syncojbs could solve this problem (assuming I got it right :-/).
Or I may have completely misunderstood timeline syncobjs - definitely a possibility :)
I wouldn't pretend I got it right either :-).
3/ The current implementation lacks information about BO access, so we serialize all jobs accessing the same set of BOs, even if those jobs might just be reading from them (which can happen concurrently). Other drivers pass an access type to the list of referenced BOs to address that. Another option would be to disable implicit deps (deps based on BOs) and force the driver to pass all deps explicitly (interestingly, some drivers have both the no-implicit-dep and r/w flags, probably to support sub-resource access, so we might want to add that one too). I don't see any userspace workaround to that problem, so that one alone would justify extending the existing ioctl or adding a new one.
Yeah - I think we need this. My only comment is that I think the read/write terminology may come back to bite. Better to use 'shared' and 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
- Exclusive ('write' access)
- Shared ('read' access)
- No fence - ensures the BO is mapped, but doesn't add any implicit
fences.
The last may make sense when doing explicit fences and e.g. doing front-buffer rendering with a display driver which does implicit fencing.
This one's really annoying. TBH, we've still not gotten it right on Intel, AFAICT. That is roughly the set of modes you need but you'll have to watch out for window-system buffers. RADV and ANV take slightly different approaches here and they each have their own problems. On the balance, I'd look at what RADV is doing rather than ANV because ANV's results in some over-synchronization every time you vkWaitForFences on the WSI fence. I've got a patch floating round somewhere that adds some new kernel API to make that case a bit better but it's a snarly mess. Sorry for being cryptic but it's a 5-page e-mail if I type out all the annoying details. (-:
Ok, I'll have a look at the RADV driver.
Thanks for your feedback.
Boris
I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But...
The primary issue is we submit vertex+compute and fragment for each batch as two disjoint jobs (with a dependency of course), reflecting the internal hardware structure as parallel job slots. That we actually require two ioctls() and a roundtrip for this is a design wart inherited from early days of the kernel driver. The downstream Mali driver handles this by allowing multiple jobs to be submitted with a single ioctl, as Boris's patch enables. In every other respect I believe our needs are similar to other renderonly drivers. (What does turnip do?)
On Thu, 11 Mar 2021 12:11:48 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
2/ Queued jobs might be executed out-of-order (unless they have explicit/implicit deps between them), and Vulkan asks that the out fence be signaled when all jobs are done. Timeline syncobjs are a good match for that use case. All we need to do is pass the same fence syncobj to all jobs being attached to a single QueueSubmit request, but a different point on the timeline. The syncobj timeline wait does the rest and guarantees that we've reached a given timeline point (IOW, all jobs before that point are done) before declaring the fence as signaled. One alternative would be to have dummy 'synchronization' jobs that don't actually execute anything on the GPU but declare a dependency on all other jobs that are part of the QueueSubmit request, and signal the out fence (the scheduler would do most of the work for us, all we have to do is support NULL job heads and signal the fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that.
That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it.
I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But...
The Vulkan spec requires that everything that all the submits that happen on a given vkQueue happen in-order. Search the spec for "Submission order" for more details.
Duh, looks like I completely occulted the "Submission order" guarantees. This being said, even after reading this chapter multiple times I'm not sure what kind of guarantee this gives us, given the execution itself can be out-of-order. My understanding is that submission order matters for implicit deps, say you have 2 distinct VkSubmitInfo, the first one (in submission order) writing to a buffer and the second one reading from it, you really want the first one to be submitted first and the second one to wait on the implicit BO fence created by the first one. If we were to submit out-of-order, this guarantee wouldn't be met. OTOH, if we have 2 completely independent submits, I don't really see what submission order gives us if execution is out-of-order.
In our case, the kernel driver takes care of the submission serialization (gathering implicit and explicit deps, queuing the job and assigning the "done" fence to the output sync objects). Once things are queued, it's the scheduler (drm_sched) deciding of the execution order.
So, generally speaking, there are some in-order requirements there. Again, not having a lot of tiler experience, I'm not the one to weigh in.
Timelines are overkill here, IMO.
Mind developing why you think this is overkill? After looking at the kernel implementation I thought using timeline syncobjs would be pretty cheap compared to the other options I considered.
If you use a regular syncobj, every time you wait on it it inserts a dependency between the current submit and the last thing to signal it on the CPU timeline. The internal dma_fences will hang around as-needed to ensure those dependencies. If you use a timeline, you have to also track a uint64_t to reference the current time point. This may work if you need to sync a bunch of in-flight stuff at one go, that may work but if you're trying to serialize, it's just extra tracking for no point. Again, maybe there's something I'm missing and you don't actually want to serialize.
My understanding (and I might very much be wrong here) is that using a regular syncobj to do this actually enforces not only the submission order but also the execution order (each job waiting on the previous one to complete before being scheduled). The idea of the timeline syncobj approach is that jobs that have no inter dependencies can be started in any order, the scheduler picking the first whose deps are ready (which might not necessarily match submission order). The timeline syncobj allows us to have one point per kernel-submission and eventually wait on the last point for the fence passed to vkSubmitQueue(), and some specific point on the timeline for pSignalSemaphores entries.
What's more challenging is signal operation ordering:
" Signal operation order is a fundamental ordering in Vulkan, giving meaning to the order in which semaphore and fence signal operations occur when submitted to a single queue.
1. The initial order is determined by the order in which vkQueueSubmit commands are executed on the host, for a single queue, from first to last.
2. The order in which VkSubmitInfo structures are specified in the pSubmits parameter of vkQueueSubmit, from lowest index to highest.
3. The fence signal operation defined by the fence parameter of a vkQueueSubmit or vkQueueBindSparse command is ordered after all semaphore signal operations defined by that command. "
This means we have to add implicit dependencies on the signaling itself. We have two options to guarantee that:
1/ Transfer one of the queue syncobj timeline point to each semaphore and fence after job submission (the point itself being dependent on the position of the submit entry in the array for semaphores, and the last point for the fence). Problem with this approach is that we now have an extra TRANSFER_SYNCOBJ call per semaphore/fence
2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but serve as a synchronization point) whose responsibility would be to do this muxing/transfer as part of the batch submission process.
On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon boris.brezillon@collabora.com wrote:
On Thu, 11 Mar 2021 12:11:48 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
> 2/ Queued jobs might be executed out-of-order (unless they have > explicit/implicit deps between them), and Vulkan asks that the out > fence be signaled when all jobs are done. Timeline syncobjs are a > good match for that use case. All we need to do is pass the same > fence syncobj to all jobs being attached to a single QueueSubmit > request, but a different point on the timeline. The syncobj > timeline wait does the rest and guarantees that we've reached a > given timeline point (IOW, all jobs before that point are done) > before declaring the fence as signaled. > One alternative would be to have dummy 'synchronization' jobs that > don't actually execute anything on the GPU but declare a dependency > on all other jobs that are part of the QueueSubmit request, and > signal the out fence (the scheduler would do most of the work for > us, all we have to do is support NULL job heads and signal the > fence directly when that happens instead of queueing the job).
I have to admit to being rather hazy on the details of timeline syncobjs, but I thought there was a requirement that the timeline moves monotonically. I.e. if you have multiple jobs signalling the same syncobj just with different points, then AFAIU the API requires that the points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that.
That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it.
I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But...
The Vulkan spec requires that everything that all the submits that happen on a given vkQueue happen in-order. Search the spec for "Submission order" for more details.
Duh, looks like I completely occulted the "Submission order" guarantees. This being said, even after reading this chapter multiple times I'm not sure what kind of guarantee this gives us, given the execution itself can be out-of-order. My understanding is that submission order matters for implicit deps, say you have 2 distinct VkSubmitInfo, the first one (in submission order) writing to a buffer and the second one reading from it, you really want the first one to be submitted first and the second one to wait on the implicit BO fence created by the first one. If we were to submit out-of-order, this guarantee wouldn't be met. OTOH, if we have 2 completely independent submits, I don't really see what submission order gives us if execution is out-of-order.
Right, this is where things get sticky. What's guaranteed there is submission order not execution order. But, sadly, there's more hidden in there than you might think. Before we can go there, though, we need to establish a few details of Mali hardware works.
My understanding (feel free to correct me if I'm wrong) is that, roughly, Mali has two engines which have to work together to render: One engine for compute/vertex/geometry and one for binning and fragment. When you go to submit a draw, you fire off any geometry work on the compute engine and the fragment work on the binning engine. At a tile flush boundary (or renderpass in Vulkan), you insert a dependency between the geometry work you put on the compute engine and the binning/fragment work. In a GL driver, you'd probably kick it off to the kernel at this point.
In Vulkan, things are going to get more tricky. You can't just kick off to the kernel at tile flush boundaries. You also can't assume that all vertex work within a given command buffer is independent of all the fragment work. Let's say you have a sequence like this:
vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass()
Now you have, in a single command buffer, a draw which writes to an image which is read by a compute shader which writes to a buffer which is read by vertex fetch for another draw. The only way to implement this on a Mali like device is to ping-pong back and forth between the compute and binning engines. It'd look something like this (assuming my mental model):
A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F
There are a couple of options for how to do this:
1. Each VkCommandBuffer is actually a bunch of command buffers for each engine with dependencies. vkQueueSubmit() calls the kernel submit ioctl a brezillion times to submit them all.
2. Same as 1, only the submit ioctl takes an array of things with dependencies so that you don't have to do as many round-trips to kernels space.
3. Each VkCommandBuffer is two command buffers: one for compute and one for binning, and you use some sort of HW synchronization mechanism to handle the dependencies as you ping-pong between them.
Option 1 is the easiest to get going with if you're working on a kernel driver that was designed for OpenGL but it has the obvious drawback of lots of smaller command buffers and lots of submits. It's not going to scale well. Option 3 is nice if the hardware can do it (I have no idea if Mali can).
Sorry if that's all a re-hash of stuff you already know. I'm just trying to establish a baseline here so we're all talking about the same things.
Ok, so what does this have to do with submission order in the spec? The mental model of the Vulkan spec is that you have queues and you submit commands to those queues. Command buffers are just big sequences of commands. Those commands kick off work. The kick-off happens in-order but completion is out-of-order. There are then pipeline barriers which allow you to specify dependencies between those bits of work such as "future compute work depends on previous fragment work". Importantly, those dependencies can even apply across command buffers.
So where does this leave us? Well, it depends on your submit model and exactly how you handle pipeline barriers that sync between engines. If you're taking option 3 above and doing two command buffers for each VkCommandBuffer, then you probably want two serialized timelines, one for each engine, and some mechanism to tell the kernel driver "these two command buffers have to run in parallel" so that your ping-pong works. If you're doing 1 or 2 above, I think you probably still want two simple syncobjs, one for each engine. You don't really have any need to go all that far back in history. All you really need to describe is "command buffer X depends on previous compute work" or "command buffer X depends on previous binning work". As long as your multi-submit ioctl processes the command buffers in-order, you can do it all with two syncobjs.
Sorry for the tome. I hope it wasn't too wrong and was at least a little helpful.
--Jason
In our case, the kernel driver takes care of the submission serialization (gathering implicit and explicit deps, queuing the job and assigning the "done" fence to the output sync objects). Once things are queued, it's the scheduler (drm_sched) deciding of the execution order.
So, generally speaking, there are some in-order requirements there. Again, not having a lot of tiler experience, I'm not the one to weigh in.
Timelines are overkill here, IMO.
Mind developing why you think this is overkill? After looking at the kernel implementation I thought using timeline syncobjs would be pretty cheap compared to the other options I considered.
If you use a regular syncobj, every time you wait on it it inserts a dependency between the current submit and the last thing to signal it on the CPU timeline. The internal dma_fences will hang around as-needed to ensure those dependencies. If you use a timeline, you have to also track a uint64_t to reference the current time point. This may work if you need to sync a bunch of in-flight stuff at one go, that may work but if you're trying to serialize, it's just extra tracking for no point. Again, maybe there's something I'm missing and you don't actually want to serialize.
My understanding (and I might very much be wrong here) is that using a regular syncobj to do this actually enforces not only the submission order but also the execution order (each job waiting on the previous one to complete before being scheduled). The idea of the timeline syncobj approach is that jobs that have no inter dependencies can be started in any order, the scheduler picking the first whose deps are ready (which might not necessarily match submission order). The timeline syncobj allows us to have one point per kernel-submission and eventually wait on the last point for the fence passed to vkSubmitQueue(), and some specific point on the timeline for pSignalSemaphores entries.
What's more challenging is signal operation ordering:
" Signal operation order is a fundamental ordering in Vulkan, giving meaning to the order in which semaphore and fence signal operations occur when submitted to a single queue.
The initial order is determined by the order in which vkQueueSubmit commands are executed on the host, for a single queue, from first to last.
The order in which VkSubmitInfo structures are specified in the pSubmits parameter of vkQueueSubmit, from lowest index to highest.
The fence signal operation defined by the fence parameter of a vkQueueSubmit or vkQueueBindSparse command is ordered after all semaphore signal operations defined by that command.
"
This means we have to add implicit dependencies on the signaling itself. We have two options to guarantee that:
1/ Transfer one of the queue syncobj timeline point to each semaphore and fence after job submission (the point itself being dependent on the position of the submit entry in the array for semaphores, and the last point for the fence). Problem with this approach is that we now have an extra TRANSFER_SYNCOBJ call per semaphore/fence
2/ Add SYNC jobs (jobs that do not actually execute on the GPU, but serve as a synchronization point) whose responsibility would be to do this muxing/transfer as part of the batch submission process.
On Fri, 12 Mar 2021 09:37:49 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
On Fri, Mar 12, 2021 at 1:31 AM Boris Brezillon boris.brezillon@collabora.com wrote:
On Thu, 11 Mar 2021 12:11:48 -0600 Jason Ekstrand jason@jlekstrand.net wrote:
> > 2/ Queued jobs might be executed out-of-order (unless they have > > explicit/implicit deps between them), and Vulkan asks that the out > > fence be signaled when all jobs are done. Timeline syncobjs are a > > good match for that use case. All we need to do is pass the same > > fence syncobj to all jobs being attached to a single QueueSubmit > > request, but a different point on the timeline. The syncobj > > timeline wait does the rest and guarantees that we've reached a > > given timeline point (IOW, all jobs before that point are done) > > before declaring the fence as signaled. > > One alternative would be to have dummy 'synchronization' jobs that > > don't actually execute anything on the GPU but declare a dependency > > on all other jobs that are part of the QueueSubmit request, and > > signal the out fence (the scheduler would do most of the work for > > us, all we have to do is support NULL job heads and signal the > > fence directly when that happens instead of queueing the job). > > I have to admit to being rather hazy on the details of timeline > syncobjs, but I thought there was a requirement that the timeline moves > monotonically. I.e. if you have multiple jobs signalling the same > syncobj just with different points, then AFAIU the API requires that the > points are triggered in order.
I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I might be wrong, but my understanding is that queuing fences (addition of new points in the timeline) should happen in order, but signaling can happen in any order. When checking for a signaled fence the fence-chain logic starts from the last point (or from an explicit point if you use the timeline wait flavor) and goes backward, stopping at the first un-signaled node. If I'm correct, that means that fences that are part of a chain can be signaled in any order.
You don't even need a timeline for this. Just have a single syncobj per-queue and make each submit wait on it and then signal it. Alternatively, you can just always hang on to the out-fence from the previous submit and make the next one wait on that.
That's what I have right now, but it forces the serialization of all jobs that are pushed during a submit (and there can be more than one per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd be better to not force this serialization if we can avoid it.
I'm not familiar with panfrost's needs and I don't work on a tiler and I know there are different issues there. But...
The Vulkan spec requires that everything that all the submits that happen on a given vkQueue happen in-order. Search the spec for "Submission order" for more details.
Duh, looks like I completely occulted the "Submission order" guarantees. This being said, even after reading this chapter multiple times I'm not sure what kind of guarantee this gives us, given the execution itself can be out-of-order. My understanding is that submission order matters for implicit deps, say you have 2 distinct VkSubmitInfo, the first one (in submission order) writing to a buffer and the second one reading from it, you really want the first one to be submitted first and the second one to wait on the implicit BO fence created by the first one. If we were to submit out-of-order, this guarantee wouldn't be met. OTOH, if we have 2 completely independent submits, I don't really see what submission order gives us if execution is out-of-order.
Right, this is where things get sticky. What's guaranteed there is submission order not execution order. But, sadly, there's more hidden in there than you might think. Before we can go there, though, we need to establish a few details of Mali hardware works.
My understanding (feel free to correct me if I'm wrong) is that, roughly, Mali has two engines which have to work together to render: One engine for compute/vertex/geometry and one for binning and fragment. When you go to submit a draw, you fire off any geometry work on the compute engine and the fragment work on the binning engine. At a tile flush boundary (or renderpass in Vulkan), you insert a dependency between the geometry work you put on the compute engine and the binning/fragment work. In a GL driver, you'd probably kick it off to the kernel at this point.
In Vulkan, things are going to get more tricky. You can't just kick off to the kernel at tile flush boundaries. You also can't assume that all vertex work within a given command buffer is independent of all the fragment work. Let's say you have a sequence like this:
vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass()
Now you have, in a single command buffer, a draw which writes to an image which is read by a compute shader which writes to a buffer which is read by vertex fetch for another draw. The only way to implement this on a Mali like device is to ping-pong back and forth between the compute and binning engines. It'd look something like this (assuming my mental model):
A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F
There are a couple of options for how to do this:
- Each VkCommandBuffer is actually a bunch of command buffers for
each engine with dependencies. vkQueueSubmit() calls the kernel submit ioctl a brezillion times to submit them all.
This is what we currently have.
- Same as 1, only the submit ioctl takes an array of things with
dependencies so that you don't have to do as many round-trips to kernels space.
This is the PoC I worked on.
- Each VkCommandBuffer is two command buffers: one for compute and
one for binning, and you use some sort of HW synchronization mechanism to handle the dependencies as you ping-pong between them.
I didn't consider that option. We have a DOORBELL instruction on Bifrost to wake up the CPU when the GPU wants to report something (not even sure we have something equivalent on Midgard), but there's no inter-queue sync mechanism AFAICT.
Option 1 is the easiest to get going with if you're working on a kernel driver that was designed for OpenGL but it has the obvious drawback of lots of smaller command buffers and lots of submits. It's not going to scale well. Option 3 is nice if the hardware can do it (I have no idea if Mali can).
Sorry if that's all a re-hash of stuff you already know. I'm just trying to establish a baseline here so we're all talking about the same things.
Ok, so what does this have to do with submission order in the spec? The mental model of the Vulkan spec is that you have queues and you submit commands to those queues. Command buffers are just big sequences of commands. Those commands kick off work. The kick-off happens in-order but completion is out-of-order. There are then pipeline barriers which allow you to specify dependencies between those bits of work such as "future compute work depends on previous fragment work". Importantly, those dependencies can even apply across command buffers.
So where does this leave us? Well, it depends on your submit model and exactly how you handle pipeline barriers that sync between engines. If you're taking option 3 above and doing two command buffers for each VkCommandBuffer, then you probably want two serialized timelines, one for each engine, and some mechanism to tell the kernel driver "these two command buffers have to run in parallel" so that your ping-pong works. If you're doing 1 or 2 above, I think you probably still want two simple syncobjs, one for each engine. You don't really have any need to go all that far back in history. All you really need to describe is "command buffer X depends on previous compute work" or "command buffer X depends on previous binning work".
Okay, so this will effectively force in-order execution. Let's take your previous example and add 2 more jobs at the end that have no deps on previous commands:
vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass() vkBeginRenderPass() /* Writes to ImageD */ vkCmdDraw() ... vkEndRenderPass()
A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F H: Vertex for the fourth draw on the compute engine I: Fragment for the fourth draw on the binning engine
When we reach E, we might be waiting for D to finish before scheduling the job, and because of the implicit serialization we have on the compute queue (F implicitly depends on E, and H on F) we can't schedule H either, which could, in theory be started. I guess that's where the term submission order is a bit unclear to me. The action of starting a job sounds like execution order to me (the order you starts jobs determines the execution order since we only have one HW queue per job type). All implicit deps have been calculated when we queued the job to the SW queue, and I thought that would be enough to meet the submission order requirements, but I might be wrong.
The PoC I have was trying to get rid of this explicit serialization on the compute and fragment queues by having one syncobj timeline (queue(<syncpoint>)) and synchronization points (Sx).
S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point A: in-fences=<explicit_deps>, out-fences=<queue(1)> B: in-fences=<explicit_deps>, out-fences=<queue(2)> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context H: in-fences=<explicit_deps>, out-fence=<queue(8)> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
With this solution H can be started before E if the compute slot is empty and E's implicit deps are not done. It's probably overkill, but I thought maximizing GPU utilization was important.
As long as your multi-submit ioctl processes the command buffers in-order, you can do it all with two syncobjs.
Sorry for the tome. I hope it wasn't too wrong and was at least a little helpful.
It definitely helps, even if the concept of submission order is still unclear to me, especially when applied to GPUs that have a single HW queue, and where submission and execution order could be considered the same thing.
On Fri, 12 Mar 2021 19:25:13 +0100 Boris Brezillon boris.brezillon@collabora.com wrote:
So where does this leave us? Well, it depends on your submit model and exactly how you handle pipeline barriers that sync between engines. If you're taking option 3 above and doing two command buffers for each VkCommandBuffer, then you probably want two serialized timelines, one for each engine, and some mechanism to tell the kernel driver "these two command buffers have to run in parallel" so that your ping-pong works. If you're doing 1 or 2 above, I think you probably still want two simple syncobjs, one for each engine. You don't really have any need to go all that far back in history. All you really need to describe is "command buffer X depends on previous compute work" or "command buffer X depends on previous binning work".
Okay, so this will effectively force in-order execution. Let's take your previous example and add 2 more jobs at the end that have no deps on previous commands:
vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass() vkBeginRenderPass() /* Writes to ImageD */ vkCmdDraw() ... vkEndRenderPass()
A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F H: Vertex for the fourth draw on the compute engine I: Fragment for the fourth draw on the binning engine
When we reach E, we might be waiting for D to finish before scheduling the job, and because of the implicit serialization we have on the compute queue (F implicitly depends on E, and H on F) we can't schedule H either, which could, in theory be started. I guess that's where the term submission order is a bit unclear to me. The action of starting a job sounds like execution order to me (the order you starts jobs determines the execution order since we only have one HW queue per job type). All implicit deps have been calculated when we queued the job to the SW queue, and I thought that would be enough to meet the submission order requirements, but I might be wrong.
The PoC I have was trying to get rid of this explicit serialization on the compute and fragment queues by having one syncobj timeline (queue(<syncpoint>)) and synchronization points (Sx).
S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point A: in-fences=<explicit_deps>, out-fences=<queue(1)> B: in-fences=<explicit_deps>, out-fences=<queue(2)> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context H: in-fences=<explicit_deps>, out-fence=<queue(8)> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
With this solution H can be started before E if the compute slot is empty and E's implicit deps are not done. It's probably overkill, but I thought maximizing GPU utilization was important.
Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2 syncobjs (one per queue type) is indeed the right approach.
- Each VkCommandBuffer is two command buffers: one for compute and
one for binning, and you use some sort of HW synchronization mechanism to handle the dependencies as you ping-pong between them.
I didn't consider that option. We have a DOORBELL instruction on Bifrost to wake up the CPU when the GPU wants to report something (not even sure we have something equivalent on Midgard), but there's no inter-queue sync mechanism AFAICT.
I was about to say that we don't have hardware support. Even considering DOORBELL (which AFAIK isn't piped through to anything on kbase at least), I'm inclined to say we don't have hardware support. Option 2 (what kbase does? dunno how DDK vulkan works though) is probably our best bet, tbh.
On Thu, 11 Mar 2021 12:16:33 +0000 Steven Price steven.price@arm.com wrote:
Also the current code completely ignores PANFROST_BO_REF_READ. So either that should be defined as 0, or even better we support 3 modes:
- Exclusive ('write' access)
- Shared ('read' access)
- No fence - ensures the BO is mapped, but doesn't add any implicit
fences.
I looked at other drivers and they seem to have this NO_FENCE/NO_IMPLICIT flag at the job/submit level and conditioned on the presence of a in_sync_fd file. I can see per-BO NO_FENCE being useful for sub-buffer accesses, assuming we don't want to deal with other deps explicitly, but we could also force the NO_IMPLICIT behavior for the whole job and let the user deal with all deps explicitly. TBH, it's a bit unclear to me how that feature would be used by the vulkan driver, so I'd rather leave that NO_FENCE use case for later.
dri-devel@lists.freedesktop.org