Hi guys,
I'd like to start a new thread about explicit fence synchronization. This time with a Nouveau twist. :-)
First, let me define what I understand by implicit/explicit sync:
Implicit synchronization * Fences are attached to buffers * Kernel manages fences automatically based on buffer read/write access
Explicit synchronization * Fences are passed around independently * Kernel takes and emits fences to/from user space when submitting work
Implicit synchronization is already implemented in open source drivers, and works well for most use cases. I don't seek to change any of that. My proposal aims at allowing some drm drivers to operate in explicit sync mode to get maximal performance, while still remaining fully compatible with the implicit paradigm.
I will try to explain why I think we should support the explicit model as well.
1. Bindless graphics
Bindless graphics is a central concept when trying to reduce the OpenGL driver overhead. The idea is that the application can bind a large set of buffers to the working set up front using extensions such as GL_ARB_bindless_texture, and they remain resident until the application releases them (note that compute APIs have typically similar semantics). These working sets can be huge, hundreds or even thousands of buffers, so we would like to opt out from the per-submit overhead of acquiring locks, waiting for fences, and storing fences. Automatically synchronizing these working sets in kernel will also prevent parallelism between channels that are sharing the working set (in fact sharing just one buffer from the working set will cause the jobs of the two channels to be serialized).
2. Evolution of graphics APIs
The graphics API evolution seems to be going to a direction where game engine and middleware vendors demand more control over work submission and synchronization. We expect that this trend will continue, and more and more synchronization decisions will be pushed to the API level. OpenGL and EGL already provide good explicit command stream level synchronization primitives: glFenceSync and EGL_KHR_wait_sync. Their use is also encouraged - for example EGL_KHR_image_base spec clearly states that the application is responsible for synchronizing accesses to EGLImages. If the API that is exposed to developers gives the control over synchronization to the developer, then implicit waits that are inserted by the kernel are unnecessary and unexpected, and can severely hurt performance. It also makes it easy for the developer to write code that happens to work on Linux because of implicit sync, but will fail on other platforms.
3. Suballocation
Using user space suballocation can help reduce the overhead when a large number of small textures are used. Synchronizing suballocated surfaces implicitly in kernel doesn't make sense - many channels should be able to access the same kernel-level buffer object simultaneously.
4. Buffer sharing complications
This is not really an argument for explicit sync as such, but I'd like to point out that sharing buffers across SoC engines is often much more complex than just exporting and importing a dma-buf and waiting for the dma-buf fences. Sometimes we need to do color format or tiling layout conversion. Sometimes, at least on Tegra, we need to decompress buffers when we pass them from the GPU to an engine that doesn't support framebuffer compression. These things are not uncommon, particularly when we have SoC's that combine licensed IP blocks from different vendors. My point is that user space is already heavily involved when sharing buffers between drivers, and giving it some more control over synchronization is not adding that much complexity.
Because of the above arguments, I think it makes sense to let some user space drm drivers opt out from implicit synchronization, while allowing them to still remain fully compatible with the rest of the drm world that uses implicit synchronization. In practice, this would require three things:
(1) Support passing fences (that are not tied to buffer objects) between kernel and user space.
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
I have written a prototype that demonstrates (1) by adding explicit sync fd support to Nouveau. It's not a lot of code, because I only use a relatively small subset of the android sync driver functionality. Thanks to Maarten's rewrite, all I need to do is to allow creating a sync_fence from a drm fence in order to pass it to user space. I don't need to use sync_pt or sync_timeline, or fill in sync_timeline_ops.
I can see why the upstream has been reluctant to de-stage the android sync driver in its current form, since (even though it now builds on struct fence) it still duplicates some of the drm fence concepts. I'd like to think that my patches only use the parts of the android sync driver that genuinely are missing from the drm fence model: allowing user space to operate on fence objects that are independent of buffer objects.
The last two patches are mocks that show how (2) and (3) might work out. I haven't done any testing with them yet. Before going any further, I'd like to get your feedback. Can you see the benefits of explicit sync as an alternative synchronization model? Do you think we could use the android sync_fence for passing fences between user space? Or did you have something else in mind for explicit sync in the drm world?
Thanks, Lauri
Lauri Peltonen (7): android: Support creating sync fence from drm fences drm/nouveau: Split nouveau_fence_sync drm/nouveau: Add fence fd helpers drm/nouveau: Support fence fd's at kickoff libdrm: nouveau: Support fence fds drm/nouveau: Support marking buffers for explicit sync drm/prime: Support explicit fence on export
Modify sync_fence_create to accept an array of 'struct fence' objects. This will allow drm drivers to create sync_fence objects and pass sync fd's between user space with minimal modifications, without ever creating sync_timeline or sync_pt objects, and without implementing the sync_timeline_ops interface.
Modify the sync driver debug code to not assume that every 'struct fence' (that is associated with a 'struct sync_fence') is embedded within a 'struct sync_pt'.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drivers/staging/android/sw_sync.c | 3 ++- drivers/staging/android/sync.c | 34 ++++++++++++++++++--------------- drivers/staging/android/sync.h | 11 ++++++----- drivers/staging/android/sync_debug.c | 37 +++++++++++++++++------------------- 4 files changed, 44 insertions(+), 41 deletions(-)
diff --git a/drivers/staging/android/sw_sync.c b/drivers/staging/android/sw_sync.c index a76db3f..6949812 100644 --- a/drivers/staging/android/sw_sync.c +++ b/drivers/staging/android/sw_sync.c @@ -184,7 +184,8 @@ static long sw_sync_ioctl_create_fence(struct sw_sync_timeline *obj, }
data.name[sizeof(data.name) - 1] = '\0'; - fence = sync_fence_create(data.name, pt); + fence = sync_fence_create(data.name, + (struct fence *[]){ &pt->base }, 1); if (fence == NULL) { sync_pt_free(pt); err = -ENOMEM; diff --git a/drivers/staging/android/sync.c b/drivers/staging/android/sync.c index e7b2e02..1d0d968 100644 --- a/drivers/staging/android/sync.c +++ b/drivers/staging/android/sync.c @@ -187,28 +187,32 @@ static void fence_check_cb_func(struct fence *f, struct fence_cb *cb) wake_up_all(&fence->wq); }
-/* TODO: implement a create which takes more that one sync_pt */ -struct sync_fence *sync_fence_create(const char *name, struct sync_pt *pt) +struct sync_fence *sync_fence_create(const char *name, + struct fence **fences, int num_fences) { - struct sync_fence *fence; + struct sync_fence *sync_fence; + int size = offsetof(struct sync_fence, cbs[num_fences]); + int i;
- fence = sync_fence_alloc(offsetof(struct sync_fence, cbs[1]), name); - if (fence == NULL) + sync_fence = sync_fence_alloc(size, name); + if (sync_fence == NULL) return NULL;
- fence->num_fences = 1; - atomic_set(&fence->status, 1); + sync_fence->num_fences = num_fences; + atomic_set(&sync_fence->status, 0);
- fence_get(&pt->base); - fence->cbs[0].sync_pt = &pt->base; - fence->cbs[0].fence = fence; - if (fence_add_callback(&pt->base, &fence->cbs[0].cb, - fence_check_cb_func)) - atomic_dec(&fence->status); + for (i = 0; i < num_fences; i++) { + struct fence *f = fences[i]; + struct sync_fence_cb *cb = &sync_fence->cbs[i];
- sync_fence_debug_add(fence); + cb->sync_pt = fence_get(f); + cb->fence = sync_fence; + if (!fence_add_callback(f, &cb->cb, fence_check_cb_func)) + atomic_inc(&sync_fence->status); + } + sync_fence_debug_add(sync_fence);
- return fence; + return sync_fence; } EXPORT_SYMBOL(sync_fence_create);
diff --git a/drivers/staging/android/sync.h b/drivers/staging/android/sync.h index 66b0f43..b8ad72c 100644 --- a/drivers/staging/android/sync.h +++ b/drivers/staging/android/sync.h @@ -246,13 +246,14 @@ void sync_pt_free(struct sync_pt *pt);
/** * sync_fence_create() - creates a sync fence - * @name: name of fence to create - * @pt: sync_pt to add to the fence + * @name: name of the sync fence to create + * @fences: fences to add to the sync fence + * @num_fences: the number of fences in the @fences array * - * Creates a fence containg @pt. Once this is called, the fence takes - * ownership of @pt. + * Creates a sync fence from an array of drm fences. Takes refs to @fences. */ -struct sync_fence *sync_fence_create(const char *name, struct sync_pt *pt); +struct sync_fence *sync_fence_create(const char *name, + struct fence **fences, int num_fences);
/* * API for sync_fence consumers diff --git a/drivers/staging/android/sync_debug.c b/drivers/staging/android/sync_debug.c index 257fc91..2d8873e 100644 --- a/drivers/staging/android/sync_debug.c +++ b/drivers/staging/android/sync_debug.c @@ -81,33 +81,33 @@ static const char *sync_status_str(int status) return "error"; }
-static void sync_print_pt(struct seq_file *s, struct sync_pt *pt, bool fence) +static void sync_print_pt(struct seq_file *s, struct fence *pt, bool fence) { int status = 1; - struct sync_timeline *parent = sync_pt_parent(pt);
- if (fence_is_signaled_locked(&pt->base)) - status = pt->base.status; + if (fence_is_signaled_locked(pt)) + status = pt->status;
- seq_printf(s, " %s%spt %s", - fence ? parent->name : "", - fence ? "_" : "", - sync_status_str(status)); + if (fence) + seq_printf(s, " %d_pt %s", pt->context, + sync_status_str(status)); + else + seq_printf(s, " pt %s", sync_status_str(status));
if (status <= 0) { - struct timeval tv = ktime_to_timeval(pt->base.timestamp); + struct timeval tv = ktime_to_timeval(pt->timestamp);
seq_printf(s, "@%ld.%06ld", tv.tv_sec, tv.tv_usec); }
- if (parent->ops->timeline_value_str && - parent->ops->pt_value_str) { + if (pt->ops->timeline_value_str && + pt->ops->fence_value_str) { char value[64];
- parent->ops->pt_value_str(pt, value, sizeof(value)); + pt->ops->fence_value_str(pt, value, sizeof(value)); seq_printf(s, ": %s", value); if (fence) { - parent->ops->timeline_value_str(parent, value, + pt->ops->timeline_value_str(pt, value, sizeof(value)); seq_printf(s, " / %s", value); } @@ -121,7 +121,8 @@ static void sync_print_obj(struct seq_file *s, struct sync_timeline *obj) struct list_head *pos; unsigned long flags;
- seq_printf(s, "%s %s", obj->name, obj->ops->driver_name); + seq_printf(s, "%d %s %s", obj->context, obj->name, + obj->ops->driver_name);
if (obj->ops->timeline_value_str) { char value[64]; @@ -136,7 +137,7 @@ static void sync_print_obj(struct seq_file *s, struct sync_timeline *obj) list_for_each(pos, &obj->child_list_head) { struct sync_pt *pt = container_of(pos, struct sync_pt, child_list); - sync_print_pt(s, pt, false); + sync_print_pt(s, &pt->base, false); } spin_unlock_irqrestore(&obj->child_list_lock, flags); } @@ -151,11 +152,7 @@ static void sync_print_fence(struct seq_file *s, struct sync_fence *fence) sync_status_str(atomic_read(&fence->status)));
for (i = 0; i < fence->num_fences; ++i) { - struct sync_pt *pt = - container_of(fence->cbs[i].sync_pt, - struct sync_pt, base); - - sync_print_pt(s, pt, true); + sync_print_pt(s, fence->cbs[i].sync_pt, true); }
spin_lock_irqsave(&fence->wq.lock, flags);
Hey,
On 26-09-14 12:00, Lauri Peltonen wrote:
Modify sync_fence_create to accept an array of 'struct fence' objects. This will allow drm drivers to create sync_fence objects and pass sync fd's between user space with minimal modifications, without ever creating sync_timeline or sync_pt objects, and without implementing the sync_timeline_ops interface.
Modify the sync driver debug code to not assume that every 'struct fence' (that is associated with a 'struct sync_fence') is embedded within a 'struct sync_pt'.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com
drivers/staging/android/sw_sync.c | 3 ++- drivers/staging/android/sync.c | 34 ++++++++++++++++++--------------- drivers/staging/android/sync.h | 11 ++++++----- drivers/staging/android/sync_debug.c | 37 +++++++++++++++++------------------- 4 files changed, 44 insertions(+), 41 deletions(-)
diff --git a/drivers/staging/android/sw_sync.c b/drivers/staging/android/sw_sync.c index a76db3f..6949812 100644 --- a/drivers/staging/android/sw_sync.c +++ b/drivers/staging/android/sw_sync.c @@ -184,7 +184,8 @@ static long sw_sync_ioctl_create_fence(struct sw_sync_timeline *obj, }
data.name[sizeof(data.name) - 1] = '\0';
- fence = sync_fence_create(data.name, pt);
- fence = sync_fence_create(data.name,
if (fence == NULL) { sync_pt_free(pt); err = -ENOMEM;(struct fence *[]){ &pt->base }, 1);
diff --git a/drivers/staging/android/sync.c b/drivers/staging/android/sync.c index e7b2e02..1d0d968 100644 --- a/drivers/staging/android/sync.c +++ b/drivers/staging/android/sync.c @@ -187,28 +187,32 @@ static void fence_check_cb_func(struct fence *f, struct fence_cb *cb) wake_up_all(&fence->wq); }
-/* TODO: implement a create which takes more that one sync_pt */ -struct sync_fence *sync_fence_create(const char *name, struct sync_pt *pt) +struct sync_fence *sync_fence_create(const char *name,
struct fence **fences, int num_fences)
{
- struct sync_fence *fence;
- struct sync_fence *sync_fence;
- int size = offsetof(struct sync_fence, cbs[num_fences]);
- int i;
- fence = sync_fence_alloc(offsetof(struct sync_fence, cbs[1]), name);
- if (fence == NULL)
- sync_fence = sync_fence_alloc(size, name);
- if (sync_fence == NULL) return NULL;
- fence->num_fences = 1;
- atomic_set(&fence->status, 1);
- sync_fence->num_fences = num_fences;
- atomic_set(&sync_fence->status, 0);
- fence_get(&pt->base);
- fence->cbs[0].sync_pt = &pt->base;
- fence->cbs[0].fence = fence;
- if (fence_add_callback(&pt->base, &fence->cbs[0].cb,
fence_check_cb_func))
atomic_dec(&fence->status);
- for (i = 0; i < num_fences; i++) {
struct fence *f = fences[i];
struct sync_fence_cb *cb = &sync_fence->cbs[i];
- sync_fence_debug_add(fence);
cb->sync_pt = fence_get(f);
cb->fence = sync_fence;
if (!fence_add_callback(f, &cb->cb, fence_check_cb_func))
atomic_inc(&sync_fence->status);
- }
- sync_fence_debug_add(sync_fence);
- return fence;
- return sync_fence;
} EXPORT_SYMBOL(sync_fence_create);
sync_fence_merge currently depends on the list of fences to be sorted to remove duplicates efficiently. Feeding it a unsorted list will probably result in hard to debug bugs. :-)
diff --git a/drivers/staging/android/sync.h b/drivers/staging/android/sync.h index 66b0f43..b8ad72c 100644 --- a/drivers/staging/android/sync.h +++ b/drivers/staging/android/sync.h @@ -246,13 +246,14 @@ void sync_pt_free(struct sync_pt *pt);
/**
- sync_fence_create() - creates a sync fence
- @name: name of fence to create
- @pt: sync_pt to add to the fence
- @name: name of the sync fence to create
- @fences: fences to add to the sync fence
- @num_fences: the number of fences in the @fences array
- Creates a fence containg @pt. Once this is called, the fence takes
- ownership of @pt.
*/
- Creates a sync fence from an array of drm fences. Takes refs to @fences.
-struct sync_fence *sync_fence_create(const char *name, struct sync_pt *pt); +struct sync_fence *sync_fence_create(const char *name,
struct fence **fences, int num_fences);
/*
- API for sync_fence consumers
diff --git a/drivers/staging/android/sync_debug.c b/drivers/staging/android/sync_debug.c index 257fc91..2d8873e 100644 --- a/drivers/staging/android/sync_debug.c +++ b/drivers/staging/android/sync_debug.c @@ -81,33 +81,33 @@ static const char *sync_status_str(int status) return "error"; }
-static void sync_print_pt(struct seq_file *s, struct sync_pt *pt, bool fence) +static void sync_print_pt(struct seq_file *s, struct fence *pt, bool fence) { int status = 1;
struct sync_timeline *parent = sync_pt_parent(pt);
if (fence_is_signaled_locked(&pt->base))
status = pt->base.status;
- if (fence_is_signaled_locked(pt))
status = pt->status;
- seq_printf(s, " %s%spt %s",
fence ? parent->name : "",
fence ? "_" : "",
sync_status_str(status));
if (fence)
seq_printf(s, " %d_pt %s", pt->context,
sync_status_str(status));
else
seq_printf(s, " pt %s", sync_status_str(status));
if (status <= 0) {
struct timeval tv = ktime_to_timeval(pt->base.timestamp);
struct timeval tv = ktime_to_timeval(pt->timestamp);
seq_printf(s, "@%ld.%06ld", tv.tv_sec, tv.tv_usec); }
- if (parent->ops->timeline_value_str &&
parent->ops->pt_value_str) {
- if (pt->ops->timeline_value_str &&
char value[64];pt->ops->fence_value_str) {
parent->ops->pt_value_str(pt, value, sizeof(value));
seq_printf(s, ": %s", value); if (fence) {pt->ops->fence_value_str(pt, value, sizeof(value));
parent->ops->timeline_value_str(parent, value,
}pt->ops->timeline_value_str(pt, value, sizeof(value)); seq_printf(s, " / %s", value);
@@ -121,7 +121,8 @@ static void sync_print_obj(struct seq_file *s, struct sync_timeline *obj) struct list_head *pos; unsigned long flags;
- seq_printf(s, "%s %s", obj->name, obj->ops->driver_name);
seq_printf(s, "%d %s %s", obj->context, obj->name,
obj->ops->driver_name);
if (obj->ops->timeline_value_str) { char value[64];
@@ -136,7 +137,7 @@ static void sync_print_obj(struct seq_file *s, struct sync_timeline *obj) list_for_each(pos, &obj->child_list_head) { struct sync_pt *pt = container_of(pos, struct sync_pt, child_list);
sync_print_pt(s, pt, false);
} spin_unlock_irqrestore(&obj->child_list_lock, flags);sync_print_pt(s, &pt->base, false);
} @@ -151,11 +152,7 @@ static void sync_print_fence(struct seq_file *s, struct sync_fence *fence) sync_status_str(atomic_read(&fence->status)));
for (i = 0; i < fence->num_fences; ++i) {
struct sync_pt *pt =
container_of(fence->cbs[i].sync_pt,
struct sync_pt, base);
sync_print_pt(s, pt, true);
sync_print_pt(s, fence->cbs[i].sync_pt, true);
}
spin_lock_irqsave(&fence->wq.lock, flags);
Split nouveau_fence_sync to two functions:
* nouveau_fence_sync, which only adds a fence wait to the channel command stream, and * nouveau_bo_sync, which gets the fences from the reservation object and passes them to nouveau_fence_sync.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drm/nouveau_bo.c | 38 +++++++++++++++++++++++++++++++++++- drm/nouveau_bo.h | 2 ++ drm/nouveau_display.c | 4 ++-- drm/nouveau_fence.c | 54 ++++++++------------------------------------------- drm/nouveau_fence.h | 2 +- drm/nouveau_gem.c | 2 +- 6 files changed, 51 insertions(+), 51 deletions(-)
diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c index 70c3cb5..534734a 100644 --- a/drm/nouveau_bo.c +++ b/drm/nouveau_bo.c @@ -463,6 +463,42 @@ nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo) }
int +nouveau_bo_sync(struct nouveau_bo *nvbo, struct nouveau_channel *chan, bool exclusive) +{ + struct fence *fence; + struct reservation_object *resv = nvbo->bo.resv; + struct reservation_object_list *fobj; + int ret = 0, i; + + if (!exclusive) { + ret = reservation_object_reserve_shared(resv); + + if (ret) + return ret; + } + + fobj = reservation_object_get_list(resv); + fence = reservation_object_get_excl(resv); + + if (fence && (!exclusive || !fobj || !fobj->shared_count)) + return nouveau_fence_sync(fence, chan); + + if (!exclusive || !fobj) + return ret; + + for (i = 0; i < fobj->shared_count && !ret; ++i) { + fence = rcu_dereference_protected(fobj->shared[i], + reservation_object_held(resv)); + ret = nouveau_fence_sync(fence, chan); + + if (ret) + break; + } + + return ret; +} + +int nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, bool no_wait_gpu) { @@ -1070,7 +1106,7 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict, bool intr, }
mutex_lock_nested(&cli->mutex, SINGLE_DEPTH_NESTING); - ret = nouveau_fence_sync(nouveau_bo(bo), chan, true); + ret = nouveau_bo_sync(nouveau_bo(bo), chan, true); if (ret == 0) { ret = drm->ttm.move(chan, bo, &bo->mem, new_mem); if (ret == 0) { diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h index 8ab877b..f97bc26 100644 --- a/drm/nouveau_bo.h +++ b/drm/nouveau_bo.h @@ -84,6 +84,8 @@ int nouveau_bo_validate(struct nouveau_bo *, bool interruptible, bool no_wait_gpu); void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo); void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo); +int nouveau_bo_sync(struct nouveau_bo *, struct nouveau_channel *, + bool exclusive);
struct nouveau_vma * nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *); diff --git a/drm/nouveau_display.c b/drm/nouveau_display.c index 6d0a3cd..41b130c 100644 --- a/drm/nouveau_display.c +++ b/drm/nouveau_display.c @@ -658,7 +658,7 @@ nouveau_page_flip_emit(struct nouveau_channel *chan, spin_unlock_irqrestore(&dev->event_lock, flags);
/* Synchronize with the old framebuffer */ - ret = nouveau_fence_sync(old_bo, chan, false); + ret = nouveau_bo_sync(old_bo, chan, false); if (ret) goto fail;
@@ -722,7 +722,7 @@ nouveau_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb, goto fail_unpin;
/* synchronise rendering channel with the kernel's channel */ - ret = nouveau_fence_sync(new_bo, chan, false); + ret = nouveau_bo_sync(new_bo, chan, false); if (ret) { ttm_bo_unreserve(&new_bo->bo); goto fail_unpin; diff --git a/drm/nouveau_fence.c b/drm/nouveau_fence.c index decfe6c..39b5436 100644 --- a/drm/nouveau_fence.c +++ b/drm/nouveau_fence.c @@ -342,57 +342,19 @@ nouveau_fence_wait(struct nouveau_fence *fence, bool lazy, bool intr) }
int -nouveau_fence_sync(struct nouveau_bo *nvbo, struct nouveau_channel *chan, bool exclusive) +nouveau_fence_sync(struct fence *fence, struct nouveau_channel *chan) { struct nouveau_fence_chan *fctx = chan->fence; - struct fence *fence; - struct reservation_object *resv = nvbo->bo.resv; - struct reservation_object_list *fobj; + struct nouveau_channel *prev = NULL; struct nouveau_fence *f; - int ret = 0, i; - - if (!exclusive) { - ret = reservation_object_reserve_shared(resv); - - if (ret) - return ret; - } - - fobj = reservation_object_get_list(resv); - fence = reservation_object_get_excl(resv); - - if (fence && (!exclusive || !fobj || !fobj->shared_count)) { - struct nouveau_channel *prev = NULL; - - f = nouveau_local_fence(fence, chan->drm); - if (f) - prev = f->channel; - - if (!prev || (prev != chan && (ret = fctx->sync(f, prev, chan)))) - ret = fence_wait(fence, true); - - return ret; - } - - if (!exclusive || !fobj) - return ret; - - for (i = 0; i < fobj->shared_count && !ret; ++i) { - struct nouveau_channel *prev = NULL; - - fence = rcu_dereference_protected(fobj->shared[i], - reservation_object_held(resv)); - - f = nouveau_local_fence(fence, chan->drm); - if (f) - prev = f->channel; + int ret = 0;
- if (!prev || (ret = fctx->sync(f, prev, chan))) - ret = fence_wait(fence, true); + f = nouveau_local_fence(fence, chan->drm); + if (f) + prev = f->channel;
- if (ret) - break; - } + if (!prev || (prev != chan && (ret = fctx->sync(f, prev, chan)))) + ret = fence_wait(fence, true);
return ret; } diff --git a/drm/nouveau_fence.h b/drm/nouveau_fence.h index 986c813..8b70166 100644 --- a/drm/nouveau_fence.h +++ b/drm/nouveau_fence.h @@ -26,7 +26,7 @@ int nouveau_fence_emit(struct nouveau_fence *, struct nouveau_channel *); bool nouveau_fence_done(struct nouveau_fence *); void nouveau_fence_work(struct fence *, void (*)(void *), void *); int nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr); -int nouveau_fence_sync(struct nouveau_bo *, struct nouveau_channel *, bool exclusive); +int nouveau_fence_sync(struct fence *, struct nouveau_channel *);
struct nouveau_fence_chan { spinlock_t lock; diff --git a/drm/nouveau_gem.c b/drm/nouveau_gem.c index a43af30..78398d4 100644 --- a/drm/nouveau_gem.c +++ b/drm/nouveau_gem.c @@ -459,7 +459,7 @@ validate_list(struct nouveau_channel *chan, struct nouveau_cli *cli, return ret; }
- ret = nouveau_fence_sync(nvbo, chan, !!b->write_domains); + ret = nouveau_bo_sync(nvbo, chan, !!b->write_domains); if (unlikely(ret)) { if (ret != -ERESTARTSYS) NV_PRINTK(error, cli, "fail post-validate sync\n");
Add nouveau_fence_install, which installs a drm fence as a file descriptor that can be returned to user space.
Add nouveau_fence_sync_fd, which pushes semaphore wait commands for each Nouveau fence contained within the sync fd. If the sync fd contains non-Nouveau fences, those are waited on the CPU.
Add missing fence_value_str and timeline_value_str callbacks to nouveau fence_ops.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drm/nouveau_fence.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++- drm/nouveau_fence.h | 2 ++ 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/drm/nouveau_fence.c b/drm/nouveau_fence.c index 39b5436..e67d467 100644 --- a/drm/nouveau_fence.c +++ b/drm/nouveau_fence.c @@ -37,6 +37,8 @@ #include "nouveau_dma.h" #include "nouveau_fence.h"
+#include "../drivers/staging/android/sync.h" + static const struct fence_ops nouveau_fence_ops_uevent; static const struct fence_ops nouveau_fence_ops_legacy;
@@ -471,11 +473,78 @@ static bool nouveau_fence_enable_signaling(struct fence *f) return ret; }
+static void nouveau_fence_timeline_value_str(struct fence *fence, + char *str, int size) +{ + struct nouveau_fence *f = from_fence(fence); + struct nouveau_fence_chan *fctx = nouveau_fctx(f); + u32 cur; + + cur = f->channel ? fctx->read(f->channel) : 0; + snprintf(str, size, "%d", cur); +} + +static void +nouveau_fence_value_str(struct fence *fence, char *str, int size) +{ + snprintf(str, size, "%d", fence->seqno); +} + static const struct fence_ops nouveau_fence_ops_uevent = { .get_driver_name = nouveau_fence_get_get_driver_name, .get_timeline_name = nouveau_fence_get_timeline_name, .enable_signaling = nouveau_fence_enable_signaling, .signaled = nouveau_fence_is_signaled, .wait = fence_default_wait, - .release = NULL + .release = NULL, + .fence_value_str = nouveau_fence_value_str, + .timeline_value_str = nouveau_fence_timeline_value_str, }; + +int +nouveau_fence_install(struct fence *fence, const char *name, int *fd_out) +{ +#ifdef CONFIG_SYNC + struct sync_fence *f; + int fd; + + f = sync_fence_create(name, &fence, 1); + if (!f) + return -ENOMEM; + + fd = get_unused_fd(); + if (fd < 0) { + sync_fence_put(f); + return fd; + } + sync_fence_install(f, fd); + *fd_out = fd; + return 0; +#else + return -ENODEV; +#endif +} + +int +nouveau_fence_sync_fd(int fence_fd, struct nouveau_channel *chan) +{ +#ifdef CONFIG_SYNC + int i, ret = 0; + struct sync_fence *fence; + + fence = sync_fence_fdget(fence_fd); + if (!fence) + return -EINVAL; + + for (i = 0; i < fence->num_fences; ++i) { + struct fence *pt = fence->cbs[i].sync_pt; + + ret = nouveau_fence_sync(pt, chan); + if (ret) + break; + } + sync_fence_put(fence); + return ret; +#else + return -ENODEV; +#endif +} diff --git a/drm/nouveau_fence.h b/drm/nouveau_fence.h index 8b70166..76342ea 100644 --- a/drm/nouveau_fence.h +++ b/drm/nouveau_fence.h @@ -27,6 +27,8 @@ bool nouveau_fence_done(struct nouveau_fence *); void nouveau_fence_work(struct fence *, void (*)(void *), void *); int nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr); int nouveau_fence_sync(struct fence *, struct nouveau_channel *); +int nouveau_fence_sync_fd(int, struct nouveau_channel *); +int nouveau_fence_install(struct fence *, const char *name, int *);
struct nouveau_fence_chan { spinlock_t lock;
Add a new NOUVEAU_GEM_PUSHBUF_2 ioctl that accepts and emits a sync fence fd from/to user space if the user space requests it by passing corresponding flags.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drm/nouveau_drm.c | 1 + drm/nouveau_gem.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- drm/nouveau_gem.h | 2 ++ drm/uapi/drm/nouveau_drm.h | 11 +++++++++++ 4 files changed, 56 insertions(+), 4 deletions(-)
diff --git a/drm/nouveau_drm.c b/drm/nouveau_drm.c index 244d78f..74d5ac6 100644 --- a/drm/nouveau_drm.c +++ b/drm/nouveau_drm.c @@ -812,6 +812,7 @@ nouveau_ioctls[] = { DRM_IOCTL_DEF_DRV(NOUVEAU_GPUOBJ_FREE, nouveau_abi16_ioctl_gpuobj_free, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_NEW, nouveau_gem_ioctl_new, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF, nouveau_gem_ioctl_pushbuf, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), + DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_PUSHBUF_2, nouveau_gem_ioctl_pushbuf_2, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_PREP, nouveau_gem_ioctl_cpu_prep, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_FINI, nouveau_gem_ioctl_cpu_fini, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_INFO, nouveau_gem_ioctl_info, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), diff --git a/drm/nouveau_gem.c b/drm/nouveau_gem.c index 78398d4..ee5782c 100644 --- a/drm/nouveau_gem.c +++ b/drm/nouveau_gem.c @@ -636,15 +636,16 @@ nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli, return ret; }
-int -nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data, - struct drm_file *file_priv) +static int +__nouveau_gem_ioctl_pushbuf(struct drm_device *dev, + struct drm_nouveau_gem_pushbuf *req, + struct drm_nouveau_gem_pushbuf_2 *req_2, + struct drm_file *file_priv) { struct nouveau_abi16 *abi16 = nouveau_abi16_get(file_priv, dev); struct nouveau_cli *cli = nouveau_cli(file_priv); struct nouveau_abi16_chan *temp; struct nouveau_drm *drm = nouveau_drm(dev); - struct drm_nouveau_gem_pushbuf *req = data; struct drm_nouveau_gem_pushbuf_push *push; struct drm_nouveau_gem_pushbuf_bo *bo; struct nouveau_channel *chan = NULL; @@ -725,6 +726,14 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data, } }
+ if (req_2 && (req_2->flags & NOUVEAU_GEM_PUSHBUF_2_FENCE_WAIT)) { + ret = nouveau_fence_sync_fd(req_2->fence, chan); + if (ret) { + NV_PRINTK(error, cli, "fence wait: %d\n", ret); + goto out; + } + } + if (chan->dma.ib_max) { ret = nouveau_dma_wait(chan, req->nr_push + 1, 16); if (ret) { @@ -800,6 +809,16 @@ nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data, goto out; }
+ if (req_2 && (req_2->flags & NOUVEAU_GEM_PUSHBUF_2_FENCE_EMIT)) { + ret = nouveau_fence_install(&fence->base, "nv-pushbuf", + &req_2->fence); + if (ret) { + NV_PRINTK(error, cli, "fence install: %d\n", ret); + WIND_RING(chan); + goto out; + } + } + out: validate_fini(&op, fence, bo); nouveau_fence_unref(&fence); @@ -825,6 +844,25 @@ out_next: return nouveau_abi16_put(abi16, ret); }
+int +nouveau_gem_ioctl_pushbuf_2(struct drm_device *dev, void *data, + struct drm_file *file_priv) +{ + struct drm_nouveau_gem_pushbuf_2 *req_2 = data; + struct drm_nouveau_gem_pushbuf *req = &req_2->base; + + return __nouveau_gem_ioctl_pushbuf(dev, req, req_2, file_priv); +} + +int +nouveau_gem_ioctl_pushbuf(struct drm_device *dev, void *data, + struct drm_file *file_priv) +{ + struct drm_nouveau_gem_pushbuf *req = data; + + return __nouveau_gem_ioctl_pushbuf(dev, req, NULL, file_priv); +} + static inline uint32_t domain_to_ttm(struct nouveau_bo *nvbo, uint32_t domain) { diff --git a/drm/nouveau_gem.h b/drm/nouveau_gem.h index ddab762..7454dea 100644 --- a/drm/nouveau_gem.h +++ b/drm/nouveau_gem.h @@ -27,6 +27,8 @@ extern int nouveau_gem_ioctl_new(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_pushbuf(struct drm_device *, void *, struct drm_file *); +extern int nouveau_gem_ioctl_pushbuf_2(struct drm_device *, void *, + struct drm_file *); extern int nouveau_gem_ioctl_cpu_prep(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_cpu_fini(struct drm_device *, void *, diff --git a/drm/uapi/drm/nouveau_drm.h b/drm/uapi/drm/nouveau_drm.h index 0d7608d..394cd94 100644 --- a/drm/uapi/drm/nouveau_drm.h +++ b/drm/uapi/drm/nouveau_drm.h @@ -115,6 +115,15 @@ struct drm_nouveau_gem_pushbuf { uint64_t gart_available; };
+#define NOUVEAU_GEM_PUSHBUF_2_FENCE_WAIT 0x00000001 +#define NOUVEAU_GEM_PUSHBUF_2_FENCE_EMIT 0x00000002 +struct drm_nouveau_gem_pushbuf_2 { + struct drm_nouveau_gem_pushbuf base; + uint32_t flags; + int32_t fence; + uint64_t reserved; +}; + #define NOUVEAU_GEM_CPU_PREP_NOWAIT 0x00000001 #define NOUVEAU_GEM_CPU_PREP_WRITE 0x00000004 struct drm_nouveau_gem_cpu_prep { @@ -139,9 +148,11 @@ struct drm_nouveau_gem_cpu_fini { #define DRM_NOUVEAU_GEM_CPU_PREP 0x42 #define DRM_NOUVEAU_GEM_CPU_FINI 0x43 #define DRM_NOUVEAU_GEM_INFO 0x44 +#define DRM_NOUVEAU_GEM_PUSHBUF_2 0x45
#define DRM_IOCTL_NOUVEAU_GEM_NEW DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_NEW, struct drm_nouveau_gem_new) #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_PUSHBUF, struct drm_nouveau_gem_pushbuf) +#define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF_2 DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_PUSHBUF_2, struct drm_nouveau_gem_pushbuf_2) #define DRM_IOCTL_NOUVEAU_GEM_CPU_PREP DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_PREP, struct drm_nouveau_gem_cpu_prep) #define DRM_IOCTL_NOUVEAU_GEM_CPU_FINI DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_FINI, struct drm_nouveau_gem_cpu_fini) #define DRM_IOCTL_NOUVEAU_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_INFO, struct drm_nouveau_gem_info)
Add a new nouveau_pushbuf_kick_fence function that takes and emits a sync fence fd. The fence fd can be waited on, or merged with other fence fd's, or passed back to kernel as a prerquisite for a subsequent hw operation.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- include/drm/nouveau_drm.h | 10 +++++ nouveau/nouveau.h | 2 + nouveau/pushbuf.c | 93 ++++++++++++++++++++++++++++++++--------------- 3 files changed, 76 insertions(+), 29 deletions(-)
diff --git a/include/drm/nouveau_drm.h b/include/drm/nouveau_drm.h index b18cad0..3e40210 100644 --- a/include/drm/nouveau_drm.h +++ b/include/drm/nouveau_drm.h @@ -171,6 +171,15 @@ struct drm_nouveau_gem_pushbuf { uint64_t gart_available; };
+#define NOUVEAU_GEM_PUSHBUF_2_FENCE_WAIT 0x00000001 +#define NOUVEAU_GEM_PUSHBUF_2_FENCE_EMIT 0x00000002 +struct drm_nouveau_gem_pushbuf_2 { + struct drm_nouveau_gem_pushbuf base; + uint32_t flags; + int32_t fence; + uint64_t reserved; +}; + #define NOUVEAU_GEM_CPU_PREP_NOWAIT 0x00000001 #define NOUVEAU_GEM_CPU_PREP_NOBLOCK 0x00000002 #define NOUVEAU_GEM_CPU_PREP_WRITE 0x00000004 @@ -204,5 +213,6 @@ struct drm_nouveau_sarea { #define DRM_NOUVEAU_GEM_CPU_PREP 0x42 #define DRM_NOUVEAU_GEM_CPU_FINI 0x43 #define DRM_NOUVEAU_GEM_INFO 0x44 +#define DRM_NOUVEAU_GEM_PUSHBUF_2 0x45
#endif /* __NOUVEAU_DRM_H__ */ diff --git a/nouveau/nouveau.h b/nouveau/nouveau.h index a55e2b0..281420f 100644 --- a/nouveau/nouveau.h +++ b/nouveau/nouveau.h @@ -225,6 +225,8 @@ void nouveau_pushbuf_reloc(struct nouveau_pushbuf *, struct nouveau_bo *, int nouveau_pushbuf_validate(struct nouveau_pushbuf *); uint32_t nouveau_pushbuf_refd(struct nouveau_pushbuf *, struct nouveau_bo *); int nouveau_pushbuf_kick(struct nouveau_pushbuf *, struct nouveau_object *channel); +int nouveau_pushbuf_kick_fence(struct nouveau_pushbuf *, + struct nouveau_object *channel, int *fence); struct nouveau_bufctx * nouveau_pushbuf_bufctx(struct nouveau_pushbuf *, struct nouveau_bufctx *);
diff --git a/nouveau/pushbuf.c b/nouveau/pushbuf.c index 6e703a4..8667d05 100644 --- a/nouveau/pushbuf.c +++ b/nouveau/pushbuf.c @@ -33,6 +33,7 @@ #include <string.h> #include <assert.h> #include <errno.h> +#include <unistd.h>
#include <xf86drm.h> #include <xf86atomic.h> @@ -77,7 +78,7 @@ nouveau_pushbuf(struct nouveau_pushbuf *push) }
static int pushbuf_validate(struct nouveau_pushbuf *, bool); -static int pushbuf_flush(struct nouveau_pushbuf *); +static int pushbuf_flush(struct nouveau_pushbuf *, int *);
static bool pushbuf_kref_fits(struct nouveau_pushbuf *push, struct nouveau_bo *bo, @@ -172,7 +173,7 @@ pushbuf_kref(struct nouveau_pushbuf *push, struct nouveau_bo *bo, */ fpush = cli_push_get(push->client, bo); if (fpush && fpush != push) - pushbuf_flush(fpush); + pushbuf_flush(fpush, NULL);
kref = cli_kref_get(push->client, bo); if (kref) { @@ -305,18 +306,21 @@ pushbuf_dump(struct nouveau_pushbuf_krec *krec, int krec_id, int chid) }
static int -pushbuf_submit(struct nouveau_pushbuf *push, struct nouveau_object *chan) +pushbuf_submit(struct nouveau_pushbuf *push, struct nouveau_object *chan, + int *fence) { struct nouveau_pushbuf_priv *nvpb = nouveau_pushbuf(push); struct nouveau_pushbuf_krec *krec = nvpb->list; struct nouveau_device *dev = push->client->device; struct drm_nouveau_gem_pushbuf_bo_presumed *info; struct drm_nouveau_gem_pushbuf_bo *kref; - struct drm_nouveau_gem_pushbuf req; + struct drm_nouveau_gem_pushbuf_2 req_2; + struct drm_nouveau_gem_pushbuf *req = &req_2.base; struct nouveau_fifo *fifo = chan->data; struct nouveau_bo *bo; int krec_id = 0; int ret = 0, i; + int fence_out = -1;
if (chan->oclass != NOUVEAU_FIFO_CHANNEL_CLASS) return -EINVAL; @@ -326,30 +330,42 @@ pushbuf_submit(struct nouveau_pushbuf *push, struct nouveau_object *chan)
nouveau_pushbuf_data(push, NULL, 0, 0);
+ /* TODO: If fence is requested, force kickoff. */ while (krec && krec->nr_push) { - req.channel = fifo->channel; - req.nr_buffers = krec->nr_buffer; - req.buffers = (uint64_t)(unsigned long)krec->buffer; - req.nr_relocs = krec->nr_reloc; - req.nr_push = krec->nr_push; - req.relocs = (uint64_t)(unsigned long)krec->reloc; - req.push = (uint64_t)(unsigned long)krec->push; - req.suffix0 = nvpb->suffix0; - req.suffix1 = nvpb->suffix1; - req.vram_available = 0; /* for valgrind */ - req.gart_available = 0; + req->channel = fifo->channel; + req->nr_buffers = krec->nr_buffer; + req->buffers = (uint64_t)(unsigned long)krec->buffer; + req->nr_relocs = krec->nr_reloc; + req->nr_push = krec->nr_push; + req->relocs = (uint64_t)(unsigned long)krec->reloc; + req->push = (uint64_t)(unsigned long)krec->push; + req->suffix0 = nvpb->suffix0; + req->suffix1 = nvpb->suffix1; + req->vram_available = 0; /* for valgrind */ + req->gart_available = 0; + if (fence) { + req_2.flags = NOUVEAU_GEM_PUSHBUF_2_FENCE_EMIT; + if (*fence >= 0) + req_2.flags |= NOUVEAU_GEM_PUSHBUF_2_FENCE_WAIT; + req_2.fence = *fence; + req_2.reserved = 0; + }
if (dbg_on(0)) pushbuf_dump(krec, krec_id++, fifo->channel);
#ifndef SIMULATE - ret = drmCommandWriteRead(dev->fd, DRM_NOUVEAU_GEM_PUSHBUF, - &req, sizeof(req)); - nvpb->suffix0 = req.suffix0; - nvpb->suffix1 = req.suffix1; - dev->vram_limit = (req.vram_available * + if (fence) + ret = drmCommandWriteRead(dev->fd, DRM_NOUVEAU_GEM_PUSHBUF_2, + &req_2, sizeof(req_2)); + else + ret = drmCommandWriteRead(dev->fd, DRM_NOUVEAU_GEM_PUSHBUF, + req, sizeof(*req)); + nvpb->suffix0 = req->suffix0; + nvpb->suffix1 = req->suffix1; + dev->vram_limit = (req->vram_available * nouveau_device(dev)->vram_limit_percent) / 100; - dev->gart_limit = (req.gart_available * + dev->gart_limit = (req->gart_available * nouveau_device(dev)->gart_limit_percent) / 100; #else if (dbg_on(31)) @@ -362,6 +378,12 @@ pushbuf_submit(struct nouveau_pushbuf *push, struct nouveau_object *chan) break; }
+ if (fence) { + if (fence_out >= 0) + close(fence_out); + fence_out = req_2.fence; + } + kref = krec->buffer; for (i = 0; i < krec->nr_buffer; i++, kref++) { bo = (void *)(unsigned long)kref->user_priv; @@ -385,11 +407,17 @@ pushbuf_submit(struct nouveau_pushbuf *push, struct nouveau_object *chan) krec = krec->next; }
+ if (!ret && fence) { + if (*fence >= 0) + close(*fence); + *fence = fence_out; + } + return ret; }
static int -pushbuf_flush(struct nouveau_pushbuf *push) +pushbuf_flush(struct nouveau_pushbuf *push, int *fence) { struct nouveau_pushbuf_priv *nvpb = nouveau_pushbuf(push); struct nouveau_pushbuf_krec *krec = nvpb->krec; @@ -399,7 +427,7 @@ pushbuf_flush(struct nouveau_pushbuf *push) int ret = 0, i;
if (push->channel) { - ret = pushbuf_submit(push, push->channel); + ret = pushbuf_submit(push, push->channel, fence); } else { nouveau_pushbuf_data(push, NULL, 0, 0); krec->next = malloc(sizeof(*krec)); @@ -469,7 +497,7 @@ pushbuf_refn(struct nouveau_pushbuf *push, bool retry, if (ret) { pushbuf_refn_fail(push, sref, krec->nr_reloc); if (retry) { - pushbuf_flush(push); + pushbuf_flush(push, NULL); nouveau_pushbuf_space(push, 0, 0, 0); return pushbuf_refn(push, false, refs, nr); } @@ -521,7 +549,7 @@ pushbuf_validate(struct nouveau_pushbuf *push, bool retry) if (ret) { pushbuf_refn_fail(push, sref, srel); if (retry) { - pushbuf_flush(push); + pushbuf_flush(push, NULL); return pushbuf_validate(push, false); } } @@ -673,7 +701,7 @@ nouveau_pushbuf_space(struct nouveau_pushbuf *push, krec->nr_reloc + relocs >= NOUVEAU_GEM_MAX_RELOCS || krec->nr_push + pushes >= NOUVEAU_GEM_MAX_PUSH) { if (nvpb->bo && krec->nr_buffer) - pushbuf_flush(push); + pushbuf_flush(push, NULL); flushed = true; }
@@ -767,10 +795,17 @@ nouveau_pushbuf_refd(struct nouveau_pushbuf *push, struct nouveau_bo *bo) }
drm_public int -nouveau_pushbuf_kick(struct nouveau_pushbuf *push, struct nouveau_object *chan) +nouveau_pushbuf_kick_fence(struct nouveau_pushbuf *push, + struct nouveau_object *chan, int *fence) { if (!push->channel) - return pushbuf_submit(push, chan); - pushbuf_flush(push); + return pushbuf_submit(push, chan, fence); + pushbuf_flush(push, fence); return pushbuf_validate(push, false); } + +drm_public int +nouveau_pushbuf_kick(struct nouveau_pushbuf *push, struct nouveau_object *chan) +{ + return nouveau_pushbuf_kick_fence(push, chan, NULL); +}
Do not attach fences automatically to buffers that are marked for explicit synchronization.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drm/nouveau_bo.c | 8 ++++---- drm/nouveau_bo.h | 4 ++-- drm/nouveau_drm.c | 1 + drm/nouveau_gem.c | 47 +++++++++++++++++++++++++++++++++++++++------- drm/nouveau_gem.h | 6 ++++-- drm/nouveau_ttm.c | 8 ++++---- drm/nv50_display.c | 2 +- drm/uapi/drm/nouveau_drm.h | 5 ++++- 8 files changed, 60 insertions(+), 21 deletions(-)
diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c index 534734a..68b7bdd 100644 --- a/drm/nouveau_bo.c +++ b/drm/nouveau_bo.c @@ -180,7 +180,7 @@ nouveau_bo_fixup_align(struct nouveau_bo *nvbo, u32 flags,
int nouveau_bo_new(struct drm_device *dev, int size, int align, - uint32_t flags, uint32_t tile_mode, uint32_t tile_flags, + uint32_t flags, uint32_t tile_mode, uint32_t bo_flags, struct sg_table *sg, struct nouveau_bo **pnvbo) { @@ -211,7 +211,7 @@ nouveau_bo_new(struct drm_device *dev, int size, int align, INIT_LIST_HEAD(&nvbo->entry); INIT_LIST_HEAD(&nvbo->vma_list); nvbo->tile_mode = tile_mode; - nvbo->tile_flags = tile_flags; + nvbo->bo_flags = bo_flags; nvbo->bo.bdev = &drm->ttm.bdev;
if (!nv_device_is_cpu_coherent(nvkm_device(&drm->device))) @@ -272,7 +272,7 @@ set_placement_range(struct nouveau_bo *nvbo, uint32_t type) * speed up when alpha-blending and depth-test are enabled * at the same time. */ - if (nvbo->tile_flags & NOUVEAU_GEM_TILE_ZETA) { + if (nvbo->bo_flags & NOUVEAU_GEM_TILE_ZETA) { fpfn = vram_pages / 2; lpfn = ~0; } else { @@ -1291,7 +1291,7 @@ nouveau_bo_vm_bind(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem, if (drm->device.info.family >= NV_DEVICE_INFO_V0_CELSIUS) { *new_tile = nv10_bo_set_tiling(dev, offset, new_mem->size, nvbo->tile_mode, - nvbo->tile_flags); + nvbo->bo_flags); }
return 0; diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h index f97bc26..ff1edba 100644 --- a/drm/nouveau_bo.h +++ b/drm/nouveau_bo.h @@ -25,7 +25,7 @@ struct nouveau_bo { unsigned page_shift;
u32 tile_mode; - u32 tile_flags; + u32 bo_flags; struct nouveau_drm_tile *tile;
/* Only valid if allocated via nouveau_gem_new() and iff you hold a @@ -68,7 +68,7 @@ extern struct ttm_bo_driver nouveau_bo_driver;
void nouveau_bo_move_init(struct nouveau_drm *); int nouveau_bo_new(struct drm_device *, int size, int align, u32 flags, - u32 tile_mode, u32 tile_flags, struct sg_table *sg, + u32 tile_mode, u32 bo_flags, struct sg_table *sg, struct nouveau_bo **); int nouveau_bo_pin(struct nouveau_bo *, u32 flags); int nouveau_bo_unpin(struct nouveau_bo *); diff --git a/drm/nouveau_drm.c b/drm/nouveau_drm.c index 74d5ac6..3d84f3a 100644 --- a/drm/nouveau_drm.c +++ b/drm/nouveau_drm.c @@ -816,6 +816,7 @@ nouveau_ioctls[] = { DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_PREP, nouveau_gem_ioctl_cpu_prep, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_CPU_FINI, nouveau_gem_ioctl_cpu_fini, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_INFO, nouveau_gem_ioctl_info, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), + DRM_IOCTL_DEF_DRV(NOUVEAU_GEM_SET_INFO, nouveau_gem_ioctl_set_info, DRM_UNLOCKED|DRM_AUTH|DRM_RENDER_ALLOW), };
long diff --git a/drm/nouveau_gem.c b/drm/nouveau_gem.c index ee5782c..bb19507 100644 --- a/drm/nouveau_gem.c +++ b/drm/nouveau_gem.c @@ -149,7 +149,7 @@ nouveau_gem_object_close(struct drm_gem_object *gem, struct drm_file *file_priv)
int nouveau_gem_new(struct drm_device *dev, int size, int align, uint32_t domain, - uint32_t tile_mode, uint32_t tile_flags, + uint32_t tile_mode, uint32_t bo_flags, struct nouveau_bo **pnvbo) { struct nouveau_drm *drm = nouveau_drm(dev); @@ -165,7 +165,7 @@ nouveau_gem_new(struct drm_device *dev, int size, int align, uint32_t domain, flags |= TTM_PL_FLAG_SYSTEM;
ret = nouveau_bo_new(dev, size, align, flags, tile_mode, - tile_flags, NULL, pnvbo); + bo_flags, NULL, pnvbo); if (ret) return ret; nvbo = *pnvbo; @@ -216,7 +216,21 @@ nouveau_gem_info(struct drm_file *file_priv, struct drm_gem_object *gem, rep->size = nvbo->bo.mem.num_pages << PAGE_SHIFT; rep->map_handle = drm_vma_node_offset_addr(&nvbo->bo.vma_node); rep->tile_mode = nvbo->tile_mode; - rep->tile_flags = nvbo->tile_flags; + rep->bo_flags = nvbo->bo_flags; + return 0; +} + +static int +nouveau_gem_set_info(struct drm_file *file_priv, struct drm_gem_object *gem, + struct drm_nouveau_gem_info *req) +{ + struct nouveau_bo *nvbo = nouveau_gem_object(gem); + + if (req->bo_flags & NOUVEAU_GEM_EXPLICIT_SYNC) + nvbo->bo_flags |= NOUVEAU_GEM_EXPLICIT_SYNC; + else + nvbo->bo_flags &= ~NOUVEAU_GEM_EXPLICIT_SYNC; + return 0; }
@@ -231,14 +245,15 @@ nouveau_gem_ioctl_new(struct drm_device *dev, void *data, struct nouveau_bo *nvbo = NULL; int ret = 0;
- if (!pfb->memtype_valid(pfb, req->info.tile_flags)) { - NV_PRINTK(error, cli, "bad page flags: 0x%08x\n", req->info.tile_flags); + if (!pfb->memtype_valid(pfb, req->info.bo_flags)) { + NV_PRINTK(error, cli, "bad page flags: 0x%08x\n", + req->info.bo_flags); return -EINVAL; }
ret = nouveau_gem_new(dev, req->info.size, req->align, req->info.domain, req->info.tile_mode, - req->info.tile_flags, &nvbo); + req->info.bo_flags, &nvbo); if (ret) return ret;
@@ -303,12 +318,14 @@ validate_fini_no_ticket(struct validate_op *op, struct nouveau_fence *fence, { struct nouveau_bo *nvbo; struct drm_nouveau_gem_pushbuf_bo *b; + bool explicit_sync;
while (!list_empty(&op->list)) { nvbo = list_entry(op->list.next, struct nouveau_bo, entry); b = &pbbo[nvbo->pbbo_index]; + explicit_sync = !!(nvbo->bo_flags & NOUVEAU_GEM_EXPLICIT_SYNC);
- if (likely(fence)) + if (likely(fence) && !explicit_sync) nouveau_bo_fence(nvbo, fence, !!b->write_domains);
if (unlikely(nvbo->validate_mapped)) { @@ -947,3 +964,19 @@ nouveau_gem_ioctl_info(struct drm_device *dev, void *data, return ret; }
+int +nouveau_gem_ioctl_set_info(struct drm_device *dev, void *data, + struct drm_file *file_priv) +{ + struct drm_nouveau_gem_info *req = data; + struct drm_gem_object *gem; + int ret; + + gem = drm_gem_object_lookup(dev, file_priv, req->handle); + if (!gem) + return -ENOENT; + + ret = nouveau_gem_set_info(file_priv, gem, req); + drm_gem_object_unreference_unlocked(gem); + return ret; +} diff --git a/drm/nouveau_gem.h b/drm/nouveau_gem.h index 7454dea..abac606 100644 --- a/drm/nouveau_gem.h +++ b/drm/nouveau_gem.h @@ -7,7 +7,7 @@ #include "nouveau_bo.h"
#define nouveau_bo_tile_layout(nvbo) \ - ((nvbo)->tile_flags & NOUVEAU_GEM_TILE_LAYOUT_MASK) + ((nvbo)->bo_flags & NOUVEAU_GEM_TILE_LAYOUT_MASK)
static inline struct nouveau_bo * nouveau_gem_object(struct drm_gem_object *gem) @@ -18,7 +18,7 @@ nouveau_gem_object(struct drm_gem_object *gem) /* nouveau_gem.c */ extern int nouveau_gem_new(struct drm_device *, int size, int align, uint32_t domain, uint32_t tile_mode, - uint32_t tile_flags, struct nouveau_bo **); + uint32_t bo_flags, struct nouveau_bo **); extern void nouveau_gem_object_del(struct drm_gem_object *); extern int nouveau_gem_object_open(struct drm_gem_object *, struct drm_file *); extern void nouveau_gem_object_close(struct drm_gem_object *, @@ -35,6 +35,8 @@ extern int nouveau_gem_ioctl_cpu_fini(struct drm_device *, void *, struct drm_file *); extern int nouveau_gem_ioctl_info(struct drm_device *, void *, struct drm_file *); +extern int nouveau_gem_ioctl_set_info(struct drm_device *, void *, + struct drm_file *);
extern int nouveau_gem_prime_pin(struct drm_gem_object *); struct reservation_object *nouveau_gem_prime_res_obj(struct drm_gem_object *); diff --git a/drm/nouveau_ttm.c b/drm/nouveau_ttm.c index 8058013..a224820 100644 --- a/drm/nouveau_ttm.c +++ b/drm/nouveau_ttm.c @@ -81,12 +81,12 @@ nouveau_vram_manager_new(struct ttm_mem_type_manager *man, u32 size_nc = 0; int ret;
- if (nvbo->tile_flags & NOUVEAU_GEM_TILE_NONCONTIG) + if (nvbo->bo_flags & NOUVEAU_GEM_TILE_NONCONTIG) size_nc = 1 << nvbo->page_shift;
ret = pfb->ram->get(pfb, mem->num_pages << PAGE_SHIFT, mem->page_alignment << PAGE_SHIFT, size_nc, - (nvbo->tile_flags >> 8) & 0x3ff, &node); + (nvbo->bo_flags >> 8) & 0x3ff, &node); if (ret) { mem->mm_node = NULL; return (ret == -ENOSPC) ? 0 : ret; @@ -174,11 +174,11 @@ nouveau_gart_manager_new(struct ttm_mem_type_manager *man, switch (drm->device.info.family) { case NV_DEVICE_INFO_V0_TESLA: if (drm->device.info.chipset != 0x50) - node->memtype = (nvbo->tile_flags & 0x7f00) >> 8; + node->memtype = (nvbo->bo_flags & 0x7f00) >> 8; break; case NV_DEVICE_INFO_V0_FERMI: case NV_DEVICE_INFO_V0_KEPLER: - node->memtype = (nvbo->tile_flags & 0xff00) >> 8; + node->memtype = (nvbo->bo_flags & 0xff00) >> 8; break; default: break; diff --git a/drm/nv50_display.c b/drm/nv50_display.c index fdb3e1a..ce0df41 100644 --- a/drm/nv50_display.c +++ b/drm/nv50_display.c @@ -2352,7 +2352,7 @@ nv50_fb_ctor(struct drm_framebuffer *fb) u8 kind = nouveau_bo_tile_layout(nvbo) >> 8; u8 tile = nvbo->tile_mode;
- if (nvbo->tile_flags & NOUVEAU_GEM_TILE_NONCONTIG) { + if (nvbo->bo_flags & NOUVEAU_GEM_TILE_NONCONTIG) { NV_ERROR(drm, "framebuffer requires contiguous bo\n"); return -EINVAL; } diff --git a/drm/uapi/drm/nouveau_drm.h b/drm/uapi/drm/nouveau_drm.h index 394cd94..49fc749 100644 --- a/drm/uapi/drm/nouveau_drm.h +++ b/drm/uapi/drm/nouveau_drm.h @@ -46,6 +46,7 @@ #define NOUVEAU_GEM_TILE_32BPP 0x00000002 #define NOUVEAU_GEM_TILE_ZETA 0x00000004 #define NOUVEAU_GEM_TILE_NONCONTIG 0x00000008 +#define NOUVEAU_GEM_EXPLICIT_SYNC 0x00040000
struct drm_nouveau_gem_info { uint32_t handle; @@ -54,7 +55,7 @@ struct drm_nouveau_gem_info { uint64_t offset; uint64_t map_handle; uint32_t tile_mode; - uint32_t tile_flags; + uint32_t bo_flags; };
struct drm_nouveau_gem_new { @@ -149,6 +150,7 @@ struct drm_nouveau_gem_cpu_fini { #define DRM_NOUVEAU_GEM_CPU_FINI 0x43 #define DRM_NOUVEAU_GEM_INFO 0x44 #define DRM_NOUVEAU_GEM_PUSHBUF_2 0x45 +#define DRM_NOUVEAU_GEM_SET_INFO 0x46
#define DRM_IOCTL_NOUVEAU_GEM_NEW DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_NEW, struct drm_nouveau_gem_new) #define DRM_IOCTL_NOUVEAU_GEM_PUSHBUF DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_PUSHBUF, struct drm_nouveau_gem_pushbuf) @@ -156,5 +158,6 @@ struct drm_nouveau_gem_cpu_fini { #define DRM_IOCTL_NOUVEAU_GEM_CPU_PREP DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_PREP, struct drm_nouveau_gem_cpu_prep) #define DRM_IOCTL_NOUVEAU_GEM_CPU_FINI DRM_IOW (DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_CPU_FINI, struct drm_nouveau_gem_cpu_fini) #define DRM_IOCTL_NOUVEAU_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_INFO, struct drm_nouveau_gem_info) +#define DRM_IOCTL_NOUVEAU_GEM_SET_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_NOUVEAU_GEM_SET_INFO, struct drm_nouveau_gem_info)
#endif /* __NOUVEAU_DRM_H__ */
Hey,
On 26-09-14 12:00, Lauri Peltonen wrote:
Do not attach fences automatically to buffers that are marked for explicit synchronization.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com
drm/nouveau_bo.c | 8 ++++---- drm/nouveau_bo.h | 4 ++-- drm/nouveau_drm.c | 1 + drm/nouveau_gem.c | 47 +++++++++++++++++++++++++++++++++++++++------- drm/nouveau_gem.h | 6 ++++-- drm/nouveau_ttm.c | 8 ++++---- drm/nv50_display.c | 2 +- drm/uapi/drm/nouveau_drm.h | 5 ++++- 8 files changed, 60 insertions(+), 21 deletions(-)
Considering this will break on eviction, is it really needed? If a fence is waited on before starting an operation synchronization will be a noop.
~Maarten
Allow user space to provide an explicit sync fence fd when exporting a dma-buf from gem handle. The fence will be stored as the explicit fence to the reservation object.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com --- drivers/gpu/drm/drm_prime.c | 41 +++++++++++++++++++++++++++++++++-------- include/uapi/drm/drm.h | 9 ++++++++- 2 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 2807a77..c69df2e 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,8 +28,10 @@
#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/reservation.h> #include <drm/drmP.h> #include "drm_internal.h" +#include "../../staging/android/sync.h"
/* * DMA-BUF/GEM Object references and lifetime overview: @@ -329,7 +331,7 @@ static const struct dma_buf_ops drm_gem_prime_dmabuf_ops = { * drm_gem_prime_export - helper library implemention of the export callback * @dev: drm_device to export from * @obj: GEM object to export - * @flags: flags like DRM_CLOEXEC + * @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD * * This is the implementation of the gem_prime_export functions for GEM drivers * using the PRIME helpers. @@ -385,7 +387,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, * @dev: dev to export the buffer from * @file_priv: drm file-private structure * @handle: buffer handle to export - * @flags: flags like DRM_CLOEXEC + * @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD * @prime_fd: pointer to storage for the fd id of the create dma-buf * * This is the PRIME export function which must be used mandatorily by GEM @@ -401,6 +403,24 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, struct drm_gem_object *obj; int ret = 0; struct dma_buf *dmabuf; + struct fence *fence = NULL; + + if (flags & DRM_SYNC_FD) { +#ifdef CONFIG_SYNC + struct sync_fence *sf = sync_fence_fdget(*prime_fd); + if (!sf) + return -ENOENT; + if (sf->num_fences != 1) { + sync_fence_put(sf); + return -EINVAL; + } + fence = fence_get(sf->cbs[0].sync_pt); + sync_fence_put(sf); + flags &= ~DRM_SYNC_FD; +#else + return -ENODEV; +#endif + }
mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(dev, file_priv, handle); @@ -453,6 +473,14 @@ out_have_obj: goto fail_put_dmabuf;
out_have_handle: + if (fence) { + if (!dmabuf->resv) { + ret = -ENODEV; + goto fail_put_dmabuf; + } + reservation_object_add_excl_fence(dmabuf->resv, fence); + } + ret = dma_buf_fd(dmabuf, flags); /* * We must _not_ remove the buffer from the handle cache since the newly @@ -475,6 +503,7 @@ out: drm_gem_object_unreference_unlocked(obj); out_unlock: mutex_unlock(&file_priv->prime.lock); + fence_put(fence);
return ret; } @@ -624,7 +653,6 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_prime_handle *args = data; - uint32_t flags;
if (!drm_core_check_feature(dev, DRIVER_PRIME)) return -EINVAL; @@ -633,14 +661,11 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, return -ENOSYS;
/* check flags are valid */ - if (args->flags & ~DRM_CLOEXEC) + if (args->flags & ~(DRM_CLOEXEC | DRM_SYNC_FD)) return -EINVAL;
- /* we only want to pass DRM_CLOEXEC which is == O_CLOEXEC */ - flags = args->flags & DRM_CLOEXEC; - return dev->driver->prime_handle_to_fd(dev, file_priv, - args->handle, flags, &args->fd); + args->handle, args->flags, &args->fd); }
int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h index b0b8556..a11b893 100644 --- a/include/uapi/drm/drm.h +++ b/include/uapi/drm/drm.h @@ -661,13 +661,20 @@ struct drm_set_client_cap { };
#define DRM_CLOEXEC O_CLOEXEC +#define DRM_SYNC_FD O_DSYNC struct drm_prime_handle { __u32 handle;
/** Flags.. only applicable for handle->fd */ __u32 flags;
- /** Returned dmabuf file descriptor */ + /** + * DRM_IOCTL_PRIME_FD_TO_HANDLE: + * in: dma-buf fd + * DRM_IOCTL_PRIME_HANDLE_TO_FD: + * in: sync fence fd if DRM_SYNC_FD flag is passed + * out: dma-buf fd + */ __s32 fd; };
On 26-09-14 12:00, Lauri Peltonen wrote:
Allow user space to provide an explicit sync fence fd when exporting a dma-buf from gem handle. The fence will be stored as the explicit fence to the reservation object.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com
drivers/gpu/drm/drm_prime.c | 41 +++++++++++++++++++++++++++++++++-------- include/uapi/drm/drm.h | 9 ++++++++- 2 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 2807a77..c69df2e 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,8 +28,10 @@
#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/reservation.h> #include <drm/drmP.h> #include "drm_internal.h" +#include "../../staging/android/sync.h"
/*
- DMA-BUF/GEM Object references and lifetime overview:
@@ -329,7 +331,7 @@ static const struct dma_buf_ops drm_gem_prime_dmabuf_ops = {
- drm_gem_prime_export - helper library implemention of the export callback
- @dev: drm_device to export from
- @obj: GEM object to export
- @flags: flags like DRM_CLOEXEC
- @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD
- This is the implementation of the gem_prime_export functions for GEM drivers
- using the PRIME helpers.
@@ -385,7 +387,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev,
- @dev: dev to export the buffer from
- @file_priv: drm file-private structure
- @handle: buffer handle to export
- @flags: flags like DRM_CLOEXEC
- @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD
- @prime_fd: pointer to storage for the fd id of the create dma-buf
- This is the PRIME export function which must be used mandatorily by GEM
@@ -401,6 +403,24 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, struct drm_gem_object *obj; int ret = 0; struct dma_buf *dmabuf;
- struct fence *fence = NULL;
- if (flags & DRM_SYNC_FD) {
+#ifdef CONFIG_SYNC
struct sync_fence *sf = sync_fence_fdget(*prime_fd);
if (!sf)
return -ENOENT;
if (sf->num_fences != 1) {
sync_fence_put(sf);
return -EINVAL;
}
fence = fence_get(sf->cbs[0].sync_pt);
sync_fence_put(sf);
flags &= ~DRM_SYNC_FD;
+#else
return -ENODEV;
+#endif
}
mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(dev, file_priv, handle);
@@ -453,6 +473,14 @@ out_have_obj: goto fail_put_dmabuf;
out_have_handle:
- if (fence) {
if (!dmabuf->resv) {
ret = -ENODEV;
goto fail_put_dmabuf;
}
This is never true. A default resv gets allocated, see dma_buf's create function.
~Maarten
~Maarten
On Fri, Sep 26, 2014 at 01:00:12PM +0300, Lauri Peltonen wrote:
Allow user space to provide an explicit sync fence fd when exporting a dma-buf from gem handle. The fence will be stored as the explicit fence to the reservation object.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com
All existing userspace treats dma_bufs as long-lived objects. Well, all the userspace that expects implicit syncing, afaik Android shovels lots of dma-bufs around all the time (since ion is using them as it's native buffer handles).
So adding an exclusive fence once at export time isn't going to be terribly useful.
So I think a better approach would be to add this as ioctls to the dma-buf fd itself. Then you can also add a "give me the fence(s) for this dma_buf" ioctl, which is useful for interop the other way round (i.e. implicit -> explicit). -Daniel
drivers/gpu/drm/drm_prime.c | 41 +++++++++++++++++++++++++++++++++-------- include/uapi/drm/drm.h | 9 ++++++++- 2 files changed, 41 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 2807a77..c69df2e 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -28,8 +28,10 @@
#include <linux/export.h> #include <linux/dma-buf.h> +#include <linux/reservation.h> #include <drm/drmP.h> #include "drm_internal.h" +#include "../../staging/android/sync.h"
/*
- DMA-BUF/GEM Object references and lifetime overview:
@@ -329,7 +331,7 @@ static const struct dma_buf_ops drm_gem_prime_dmabuf_ops = {
- drm_gem_prime_export - helper library implemention of the export callback
- @dev: drm_device to export from
- @obj: GEM object to export
- @flags: flags like DRM_CLOEXEC
- @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD
- This is the implementation of the gem_prime_export functions for GEM drivers
- using the PRIME helpers.
@@ -385,7 +387,7 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev,
- @dev: dev to export the buffer from
- @file_priv: drm file-private structure
- @handle: buffer handle to export
- @flags: flags like DRM_CLOEXEC
- @flags: flags like DRM_CLOEXEC or DRM_SYNC_FD
- @prime_fd: pointer to storage for the fd id of the create dma-buf
- This is the PRIME export function which must be used mandatorily by GEM
@@ -401,6 +403,24 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev, struct drm_gem_object *obj; int ret = 0; struct dma_buf *dmabuf;
- struct fence *fence = NULL;
- if (flags & DRM_SYNC_FD) {
+#ifdef CONFIG_SYNC
struct sync_fence *sf = sync_fence_fdget(*prime_fd);
if (!sf)
return -ENOENT;
if (sf->num_fences != 1) {
sync_fence_put(sf);
return -EINVAL;
}
fence = fence_get(sf->cbs[0].sync_pt);
sync_fence_put(sf);
flags &= ~DRM_SYNC_FD;
+#else
return -ENODEV;
+#endif
}
mutex_lock(&file_priv->prime.lock); obj = drm_gem_object_lookup(dev, file_priv, handle);
@@ -453,6 +473,14 @@ out_have_obj: goto fail_put_dmabuf;
out_have_handle:
- if (fence) {
if (!dmabuf->resv) {
ret = -ENODEV;
goto fail_put_dmabuf;
}
reservation_object_add_excl_fence(dmabuf->resv, fence);
- }
- ret = dma_buf_fd(dmabuf, flags); /*
- We must _not_ remove the buffer from the handle cache since the newly
@@ -475,6 +503,7 @@ out: drm_gem_object_unreference_unlocked(obj); out_unlock: mutex_unlock(&file_priv->prime.lock);
fence_put(fence);
return ret;
} @@ -624,7 +653,6 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv) { struct drm_prime_handle *args = data;
uint32_t flags;
if (!drm_core_check_feature(dev, DRIVER_PRIME)) return -EINVAL;
@@ -633,14 +661,11 @@ int drm_prime_handle_to_fd_ioctl(struct drm_device *dev, void *data, return -ENOSYS;
/* check flags are valid */
- if (args->flags & ~DRM_CLOEXEC)
- if (args->flags & ~(DRM_CLOEXEC | DRM_SYNC_FD)) return -EINVAL;
- /* we only want to pass DRM_CLOEXEC which is == O_CLOEXEC */
- flags = args->flags & DRM_CLOEXEC;
- return dev->driver->prime_handle_to_fd(dev, file_priv,
args->handle, flags, &args->fd);
args->handle, args->flags, &args->fd);
}
int drm_prime_fd_to_handle_ioctl(struct drm_device *dev, void *data, diff --git a/include/uapi/drm/drm.h b/include/uapi/drm/drm.h index b0b8556..a11b893 100644 --- a/include/uapi/drm/drm.h +++ b/include/uapi/drm/drm.h @@ -661,13 +661,20 @@ struct drm_set_client_cap { };
#define DRM_CLOEXEC O_CLOEXEC +#define DRM_SYNC_FD O_DSYNC struct drm_prime_handle { __u32 handle;
/** Flags.. only applicable for handle->fd */ __u32 flags;
- /** Returned dmabuf file descriptor */
- /**
* DRM_IOCTL_PRIME_FD_TO_HANDLE:
* in: dma-buf fd
* DRM_IOCTL_PRIME_HANDLE_TO_FD:
* in: sync fence fd if DRM_SYNC_FD flag is passed
* out: dma-buf fd
__s32 fd;*/
};
-- 1.8.1.5
On Mon, Sep 29, 2014 at 09:46:48AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:12PM +0300, Lauri Peltonen wrote:
Allow user space to provide an explicit sync fence fd when exporting a dma-buf from gem handle. The fence will be stored as the explicit fence to the reservation object.
Signed-off-by: Lauri Peltonen lpeltonen@nvidia.com
All existing userspace treats dma_bufs as long-lived objects. Well, all the userspace that expects implicit syncing, afaik Android shovels lots of dma-bufs around all the time (since ion is using them as it's native buffer handles).
So adding an exclusive fence once at export time isn't going to be terribly useful.
So I think a better approach would be to add this as ioctls to the dma-buf fd itself. Then you can also add a "give me the fence(s) for this dma_buf" ioctl, which is useful for interop the other way round (i.e. implicit -> explicit).
Yes, I like this. I thought that one could support long-lived dma-bufs by doing a "re-export" when a new fence needs to be attached, but your model is indeed much nicer!
On Sat, Sep 27, 2014 at 08:49:39AM +0200, Maarten Lankhorst wrote:
This is never true. A default resv gets allocated, see dma_buf's create function.
Ah, ok. I'll keep that in mind when writing new versions. :-)
Thanks, Lauri
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
Hi guys,
I'd like to start a new thread about explicit fence synchronization. This time with a Nouveau twist. :-)
First, let me define what I understand by implicit/explicit sync:
Implicit synchronization
- Fences are attached to buffers
- Kernel manages fences automatically based on buffer read/write access
Explicit synchronization
- Fences are passed around independently
- Kernel takes and emits fences to/from user space when submitting work
Implicit synchronization is already implemented in open source drivers, and works well for most use cases. I don't seek to change any of that. My proposal aims at allowing some drm drivers to operate in explicit sync mode to get maximal performance, while still remaining fully compatible with the implicit paradigm.
Yeah, pretty much what we have in mind on the i915 side too. I didn't look too closely at your patches, so just a few high level comments on your rfc here.
I will try to explain why I think we should support the explicit model as well.
- Bindless graphics
Bindless graphics is a central concept when trying to reduce the OpenGL driver overhead. The idea is that the application can bind a large set of buffers to the working set up front using extensions such as GL_ARB_bindless_texture, and they remain resident until the application releases them (note that compute APIs have typically similar semantics). These working sets can be huge, hundreds or even thousands of buffers, so we would like to opt out from the per-submit overhead of acquiring locks, waiting for fences, and storing fences. Automatically synchronizing these working sets in kernel will also prevent parallelism between channels that are sharing the working set (in fact sharing just one buffer from the working set will cause the jobs of the two channels to be serialized).
- Evolution of graphics APIs
The graphics API evolution seems to be going to a direction where game engine and middleware vendors demand more control over work submission and synchronization. We expect that this trend will continue, and more and more synchronization decisions will be pushed to the API level. OpenGL and EGL already provide good explicit command stream level synchronization primitives: glFenceSync and EGL_KHR_wait_sync. Their use is also encouraged - for example EGL_KHR_image_base spec clearly states that the application is responsible for synchronizing accesses to EGLImages. If the API that is exposed to developers gives the control over synchronization to the developer, then implicit waits that are inserted by the kernel are unnecessary and unexpected, and can severely hurt performance. It also makes it easy for the developer to write code that happens to work on Linux because of implicit sync, but will fail on other platforms.
- Suballocation
Using user space suballocation can help reduce the overhead when a large number of small textures are used. Synchronizing suballocated surfaces implicitly in kernel doesn't make sense - many channels should be able to access the same kernel-level buffer object simultaneously.
- Buffer sharing complications
This is not really an argument for explicit sync as such, but I'd like to point out that sharing buffers across SoC engines is often much more complex than just exporting and importing a dma-buf and waiting for the dma-buf fences. Sometimes we need to do color format or tiling layout conversion. Sometimes, at least on Tegra, we need to decompress buffers when we pass them from the GPU to an engine that doesn't support framebuffer compression. These things are not uncommon, particularly when we have SoC's that combine licensed IP blocks from different vendors. My point is that user space is already heavily involved when sharing buffers between drivers, and giving it some more control over synchronization is not adding that much complexity.
Because of the above arguments, I think it makes sense to let some user space drm drivers opt out from implicit synchronization, while allowing them to still remain fully compatible with the rest of the drm world that uses implicit synchronization. In practice, this would require three things:
(1) Support passing fences (that are not tied to buffer objects) between kernel and user space.
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
I have written a prototype that demonstrates (1) by adding explicit sync fd support to Nouveau. It's not a lot of code, because I only use a relatively small subset of the android sync driver functionality. Thanks to Maarten's rewrite, all I need to do is to allow creating a sync_fence from a drm fence in order to pass it to user space. I don't need to use sync_pt or sync_timeline, or fill in sync_timeline_ops.
I can see why the upstream has been reluctant to de-stage the android sync driver in its current form, since (even though it now builds on struct fence) it still duplicates some of the drm fence concepts. I'd like to think that my patches only use the parts of the android sync driver that genuinely are missing from the drm fence model: allowing user space to operate on fence objects that are independent of buffer objects.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
The last two patches are mocks that show how (2) and (3) might work out. I haven't done any testing with them yet. Before going any further, I'd like to get your feedback. Can you see the benefits of explicit sync as an alternative synchronization model? Do you think we could use the android sync_fence for passing fences between user space? Or did you have something else in mind for explicit sync in the drm world?
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
- Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
- Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
- For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
Unfortunately it looks like Intel won't do this all for you due to a bunch of hilarious internal reasons :( At least not anytime soon.
Cheers, Daniel
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
Hi guys,
I'd like to start a new thread about explicit fence synchronization. This time with a Nouveau twist. :-)
First, let me define what I understand by implicit/explicit sync:
Implicit synchronization
- Fences are attached to buffers
- Kernel manages fences automatically based on buffer read/write access
Explicit synchronization
- Fences are passed around independently
- Kernel takes and emits fences to/from user space when submitting work
Implicit synchronization is already implemented in open source drivers, and works well for most use cases. I don't seek to change any of that. My proposal aims at allowing some drm drivers to operate in explicit sync mode to get maximal performance, while still remaining fully compatible with the implicit paradigm.
Yeah, pretty much what we have in mind on the i915 side too. I didn't look too closely at your patches, so just a few high level comments on your rfc here.
I will try to explain why I think we should support the explicit model as well.
- Bindless graphics
Bindless graphics is a central concept when trying to reduce the OpenGL driver overhead. The idea is that the application can bind a large set of buffers to the working set up front using extensions such as GL_ARB_bindless_texture, and they remain resident until the application releases them (note that compute APIs have typically similar semantics). These working sets can be huge, hundreds or even thousands of buffers, so we would like to opt out from the per-submit overhead of acquiring locks, waiting for fences, and storing fences. Automatically synchronizing these working sets in kernel will also prevent parallelism between channels that are sharing the working set (in fact sharing just one buffer from the working set will cause the jobs of the two channels to be serialized).
- Evolution of graphics APIs
The graphics API evolution seems to be going to a direction where game engine and middleware vendors demand more control over work submission and synchronization. We expect that this trend will continue, and more and more synchronization decisions will be pushed to the API level. OpenGL and EGL already provide good explicit command stream level synchronization primitives: glFenceSync and EGL_KHR_wait_sync. Their use is also encouraged - for example EGL_KHR_image_base spec clearly states that the application is responsible for synchronizing accesses to EGLImages. If the API that is exposed to developers gives the control over synchronization to the developer, then implicit waits that are inserted by the kernel are unnecessary and unexpected, and can severely hurt performance. It also makes it easy for the developer to write code that happens to work on Linux because of implicit sync, but will fail on other platforms.
- Suballocation
Using user space suballocation can help reduce the overhead when a large number of small textures are used. Synchronizing suballocated surfaces implicitly in kernel doesn't make sense - many channels should be able to access the same kernel-level buffer object simultaneously.
- Buffer sharing complications
This is not really an argument for explicit sync as such, but I'd like to point out that sharing buffers across SoC engines is often much more complex than just exporting and importing a dma-buf and waiting for the dma-buf fences. Sometimes we need to do color format or tiling layout conversion. Sometimes, at least on Tegra, we need to decompress buffers when we pass them from the GPU to an engine that doesn't support framebuffer compression. These things are not uncommon, particularly when we have SoC's that combine licensed IP blocks from different vendors. My point is that user space is already heavily involved when sharing buffers between drivers, and giving it some more control over synchronization is not adding that much complexity.
Because of the above arguments, I think it makes sense to let some user space drm drivers opt out from implicit synchronization, while allowing them to still remain fully compatible with the rest of the drm world that uses implicit synchronization. In practice, this would require three things:
(1) Support passing fences (that are not tied to buffer objects) between kernel and user space.
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
Well i am gonna repeat myself. But yes you can do explicit without associating fence (at least no struct alloc) just associate a unique per command stream number and have it be global ie irrespective of different execution pipeline your hw have.
For non scheduling GPU, today generation roughly, you keep buffer on lru and you know you can not evict buffer to swap for those that have an active id (hw did not yet write the sequence number back). You update the lru with each command stream ioctl.
For binding to GPU GART you can do that as a preamble to the command stream which most hardware (AMD, Intel, NVidia) should be able to do.
For VRAM you have several choice that depends on how you want to manage VRAM. For instance you might want to use it more like a cache and have each command stream preamble with a bunch of copy to VRAM and posibly bunch of post copy back to RAM. Or you can hold to current scheme but buffer move now becomes preamble to command stream (ie buffer move are scheduled as a preamble to command stream).
So i do not see what you would consider rocket science about this ?
Cheers, Jérôme
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
I have written a prototype that demonstrates (1) by adding explicit sync fd support to Nouveau. It's not a lot of code, because I only use a relatively small subset of the android sync driver functionality. Thanks to Maarten's rewrite, all I need to do is to allow creating a sync_fence from a drm fence in order to pass it to user space. I don't need to use sync_pt or sync_timeline, or fill in sync_timeline_ops.
I can see why the upstream has been reluctant to de-stage the android sync driver in its current form, since (even though it now builds on struct fence) it still duplicates some of the drm fence concepts. I'd like to think that my patches only use the parts of the android sync driver that genuinely are missing from the drm fence model: allowing user space to operate on fence objects that are independent of buffer objects.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
The last two patches are mocks that show how (2) and (3) might work out. I haven't done any testing with them yet. Before going any further, I'd like to get your feedback. Can you see the benefits of explicit sync as an alternative synchronization model? Do you think we could use the android sync_fence for passing fences between user space? Or did you have something else in mind for explicit sync in the drm world?
I'm all for adding explicit syncing. Our plans are roughly.
Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
Unfortunately it looks like Intel won't do this all for you due to a bunch of hilarious internal reasons :( At least not anytime soon.
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 29, 2014 at 11:42:19AM -0400, Jerome Glisse wrote:
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
Hi guys,
I'd like to start a new thread about explicit fence synchronization. This time with a Nouveau twist. :-)
First, let me define what I understand by implicit/explicit sync:
Implicit synchronization
- Fences are attached to buffers
- Kernel manages fences automatically based on buffer read/write access
Explicit synchronization
- Fences are passed around independently
- Kernel takes and emits fences to/from user space when submitting work
Implicit synchronization is already implemented in open source drivers, and works well for most use cases. I don't seek to change any of that. My proposal aims at allowing some drm drivers to operate in explicit sync mode to get maximal performance, while still remaining fully compatible with the implicit paradigm.
Yeah, pretty much what we have in mind on the i915 side too. I didn't look too closely at your patches, so just a few high level comments on your rfc here.
I will try to explain why I think we should support the explicit model as well.
- Bindless graphics
Bindless graphics is a central concept when trying to reduce the OpenGL driver overhead. The idea is that the application can bind a large set of buffers to the working set up front using extensions such as GL_ARB_bindless_texture, and they remain resident until the application releases them (note that compute APIs have typically similar semantics). These working sets can be huge, hundreds or even thousands of buffers, so we would like to opt out from the per-submit overhead of acquiring locks, waiting for fences, and storing fences. Automatically synchronizing these working sets in kernel will also prevent parallelism between channels that are sharing the working set (in fact sharing just one buffer from the working set will cause the jobs of the two channels to be serialized).
- Evolution of graphics APIs
The graphics API evolution seems to be going to a direction where game engine and middleware vendors demand more control over work submission and synchronization. We expect that this trend will continue, and more and more synchronization decisions will be pushed to the API level. OpenGL and EGL already provide good explicit command stream level synchronization primitives: glFenceSync and EGL_KHR_wait_sync. Their use is also encouraged - for example EGL_KHR_image_base spec clearly states that the application is responsible for synchronizing accesses to EGLImages. If the API that is exposed to developers gives the control over synchronization to the developer, then implicit waits that are inserted by the kernel are unnecessary and unexpected, and can severely hurt performance. It also makes it easy for the developer to write code that happens to work on Linux because of implicit sync, but will fail on other platforms.
- Suballocation
Using user space suballocation can help reduce the overhead when a large number of small textures are used. Synchronizing suballocated surfaces implicitly in kernel doesn't make sense - many channels should be able to access the same kernel-level buffer object simultaneously.
- Buffer sharing complications
This is not really an argument for explicit sync as such, but I'd like to point out that sharing buffers across SoC engines is often much more complex than just exporting and importing a dma-buf and waiting for the dma-buf fences. Sometimes we need to do color format or tiling layout conversion. Sometimes, at least on Tegra, we need to decompress buffers when we pass them from the GPU to an engine that doesn't support framebuffer compression. These things are not uncommon, particularly when we have SoC's that combine licensed IP blocks from different vendors. My point is that user space is already heavily involved when sharing buffers between drivers, and giving it some more control over synchronization is not adding that much complexity.
Because of the above arguments, I think it makes sense to let some user space drm drivers opt out from implicit synchronization, while allowing them to still remain fully compatible with the rest of the drm world that uses implicit synchronization. In practice, this would require three things:
(1) Support passing fences (that are not tied to buffer objects) between kernel and user space.
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
Well i am gonna repeat myself. But yes you can do explicit without associating fence (at least no struct alloc) just associate a unique per command stream number and have it be global ie irrespective of different execution pipeline your hw have.
For non scheduling GPU, today generation roughly, you keep buffer on lru and you know you can not evict buffer to swap for those that have an active id (hw did not yet write the sequence number back). You update the lru with each command stream ioctl.
For binding to GPU GART you can do that as a preamble to the command stream which most hardware (AMD, Intel, NVidia) should be able to do.
For VRAM you have several choice that depends on how you want to manage VRAM. For instance you might want to use it more like a cache and have each command stream preamble with a bunch of copy to VRAM and posibly bunch of post copy back to RAM. Or you can hold to current scheme but buffer move now becomes preamble to command stream (ie buffer move are scheduled as a preamble to command stream).
So i do not see what you would consider rocket science about this ?
You just implement fences with a seqno value. Indeed not rocket science ;-)
Cheers, Daniel
Cheers, Jérôme
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
I have written a prototype that demonstrates (1) by adding explicit sync fd support to Nouveau. It's not a lot of code, because I only use a relatively small subset of the android sync driver functionality. Thanks to Maarten's rewrite, all I need to do is to allow creating a sync_fence from a drm fence in order to pass it to user space. I don't need to use sync_pt or sync_timeline, or fill in sync_timeline_ops.
I can see why the upstream has been reluctant to de-stage the android sync driver in its current form, since (even though it now builds on struct fence) it still duplicates some of the drm fence concepts. I'd like to think that my patches only use the parts of the android sync driver that genuinely are missing from the drm fence model: allowing user space to operate on fence objects that are independent of buffer objects.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
The last two patches are mocks that show how (2) and (3) might work out. I haven't done any testing with them yet. Before going any further, I'd like to get your feedback. Can you see the benefits of explicit sync as an alternative synchronization model? Do you think we could use the android sync_fence for passing fences between user space? Or did you have something else in mind for explicit sync in the drm world?
I'm all for adding explicit syncing. Our plans are roughly.
Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
Unfortunately it looks like Intel won't do this all for you due to a bunch of hilarious internal reasons :( At least not anytime soon.
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 9/29/14 8:42 AM, Jerome Glisse wrote:
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
Hi guys,
I'd like to start a new thread about explicit fence synchronization. This time with a Nouveau twist. :-)
First, let me define what I understand by implicit/explicit sync:
Implicit synchronization
- Fences are attached to buffers
- Kernel manages fences automatically based on buffer read/write access
Explicit synchronization
- Fences are passed around independently
- Kernel takes and emits fences to/from user space when submitting work
Implicit synchronization is already implemented in open source drivers, and works well for most use cases. I don't seek to change any of that. My proposal aims at allowing some drm drivers to operate in explicit sync mode to get maximal performance, while still remaining fully compatible with the implicit paradigm.
Yeah, pretty much what we have in mind on the i915 side too. I didn't look too closely at your patches, so just a few high level comments on your rfc here.
I will try to explain why I think we should support the explicit model as well.
- Bindless graphics
Bindless graphics is a central concept when trying to reduce the OpenGL driver overhead. The idea is that the application can bind a large set of buffers to the working set up front using extensions such as GL_ARB_bindless_texture, and they remain resident until the application releases them (note that compute APIs have typically similar semantics). These working sets can be huge, hundreds or even thousands of buffers, so we would like to opt out from the per-submit overhead of acquiring locks, waiting for fences, and storing fences. Automatically synchronizing these working sets in kernel will also prevent parallelism between channels that are sharing the working set (in fact sharing just one buffer from the working set will cause the jobs of the two channels to be serialized).
- Evolution of graphics APIs
The graphics API evolution seems to be going to a direction where game engine and middleware vendors demand more control over work submission and synchronization. We expect that this trend will continue, and more and more synchronization decisions will be pushed to the API level. OpenGL and EGL already provide good explicit command stream level synchronization primitives: glFenceSync and EGL_KHR_wait_sync. Their use is also encouraged - for example EGL_KHR_image_base spec clearly states that the application is responsible for synchronizing accesses to EGLImages. If the API that is exposed to developers gives the control over synchronization to the developer, then implicit waits that are inserted by the kernel are unnecessary and unexpected, and can severely hurt performance. It also makes it easy for the developer to write code that happens to work on Linux because of implicit sync, but will fail on other platforms.
- Suballocation
Using user space suballocation can help reduce the overhead when a large number of small textures are used. Synchronizing suballocated surfaces implicitly in kernel doesn't make sense - many channels should be able to access the same kernel-level buffer object simultaneously.
- Buffer sharing complications
This is not really an argument for explicit sync as such, but I'd like to point out that sharing buffers across SoC engines is often much more complex than just exporting and importing a dma-buf and waiting for the dma-buf fences. Sometimes we need to do color format or tiling layout conversion. Sometimes, at least on Tegra, we need to decompress buffers when we pass them from the GPU to an engine that doesn't support framebuffer compression. These things are not uncommon, particularly when we have SoC's that combine licensed IP blocks from different vendors. My point is that user space is already heavily involved when sharing buffers between drivers, and giving it some more control over synchronization is not adding that much complexity.
Because of the above arguments, I think it makes sense to let some user space drm drivers opt out from implicit synchronization, while allowing them to still remain fully compatible with the rest of the drm world that uses implicit synchronization. In practice, this would require three things:
(1) Support passing fences (that are not tied to buffer objects) between kernel and user space.
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
Well i am gonna repeat myself. But yes you can do explicit without associating fence (at least no struct alloc) just associate a unique per command stream number and have it be global ie irrespective of different execution pipeline your hw have.
For non scheduling GPU, today generation roughly, you keep buffer on lru and you know you can not evict buffer to swap for those that have an active id (hw did not yet write the sequence number back). You update the lru with each command stream ioctl.
For binding to GPU GART you can do that as a preamble to the command stream which most hardware (AMD, Intel, NVidia) should be able to do.
For VRAM you have several choice that depends on how you want to manage VRAM. For instance you might want to use it more like a cache and have each command stream preamble with a bunch of copy to VRAM and posibly bunch of post copy back to RAM. Or you can hold to current scheme but buffer move now becomes preamble to command stream (ie buffer move are scheduled as a preamble to command stream).
Additionally, I think the goal is to move to a model where some higher-level object such as a working set, rather than individual buffers, are assigned counters or sync primitives on a per-submission basis. Versioning off tags for individual buffers then moves to working set modification time. This is more feasible if the only thing that needs precise fencing of individual surfaces is lifetime management.
The trend seems to be towards establishing a relatively large working set up front and then submitting many command buffers against it, perhaps with incremental modifications to the working set along the way. This may be what's referred to as the Android model above, but I view it as the "non-glitchy graphic" model going forward.
Thanks, -James
So i do not see what you would consider rocket science about this ?
Cheers, Jérôme
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
I have written a prototype that demonstrates (1) by adding explicit sync fd support to Nouveau. It's not a lot of code, because I only use a relatively small subset of the android sync driver functionality. Thanks to Maarten's rewrite, all I need to do is to allow creating a sync_fence from a drm fence in order to pass it to user space. I don't need to use sync_pt or sync_timeline, or fill in sync_timeline_ops.
I can see why the upstream has been reluctant to de-stage the android sync driver in its current form, since (even though it now builds on struct fence) it still duplicates some of the drm fence concepts. I'd like to think that my patches only use the parts of the android sync driver that genuinely are missing from the drm fence model: allowing user space to operate on fence objects that are independent of buffer objects.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
The last two patches are mocks that show how (2) and (3) might work out. I haven't done any testing with them yet. Before going any further, I'd like to get your feedback. Can you see the benefits of explicit sync as an alternative synchronization model? Do you think we could use the android sync_fence for passing fences between user space? Or did you have something else in mind for explicit sync in the drm world?
I'm all for adding explicit syncing. Our plans are roughly.
Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
Unfortunately it looks like Intel won't do this all for you due to a bunch of hilarious internal reasons :( At least not anytime soon.
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 29, 2014 at 10:20:44AM -0700, James Jones wrote:
Additionally, I think the goal is to move to a model where some higher-level object such as a working set, rather than individual buffers, are assigned counters or sync primitives on a per-submission basis. Versioning off tags for individual buffers then moves to working set modification time. This is more feasible if the only thing that needs precise fencing of individual surfaces is lifetime management.
Yeah, there's always ways to make the fence assignment and tracking a bit more efficient, we're playing around with working-set tricks for i915 ourselves. But fundamentally you still have fences for each buffer object (just can't directly access them). And for buffers exported with dma_buf you still need the direct link I think, at least when you care about implicit syncing somewhat.
The trend seems to be towards establishing a relatively large working set up front and then submitting many command buffers against it, perhaps with incremental modifications to the working set along the way. This may be what's referred to as the Android model above, but I view it as the "non-glitchy graphic" model going forward.
Nah, that's something different. Afaik Android drivers don't really bother a lot with swap and page migration and having a working shrinker for gpu objects. At least our android guys need to disable the lowmemory killer since that thing just goes bananas if we driver i915 memory usuage against the wall and into swap.
I'm not really sure what you mean by "non-glitchy graphics", for me this can mean anything from avoiding stalls to proper syncing with vblanks to anything else really ... So might be good to elaborate here. -Daniel
Thanks Daniel for your input!
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
Implicit fences attached to individual buffers are one way for residency management. Do you think a working set based model could work in the DRM framework? For example, something like this:
- Allow user space to create "working set objects" and associate buffers with them. If the user space doesn't want to manage working sets explicitly, it could also use an implicit default working set that contains all buffers that are mapped to the channel vm (on Android we could always use the default working set since we don't need to manage residency). The working sets are initially marked as dirty. - User space tells which working sets are referenced by each work submission. Kernel locks these working sets, pins all buffers in dirty working sets, and resets the dirty bits. After kicking off work, kernel stores the fence to the _working sets_, and then releases the locks (if an implicit default working set is used, then this would be roughly equivalent to storing a fence to channel vm that tells "this is the last hw operation that might have touched buffers in this address space"). - If swapping doesn't happen, then we just need to check the working set dirty bits at each submit. - When a buffer is swapped out, all working sets that refer to it need to be marked as dirty. - When a buffer is swapped out or unmapped, we need to wait for the fences from all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization - we now need to process a few working sets at every submit instead of many individual buffers. However, it makes a huge difference because of fences: fences that are attached to buffers are used for implicitly synchronizing work across different channels and engines. They are in the performance critical path, and we want to carefully manage them (that's the idea of explicit synchronization). The working set fences, on the other hand, would only be used to guarantee that we don't swap out or unmap something that the GPU might be accessing. We never need to wait for those fences (except when swapping or unmapping), so we can be conservative without hurting performance.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
Fully agree. I thought the best way towards that would be to show some driver code that _would_ use it. :)
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can move forward with these parts, and continue to attach fences at submit until we have a satisfactory solution for the pinning problem?
I'd like to understand what are the concrete steps to de-stage struct sync_fence, since that's the first thing that needs to be done. For example, what do you mean by "de-cluttering the internal interfaces"? Just that we'd move the sync_fence parts from drivers/staging/android/sync.c to, say, drivers/dma-buf/sync-fence.c ? Would we still leave a copy of the existing full driver to staging/android?
Thanks, Lauri
Hey,
On 01-10-14 17:14, Lauri Peltonen wrote:
Thanks Daniel for your input!
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
Implicit fences attached to individual buffers are one way for residency management. Do you think a working set based model could work in the DRM framework? For example, something like this:
- Allow user space to create "working set objects" and associate buffers with them. If the user space doesn't want to manage working sets explicitly, it could also use an implicit default working set that contains all buffers that are mapped to the channel vm (on Android we could always use the default working set since we don't need to manage residency). The working sets are initially marked as dirty.
- User space tells which working sets are referenced by each work submission. Kernel locks these working sets, pins all buffers in dirty working sets, and resets the dirty bits. After kicking off work, kernel stores the fence to the _working sets_, and then releases the locks (if an implicit default working set is used, then this would be roughly equivalent to storing a fence to channel vm that tells "this is the last hw operation that might have touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working set dirty bits at each submit.
- When a buffer is swapped out, all working sets that refer to it need to be marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the fences from all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization - we now need to process a few working sets at every submit instead of many individual buffers. However, it makes a huge difference because of fences: fences that are attached to buffers are used for implicitly synchronizing work across different channels and engines. They are in the performance critical path, and we want to carefully manage them (that's the idea of explicit synchronization). The working set fences, on the other hand, would only be used to guarantee that we don't swap out or unmap something that the GPU might be accessing. We never need to wait for those fences (except when swapping or unmapping), so we can be conservative without hurting performance.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
Fully agree. I thought the best way towards that would be to show some driver code that _would_ use it. :)
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can move forward with these parts, and continue to attach fences at submit until we have a satisfactory solution for the pinning problem?
You could neuter implicit fences by always attaching the fences as shared when explicit syncing is used. This would work correctly with eviction, and wouldn't cause any unneeded syncing. :)
~Maarten
On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
Thanks Daniel for your input!
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:
(2) Stop automatically storing fences to the buffers that user space wants to synchronize explicitly.
The problem with this approach is that you then need hw faulting to make sure the memory is there. Implicit fences aren't just used for syncing, but also to make sure that the gpu still has access to the buffer as long as it needs it. So you need at least a non-exclusive fence attached for each command submission.
Of course on Android you don't have swap (would kill the puny mmc within seconds) and you don't care for letting userspace pin most of memory for gfx. So you'll get away with no fences at all. But for upstream I don't see a good solution unfortunately. Ideas very much welcome.
(3) Allow user space to attach an explicit fence to dma-buf when exporting to another driver that uses implicit sync.
There are still some open issues beyond these. For example, can we skip acquiring the ww mutex for explicitly synchronized buffers? I think we could eventually, at least on unified memory systems where we don't need to migrate between heaps (our downstream Tegra GPU driver does not lock any buffers at submit, it just grabs refcounts for hw). Another quirk is that now Nouveau waits on the buffer fences when closing the gem object to ensure that it doesn't unmap too early. We need to rework that for explicit sync, but that shouldn't be difficult.
See above, but you can't avoid to attach fences as long as we still use a buffer-object based gfx memory management model. At least afaics. Which means you need the ordering guarantees imposed by ww mutexes to ensure that the oddball implicit ordered client can't deadlock the kernel's memory management code.
Implicit fences attached to individual buffers are one way for residency management. Do you think a working set based model could work in the DRM framework? For example, something like this:
- Allow user space to create "working set objects" and associate buffers with them. If the user space doesn't want to manage working sets explicitly, it could also use an implicit default working set that contains all buffers that are mapped to the channel vm (on Android we could always use the default working set since we don't need to manage residency). The working sets are initially marked as dirty.
- User space tells which working sets are referenced by each work submission. Kernel locks these working sets, pins all buffers in dirty working sets, and resets the dirty bits. After kicking off work, kernel stores the fence to the _working sets_, and then releases the locks (if an implicit default working set is used, then this would be roughly equivalent to storing a fence to channel vm that tells "this is the last hw operation that might have touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working set dirty bits at each submit.
- When a buffer is swapped out, all working sets that refer to it need to be marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the fences from all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization - we now need to process a few working sets at every submit instead of many individual buffers. However, it makes a huge difference because of fences: fences that are attached to buffers are used for implicitly synchronizing work across different channels and engines. They are in the performance critical path, and we want to carefully manage them (that's the idea of explicit synchronization). The working set fences, on the other hand, would only be used to guarantee that we don't swap out or unmap something that the GPU might be accessing. We never need to wait for those fences (except when swapping or unmapping), so we can be conservative without hurting performance.
Yeah, within the driver (i.e. for private objects which are never exported to dma_buf) we can recently do stuff like this. And your above idea is roughly one of the things we're tossing around for i915.
But the cool stuff with drm is that cmd submission is driver-specific, so you can just go wild with nouveau. Of course you have to coninvce the nouveau guys (and also have open-source users for the new interface).
For shared buffers I think we should stick with the implicit fences for a while simply because I'm not sure whether it's really worth the fuzz. And reworking all the drivers and dma-buf for some working sets is a lot of fuzz ;-) Like Maarten said you can mostly short-circuit the implicit fencing by only attaching shared fences.
In case you're curious: The idea is to have a 1:1 association between ppgtt address spaces and what you call the working set above, to implement the buffer svm model in ocl2. Mostly because we expect that applications won't get the more fine-grained buffer list right anyway. And this kind of gang-scheduling of working set sizes should be more efficient for the usual case where everything fits.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
Fully agree. I thought the best way towards that would be to show some driver code that _would_ use it. :)
Oh, there's the usual chicken&egg where we need a full-blown prototype before we can start merging. Interface work on upstream is super-hard, but given the ridiculous backwards compat guarantees Linus expects us to keep up totally justified. Mistakes are really expensive. So I'm happy to see you charge ahead here.
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can move forward with these parts, and continue to attach fences at submit until we have a satisfactory solution for the pinning problem?
Yeah, that's our plan for i915 too. First add explicit fences, then figure out whether we need to be better at neutering the implicit fences, in case and only where it really gets in the way.
I'd like to understand what are the concrete steps to de-stage struct sync_fence, since that's the first thing that needs to be done. For example, what do you mean by "de-cluttering the internal interfaces"? Just that we'd move the sync_fence parts from drivers/staging/android/sync.c to, say, drivers/dma-buf/sync-fence.c ? Would we still leave a copy of the existing full driver to staging/android?
Yeah I guess that would be an approach. Personally I think we should also have basic ioctl testcase for all the ioctls exposed by syncpt fds. And reviewing the kerneldoc for the driver-internal interfaces (which includes removing everything that's no made obsolete by struct fence). Bonus points for documenting the ioctls. We could throw the test binary into libdrm maybe, there's a bunch other like it already there.
I'm not sure whether/how much google has already for this.
Aside: Will you be at XDC or linux plumbers? Either would be a perfect place to discuss plans and ideas - I'll attend both.
Cheers, Daniel§
+Rom who seems to be presenting about mainlining android sync at linux plumbers
On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
You could neuter implicit fences by always attaching the fences as shared when explicit syncing is used. This would work correctly with eviction, and wouldn't cause any unneeded syncing. :)
Yes, that will probably work! So, just to reiterate that I understood you and Daniel correctly:
- de-stage sync_fence and it's user space API (the tedious part) - add dma-buf ioctls for extracting and attaching explicit fences - Nouveau specific: allow flagging gem buffers for explicit sync - do not check pre-fences from explicitly synced buffers at submit - continue to attach a shared fence after submit so that pinning and unmapping continue to work
Then working sets and getting rid of locking all buffers individually can be dealt with later as an optimization.
On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
Implicit fences attached to individual buffers are one way for residency management. Do you think a working set based model could work in the DRM framework? For example, something like this:
- Allow user space to create "working set objects" and associate buffers with them. If the user space doesn't want to manage working sets explicitly, it could also use an implicit default working set that contains all buffers that are mapped to the channel vm (on Android we could always use the default working set since we don't need to manage residency). The working sets are initially marked as dirty.
- User space tells which working sets are referenced by each work submission. Kernel locks these working sets, pins all buffers in dirty working sets, and resets the dirty bits. After kicking off work, kernel stores the fence to the _working sets_, and then releases the locks (if an implicit default working set is used, then this would be roughly equivalent to storing a fence to channel vm that tells "this is the last hw operation that might have touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working set dirty bits at each submit.
- When a buffer is swapped out, all working sets that refer to it need to be marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the fences from all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization - we now need to process a few working sets at every submit instead of many individual buffers. However, it makes a huge difference because of fences: fences that are attached to buffers are used for implicitly synchronizing work across different channels and engines. They are in the performance critical path, and we want to carefully manage them (that's the idea of explicit synchronization). The working set fences, on the other hand, would only be used to guarantee that we don't swap out or unmap something that the GPU might be accessing. We never need to wait for those fences (except when swapping or unmapping), so we can be conservative without hurting performance.
Yeah, within the driver (i.e. for private objects which are never exported to dma_buf) we can recently do stuff like this. And your above idea is roughly one of the things we're tossing around for i915.
But the cool stuff with drm is that cmd submission is driver-specific, so you can just go wild with nouveau. Of course you have to coninvce the nouveau guys (and also have open-source users for the new interface).
For shared buffers I think we should stick with the implicit fences for a while simply because I'm not sure whether it's really worth the fuzz. And reworking all the drivers and dma-buf for some working sets is a lot of fuzz ;-) Like Maarten said you can mostly short-circuit the implicit fencing by only attaching shared fences.
Yes, I'll try to do that.
In case you're curious: The idea is to have a 1:1 association between ppgtt address spaces and what you call the working set above, to implement the buffer svm model in ocl2. Mostly because we expect that applications won't get the more fine-grained buffer list right anyway. And this kind of gang-scheduling of working set sizes should be more efficient for the usual case where everything fits.
If I understood correctly, this would be exactly the same as what I called the "default working set" above. On Android we don't care much about finer grained working sets either, because of UMA and no swapping.
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
Fully agree. I thought the best way towards that would be to show some driver code that _would_ use it. :)
Oh, there's the usual chicken&egg where we need a full-blown prototype before we can start merging. Interface work on upstream is super-hard, but given the ridiculous backwards compat guarantees Linus expects us to keep up totally justified. Mistakes are really expensive. So I'm happy to see you charge ahead here.
Given that I'm not used to working with upstream, don't expect too much from my "charging ahead". :) I'm still secretly hoping that the Android guys at Google would jump in to help, now that we seem to agree that we could de-stage sync_fence.
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can move forward with these parts, and continue to attach fences at submit until we have a satisfactory solution for the pinning problem?
Yeah, that's our plan for i915 too. First add explicit fences, then figure out whether we need to be better at neutering the implicit fences, in case and only where it really gets in the way.
I'd like to understand what are the concrete steps to de-stage struct sync_fence, since that's the first thing that needs to be done. For example, what do you mean by "de-cluttering the internal interfaces"? Just that we'd move the sync_fence parts from drivers/staging/android/sync.c to, say, drivers/dma-buf/sync-fence.c ? Would we still leave a copy of the existing full driver to staging/android?
Yeah I guess that would be an approach. Personally I think we should also have basic ioctl testcase for all the ioctls exposed by syncpt fds. And reviewing the kerneldoc for the driver-internal interfaces (which includes removing everything that's no made obsolete by struct fence). Bonus points for documenting the ioctls. We could throw the test binary into libdrm maybe, there's a bunch other like it already there.
I'm not sure whether/how much google has already for this.
The simplest way to add tests is if we allow user space to create and trigger fences. We don't want to enable that from kernel by default, because that opens possibilities for deadlocks (e.g. a process could deadlock a compositor by passing a fence that it never triggers). Android sync driver solves this by having a separate CONFIG_SW_SYNC_USER that can be used for testing.
Here's a simple test by Google: https://android.googlesource.com/platform/system/core/+/master/libsync/sync_...
And this is their userspace wrapper lib: https://android.googlesource.com/platform/system/core/+/master/libsync/sync.... https://android.googlesource.com/platform/system/core/+/master/libsync/inclu...
Would that go to libdrm now...?
Aside: Will you be at XDC or linux plumbers? Either would be a perfect place to discuss plans and ideas - I'll attend both.
I wasn't going to, but let's see. The former is pretty soon and the latter is sold out. At least Andy Ritger from Nvidia is coming to XDC for sure, and he's been involved in our internal discussions around these topics. So I suggest you have a chat with him at least! :)
Thanks, Lauri
On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
+Rom who seems to be presenting about mainlining android sync at linux plumbers
Also add Greg KH as fyi that we're working on de-stage one of the android subsystems.
On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
You could neuter implicit fences by always attaching the fences as shared when explicit syncing is used. This would work correctly with eviction, and wouldn't cause any unneeded syncing. :)
Yes, that will probably work! So, just to reiterate that I understood you and Daniel correctly:
- de-stage sync_fence and it's user space API (the tedious part)
- add dma-buf ioctls for extracting and attaching explicit fences
- Nouveau specific: allow flagging gem buffers for explicit sync
- do not check pre-fences from explicitly synced buffers at submit
- continue to attach a shared fence after submit so that pinning and unmapping continue to work
- Have explicit in/out fences for the pushbuf ioctl is missing I guess in this step?
I also think we need some kind of demonstration vehicle using nouveau to satisfy Dave Airlie's open-source userspace requirements for new interfaces. Might be good to chat with him to make sure we have that covered (and how much needs to be there really).
Then working sets and getting rid of locking all buffers individually can be dealt with later as an optimization.
Yeah, sounds like a good plan.
On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
Implicit fences attached to individual buffers are one way for residency management. Do you think a working set based model could work in the DRM framework? For example, something like this:
- Allow user space to create "working set objects" and associate buffers with them. If the user space doesn't want to manage working sets explicitly, it could also use an implicit default working set that contains all buffers that are mapped to the channel vm (on Android we could always use the default working set since we don't need to manage residency). The working sets are initially marked as dirty.
- User space tells which working sets are referenced by each work submission. Kernel locks these working sets, pins all buffers in dirty working sets, and resets the dirty bits. After kicking off work, kernel stores the fence to the _working sets_, and then releases the locks (if an implicit default working set is used, then this would be roughly equivalent to storing a fence to channel vm that tells "this is the last hw operation that might have touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working set dirty bits at each submit.
- When a buffer is swapped out, all working sets that refer to it need to be marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the fences from all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization - we now need to process a few working sets at every submit instead of many individual buffers. However, it makes a huge difference because of fences: fences that are attached to buffers are used for implicitly synchronizing work across different channels and engines. They are in the performance critical path, and we want to carefully manage them (that's the idea of explicit synchronization). The working set fences, on the other hand, would only be used to guarantee that we don't swap out or unmap something that the GPU might be accessing. We never need to wait for those fences (except when swapping or unmapping), so we can be conservative without hurting performance.
Yeah, within the driver (i.e. for private objects which are never exported to dma_buf) we can recently do stuff like this. And your above idea is roughly one of the things we're tossing around for i915.
But the cool stuff with drm is that cmd submission is driver-specific, so you can just go wild with nouveau. Of course you have to coninvce the nouveau guys (and also have open-source users for the new interface).
For shared buffers I think we should stick with the implicit fences for a while simply because I'm not sure whether it's really worth the fuzz. And reworking all the drivers and dma-buf for some working sets is a lot of fuzz ;-) Like Maarten said you can mostly short-circuit the implicit fencing by only attaching shared fences.
Yes, I'll try to do that.
In case you're curious: The idea is to have a 1:1 association between ppgtt address spaces and what you call the working set above, to implement the buffer svm model in ocl2. Mostly because we expect that applications won't get the more fine-grained buffer list right anyway. And this kind of gang-scheduling of working set sizes should be more efficient for the usual case where everything fits.
If I understood correctly, this would be exactly the same as what I called the "default working set" above. On Android we don't care much about finer grained working sets either, because of UMA and no swapping.
Yeah, that's pretty much the idea. Well with ocl the SVM buffer working set sizes are attached to an ocl context, but I don't think there'll be more than one of those around really (except maybe when different libraries all use ocl, but don't work together on the same data).
Imo de-staging the android syncpt stuff needs to happen first, before drivers can use it. Since non-staging stuff really shouldn't depend upon code from staging.
Fully agree. I thought the best way towards that would be to show some driver code that _would_ use it. :)
Oh, there's the usual chicken&egg where we need a full-blown prototype before we can start merging. Interface work on upstream is super-hard, but given the ridiculous backwards compat guarantees Linus expects us to keep up totally justified. Mistakes are really expensive. So I'm happy to see you charge ahead here.
Given that I'm not used to working with upstream, don't expect too much from my "charging ahead". :) I'm still secretly hoping that the Android guys at Google would jump in to help, now that we seem to agree that we could de-stage sync_fence.
Forget that, imo the android guys are 100% absorbed with their own stuff, at least on the gfx side. Occasionally they pipe up with "btw this is what we're doing now".
Also, the problem is that to actually push android stuff out of staging you need a use-case in upstream, which means an open-source gpu driver. There's not a lot of companies who have both that and ship android, and definitely not the nexus/android lead platforms.
Display side would be easier since there's a bunch of kms drivers now upstream. But given that google decided to go ahead with their own adf instead of drm-kms that's also a non-starter.
Heck, I have a hard time draggin our own android folks here at intel into upstream work ...
I'm all for adding explicit syncing. Our plans are roughly. - Add both an in and and out fence to execbuf to sync with other rendering and give userspace a fence back. Needs to different flags probably.
Maybe add an ioctl to dma-bufs to get at the current implicit fences attached to them (both an exclusive and non-exclusive version). This should help with making explicit and implicit sync work together nicely.
Add fence support to kms. Probably only worth it together with the new atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code line needs to be changed for this.
For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say it should be just the userspace interface for struct drm_fence. Also, it needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can move forward with these parts, and continue to attach fences at submit until we have a satisfactory solution for the pinning problem?
Yeah, that's our plan for i915 too. First add explicit fences, then figure out whether we need to be better at neutering the implicit fences, in case and only where it really gets in the way.
I'd like to understand what are the concrete steps to de-stage struct sync_fence, since that's the first thing that needs to be done. For example, what do you mean by "de-cluttering the internal interfaces"? Just that we'd move the sync_fence parts from drivers/staging/android/sync.c to, say, drivers/dma-buf/sync-fence.c ? Would we still leave a copy of the existing full driver to staging/android?
Yeah I guess that would be an approach. Personally I think we should also have basic ioctl testcase for all the ioctls exposed by syncpt fds. And reviewing the kerneldoc for the driver-internal interfaces (which includes removing everything that's no made obsolete by struct fence). Bonus points for documenting the ioctls. We could throw the test binary into libdrm maybe, there's a bunch other like it already there.
I'm not sure whether/how much google has already for this.
The simplest way to add tests is if we allow user space to create and trigger fences. We don't want to enable that from kernel by default, because that opens possibilities for deadlocks (e.g. a process could deadlock a compositor by passing a fence that it never triggers). Android sync driver solves this by having a separate CONFIG_SW_SYNC_USER that can be used for testing.
Here's a simple test by Google: https://android.googlesource.com/platform/system/core/+/master/libsync/sync_...
Hm, usually we expose such test interfaces through debugfs - that way production system won't ever ship with it (since there's too many exploits in there, especially with secure boot). But since you need it for validation tests (at least for the i915 suite) it should always be there when you need it.
Exposing this as a configurable driver in dev is imo a no-go. But we should be able to easily convert this into a few debugfs files, so not too much fuzz hopefully.
And this is their userspace wrapper lib: https://android.googlesource.com/platform/system/core/+/master/libsync/sync.... https://android.googlesource.com/platform/system/core/+/master/libsync/inclu...
Would that go to libdrm now...?
Imo it would make sense. At least we kinda want to use fences outside of Android, and a separate library project feels like overkill.
Aside: Will you be at XDC or linux plumbers? Either would be a perfect place to discuss plans and ideas - I'll attend both.
I wasn't going to, but let's see. The former is pretty soon and the latter is sold out. At least Andy Ritger from Nvidia is coming to XDC for sure, and he's been involved in our internal discussions around these topics. So I suggest you have a chat with him at least! :)
I'll definitely have a chat (and some beers) with Andy, been a while I've last seen him ;-)
Cheers, Daniel
Riley (CCed) and I will be at Plumbers in a couple weeks.
There is a session on sync planned in the Android track, and of course we'll be available to chat.
On Thu, Oct 2, 2014 at 1:44 PM, Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
+Rom who seems to be presenting about mainlining android sync at linux
plumbers
Also add Greg KH as fyi that we're working on de-stage one of the android subsystems.
On Wed, Oct 01, 2014 at 05:58:52PM +0200, Maarten Lankhorst wrote:
You could neuter implicit fences by always attaching the fences as shared when explicit syncing is used. This would work correctly with eviction, and wouldn't cause any unneeded syncing. :)
Yes, that will probably work! So, just to reiterate that I understood
you and
Daniel correctly:
- de-stage sync_fence and it's user space API (the tedious part)
- add dma-buf ioctls for extracting and attaching explicit fences
- Nouveau specific: allow flagging gem buffers for explicit sync
- do not check pre-fences from explicitly synced buffers at submit
- continue to attach a shared fence after submit so that pinning and unmapping continue to work
- Have explicit in/out fences for the pushbuf ioctl is missing I guess in this step?
I also think we need some kind of demonstration vehicle using nouveau to satisfy Dave Airlie's open-source userspace requirements for new interfaces. Might be good to chat with him to make sure we have that covered (and how much needs to be there really).
Then working sets and getting rid of locking all buffers individually can be dealt with later as an optimization.
Yeah, sounds like a good plan.
On Wed, Oct 01, 2014 at 07:27:21PM +0200, Daniel Vetter wrote:
On Wed, Oct 01, 2014 at 06:14:16PM +0300, Lauri Peltonen wrote:
Implicit fences attached to individual buffers are one way for
residency
management. Do you think a working set based model could work in
the DRM
framework? For example, something like this:
- Allow user space to create "working set objects" and associate
buffers with
them. If the user space doesn't want to manage working sets
explicitly, it
could also use an implicit default working set that contains all
buffers that
are mapped to the channel vm (on Android we could always use the
default
working set since we don't need to manage residency). The working
sets are
initially marked as dirty.
- User space tells which working sets are referenced by each work
submission.
Kernel locks these working sets, pins all buffers in dirty working
sets, and
resets the dirty bits. After kicking off work, kernel stores the
fence to
the _working sets_, and then releases the locks (if an implicit
default
working set is used, then this would be roughly equivalent to
storing a fence
to channel vm that tells "this is the last hw operation that might
have
touched buffers in this address space").
- If swapping doesn't happen, then we just need to check the working
set dirty
bits at each submit.
- When a buffer is swapped out, all working sets that refer to it
need to be
marked as dirty.
- When a buffer is swapped out or unmapped, we need to wait for the
fences from
all working sets that refer to the buffer.
Initially one might think of working sets as a mere optimization -
we now need
to process a few working sets at every submit instead of many
individual
buffers. However, it makes a huge difference because of fences:
fences that
are attached to buffers are used for implicitly synchronizing work
across
different channels and engines. They are in the performance
critical path, and
we want to carefully manage them (that's the idea of explicit
synchronization).
The working set fences, on the other hand, would only be used to
guarantee that
we don't swap out or unmap something that the GPU might be
accessing. We never
need to wait for those fences (except when swapping or unmapping),
so we can be
conservative without hurting performance.
Yeah, within the driver (i.e. for private objects which are never
exported
to dma_buf) we can recently do stuff like this. And your above idea is roughly one of the things we're tossing around for i915.
But the cool stuff with drm is that cmd submission is driver-specific,
so
you can just go wild with nouveau. Of course you have to coninvce the nouveau guys (and also have open-source users for the new interface).
For shared buffers I think we should stick with the implicit fences
for a
while simply because I'm not sure whether it's really worth the fuzz.
And
reworking all the drivers and dma-buf for some working sets is a lot of fuzz ;-) Like Maarten said you can mostly short-circuit the implicit fencing by only attaching shared fences.
Yes, I'll try to do that.
In case you're curious: The idea is to have a 1:1 association between ppgtt address spaces and what you call the working set above, to
implement
the buffer svm model in ocl2. Mostly because we expect that
applications
won't get the more fine-grained buffer list right anyway. And this
kind of
gang-scheduling of working set sizes should be more efficient for the usual case where everything fits.
If I understood correctly, this would be exactly the same as what I
called the
"default working set" above. On Android we don't care much about finer
grained
working sets either, because of UMA and no swapping.
Yeah, that's pretty much the idea. Well with ocl the SVM buffer working set sizes are attached to an ocl context, but I don't think there'll be more than one of those around really (except maybe when different libraries all use ocl, but don't work together on the same data).
Imo de-staging the android syncpt stuff needs to happen first,
before drivers
can use it. Since non-staging stuff really shouldn't depend upon
code from
staging.
Fully agree. I thought the best way towards that would be to show
some driver
code that _would_ use it. :)
Oh, there's the usual chicken&egg where we need a full-blown prototype before we can start merging. Interface work on upstream is super-hard,
but
given the ridiculous backwards compat guarantees Linus expects us to
keep
up totally justified. Mistakes are really expensive. So I'm happy to
see
you charge ahead here.
Given that I'm not used to working with upstream, don't expect too much
from my
"charging ahead". :) I'm still secretly hoping that the Android guys at
would jump in to help, now that we seem to agree that we could de-stage sync_fence.
Forget that, imo the android guys are 100% absorbed with their own stuff, at least on the gfx side. Occasionally they pipe up with "btw this is what we're doing now".
Also, the problem is that to actually push android stuff out of staging you need a use-case in upstream, which means an open-source gpu driver. There's not a lot of companies who have both that and ship android, and definitely not the nexus/android lead platforms.
Display side would be easier since there's a bunch of kms drivers now upstream. But given that google decided to go ahead with their own adf instead of drm-kms that's also a non-starter.
Heck, I have a hard time draggin our own android folks here at intel into upstream work ...
I'm all for adding explicit syncing. Our plans are roughly. - Add
both an in
and and out fence to execbuf to sync with other rendering and give
userspace
a fence back. Needs to different flags probably.
- Maybe add an ioctl to dma-bufs to get at the current implicit
fences
attached to them (both an exclusive and non-exclusive version).
This
should help with making explicit and implicit sync work together
nicely.
- Add fence support to kms. Probably only worth it together with
the new
atomic stuff. Again we need an in fence to wait for (one for each buffer) and an out fence. The later can easily be implemented by extending struct drm_event, which means not a single driver code
line
needs to be changed for this.
- For de-staging android syncpts we need to de-clutter the internal interfaces and also review all the ioctls exposed. Like you say
it
should be just the userspace interface for struct drm_fence.
Also, it
needs testcases and preferrably manpages.
This all sounds very similar to what we'd like to do! Maybe we can
move
forward with these parts, and continue to attach fences at submit
until we have
a satisfactory solution for the pinning problem?
Yeah, that's our plan for i915 too. First add explicit fences, then
figure
out whether we need to be better at neutering the implicit fences, in
case
and only where it really gets in the way.
I'd like to understand what are the concrete steps to de-stage struct sync_fence, since that's the first thing that needs to be done. For
example,
what do you mean by "de-cluttering the internal interfaces"? Just
that we'd
move the sync_fence parts from drivers/staging/android/sync.c to,
say,
drivers/dma-buf/sync-fence.c ? Would we still leave a copy of the
existing
full driver to staging/android?
Yeah I guess that would be an approach. Personally I think we should
also
have basic ioctl testcase for all the ioctls exposed by syncpt fds. And reviewing the kerneldoc for the driver-internal interfaces (which
includes
removing everything that's no made obsolete by struct fence). Bonus
points
for documenting the ioctls. We could throw the test binary into libdrm maybe, there's a bunch other like it already there.
I'm not sure whether/how much google has already for this.
The simplest way to add tests is if we allow user space to create and
trigger
fences. We don't want to enable that from kernel by default, because
that
opens possibilities for deadlocks (e.g. a process could deadlock a
compositor
by passing a fence that it never triggers). Android sync driver solves
this by
having a separate CONFIG_SW_SYNC_USER that can be used for testing.
Here's a simple test by Google:
https://android.googlesource.com/platform/system/core/+/master/libsync/sync_...
Hm, usually we expose such test interfaces through debugfs - that way production system won't ever ship with it (since there's too many exploits in there, especially with secure boot). But since you need it for validation tests (at least for the i915 suite) it should always be there when you need it.
Exposing this as a configurable driver in dev is imo a no-go. But we should be able to easily convert this into a few debugfs files, so not too much fuzz hopefully.
And this is their userspace wrapper lib:
https://android.googlesource.com/platform/system/core/+/master/libsync/sync....
https://android.googlesource.com/platform/system/core/+/master/libsync/inclu...
Would that go to libdrm now...?
Imo it would make sense. At least we kinda want to use fences outside of Android, and a separate library project feels like overkill.
Aside: Will you be at XDC or linux plumbers? Either would be a perfect place to discuss plans and ideas - I'll attend both.
I wasn't going to, but let's see. The former is pretty soon and the
latter is
sold out. At least Andy Ritger from Nvidia is coming to XDC for sure,
and he's
been involved in our internal discussions around these topics. So I
suggest you
have a chat with him at least! :)
I'll definitely have a chat (and some beers) with Andy, been a while I've last seen him ;-)
Cheers, Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Oct 02, 2014 at 10:44:05PM +0200, Daniel Vetter wrote:
On Thu, Oct 02, 2014 at 05:59:51PM +0300, Lauri Peltonen wrote:
Yes, that will probably work! So, just to reiterate that I understood you and Daniel correctly:
- de-stage sync_fence and it's user space API (the tedious part)
- add dma-buf ioctls for extracting and attaching explicit fences
- Nouveau specific: allow flagging gem buffers for explicit sync
- do not check pre-fences from explicitly synced buffers at submit
- continue to attach a shared fence after submit so that pinning and unmapping continue to work
- Have explicit in/out fences for the pushbuf ioctl is missing I guess in this step?
Yes, I was missing that. :)
I also think we need some kind of demonstration vehicle using nouveau to satisfy Dave Airlie's open-source userspace requirements for new interfaces. Might be good to chat with him to make sure we have that covered (and how much needs to be there really).
Agreed.
Also, the problem is that to actually push android stuff out of staging you need a use-case in upstream, which means an open-source gpu driver. There's not a lot of companies who have both that and ship android, and definitely not the nexus/android lead platforms.
Display side would be easier since there's a bunch of kms drivers now upstream. But given that google decided to go ahead with their own adf instead of drm-kms that's also a non-starter.
Hmm.. Maybe we could use TegraDRM on the display side.. That and Nouveau would already be two upstream drivers that support explicit sync on Tegra K1.
Also, if we bring sync fd's out of staging, one idea would be to add support for EGL_ANDROID_native_fence_sync in mesa, along with some tests. That would demonstrate converting between sync fd's and EGLSync objects.
Hm, usually we expose such test interfaces through debugfs - that way production system won't ever ship with it (since there's too many exploits in there, especially with secure boot). But since you need it for validation tests (at least for the i915 suite) it should always be there when you need it.
Exposing this as a configurable driver in dev is imo a no-go. But we should be able to easily convert this into a few debugfs files, so not too much fuzz hopefully.
Good idea!
Aside: Will you be at XDC or linux plumbers? Either would be a perfect place to discuss plans and ideas - I'll attend both.
I wasn't going to, but let's see. The former is pretty soon and the latter is sold out. At least Andy Ritger from Nvidia is coming to XDC for sure, and he's been involved in our internal discussions around these topics. So I suggest you have a chat with him at least! :)
I'll definitely have a chat (and some beers) with Andy, been a while I've last seen him ;-)
Change of plans, I'll attend XDC, so see you there! I'll even give a short talk about explicit sync to get some discussions going. :)
Thanks, Lauri
On Mon, Oct 6, 2014 at 2:25 PM, Lauri Peltonen lpeltonen@nvidia.com wrote:
Also, the problem is that to actually push android stuff out of staging you need a use-case in upstream, which means an open-source gpu driver. There's not a lot of companies who have both that and ship android, and definitely not the nexus/android lead platforms.
Display side would be easier since there's a bunch of kms drivers now upstream. But given that google decided to go ahead with their own adf instead of drm-kms that's also a non-starter.
Hmm.. Maybe we could use TegraDRM on the display side.. That and Nouveau would already be two upstream drivers that support explicit sync on Tegra K1.
Also, if we bring sync fd's out of staging, one idea would be to add support for EGL_ANDROID_native_fence_sync in mesa, along with some tests. That would demonstrate converting between sync fd's and EGLSync objects.
Just read through the extension spec for this and this sounds excellent. So enabling that in mesa and having some basic piglits for the conversion should be more than good enough to fullfill the open-source userspace driver requirement. -Daniel
dri-devel@lists.freedesktop.org