amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
This patch was tested to behave properly with intel-kms + radeon/amdgpu/nouveau-kms for correct prime sync during pageflipping under DRI3/Present.
Should fix https://bugs.freedesktop.org/show_bug.cgi?id=95472 at least for page-flipped presentation.
Suggested-by: Michel Dänzer michel.daenzer@amd.com Signed-off-by: Mario Kleiner mario.kleiner.de@gmail.com Cc: Michel Dänzer michel.daenzer@amd.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: David Airlie airlied@linux.ie --- drivers/gpu/drm/i915/intel_display.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 922709b..4b74b96 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12043,7 +12043,7 @@ static void intel_mmio_flip_work_func(struct work_struct *w) /* For framebuffer backed by dmabuf, wait for fence */ resv = i915_gem_object_get_dmabuf_resv(obj); if (resv) - WARN_ON(reservation_object_wait_timeout_rcu(resv, false, false, + WARN_ON(reservation_object_wait_timeout_rcu(resv, true, false, MAX_SCHEDULE_TIMEOUT) < 0);
intel_pipe_update_start(crtc); @@ -14700,7 +14700,7 @@ intel_prepare_plane_fb(struct drm_plane *plane, if (resv) { long lret;
- lret = reservation_object_wait_timeout_rcu(resv, false, true, + lret = reservation_object_wait_timeout_rcu(resv, true, true, MAX_SCHEDULE_TIMEOUT); if (lret == -ERESTARTSYS) return lret;
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
thanks, -mario
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set(). -Chris
On 09/09/16 01:23 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set().
This all makes sense to me.
Christian, why is amdgpu setting only shared fences? Can we fix that?
Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
On 09/09/16 01:23 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set().
This all makes sense to me.
Christian, why is amdgpu setting only shared fences? Can we fix that?
No, amdgpu relies on the fact that we even allow concurrent write accesses by userspace.
E.g. one part of the buffer can be rendered by one engine while another part could be rendered by another engine.
Just imagine X which is composing a buffer with both the 3D engine as well as the DMA engine.
All engines need to run in parallel and you need to wait for all of them to finish before scanout.
Everybody which needs exclusive access to the reservation object (like scanouts do) needs to wait for all fences, not just the exclusive one.
The Intel driver clearly needs to be fixed here.
Regards, Christian.
On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
On 09/09/16 01:23 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping.
No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set().
This all makes sense to me.
Christian, why is amdgpu setting only shared fences? Can we fix that?
No, amdgpu relies on the fact that we even allow concurrent write accesses by userspace.
E.g. one part of the buffer can be rendered by one engine while another part could be rendered by another engine.
Just imagine X which is composing a buffer with both the 3D engine as well as the DMA engine.
All engines need to run in parallel and you need to wait for all of them to finish before scanout.
Everybody which needs exclusive access to the reservation object (like scanouts do) needs to wait for all fences, not just the exclusive one.
The Intel driver clearly needs to be fixed here.
If you are not using implicit fencing, you have to pass explicit fences instead. -Chris
Am 13.09.2016 um 11:39 schrieb Chris Wilson:
On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
On 09/09/16 01:23 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: > amdgpu-kms uses shared fences for its prime exported dmabufs, > instead of an exclusive fence. Therefore we need to wait for > all fences of the dmabuf reservation object to prevent > unsynchronized rendering and flipping. No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set().
This all makes sense to me.
Christian, why is amdgpu setting only shared fences? Can we fix that?
No, amdgpu relies on the fact that we even allow concurrent write accesses by userspace.
E.g. one part of the buffer can be rendered by one engine while another part could be rendered by another engine.
Just imagine X which is composing a buffer with both the 3D engine as well as the DMA engine.
All engines need to run in parallel and you need to wait for all of them to finish before scanout.
Everybody which needs exclusive access to the reservation object (like scanouts do) needs to wait for all fences, not just the exclusive one.
The Intel driver clearly needs to be fixed here.
If you are not using implicit fencing, you have to pass explicit fences instead.
Which is exactly what we do, but only for the driver internally command submissions.
All command submissions from the same process can run concurrently with amdgpu, only when we see a fence from another driver or process we wait for it to complete before starting to run a command submission.
Other drivers can't make any assumption on what a shared access is actually doing (e.g. writing or reading) with a buffer.
So the model i915.ko is using the reservation object and it's shared fences is certainly not correct and needs to be fixed.
Regards, Christian.
-Chris
On 13/09/16 09:52 PM, Christian König wrote:
Am 13.09.2016 um 11:39 schrieb Chris Wilson:
On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
On 09/09/16 01:23 AM, Chris Wilson wrote:
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
On 09/08/2016 08:30 AM, Chris Wilson wrote: > On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >> amdgpu-kms uses shared fences for its prime exported dmabufs, >> instead of an exclusive fence. Therefore we need to wait for >> all fences of the dmabuf reservation object to prevent >> unsynchronized rendering and flipping. > No. Fix the root cause as this affects not just flips but copies - > this implies that everybody using the resv object must wait for all > fences. The resv object is not just used for prime, but all > fencing, so > this breaks the ability to schedule parallel operations across > engine. > -Chris > Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time?
That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.)
In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects.
Ie.:
Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc.
Yes.
Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout?
Yes.
Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this?
i915.ko requires the client to mark which buffers are written to.
In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set().
This all makes sense to me.
Christian, why is amdgpu setting only shared fences? Can we fix that?
No, amdgpu relies on the fact that we even allow concurrent write accesses by userspace.
E.g. one part of the buffer can be rendered by one engine while another part could be rendered by another engine.
Just imagine X which is composing a buffer with both the 3D engine as well as the DMA engine.
All engines need to run in parallel and you need to wait for all of them to finish before scanout.
Everybody which needs exclusive access to the reservation object (like scanouts do) needs to wait for all fences, not just the exclusive one.
The Intel driver clearly needs to be fixed here.
If you are not using implicit fencing, you have to pass explicit fences instead.
Which is exactly what we do, but only for the driver internally command submissions.
All command submissions from the same process can run concurrently with amdgpu, only when we see a fence from another driver or process we wait for it to complete before starting to run a command submission.
Other drivers can't make any assumption on what a shared access is actually doing (e.g. writing or reading) with a buffer.
So the model i915.ko is using the reservation object and it's shared fences is certainly not correct and needs to be fixed.
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Yeah, I know. See radeon_display.c radeon_flip_work_func().
We pretty much need the same patch here I've done for amdgpu as well.
Regards, Christian.
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync. -Daniel
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync.
Nope, that isn't true.
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Regards, Christian.
-Daniel
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync.
Nope, that isn't true.
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing), and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever). Since you're doing explicit syncing for internal stuff anyway you can still opt to ignore the exclusive fences if you want to (driven by a flag or something similar). -Daniel
On 21/09/16 09:56 PM, Daniel Vetter wrote:
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync.
Nope, that isn't true.
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing),
Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :)
and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever).
Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
On 21/09/16 09:56 PM, Daniel Vetter wrote:
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync.
Nope, that isn't true.
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing),
Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :)
Actually amdgpu does set the exclusive fence for buffer migrations, cause that is an operation user space doesn't know about and so it needs to be "exclusive" access to the buffer.
and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever).
Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
The problem is that we don't have one fence, but many.
E.g. there can be many operation on a buffer at the same time and we need to wait for all of them to complete before it can be displayed.
Regards, Christian.
On 22/09/16 12:15 AM, Christian König wrote:
Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
On 21/09/16 09:56 PM, Daniel Vetter wrote:
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing),
Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :)
Actually amdgpu does set the exclusive fence for buffer migrations, cause that is an operation user space doesn't know about and so it needs to be "exclusive" access to the buffer.
and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever).
Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
The problem is that we don't have one fence, but many.
E.g. there can be many operation on a buffer at the same time and we need to wait for all of them to complete before it can be displayed.
Maybe in theory, but with the problem we have in practice right now, the amdgpu GPU should only ever access the shared BO with the same engine.
Anyway, this should be solvable by setting some kind of meta-fence as the exclusive fence, which can internally be mapped to multiple fences, maybe up to one for each ring which can access the BO?
Am 21.09.2016 um 17:29 schrieb Michel Dänzer:
On 22/09/16 12:15 AM, Christian König wrote:
Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
On 21/09/16 09:56 PM, Daniel Vetter wrote:
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing),
Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :)
Actually amdgpu does set the exclusive fence for buffer migrations, cause that is an operation user space doesn't know about and so it needs to be "exclusive" access to the buffer.
and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever).
Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
The problem is that we don't have one fence, but many.
E.g. there can be many operation on a buffer at the same time and we need to wait for all of them to complete before it can be displayed.
Maybe in theory, but with the problem we have in practice right now, the amdgpu GPU should only ever access the shared BO with the same engine.
That clearly won't work. Take a look at what both Mesa and the pro stack do with the BO before it is displayed makes it mandatory to execute things in parallel (at least for the not shared case).
Anyway, this should be solvable by setting some kind of meta-fence as the exclusive fence, which can internally be mapped to multiple fences, maybe up to one for each ring which can access the BO?
I've thought about that as well, but this approach would also only work when we keep a collection of fences and not just an array because of the scheduler.
For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO.
Christian.
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO.
It is, but it's not possible with implicit sync. If you want parallel write access to the same shared buffer, you _must_ carry around some explicit fences. Within amdgpu you can use driver-specific cookies, for shared buffer we now have sync_file. But multiple writers with implicit sync simply fundamentally doesn't work. Because you have no idea with which writer, touching the same subrange you want to touch. -Daniel
Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO.
It is, but it's not possible with implicit sync. If you want parallel write access to the same shared buffer, you _must_ carry around some explicit fences. Within amdgpu you can use driver-specific cookies, for shared buffer we now have sync_file. But multiple writers with implicit sync simply fundamentally doesn't work. Because you have no idea with which writer, touching the same subrange you want to touch.
You don't need to separate the BO into subranges which are touched by different engines for allowing multiple writers.
AMD hardware and I'm pretty sure others as well are perfectly capable of writing to the same memory from multiple engines and even multiple GPUs at the same time.
For a good hint of what is possible see the public AMD ISA documentation about atomic operations, but that is only the start of it.
The crux here is that we need to assume that we will have implicit and explicit sync mixed for backward compatibility.
This implies that we need some mechanism like amdgpu uses in it's sync implementation where every fence is associated with an owner which denotes the domain in which implicit sync happens. If you leave this domain you will automatically run into explicit sync.
Currently we define the borders of this domain in amdgpu on process boundary to keep things like DRI2/DRI3 working as expected.
I really don't see how you want to solve this with a single explicit fence for each reservation object. As long as you have multiple concurrently running operations accessing the same buffer you need to keep one fence for each operation no matter what.
Regards, Christian.
-Daniel
On Thu, Sep 22, 2016 at 12:55 PM, Christian König deathsimple@vodafone.de wrote:
Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO.
It is, but it's not possible with implicit sync. If you want parallel write access to the same shared buffer, you _must_ carry around some explicit fences. Within amdgpu you can use driver-specific cookies, for shared buffer we now have sync_file. But multiple writers with implicit sync simply fundamentally doesn't work. Because you have no idea with which writer, touching the same subrange you want to touch.
You don't need to separate the BO into subranges which are touched by different engines for allowing multiple writers.
AMD hardware and I'm pretty sure others as well are perfectly capable of writing to the same memory from multiple engines and even multiple GPUs at the same time.
For a good hint of what is possible see the public AMD ISA documentation about atomic operations, but that is only the start of it.
The crux here is that we need to assume that we will have implicit and explicit sync mixed for backward compatibility.
This implies that we need some mechanism like amdgpu uses in it's sync implementation where every fence is associated with an owner which denotes the domain in which implicit sync happens. If you leave this domain you will automatically run into explicit sync.
Currently we define the borders of this domain in amdgpu on process boundary to keep things like DRI2/DRI3 working as expected.
I really don't see how you want to solve this with a single explicit fence for each reservation object. As long as you have multiple concurrently running operations accessing the same buffer you need to keep one fence for each operation no matter what.
I can't make sense of what you're saying, and I suspect we put different meaning to different words. So let me define here:
- implicit fencing: Userspace does not track read/writes to buffers, but only the kernel does that. This is the assumption DRI2/3 has. Since synchronization is by necessity on a per-buffer level you can only have 1 writer. In the kernel the cross-driver interface for this is struct reservation_object attached to dma-bufs. If you don't fill out/wait for the exclusive fence in there, you're driver is _not_ doing (cross-device) implicit fencing.
- explicit fencing: Userspace passes around distinct fence objects for any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
Given that I'm not sure what you mean with "one explicit fence per reservation_object", since explicit fencing should not attach anything (at least not any exclusive fences) to a reservation_object. It does sound a bit like you have the meaning the other way round (as in explicit fencing = the kernel explicitly takes care of fencing, but it's explicit as in explicit fences visible to userspace). -Daniel
Am 22.09.2016 um 14:26 schrieb Daniel Vetter:
On Thu, Sep 22, 2016 at 12:55 PM, Christian König deathsimple@vodafone.de wrote:
Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO.
It is, but it's not possible with implicit sync. If you want parallel write access to the same shared buffer, you _must_ carry around some explicit fences. Within amdgpu you can use driver-specific cookies, for shared buffer we now have sync_file. But multiple writers with implicit sync simply fundamentally doesn't work. Because you have no idea with which writer, touching the same subrange you want to touch.
You don't need to separate the BO into subranges which are touched by different engines for allowing multiple writers.
AMD hardware and I'm pretty sure others as well are perfectly capable of writing to the same memory from multiple engines and even multiple GPUs at the same time.
For a good hint of what is possible see the public AMD ISA documentation about atomic operations, but that is only the start of it.
The crux here is that we need to assume that we will have implicit and explicit sync mixed for backward compatibility.
This implies that we need some mechanism like amdgpu uses in it's sync implementation where every fence is associated with an owner which denotes the domain in which implicit sync happens. If you leave this domain you will automatically run into explicit sync.
Currently we define the borders of this domain in amdgpu on process boundary to keep things like DRI2/DRI3 working as expected.
I really don't see how you want to solve this with a single explicit fence for each reservation object. As long as you have multiple concurrently running operations accessing the same buffer you need to keep one fence for each operation no matter what.
I can't make sense of what you're saying, and I suspect we put different meaning to different words. So let me define here:
- implicit fencing: Userspace does not track read/writes to buffers,
but only the kernel does that. This is the assumption DRI2/3 has. Since synchronization is by necessity on a per-buffer level you can only have 1 writer. In the kernel the cross-driver interface for this is struct reservation_object attached to dma-bufs. If you don't fill out/wait for the exclusive fence in there, you're driver is _not_ doing (cross-device) implicit fencing.
I can confirm that my understanding of implicit fencing is exactly the same as yours.
- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
No, that is clearly not my understanding of explicit fencing.
Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics.
Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3.
I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question.
Regards, Christian.
On Thu, Sep 22, 2016 at 2:44 PM, Christian König deathsimple@vodafone.de wrote:
- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
No, that is clearly not my understanding of explicit fencing.
Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics.
Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3.
This is about mixing them. Explicit fencing still means userspace has an explicit piece, separate from buffers, (either sync_file fd, or a driver-specific cookie, or similar).
I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question.
Yup agreed. This way if your explicitly-fencing driver reads a shared buffer passed over a protocol that does implicit fencing (like DRI2/3), then it will work.
The other interop direction is explicitly-fencing driver passes a buffer to a consumer which expects implicit fencing. In that case you must attach the right fence to the exclusive slot, but _only_ in that case. Otherwise you end up stalling your explicitly-fencing userspace, since implicit fencing doesn't allow more than 1 writer. For amdgpu one possible way to implement this might be to count how many users a dma-buf has, and if it's more than just the current context set the exclusive fence. Or do an uabi revision and let userspace decide (or at least overwrite it).
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-) -Daniel
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
On Thu, Sep 22, 2016 at 2:44 PM, Christian König deathsimple@vodafone.de wrote:
- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
No, that is clearly not my understanding of explicit fencing.
Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics.
Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3.
This is about mixing them. Explicit fencing still means userspace has an explicit piece, separate from buffers, (either sync_file fd, or a driver-specific cookie, or similar).
I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question.
Yup agreed. This way if your explicitly-fencing driver reads a shared buffer passed over a protocol that does implicit fencing (like DRI2/3), then it will work.
The other interop direction is explicitly-fencing driver passes a buffer to a consumer which expects implicit fencing. In that case you must attach the right fence to the exclusive slot, but _only_ in that case.
Ok well sounds like you are close to understand why I can't do exactly this: There simply is no right fence I could attach.
When amdgpu makes the command submissions it doesn't necessarily know that the buffer will be exported and shared with another device later on.
So when the buffer is exported and given to the other device you might have a whole bunch of fences which run concurrently and not in any serial order.
Otherwise you end up stalling your explicitly-fencing userspace, since implicit fencing doesn't allow more than 1 writer. For amdgpu one possible way to implement this might be to count how many users a dma-buf has, and if it's more than just the current context set the exclusive fence. Or do an uabi revision and let userspace decide (or at least overwrite it).
I mean I can pick one fence and wait for the rest to finish manually, but that would certainly defeat the whole effort, doesn't it?
I completely agree that you have only 1 writer with implicit fencing, but when you switch from explicit fencing back to implicit fencing you can have multiple ones.
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Regards, Christian.
-Daniel
On 22/09/16 10:22 PM, Christian König wrote:
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
On Thu, Sep 22, 2016 at 2:44 PM, Christian König deathsimple@vodafone.de wrote:
- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
No, that is clearly not my understanding of explicit fencing.
Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics.
Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3.
This is about mixing them. Explicit fencing still means userspace has an explicit piece, separate from buffers, (either sync_file fd, or a driver-specific cookie, or similar).
I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question.
Yup agreed. This way if your explicitly-fencing driver reads a shared buffer passed over a protocol that does implicit fencing (like DRI2/3), then it will work.
The other interop direction is explicitly-fencing driver passes a buffer to a consumer which expects implicit fencing. In that case you must attach the right fence to the exclusive slot, but _only_ in that case.
Ok well sounds like you are close to understand why I can't do exactly this: There simply is no right fence I could attach.
When amdgpu makes the command submissions it doesn't necessarily know that the buffer will be exported and shared with another device later on.
So when the buffer is exported and given to the other device you might have a whole bunch of fences which run concurrently and not in any serial order.
I feel like you're thinking too much of buffers shared between GPUs as being short-lived and only shared late. In the use-cases I know about, shared buffers are created separately and shared ahead of time, the actual rendering work is done to non-shared buffers and then just copied to the shared buffers for transfer between GPUs. These copies are always performed by the same context in such a way that they should always be performed by the same HW engine and thus implicitly serialized.
Do you have any specific use-cases in mind where buffers are only shared between GPUs after the rendering operations creating the buffer contents to be shared have already been submitted?
Otherwise you end up stalling your explicitly-fencing userspace, since implicit fencing doesn't allow more than 1 writer. For amdgpu one possible way to implement this might be to count how many users a dma-buf has, and if it's more than just the current context set the exclusive fence. Or do an uabi revision and let userspace decide (or at least overwrite it).
I mean I can pick one fence and wait for the rest to finish manually, but that would certainly defeat the whole effort, doesn't it?
I'm afraid it's not clear to me why it would. Can you elaborate?
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
On 22/09/16 10:22 PM, Christian König wrote:
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
On Thu, Sep 22, 2016 at 2:44 PM, Christian König deathsimple@vodafone.de wrote:
- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner.
No, that is clearly not my understanding of explicit fencing.
Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics.
Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3.
This is about mixing them. Explicit fencing still means userspace has an explicit piece, separate from buffers, (either sync_file fd, or a driver-specific cookie, or similar).
I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question.
Yup agreed. This way if your explicitly-fencing driver reads a shared buffer passed over a protocol that does implicit fencing (like DRI2/3), then it will work.
The other interop direction is explicitly-fencing driver passes a buffer to a consumer which expects implicit fencing. In that case you must attach the right fence to the exclusive slot, but _only_ in that case.
Ok well sounds like you are close to understand why I can't do exactly this: There simply is no right fence I could attach.
When amdgpu makes the command submissions it doesn't necessarily know that the buffer will be exported and shared with another device later on.
So when the buffer is exported and given to the other device you might have a whole bunch of fences which run concurrently and not in any serial order.
I feel like you're thinking too much of buffers shared between GPUs as being short-lived and only shared late. In the use-cases I know about, shared buffers are created separately and shared ahead of time, the actual rendering work is done to non-shared buffers and then just copied to the shared buffers for transfer between GPUs. These copies are always performed by the same context in such a way that they should always be performed by the same HW engine and thus implicitly serialized.
Do you have any specific use-cases in mind where buffers are only shared between GPUs after the rendering operations creating the buffer contents to be shared have already been submitted?
Yeah, it should be known which buffer you use (at least in userspace, maybe not in the kernel) for which you need implicit fencing. At least with DRI2/3 it's really obvious which buffers are shared. Same holds for external images and other imported buffers.
Yes that means you need to keep track of a few things in userspace, and you need to add a special flag to CS to make sure the kernel does set the exclusive fence.
Otherwise you end up stalling your explicitly-fencing userspace, since implicit fencing doesn't allow more than 1 writer. For amdgpu one possible way to implement this might be to count how many users a dma-buf has, and if it's more than just the current context set the exclusive fence. Or do an uabi revision and let userspace decide (or at least overwrite it).
I mean I can pick one fence and wait for the rest to finish manually, but that would certainly defeat the whole effort, doesn't it?
I'm afraid it's not clear to me why it would. Can you elaborate?
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
Afaik there's also amd+intel machines out there, so really the only option is to either fix amdgpu to correctly set exclusive fences on shared buffers (with the help of userspace hints). Or change all the existing drivers. No idea what's simpler to do, since I don't know about amdgpu userspace. -Daniel
On 23/09/16 09:09 PM, Daniel Vetter wrote:
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
On 22/09/16 10:22 PM, Christian König wrote:
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
Afaik there's also amd+intel machines out there,
Sure, what I meant was that even if we didn't care about those (which we do), we'd still need a solution between our own drivers.
so really the only option is to either fix amdgpu to correctly set exclusive fences on shared buffers (with the help of userspace hints). Or change all the existing drivers.
I got some fresh perspective on the problem while taking a walk, and I'm now fairly convinced that neither amdgpu userspace nor other drivers need to be modified:
It occurred to me that all the information we need is already there in the exclusive and shared fences set by amdgpu. We just need to use it differently to match the expectations of other drivers.
We should be able to set some sort of "pseudo" fence as the exclusive fence of the reservation object. When we are asked by another driver to wait for this fence to signal, we take the current "real" exclusive fence (which we can keep track of e.g. in our BO struct) and shared fences, and wait for all of those to signal; then we mark the "pseudo" exclusive fence as signalled.
Am I missing anything which would prevent this from working?
On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote:
On 23/09/16 09:09 PM, Daniel Vetter wrote:
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
On 22/09/16 10:22 PM, Christian König wrote:
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
Afaik there's also amd+intel machines out there,
Sure, what I meant was that even if we didn't care about those (which we do), we'd still need a solution between our own drivers.
so really the only option is to either fix amdgpu to correctly set exclusive fences on shared buffers (with the help of userspace hints). Or change all the existing drivers.
I got some fresh perspective on the problem while taking a walk, and I'm now fairly convinced that neither amdgpu userspace nor other drivers need to be modified:
It occurred to me that all the information we need is already there in the exclusive and shared fences set by amdgpu. We just need to use it differently to match the expectations of other drivers.
We should be able to set some sort of "pseudo" fence as the exclusive fence of the reservation object. When we are asked by another driver to wait for this fence to signal, we take the current "real" exclusive fence (which we can keep track of e.g. in our BO struct) and shared fences, and wait for all of those to signal; then we mark the "pseudo" exclusive fence as signalled.
Am I missing anything which would prevent this from working?
One thing to make sure is that you don't change the (real, private stored) fences you're waiting on over the lifetime of the public exclusive fence. That might lead to some hilarity wrt potentially creating fence depency loops. But I think as long as you guarantee that the private internal fences are always amdgpu ones, and never anything imported from a different driver even that should be fine. Not because this would break the loops, but since amgpud has a hangcheck it can still gurantee that the fence eventually fires even if there is a loop. -Daniel
Hi
This has discussion has gone a little quiet
Was there any agreement about what needed doing to get this working for i965/amdgpu?
Regards
Mike
On Mon, 26 Sep 2016 at 09:04 Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote:
On 23/09/16 09:09 PM, Daniel Vetter wrote:
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
On 22/09/16 10:22 PM, Christian König wrote:
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-)
Well as far as I can see there is no way I can fix amdgpu in this case.
The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu.
Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
Afaik there's also amd+intel machines out there,
Sure, what I meant was that even if we didn't care about those (which we do), we'd still need a solution between our own drivers.
so really the only option is to either fix amdgpu to correctly set exclusive fences on shared buffers (with the help of userspace hints). Or change all the existing drivers.
I got some fresh perspective on the problem while taking a walk, and I'm now fairly convinced that neither amdgpu userspace nor other drivers need to be modified:
It occurred to me that all the information we need is already there in the exclusive and shared fences set by amdgpu. We just need to use it differently to match the expectations of other drivers.
We should be able to set some sort of "pseudo" fence as the exclusive fence of the reservation object. When we are asked by another driver to wait for this fence to signal, we take the current "real" exclusive fence (which we can keep track of e.g. in our BO struct) and shared fences, and wait for all of those to signal; then we mark the "pseudo" exclusive fence as signalled.
Am I missing anything which would prevent this from working?
One thing to make sure is that you don't change the (real, private stored) fences you're waiting on over the lifetime of the public exclusive fence. That might lead to some hilarity wrt potentially creating fence depency loops. But I think as long as you guarantee that the private internal fences are always amdgpu ones, and never anything imported from a different driver even that should be fine. Not because this would break the loops, but since amgpud has a hangcheck it can still gurantee that the fence eventually fires even if there is a loop.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 07/10/16 09:34 PM, Mike Lothian wrote:
This has discussion has gone a little quiet
Was there any agreement about what needed doing to get this working for i965/amdgpu?
Christian, do you see anything which could prevent the solution I outlined from working?
Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
On 07/10/16 09:34 PM, Mike Lothian wrote:
This has discussion has gone a little quiet
Was there any agreement about what needed doing to get this working for i965/amdgpu?
Christian, do you see anything which could prevent the solution I outlined from working?
I thought about that approach as well, but unfortunately it also has a couple of downsides. Especially keeping the exclusive fence set while we actually don't need it isn't really clean either.
I'm currently a bit busy with other tasks and so put Nayan on a road to get a bit into the kernel driver (he asked for that anyway).
Implementing the simple workaround to sync when we export the DMA-buf should be something like 20 lines of code and fortunately Nayan has an I+A system and so can actually test it.
If it turns out to be more problematic or somebody really starts to need it I'm going to hack on that myself a bit more.
Regards, Christian.
On 11/10/16 09:04 PM, Christian König wrote:
Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
On 07/10/16 09:34 PM, Mike Lothian wrote:
This has discussion has gone a little quiet
Was there any agreement about what needed doing to get this working for i965/amdgpu?
Christian, do you see anything which could prevent the solution I outlined from working?
I thought about that approach as well, but unfortunately it also has a couple of downsides. Especially keeping the exclusive fence set while we actually don't need it isn't really clean either.
I was wondering if it's possible to have a singleton pseudo exclusive fence which is used for all BOs. That might keep the overhead acceptably low.
I'm currently a bit busy with other tasks and so put Nayan on a road to get a bit into the kernel driver (he asked for that anyway).
Implementing the simple workaround to sync when we export the DMA-buf should be something like 20 lines of code and fortunately Nayan has an I+A system and so can actually test it.
If it turns out to be more problematic or somebody really starts to need it I'm going to hack on that myself a bit more.
If you mean only syncing when a DMA-buf is exported, that's not enough, as I explained before. The BOs being shared are long-lived, and synchronization between GPUs is required for every command stream submission.
Hi
Just another gentle ping to see where you are with this?
Cheers
Mike
On Wed, 12 Oct 2016 at 01:40 Michel Dänzer michel@daenzer.net wrote:
On 11/10/16 09:04 PM, Christian König wrote:
Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
On 07/10/16 09:34 PM, Mike Lothian wrote:
This has discussion has gone a little quiet
Was there any agreement about what needed doing to get this working for i965/amdgpu?
Christian, do you see anything which could prevent the solution I outlined from working?
I thought about that approach as well, but unfortunately it also has a couple of downsides. Especially keeping the exclusive fence set while we actually don't need it isn't really clean either.
I was wondering if it's possible to have a singleton pseudo exclusive fence which is used for all BOs. That might keep the overhead acceptably low.
I'm currently a bit busy with other tasks and so put Nayan on a road to get a bit into the kernel driver (he asked for that anyway).
Implementing the simple workaround to sync when we export the DMA-buf should be something like 20 lines of code and fortunately Nayan has an I+A system and so can actually test it.
If it turns out to be more problematic or somebody really starts to need it I'm going to hack on that myself a bit more.
If you mean only syncing when a DMA-buf is exported, that's not enough, as I explained before. The BOs being shared are long-lived, and synchronization between GPUs is required for every command stream submission.
-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
-mario
Am 28.10.2016 um 19:37 schrieb Mario Kleiner:
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
Yeah, that's pretty much what I had in mind as well. You additionally need to wait for the shared fences when you export the BO for the first time, but that's only a nitpick.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
Yes, exactly. VM updates doesn't matter for anybody else than amdgpu. Additional to that we don't even add a fence to the shared slot we reserve (could probably drop that for optimization).
Please remove the debugging stuff and the extra code on the VM updates and add a reservation_object_wait_timeout_rcu() to the export path and we should be good to go.
Regards, Christian.
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
-mario
On 10/28/2016 07:48 PM, Christian König wrote:
Am 28.10.2016 um 19:37 schrieb Mario Kleiner:
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
Yeah, that's pretty much what I had in mind as well. You additionally need to wait for the shared fences when you export the BO for the first time, but that's only a nitpick.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
Yes, exactly. VM updates doesn't matter for anybody else than amdgpu. Additional to that we don't even add a fence to the shared slot we reserve (could probably drop that for optimization).
Please remove the debugging stuff and the extra code on the VM updates and add a reservation_object_wait_timeout_rcu() to the export path and we should be good to go.
Done. Patch v2 just sent out after retesting with Tonga only and Intel + Tonga renderoffload. Would be cool if we could get this into 4.9-rc.
Ideally also backported to 4.8, given it is a simple change, as that would be the next official kernel for *buntu 16.04-LTS and derivatives, but that's probably breaking the rules as it doesn't fix a regression?
thanks, -mario
Regards, Christian.
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
-mario
Hi Mario
That fixes the tearing, it's been replaced with a strange stutter, like it's only showing half the number of frames being reported - it's really noticeable in tomb raider
Thanks for your work on this, the stutter is much more manageable than the tearing was
I've attached the patch that applies cleanly to 4.10-wip
On Fri, 28 Oct 2016 at 18:37 Mario Kleiner mario.kleiner.de@gmail.com wrote:
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
-mario
I turned on vsync and everything works great in tomb raider
:D
Thanks again to everyone who made this possible
On Fri, 28 Oct 2016 at 19:37 Mike Lothian mike@fireburn.co.uk wrote:
Hi Mario
That fixes the tearing, it's been replaced with a strange stutter, like it's only showing half the number of frames being reported - it's really noticeable in tomb raider
Thanks for your work on this, the stutter is much more manageable than the tearing was
I've attached the patch that applies cleanly to 4.10-wip
On Fri, 28 Oct 2016 at 18:37 Mario Kleiner mario.kleiner.de@gmail.com wrote:
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
-mario
On 29/10/16 10:58 PM, Mike Lothian wrote:
I turned on vsync and everything works great in tomb raider
:D
Thanks again to everyone who made this possible
On Fri, 28 Oct 2016 at 19:37 Mike Lothian <mike@fireburn.co.uk mailto:mike@fireburn.co.uk> wrote:
Hi Mario That fixes the tearing, it's been replaced with a strange stutter, like it's only showing half the number of frames being reported - it's really noticeable in tomb raider
I wonder if the stutter might be due to the dGPU writing another frame before the iGPU is done processing the previous one. Christian, does the amdgpu scheduler wait for shared fences of shared BOs to signal before submitting jobs using them to the GPU?
Am 31.10.2016 um 07:44 schrieb Michel Dänzer:
On 29/10/16 10:58 PM, Mike Lothian wrote:
I turned on vsync and everything works great in tomb raider
:D
Thanks again to everyone who made this possible
On Fri, 28 Oct 2016 at 19:37 Mike Lothian <mike@fireburn.co.uk mailto:mike@fireburn.co.uk> wrote:
Hi Mario That fixes the tearing, it's been replaced with a strange stutter, like it's only showing half the number of frames being reported - it's really noticeable in tomb raider
I wonder if the stutter might be due to the dGPU writing another frame before the iGPU is done processing the previous one. Christian, does the amdgpu scheduler wait for shared fences of shared BOs to signal before submitting jobs using them to the GPU?
Yeah, that should work. We wait for both the exclusive as well as all shared fences before CS or pageflip.
Only on CS we filter the shared fences so that we don't necessary wait for submissions from the same process.
Christian.
On 31/10/16 05:00 PM, Christian König wrote:
Am 31.10.2016 um 07:44 schrieb Michel Dänzer:
On 29/10/16 10:58 PM, Mike Lothian wrote:
I turned on vsync and everything works great in tomb raider
:D
Thanks again to everyone who made this possible
On Fri, 28 Oct 2016 at 19:37 Mike Lothian <mike@fireburn.co.uk mailto:mike@fireburn.co.uk> wrote:
Hi Mario That fixes the tearing, it's been replaced with a strange stutter, like it's only showing half the number of frames being reported - it's really noticeable in tomb raider
I wonder if the stutter might be due to the dGPU writing another frame before the iGPU is done processing the previous one. Christian, does the amdgpu scheduler wait for shared fences of shared BOs to signal before submitting jobs using them to the GPU?
Yeah, that should work. We wait for both the exclusive as well as all shared fences before CS or pageflip.
Only on CS we filter the shared fences so that we don't necessary wait for submissions from the same process.
Note that it can be the same process (Xorg) in the RandR 1.4 multihead case. But it sounds like Mike's stutter can't be due to the issue I was thinking of.
On 29/10/16 02:37 AM, Mario Kleiner wrote:
On 10/28/2016 03:34 AM, Michel Dänzer wrote:
On 27/10/16 10:33 PM, Mike Lothian wrote:
Just another gentle ping to see where you are with this?
I haven't got a chance to look into this any further.
Fwiw., as a proof of concept, the attached experimental patch does work as tested on Intel HD Haswell + AMD R9 380 Tonga under amdgpu and DRI3/Present when applied to drm-next (updated from a few days ago). With DRI_PRIME=1 tearing for page-flipped fullscreen windows is gone under all loads. The tearing with "windowed" windows now looks as expected for regular tearing not related to Prime.
ftrace confirms the i915 driver's pageflip function is waiting on the fence in reservation_object_wait_timeout_rcu() as it should.
That entry->tv.shared needs to be set false for such buffers in amdgpu_bo_list_set() makes sense to me, as that is part of the buffer validation for command stream submission. There are other places in the driver where tv.shared is set, which i didn't check so far.
I don't know which of these would need to be updated with a "exported bo" check as well, e.g., for video decoding or maybe gpu compute? Adding or removing the check to amdgpu_gem_va_update_vm(), e.g., made no difference. I assume that makes sense because that functions seems to deal with amdgpu internal vm page tables or page table entries for such a bo, not with something visible to external clients?
All i can say is it fixes 3D rendering under DRI3 + Prime + pageflipping without causing any obvious new problems.
[...]
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c index 7700dc2..bfbfeb9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_prime.c @@ -121,5 +121,8 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_device *dev, if (amdgpu_ttm_tt_get_usermm(bo->tbo.ttm)) return ERR_PTR(-EPERM);
- bo->prime_exported = true;
- DRM_DEBUG_PRIME("Exporting prime bo %p\n", bo);
- return drm_gem_prime_export(dev, gobj, flags);
}
This will take effect in non-PRIME cases as well, at least DRI3 and GL<->[other API] interop off the top of my head. Is that okay?
On Thu, Sep 22, 2016 at 12:07:24AM +0900, Michel Dänzer wrote:
On 21/09/16 09:56 PM, Daniel Vetter wrote:
On Wed, Sep 21, 2016 at 1:19 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
On Wed, Sep 21, 2016 at 12:30 PM, Christian König deathsimple@vodafone.de wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented?
Yeah, I think as well that this is the primary question here.
IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose.
I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time.
So in this case the writing side of a reservation object needs to be a collection of fences as well.
You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync.
Nope, that isn't true.
We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine.
See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process.
Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running.
This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object.
This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3.
Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing),
Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :)
Well since you sometimes sync and sometimes not sync it is kinda a special case of semi-exclusive fence (even if attached to the shared slots).
and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever).
Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
Yup, that's what I mean. And it shouldn't cause a problem since for shared buffers (at least for protocols where implicit fencing is required), since for those you really can't have multiple concurrent writers. And with the special checks in amdgpu_sync.c that's what's happening in reality, only difference is that the filtering/selection of what is considered and exclusive fences happens when you sync, and not when you attach them. And that breaks reservation_object assumptions. -Daniel
On 21/09/16 07:30 PM, Christian König wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Yeah, I know. See radeon_display.c radeon_flip_work_func().
We pretty much need the same patch here I've done for amdgpu as well.
Actually, the PRIME slave can't scan out from the shared BOs directly (recall the recent discussion we had about that with Mario), but has to copy from the shared BO to a dedicated scanout BO. These copies need to be synchronized with the primary GPU's copies to the shared BO.
Am 21.09.2016 um 17:13 schrieb Michel Dänzer:
On 21/09/16 07:30 PM, Christian König wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Yeah, I know. See radeon_display.c radeon_flip_work_func().
We pretty much need the same patch here I've done for amdgpu as well.
Actually, the PRIME slave can't scan out from the shared BOs directly (recall the recent discussion we had about that with Mario), but has to copy from the shared BO to a dedicated scanout BO. These copies need to be synchronized with the primary GPU's copies to the shared BO.
Yeah, that thought came to my mind before as well.
Buffer migrations by the kernel caused by a prime export actually set the exclusive fence.
So this shouldn't be an issue in practice when the displaying GPU needs to copy from the BO again anyway.
The only case I can see when this can happen is when the BO is composed directly in system memory by the engines and not migrated there.
Could be that we run into this issue more often in the future, because that is pretty much what we want to have for 4K UVD decode.
Christian.
On 22/09/16 12:21 AM, Christian König wrote:
Am 21.09.2016 um 17:13 schrieb Michel Dänzer:
On 21/09/16 07:30 PM, Christian König wrote:
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Yeah, I know. See radeon_display.c radeon_flip_work_func().
We pretty much need the same patch here I've done for amdgpu as well.
Actually, the PRIME slave can't scan out from the shared BOs directly (recall the recent discussion we had about that with Mario), but has to copy from the shared BO to a dedicated scanout BO. These copies need to be synchronized with the primary GPU's copies to the shared BO.
Yeah, that thought came to my mind before as well.
Buffer migrations by the kernel caused by a prime export actually set the exclusive fence.
So this shouldn't be an issue in practice when the displaying GPU needs to copy from the BO again anyway.
The only case I can see when this can happen is when the BO is composed directly in system memory by the engines and not migrated there.
There is no migration going on in the steady state, just the primary GPU writing to the shared BO and the slave GPU reading from it. If those operations aren't properly synchronized, there is at least intermittent tearing, but there can even be artifacts which stay until that area is updated again.
dri-devel@lists.freedesktop.org