Change how amdgpu stores fences in dma_resv objects

List overview All Threads
Download

newer

older

question about error handling in...

[PATCH] drm/i915: Perform...

Christian König

10 Jun 2021 10 Jun '21

9:17 a.m.

Since we can't find a consensus on hot to move forward with the dma_resv object I concentrated on changing the approach for amdgpu first.

This new approach changes how the driver stores the command submission fence in the dma_resv object in DMA-buf exported BOs.

For exported BOs we now store the CS fence in a dma_fence_chain container and assign that one to the exclusive fences slot.

During synchronization this dma_fence_chain container is unpacked again and the containing fences handled individually.

This has a little bit more overhead than the old approach, but it allows for waiting for the exclusive slot for writes again.

Regards, Christian.

Show replies by date

Christian König

10 Jun 10 Jun

9:17 a.m.

New subject: [PATCH 1/7] dma-buf: some dma_fence_chain improvements

The callback and the irq work are never used at the same time. Putting them into an union saves us 24 bytes and makes the structure only 120 bytes in size.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-fence-chain.c | 2 +- include/linux/dma-fence-chain.h | 8 +++++--- 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 7d129e68ac70..1b4cb3e5cec9 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -137,6 +137,7 @@ static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb) struct dma_fence_chain *chain;

chain = container_of(cb, typeof(*chain), cb); + init_irq_work(&chain->work, dma_fence_chain_irq_work); irq_work_queue(&chain->work); dma_fence_put(f); } @@ -239,7 +240,6 @@ void dma_fence_chain_init(struct dma_fence_chain *chain, rcu_assign_pointer(chain->prev, prev); chain->fence = fence; chain->prev_seqno = 0; - init_irq_work(&chain->work, dma_fence_chain_irq_work);

/* Try to reuse the context of the previous chain node. */ if (prev_chain && __dma_fence_is_later(seqno, prev->seqno, prev->ops)) { diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 10462a029da2..9d6a062be640 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -25,12 +25,14 @@ */ struct dma_fence_chain { struct dma_fence base; - spinlock_t lock; struct dma_fence __rcu *prev; u64 prev_seqno; struct dma_fence *fence; - struct dma_fence_cb cb; - struct irq_work work; + union { + struct dma_fence_cb cb; + struct irq_work work; + }; + spinlock_t lock; };

extern const struct dma_fence_ops dma_fence_chain_ops;

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

7:49 a.m.

New subject: [PATCH 1/7] dma-buf: some dma_fence_chain improvements

On Thu, Jun 10, 2021 at 11:17:54AM +0200, Christian König wrote:

...

The callback and the irq work are never used at the same time. Putting them into an union saves us 24 bytes and makes the structure only 120 bytes in size.

Yeah pushing below 128 bytes makes sense.

...

Signed-off-by: Christian König christian.koenig@amd.com

drivers/dma-buf/dma-fence-chain.c | 2 +- include/linux/dma-fence-chain.h | 8 +++++--- 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 7d129e68ac70..1b4cb3e5cec9 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -137,6 +137,7 @@ static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb) struct dma_fence_chain *chain;

chain = container_of(cb, typeof(*chain), cb);

init_irq_work(&chain->work, dma_fence_chain_irq_work); irq_work_queue(&chain->work); dma_fence_put(f);

} @@ -239,7 +240,6 @@ void dma_fence_chain_init(struct dma_fence_chain *chain, rcu_assign_pointer(chain->prev, prev); chain->fence = fence; chain->prev_seqno = 0;

init_irq_work(&chain->work, dma_fence_chain_irq_work);

/* Try to reuse the context of the previous chain node. */ if (prev_chain && __dma_fence_is_later(seqno, prev->seqno, prev->ops)) {

diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 10462a029da2..9d6a062be640 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -25,12 +25,14 @@ */ struct dma_fence_chain { struct dma_fence base;

spinlock_t lock; struct dma_fence __rcu *prev; u64 prev_seqno; struct dma_fence *fence;

struct dma_fence_cb cb;

struct irq_work work;

Can you pls pull the kerneldoc inline here for these too and extend the comments that @work is only used from the callback, at which point we don't need @cb anymore? For union lifetime tricks we really should document this in the datastructure docs.

With that:

Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch

I also think it'd be good to specify this clearly in the kerneldoc for dma_fence_add_callback() with something like:

"Note that the registered @cb structure is no longer in use by the signalling code by the time @func is called, and can therefore be used again. This is useful when @func launches a work item because it allows us to put both the struct dma_fence_cb and the work struct (e.g. struct work_struct) into a union to save space."

Feel free to includ this in this patch here or do a separate one.

Cheers, Daniel

...

union {
struct dma_fence_cb cb;
struct irq_work work;
};

spinlock_t lock;
};

extern const struct dma_fence_ops dma_fence_chain_ops;

2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10 Jun 10 Jun

9:17 a.m.

New subject: [PATCH 2/7] dma-buf: add dma_fence_chain_alloc/free

Add a common allocation helper. Cleaning up the mix of kzalloc/kmalloc and some unused code in the selftest.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/st-dma-fence-chain.c | 16 ++++---------- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/drm_syncobj.c | 6 ++--- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++--- drivers/gpu/drm/msm/msm_gem_submit.c | 6 ++--- include/linux/dma-fence-chain.h | 22 +++++++++++++++++++ 6 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 9525f7f56119..8ce1ea59d31b 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -58,28 +58,20 @@ static struct dma_fence *mock_fence(void) return &f->base; }

-static inline struct mock_chain { - struct dma_fence_chain base; -} *to_mock_chain(struct dma_fence *f) { - return container_of(f, struct mock_chain, base.base); -} - static struct dma_fence *mock_chain(struct dma_fence *prev, struct dma_fence *fence, u64 seqno) { - struct mock_chain *f; + struct dma_fence_chain *f;

- f = kmalloc(sizeof(*f), GFP_KERNEL); + f = dma_fence_chain_alloc(); if (!f) return NULL;

- dma_fence_chain_init(&f->base, - dma_fence_get(prev), - dma_fence_get(fence), + dma_fence_chain_init(f, dma_fence_get(prev), dma_fence_get(fence), seqno);

- return &f->base.base; + return &f->base; }

static int sanitycheck(void *arg) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 90136f9dedd6..325e82621467 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1124,7 +1124,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p

dep->chain = NULL; if (syncobj_deps[i].point) { - dep->chain = kmalloc(sizeof(*dep->chain), GFP_KERNEL); + dep->chain = dma_fence_chain_alloc(); if (!dep->chain) return -ENOMEM; } @@ -1132,7 +1132,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p dep->syncobj = drm_syncobj_find(p->filp, syncobj_deps[i].handle); if (!dep->syncobj) { - kfree(dep->chain); + dma_fence_chain_free(dep->chain); return -EINVAL; } dep->point = syncobj_deps[i].point; diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c index fdd2ec87cdd1..1c5b9ef6da37 100644 --- a/drivers/gpu/drm/drm_syncobj.c +++ b/drivers/gpu/drm/drm_syncobj.c @@ -861,7 +861,7 @@ static int drm_syncobj_transfer_to_timeline(struct drm_file *file_private, &fence); if (ret) goto err; - chain = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); + chain = dma_fence_chain_alloc(); if (!chain) { ret = -ENOMEM; goto err1; @@ -1402,10 +1402,10 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data, goto err_points; } for (i = 0; i < args->count_handles; i++) { - chains[i] = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); + chains[i] = dma_fence_chain_alloc(); if (!chains[i]) { for (j = 0; j < i; j++) - kfree(chains[j]); + dma_fence_chain_free(chains[j]); ret = -ENOMEM; goto err_chains; } diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 66789111a24b..a22cb86730b3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2983,7 +2983,7 @@ __free_fence_array(struct eb_fence *fences, unsigned int n) while (n--) { drm_syncobj_put(ptr_mask_bits(fences[n].syncobj, 2)); dma_fence_put(fences[n].dma_fence); - kfree(fences[n].chain_fence); + dma_fence_chain_free(fences[n].chain_fence); } kvfree(fences); } @@ -3097,9 +3097,7 @@ add_timeline_fence_array(struct i915_execbuffer *eb, return -EINVAL; }

- f->chain_fence = - kmalloc(sizeof(*f->chain_fence), - GFP_KERNEL); + f->chain_fence = dma_fence_chain_alloc(); if (!f->chain_fence) { drm_syncobj_put(syncobj); dma_fence_put(fence); diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 5480852bdeda..6ff6df6c4791 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -586,9 +586,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev, break; }

- post_deps[i].chain = - kmalloc(sizeof(*post_deps[i].chain), - GFP_KERNEL); + post_deps[i].chain = dma_fence_chain_alloc(); if (!post_deps[i].chain) { ret = -ENOMEM; break; @@ -605,7 +603,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev,

if (ret) { for (j = 0; j <= i; ++j) { - kfree(post_deps[j].chain); + dma_fence_chain_free(post_deps[j].chain); if (post_deps[j].syncobj) drm_syncobj_put(post_deps[j].syncobj); } diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 9d6a062be640..5f45689a6d2e 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -12,6 +12,7 @@

#include <linux/dma-fence.h> #include <linux/irq_work.h> +#include <linux/slab.h>

/** * struct dma_fence_chain - fence to represent an node of a fence chain @@ -53,6 +54,27 @@ to_dma_fence_chain(struct dma_fence *fence) return container_of(fence, struct dma_fence_chain, base); }

+/** + * dma_fence_chain_alloc + * + * Returns a new dma_fence_chain object + */ +static inline struct dma_fence_chain *dma_fence_chain_alloc(void) +{ + return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); +}; + +/** + * dma_fence_chain_free + * @chain: chain node to free + * + * Frees up an allocated but not used dma_fence_chain node. + */ +static inline void dma_fence_chain_free(struct dma_fence_chain *chain) +{ + kfree(chain); +}; + /** * dma_fence_chain_for_each - iterate over all fences in chain * @iter: current fence

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

7:54 a.m.

New subject: [PATCH 2/7] dma-buf: add dma_fence_chain_alloc/free

On Thu, Jun 10, 2021 at 11:17:55AM +0200, Christian König wrote:

...

Add a common allocation helper. Cleaning up the mix of kzalloc/kmalloc and some unused code in the selftest.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/dma-buf/st-dma-fence-chain.c | 16 ++++---------- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/drm_syncobj.c | 6 ++--- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++--- drivers/gpu/drm/msm/msm_gem_submit.c | 6 ++--- include/linux/dma-fence-chain.h | 22 +++++++++++++++++++ 6 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 9525f7f56119..8ce1ea59d31b 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -58,28 +58,20 @@ static struct dma_fence *mock_fence(void) return &f->base; }

-static inline struct mock_chain {

struct dma_fence_chain base;

-} *to_mock_chain(struct dma_fence *f) {

return container_of(f, struct mock_chain, base.base);

-}

static struct dma_fence *mock_chain(struct dma_fence *prev, struct dma_fence *fence, u64 seqno) {

struct mock_chain *f;

struct dma_fence_chain *f;

f = kmalloc(sizeof(*f), GFP_KERNEL);

f = dma_fence_chain_alloc(); if (!f) return NULL;
dma_fence_chain_init(&f->base,
	     dma_fence_get(prev),
	     dma_fence_get(fence),
dma_fence_chain_init(f, dma_fence_get(prev), dma_fence_get(fence), seqno);

return &f->base.base;

return &f->base;

}

static int sanitycheck(void *arg) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 90136f9dedd6..325e82621467 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1124,7 +1124,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p
dep->chain = NULL;
if (syncobj_deps[i].point) {
	dep->chain = kmalloc(sizeof(*dep->chain), GFP_KERNEL);
	dep->chain = dma_fence_chain_alloc();
if (!dep->chain)
	return -ENOMEM;
}
@@ -1132,7 +1132,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p dep->syncobj = drm_syncobj_find(p->filp, syncobj_deps[i].handle); if (!dep->syncobj) {
	kfree(dep->chain);
	dma_fence_chain_free(dep->chain);
return -EINVAL;
} dep->point = syncobj_deps[i].point;
diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c index fdd2ec87cdd1..1c5b9ef6da37 100644 --- a/drivers/gpu/drm/drm_syncobj.c +++ b/drivers/gpu/drm/drm_syncobj.c @@ -861,7 +861,7 @@ static int drm_syncobj_transfer_to_timeline(struct drm_file *file_private, &fence); if (ret) goto err;

chain = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

chain = dma_fence_chain_alloc(); if (!chain) { ret = -ENOMEM; goto err1;

@@ -1402,10 +1402,10 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data, goto err_points; } for (i = 0; i < args->count_handles; i++) {
chains[i] = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);
chains[i] = dma_fence_chain_alloc();
if (!chains[i]) { for (j = 0; j < i; j++)
		kfree(chains[j]);
		dma_fence_chain_free(chains[j]);
ret = -ENOMEM;
goto err_chains;
}
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 66789111a24b..a22cb86730b3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2983,7 +2983,7 @@ __free_fence_array(struct eb_fence *fences, unsigned int n) while (n--) { drm_syncobj_put(ptr_mask_bits(fences[n].syncobj, 2)); dma_fence_put(fences[n].dma_fence);
kfree(fences[n].chain_fence);
dma_fence_chain_free(fences[n].chain_fence);
} kvfree(fences);
} @@ -3097,9 +3097,7 @@ add_timeline_fence_array(struct i915_execbuffer *eb, return -EINVAL; }
	f->chain_fence =
		kmalloc(sizeof(*f->chain_fence),
			GFP_KERNEL);
	f->chain_fence = dma_fence_chain_alloc();
if (!f->chain_fence) {
	drm_syncobj_put(syncobj);
	dma_fence_put(fence);
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 5480852bdeda..6ff6df6c4791 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -586,9 +586,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev, break; }
	post_deps[i].chain =
		kmalloc(sizeof(*post_deps[i].chain),
		        GFP_KERNEL);
	post_deps[i].chain = dma_fence_chain_alloc();
if (!post_deps[i].chain) {
	ret = -ENOMEM;
	break;
@@ -605,7 +603,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev,

if (ret) { for (j = 0; j <= i; ++j) {
	kfree(post_deps[j].chain);
	dma_fence_chain_free(post_deps[j].chain);
if (post_deps[j].syncobj)
	drm_syncobj_put(post_deps[j].syncobj);
}
diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 9d6a062be640..5f45689a6d2e 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -12,6 +12,7 @@

#include <linux/dma-fence.h> #include <linux/irq_work.h> +#include <linux/slab.h>

/**

struct dma_fence_chain - fence to represent an node of a fence chain

@@ -53,6 +54,27 @@ to_dma_fence_chain(struct dma_fence *fence) return container_of(fence, struct dma_fence_chain, base); }

+/**

dma_fence_chain_alloc

Returns a new dma_fence_chain object

... or NULL on failure.

...

*/

+static inline struct dma_fence_chain *dma_fence_chain_alloc(void) +{

return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

+};

+/**

dma_fence_chain_free

@chain: chain node to free

Frees up an allocated but not used dma_fence_chain node.

*/

+static inline void dma_fence_chain_free(struct dma_fence_chain *chain) +{

kfree(chain);

kfree_rcu, and I guess this means this patch here should be cc: stable.

This is kinda why I'm questioning whether this "dma_fence are protected by rcu" cross driver api is really a good idea. We largely get it wrong in the details in a _ton_ of places.

With the details fixed:

Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch

...

/**

dma_fence_chain_for_each - iterate over all fences in chain

@iter: current fence

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

11:48 a.m.

New subject: [PATCH 2/7] dma-buf: add dma_fence_chain_alloc/free

Am 11.06.21 um 09:54 schrieb Daniel Vetter:

...

On Thu, Jun 10, 2021 at 11:17:55AM +0200, Christian König wrote:

...
Add a common allocation helper. Cleaning up the mix of kzalloc/kmalloc and some unused code in the selftest.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/dma-buf/st-dma-fence-chain.c | 16 ++++---------- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/drm_syncobj.c | 6 ++--- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++--- drivers/gpu/drm/msm/msm_gem_submit.c | 6 ++--- include/linux/dma-fence-chain.h | 22 +++++++++++++++++++ 6 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 9525f7f56119..8ce1ea59d31b 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -58,28 +58,20 @@ static struct dma_fence *mock_fence(void) return &f->base; }

-static inline struct mock_chain {

struct dma_fence_chain base;

-} *to_mock_chain(struct dma_fence *f) {

return container_of(f, struct mock_chain, base.base);

-}

static struct dma_fence *mock_chain(struct dma_fence *prev, struct dma_fence *fence, u64 seqno) {

struct mock_chain *f;

struct dma_fence_chain *f;

f = kmalloc(sizeof(*f), GFP_KERNEL);

f = dma_fence_chain_alloc(); if (!f) return NULL;
dma_fence_chain_init(&f->base,
	     dma_fence_get(prev),
	     dma_fence_get(fence),
dma_fence_chain_init(f, dma_fence_get(prev), dma_fence_get(fence), seqno);

return &f->base.base;

return &f->base; }

static int sanitycheck(void *arg)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 90136f9dedd6..325e82621467 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1124,7 +1124,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p
dep->chain = NULL;
if (syncobj_deps[i].point) {
	dep->chain = kmalloc(sizeof(*dep->chain), GFP_KERNEL);
	dep->chain = dma_fence_chain_alloc();
if (!dep->chain)
	return -ENOMEM;
}
@@ -1132,7 +1132,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p dep->syncobj = drm_syncobj_find(p->filp, syncobj_deps[i].handle); if (!dep->syncobj) {
	kfree(dep->chain);
	dma_fence_chain_free(dep->chain);
return -EINVAL;
} dep->point = syncobj_deps[i].point;
diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c index fdd2ec87cdd1..1c5b9ef6da37 100644 --- a/drivers/gpu/drm/drm_syncobj.c +++ b/drivers/gpu/drm/drm_syncobj.c @@ -861,7 +861,7 @@ static int drm_syncobj_transfer_to_timeline(struct drm_file *file_private, &fence); if (ret) goto err;

chain = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

chain = dma_fence_chain_alloc(); if (!chain) { ret = -ENOMEM; goto err1;

@@ -1402,10 +1402,10 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data, goto err_points; } for (i = 0; i < args->count_handles; i++) {
chains[i] = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);
chains[i] = dma_fence_chain_alloc();
if (!chains[i]) { for (j = 0; j < i; j++)
		kfree(chains[j]);
		dma_fence_chain_free(chains[j]);
ret = -ENOMEM;
goto err_chains;
}
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 66789111a24b..a22cb86730b3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2983,7 +2983,7 @@ __free_fence_array(struct eb_fence *fences, unsigned int n) while (n--) { drm_syncobj_put(ptr_mask_bits(fences[n].syncobj, 2)); dma_fence_put(fences[n].dma_fence);
kfree(fences[n].chain_fence);
dma_fence_chain_free(fences[n].chain_fence);
} kvfree(fences); }
@@ -3097,9 +3097,7 @@ add_timeline_fence_array(struct i915_execbuffer *eb, return -EINVAL; }
	f->chain_fence =
		kmalloc(sizeof(*f->chain_fence),
			GFP_KERNEL);
	f->chain_fence = dma_fence_chain_alloc();
if (!f->chain_fence) {
	drm_syncobj_put(syncobj);
	dma_fence_put(fence);
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 5480852bdeda..6ff6df6c4791 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -586,9 +586,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev, break; }
	post_deps[i].chain =
		kmalloc(sizeof(*post_deps[i].chain),
		        GFP_KERNEL);
	post_deps[i].chain = dma_fence_chain_alloc();
if (!post_deps[i].chain) {
	ret = -ENOMEM;
	break;
@@ -605,7 +603,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev,

if (ret) { for (j = 0; j <= i; ++j) {
	kfree(post_deps[j].chain);
	dma_fence_chain_free(post_deps[j].chain);
if (post_deps[j].syncobj)
	drm_syncobj_put(post_deps[j].syncobj);
}
diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 9d6a062be640..5f45689a6d2e 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -12,6 +12,7 @@

#include <linux/dma-fence.h> #include <linux/irq_work.h> +#include <linux/slab.h>

/**

struct dma_fence_chain - fence to represent an node of a fence chain

@@ -53,6 +54,27 @@ to_dma_fence_chain(struct dma_fence *fence) return container_of(fence, struct dma_fence_chain, base); }

+/**

dma_fence_chain_alloc

Returns a new dma_fence_chain object
... or NULL on failure.

...

*/

+static inline struct dma_fence_chain *dma_fence_chain_alloc(void) +{

return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

+};

+/**

dma_fence_chain_free

@chain: chain node to free

Frees up an allocated but not used dma_fence_chain node.

*/

+static inline void dma_fence_chain_free(struct dma_fence_chain *chain) +{

kfree(chain);

kfree_rcu, and I guess this means this patch here should be cc: stable.

Nope, kfree() is correct here.

This is to free up fences which never been initialized, so an RCU grace period isn't necessary because nobody could potentially have a reference.

Christian.

...

This is kinda why I'm questioning whether this "dma_fence are protected by rcu" cross driver api is really a good idea. We largely get it wrong in the details in a _ton_ of places.

With the details fixed:

Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch

...

/**

dma_fence_chain_for_each - iterate over all fences in chain

@iter: current fence

-- 2.25.1

Daniel Vetter

2:40 p.m.

New subject: [PATCH 2/7] dma-buf: add dma_fence_chain_alloc/free

On Fri, Jun 11, 2021 at 1:48 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 11.06.21 um 09:54 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:17:55AM +0200, Christian König wrote:

...
Add a common allocation helper. Cleaning up the mix of kzalloc/kmalloc and some unused code in the selftest.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/dma-buf/st-dma-fence-chain.c | 16 ++++---------- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/drm_syncobj.c | 6 ++--- .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 6 ++--- drivers/gpu/drm/msm/msm_gem_submit.c | 6 ++--- include/linux/dma-fence-chain.h | 22 +++++++++++++++++++ 6 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 9525f7f56119..8ce1ea59d31b 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -58,28 +58,20 @@ static struct dma_fence *mock_fence(void) return &f->base; }

-static inline struct mock_chain {

struct dma_fence_chain base;

-} *to_mock_chain(struct dma_fence *f) {

return container_of(f, struct mock_chain, base.base);

-}

static struct dma_fence *mock_chain(struct dma_fence *prev, struct dma_fence *fence, u64 seqno) {

struct mock_chain *f;

struct dma_fence_chain *f;

f = kmalloc(sizeof(*f), GFP_KERNEL);

f = dma_fence_chain_alloc(); if (!f) return NULL;
dma_fence_chain_init(&f->base,
                    dma_fence_get(prev),
                    dma_fence_get(fence),
dma_fence_chain_init(f, dma_fence_get(prev), dma_fence_get(fence), seqno);

return &f->base.base;

return &f->base; }

static int sanitycheck(void *arg)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 90136f9dedd6..325e82621467 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1124,7 +1124,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p
         dep->chain = NULL;
         if (syncobj_deps[i].point) {
               dep->chain = kmalloc(sizeof(*dep->chain), GFP_KERNEL);
               dep->chain = dma_fence_chain_alloc();
               if (!dep->chain)
                       return -ENOMEM;
       }
@@ -1132,7 +1132,7 @@ static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p dep->syncobj = drm_syncobj_find(p->filp, syncobj_deps[i].handle); if (!dep->syncobj) {
               kfree(dep->chain);
               dma_fence_chain_free(dep->chain);
               return -EINVAL;
       }
       dep->point = syncobj_deps[i].point;
diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c index fdd2ec87cdd1..1c5b9ef6da37 100644 --- a/drivers/gpu/drm/drm_syncobj.c +++ b/drivers/gpu/drm/drm_syncobj.c @@ -861,7 +861,7 @@ static int drm_syncobj_transfer_to_timeline(struct drm_file *file_private, &fence); if (ret) goto err;

chain = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

chain = dma_fence_chain_alloc(); if (!chain) { ret = -ENOMEM; goto err1;

@@ -1402,10 +1402,10 @@ drm_syncobj_timeline_signal_ioctl(struct drm_device *dev, void *data, goto err_points; } for (i = 0; i < args->count_handles; i++) {
       chains[i] = kzalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);
       chains[i] = dma_fence_chain_alloc();
       if (!chains[i]) {
               for (j = 0; j < i; j++)
                       kfree(chains[j]);
                       dma_fence_chain_free(chains[j]);
               ret = -ENOMEM;
               goto err_chains;
       }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 66789111a24b..a22cb86730b3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -2983,7 +2983,7 @@ __free_fence_array(struct eb_fence *fences, unsigned int n) while (n--) { drm_syncobj_put(ptr_mask_bits(fences[n].syncobj, 2)); dma_fence_put(fences[n].dma_fence);
       kfree(fences[n].chain_fence);
       dma_fence_chain_free(fences[n].chain_fence);
} kvfree(fences); }
@@ -3097,9 +3097,7 @@ add_timeline_fence_array(struct i915_execbuffer *eb, return -EINVAL; }
               f->chain_fence =
                       kmalloc(sizeof(*f->chain_fence),
                               GFP_KERNEL);
               f->chain_fence = dma_fence_chain_alloc();
               if (!f->chain_fence) {
                       drm_syncobj_put(syncobj);
                       dma_fence_put(fence);
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index 5480852bdeda..6ff6df6c4791 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -586,9 +586,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev, break; }
               post_deps[i].chain =
                       kmalloc(sizeof(*post_deps[i].chain),
                               GFP_KERNEL);
               post_deps[i].chain = dma_fence_chain_alloc();
               if (!post_deps[i].chain) {
                       ret = -ENOMEM;
                       break;
@@ -605,7 +603,7 @@ static struct msm_submit_post_dep *msm_parse_post_deps(struct drm_device *dev,
 if (ret) {
         for (j = 0; j <= i; ++j) {
               kfree(post_deps[j].chain);
               dma_fence_chain_free(post_deps[j].chain);
               if (post_deps[j].syncobj)
                       drm_syncobj_put(post_deps[j].syncobj);
       }
diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 9d6a062be640..5f45689a6d2e 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -12,6 +12,7 @@

#include <linux/dma-fence.h> #include <linux/irq_work.h> +#include <linux/slab.h>

/**

struct dma_fence_chain - fence to represent an node of a fence chain

@@ -53,6 +54,27 @@ to_dma_fence_chain(struct dma_fence *fence) return container_of(fence, struct dma_fence_chain, base); }

+/**

dma_fence_chain_alloc

Returns a new dma_fence_chain object
... or NULL on failure.

...

*/

+static inline struct dma_fence_chain *dma_fence_chain_alloc(void) +{

return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

+};

+/**

dma_fence_chain_free

@chain: chain node to free

Frees up an allocated but not used dma_fence_chain node.

*/

+static inline void dma_fence_chain_free(struct dma_fence_chain *chain) +{

kfree(chain);

kfree_rcu, and I guess this means this patch here should be cc: stable.
Nope, kfree() is correct here.

This is to free up fences which never been initialized, so an RCU grace period isn't necessary because nobody could potentially have a reference.

Hah I got tricked. I see you have the next revision out there already, I'll drop some suggestions on the kerneldoc so it's clearer what this does and why. -Daniel

...

Christian.

...
This is kinda why I'm questioning whether this "dma_fence are protected by rcu" cross driver api is really a good idea. We largely get it wrong in the details in a _ton_ of places.

With the details fixed:

Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch

...

/**

dma_fence_chain_for_each - iterate over all fences in chain

@iter: current fence

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10 Jun 10 Jun

9:17 a.m.

New subject: [PATCH 3/7] dma-buf: add dma_fence_chain_alloc/free self tests

Exercise the newly added functions.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/st-dma-fence-chain.c | 48 ++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 8ce1ea59d31b..855c129c6093 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -95,6 +95,53 @@ static int sanitycheck(void *arg) return err; }

+static int __alloc_free(void *arg) +{ + atomic_t *counter = arg; + int i, j; + + for (i = 0; i < 1024; ++i) { + struct dma_fence_chain *chains[64]; + + for (j = 0; j < ARRAY_SIZE(chains); ++j) + chains[j] = dma_fence_chain_alloc(); + + for (j = 0; j < ARRAY_SIZE(chains); ++j) + dma_fence_chain_free(chains[j]); + + atomic_add(ARRAY_SIZE(chains), counter); + } + return 0; +} + +static int alloc_free(void *arg) +{ + struct task_struct *threads[8]; + atomic_t counter = ATOMIC_INIT(0); + int i, err = 0; + + for (i = 0; i < ARRAY_SIZE(threads); i++) { + threads[i] = kthread_run(__alloc_free, &counter, "dmabuf/%d", + i); + if (IS_ERR(threads[i])) { + err = PTR_ERR(threads[i]); + break; + } + } + + while (i--) { + int ret; + + ret = kthread_stop(threads[i]); + if (ret && !err) + err = ret; + } + + pr_info("Completed %u cycles\n", atomic_read(&counter)); + + return err; +} + struct fence_chains { unsigned int chain_length; struct dma_fence **fences; @@ -677,6 +724,7 @@ int dma_fence_chain(void) { static const struct subtest tests[] = { SUBTEST(sanitycheck), + SUBTEST(alloc_free), SUBTEST(find_seqno), SUBTEST(find_signaled), SUBTEST(find_out_of_order),

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

7:58 a.m.

New subject: [PATCH 3/7] dma-buf: add dma_fence_chain_alloc/free self tests

On Thu, Jun 10, 2021 at 11:17:56AM +0200, Christian König wrote:

...

Exercise the newly added functions.

Signed-off-by: Christian König christian.koenig@amd.com

I have honestly no idea what this checks. Spawning a few threads to validate kmalloc/kfree feels a bit silly. Now testing whether we correctly rcu-delay the freeing here would make some sense, but even that feels a bit silly.

I guess if you want this explain with comments what it does and why? -Daniel

...

drivers/dma-buf/st-dma-fence-chain.c | 48 ++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 8ce1ea59d31b..855c129c6093 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -95,6 +95,53 @@ static int sanitycheck(void *arg) return err; }

+static int __alloc_free(void *arg) +{
atomic_t *counter = arg;

int i, j;

for (i = 0; i < 1024; ++i) {
struct dma_fence_chain *chains[64];
for (j = 0; j < ARRAY_SIZE(chains); ++j)
	chains[j] = dma_fence_chain_alloc();
for (j = 0; j < ARRAY_SIZE(chains); ++j)
	dma_fence_chain_free(chains[j]);
atomic_add(ARRAY_SIZE(chains), counter);
}

return 0;
+}

+static int alloc_free(void *arg) +{
struct task_struct *threads[8];

atomic_t counter = ATOMIC_INIT(0);

int i, err = 0;

for (i = 0; i < ARRAY_SIZE(threads); i++) {
threads[i] = kthread_run(__alloc_free, &counter, "dmabuf/%d",
			 i);
if (IS_ERR(threads[i])) {
	err = PTR_ERR(threads[i]);
	break;
}
}

while (i--) {
int ret;
ret = kthread_stop(threads[i]);
if (ret && !err)
	err = ret;
}

pr_info("Completed %u cycles\n", atomic_read(&counter));

return err;
+}

struct fence_chains { unsigned int chain_length; struct dma_fence **fences; @@ -677,6 +724,7 @@ int dma_fence_chain(void) { static const struct subtest tests[] = { SUBTEST(sanitycheck),
SUBTEST(alloc_free),
SUBTEST(find_seqno), SUBTEST(find_signaled), SUBTEST(find_out_of_order),
-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10:04 a.m.

New subject: [PATCH 3/7] dma-buf: add dma_fence_chain_alloc/free self tests

Am 11.06.21 um 09:58 schrieb Daniel Vetter:

...

On Thu, Jun 10, 2021 at 11:17:56AM +0200, Christian König wrote:

...
Exercise the newly added functions.

Signed-off-by: Christian König christian.koenig@amd.com

I have honestly no idea what this checks. Spawning a few threads to validate kmalloc/kfree feels a bit silly. Now testing whether we correctly rcu-delay the freeing here would make some sense, but even that feels a bit silly.

I guess if you want this explain with comments what it does and why?

This was soley to figure out if the garbage collection is working properly and how much overhead it generates.

No actual need to commit it.

Christian.

...

-Daniel

...
drivers/dma-buf/st-dma-fence-chain.c | 48 ++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+)

diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 8ce1ea59d31b..855c129c6093 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -95,6 +95,53 @@ static int sanitycheck(void *arg) return err; }

+static int __alloc_free(void *arg) +{
atomic_t *counter = arg;

int i, j;

for (i = 0; i < 1024; ++i) {
struct dma_fence_chain *chains[64];
for (j = 0; j < ARRAY_SIZE(chains); ++j)
	chains[j] = dma_fence_chain_alloc();
for (j = 0; j < ARRAY_SIZE(chains); ++j)
	dma_fence_chain_free(chains[j]);
atomic_add(ARRAY_SIZE(chains), counter);
}

return 0;
+}

+static int alloc_free(void *arg) +{
struct task_struct *threads[8];

atomic_t counter = ATOMIC_INIT(0);

int i, err = 0;

for (i = 0; i < ARRAY_SIZE(threads); i++) {
threads[i] = kthread_run(__alloc_free, &counter, "dmabuf/%d",
			 i);
if (IS_ERR(threads[i])) {
	err = PTR_ERR(threads[i]);
	break;
}
}

while (i--) {
int ret;
ret = kthread_stop(threads[i]);
if (ret && !err)
	err = ret;
}

pr_info("Completed %u cycles\n", atomic_read(&counter));

return err;
+}

struct fence_chains { unsigned int chain_length; struct dma_fence **fences;

@@ -677,6 +724,7 @@ int dma_fence_chain(void) { static const struct subtest tests[] = { SUBTEST(sanitycheck),
SUBTEST(alloc_free),
SUBTEST(find_seqno), SUBTEST(find_signaled), SUBTEST(find_out_of_order),
-- 2.25.1

Christian König

10 Jun 10 Jun

9:17 a.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

Add some rather sophisticated lockless garbage collection for dma_fence_chain objects.

For this keep all initialized dma_fence_chain nodes an a queue and trigger garbage collection before a new one is allocated.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/dma-buf/dma-fence-chain.c | 160 +++++++++++++++++++++++++----- include/linux/dma-fence-chain.h | 5 + 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 1b4cb3e5cec9..c2f0b69eabb7 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -9,8 +9,53 @@

#include <linux/dma-fence-chain.h>

+static struct dma_fence_chain __rcu *fifo_front; +static struct dma_fence_chain __rcu **fifo_back = &fifo_front; + static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);

+/** + * dma_fence_chain_enqueue - enqeue a chain node for garbage collection + * @chain: the chain node to enqueue + * + * Add the chain node to the end of the gc fifo. + */ +static void dma_fence_chain_enqueue(struct dma_fence_chain *chain) +{ + struct dma_fence_chain __rcu **tmp; + + RCU_INIT_POINTER(chain->next, NULL); + tmp = xchg((struct dma_fence_chain __force ***)&fifo_back, + &chain->next); + + /* This is intentionally unordered since we only need the fifo for gc */ + rcu_assign_pointer(*tmp, chain); +} + +/** + * dma_fence_chain_dequeue - deqeueue a chain node for garbage collection + * + * Remove the first chain node from the gc fifo while making sure that always + * keep at least one node on the fifo for lockless fifo implementation. + * Returns the dequeued chain node or NULL. + */ +static struct dma_fence_chain *dma_fence_chain_dequeue(void) +{ + struct dma_fence_chain *chain, *tmp; + + rcu_read_lock(); + chain = rcu_dereference(fifo_front); + /* Never dequeue the last chain node for lockless fifo */ + if (unlikely(!chain || !rcu_access_pointer(chain->next))) { + rcu_read_unlock(); + return NULL; + } + tmp = cmpxchg((struct dma_fence_chain __force **)&fifo_front, + chain, rcu_access_pointer(chain->next)); + rcu_read_unlock(); + return tmp == chain ? chain : NULL; +} + /** * dma_fence_chain_get_prev - use RCU to get a reference to the previous fence * @chain: chain node to get the previous node from @@ -28,6 +73,43 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) return prev; }

+/** + * dma_fence_chain_try_replace - try to replace the prev node + * @chain: Chain node we try to replace prev. + * @prev: the old prev node + * + * Try to replace the previous chain node when it or its containing fence is + * signaled. Returns true if we tried, false if we need to wait. + */ +static bool dma_fence_chain_try_replace(struct dma_fence_chain *chain, + struct dma_fence *prev) +{ + struct dma_fence *replacement, *tmp; + struct dma_fence_chain *prev_chain; + + prev_chain = to_dma_fence_chain(prev); + if (prev_chain) { + if (!dma_fence_is_signaled(prev_chain->fence)) + return false; + + replacement = dma_fence_chain_get_prev(prev_chain); + } else { + if (!dma_fence_is_signaled(prev)) + return false; + + replacement = NULL; + } + + tmp = cmpxchg((struct dma_fence __force **)&chain->prev, prev, + replacement); + if (tmp == prev) + dma_fence_put(tmp); + else + dma_fence_put(replacement); + + return true; +} + /** * dma_fence_chain_walk - chain walking function * @fence: current chain node @@ -38,8 +120,8 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) */ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) { - struct dma_fence_chain *chain, *prev_chain; - struct dma_fence *prev, *replacement, *tmp; + struct dma_fence_chain *chain; + struct dma_fence *prev;

chain = to_dma_fence_chain(fence); if (!chain) { @@ -48,26 +130,8 @@ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) }

while ((prev = dma_fence_chain_get_prev(chain))) { - - prev_chain = to_dma_fence_chain(prev); - if (prev_chain) { - if (!dma_fence_is_signaled(prev_chain->fence)) - break; - - replacement = dma_fence_chain_get_prev(prev_chain); - } else { - if (!dma_fence_is_signaled(prev)) - break; - - replacement = NULL; - } - - tmp = cmpxchg((struct dma_fence __force **)&chain->prev, - prev, replacement); - if (tmp == prev) - dma_fence_put(tmp); - else - dma_fence_put(replacement); + if (!dma_fence_chain_try_replace(chain, prev)) + break; dma_fence_put(prev); }

@@ -205,7 +269,21 @@ static void dma_fence_chain_release(struct dma_fence *fence) dma_fence_put(prev);

dma_fence_put(chain->fence); - dma_fence_free(fence); + WRITE_ONCE(chain->fence, NULL); + + /* + * Don't garbage collect here to avoid recursion and potential stack + * overflow. + */ + chain = dma_fence_chain_dequeue(); + if (!chain) + return; + + if (kref_read(&chain->base.refcount) || + READ_ONCE(chain->fence)) + dma_fence_chain_enqueue(chain); + else + dma_fence_free(&chain->base); }

const struct dma_fence_ops dma_fence_chain_ops = { @@ -218,6 +296,40 @@ const struct dma_fence_ops dma_fence_chain_ops = { }; EXPORT_SYMBOL(dma_fence_chain_ops);

+/** + * dma_fence_chain_garbage_collect - cleanup chain nodes + * + * Do some garbage collection and try to release chain nodes. + */ +void dma_fence_chain_garbage_collect(void) +{ + struct dma_fence_chain *chain = dma_fence_chain_dequeue(); + + if (!chain) + return; + + if (!dma_fence_get_rcu(&chain->base)) { + /* Unused, check if it's also clean */ + if (likely(!READ_ONCE(chain->fence))) { + dma_fence_free(&chain->base); + return; + } + + } else { + struct dma_fence *prev; + + /* Used, do some chain walk */ + prev = dma_fence_chain_get_prev(chain); + if (prev) { + dma_fence_chain_try_replace(chain, prev); + dma_fence_put(prev); + } + dma_fence_put(&chain->base); + } + dma_fence_chain_enqueue(chain); +} +EXPORT_SYMBOL(dma_fence_chain_garbage_collect); + /** * dma_fence_chain_init - initialize a fence chain * @chain: the chain node to initialize @@ -254,5 +366,7 @@ void dma_fence_chain_init(struct dma_fence_chain *chain,

dma_fence_init(&chain->base, &dma_fence_chain_ops, &chain->lock, context, seqno); + dma_fence_chain_enqueue(chain); } + EXPORT_SYMBOL(dma_fence_chain_init); diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 5f45689a6d2e..b412b5396434 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -19,6 +19,7 @@ * @base: fence base class * @lock: spinlock for fence handling * @prev: previous fence of the chain + * @next: next chain node for garbage collection * @prev_seqno: original previous seqno before garbage collection * @fence: encapsulated fence * @cb: callback structure for signaling @@ -27,6 +28,7 @@ struct dma_fence_chain { struct dma_fence base; struct dma_fence __rcu *prev; + struct dma_fence_chain __rcu *next; u64 prev_seqno; struct dma_fence *fence; union { @@ -38,6 +40,8 @@ struct dma_fence_chain {

extern const struct dma_fence_ops dma_fence_chain_ops;

+void dma_fence_chain_garbage_collect(void); + /** * to_dma_fence_chain - cast a fence to a dma_fence_chain * @fence: fence to cast to a dma_fence_array @@ -61,6 +65,7 @@ to_dma_fence_chain(struct dma_fence *fence) */ static inline struct dma_fence_chain *dma_fence_chain_alloc(void) { + dma_fence_chain_garbage_collect(); return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); };

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

8:58 a.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

On Thu, Jun 10, 2021 at 11:17:57AM +0200, Christian König wrote:

...

Add some rather sophisticated lockless garbage collection for dma_fence_chain objects.

For this keep all initialized dma_fence_chain nodes an a queue and trigger garbage collection before a new one is allocated.

Signed-off-by: Christian König christian.koenig@amd.com

Uh hand-rolled lockless list, I'm not a fan.

But the real question here is why? This is a global list, so it's going to look great on your desktop, but gpus are for servers now and those are NUMA. So just from that pov doing garbage-collection individually still feels like a much better idea.

So what's the problem your trying to solve here? -Daniel

...

drivers/dma-buf/dma-fence-chain.c | 160 +++++++++++++++++++++++++----- include/linux/dma-fence-chain.h | 5 + 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 1b4cb3e5cec9..c2f0b69eabb7 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -9,8 +9,53 @@

#include <linux/dma-fence-chain.h>

+static struct dma_fence_chain __rcu *fifo_front; +static struct dma_fence_chain __rcu **fifo_back = &fifo_front;

static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);

+/**

dma_fence_chain_enqueue - enqeue a chain node for garbage collection

@chain: the chain node to enqueue

Add the chain node to the end of the gc fifo.

*/

+static void dma_fence_chain_enqueue(struct dma_fence_chain *chain) +{
struct dma_fence_chain __rcu **tmp;

RCU_INIT_POINTER(chain->next, NULL);

tmp = xchg((struct dma_fence_chain __force ***)&fifo_back,
   &chain->next);
/* This is intentionally unordered since we only need the fifo for gc */

rcu_assign_pointer(*tmp, chain);
+}

+/**

dma_fence_chain_dequeue - deqeueue a chain node for garbage collection

Remove the first chain node from the gc fifo while making sure that always

keep at least one node on the fifo for lockless fifo implementation.

Returns the dequeued chain node or NULL.

*/

+static struct dma_fence_chain *dma_fence_chain_dequeue(void) +{
struct dma_fence_chain *chain, *tmp;

rcu_read_lock();

chain = rcu_dereference(fifo_front);

/* Never dequeue the last chain node for lockless fifo */

if (unlikely(!chain || !rcu_access_pointer(chain->next))) {
rcu_read_unlock();
return NULL;
}

tmp = cmpxchg((struct dma_fence_chain __force **)&fifo_front,
      chain, rcu_access_pointer(chain->next));
rcu_read_unlock();

return tmp == chain ? chain : NULL;
+}

/**

dma_fence_chain_get_prev - use RCU to get a reference to the previous fence

@chain: chain node to get the previous node from

@@ -28,6 +73,43 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) return prev; }

+/**

dma_fence_chain_try_replace - try to replace the prev node

@chain: Chain node we try to replace prev.

@prev: the old prev node

Try to replace the previous chain node when it or its containing fence is

signaled. Returns true if we tried, false if we need to wait.

*/

+static bool dma_fence_chain_try_replace(struct dma_fence_chain *chain,
			struct dma_fence *prev)
+{
struct dma_fence *replacement, *tmp;

struct dma_fence_chain *prev_chain;

prev_chain = to_dma_fence_chain(prev);

if (prev_chain) {
if (!dma_fence_is_signaled(prev_chain->fence))
	return false;
replacement = dma_fence_chain_get_prev(prev_chain);
} else {
if (!dma_fence_is_signaled(prev))
	return false;
replacement = NULL;
}

tmp = cmpxchg((struct dma_fence __force **)&chain->prev, prev,
      replacement);
if (tmp == prev)
dma_fence_put(tmp);
else
dma_fence_put(replacement);
return true;
+}

/**

dma_fence_chain_walk - chain walking function

@fence: current chain node

@@ -38,8 +120,8 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) */ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) {

struct dma_fence_chain *chain, *prev_chain;

struct dma_fence *prev, *replacement, *tmp;

struct dma_fence_chain *chain;

struct dma_fence *prev;

chain = to_dma_fence_chain(fence); if (!chain) {

@@ -48,26 +130,8 @@ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) }

while ((prev = dma_fence_chain_get_prev(chain))) {
prev_chain = to_dma_fence_chain(prev);
if (prev_chain) {
	if (!dma_fence_is_signaled(prev_chain->fence))
		break;
	replacement = dma_fence_chain_get_prev(prev_chain);
} else {
	if (!dma_fence_is_signaled(prev))
		break;
	replacement = NULL;
}
tmp = cmpxchg((struct dma_fence __force **)&chain->prev,
	      prev, replacement);
if (tmp == prev)
	dma_fence_put(tmp);
else
	dma_fence_put(replacement);
if (!dma_fence_chain_try_replace(chain, prev))
	break;
dma_fence_put(prev); }
@@ -205,7 +269,21 @@ static void dma_fence_chain_release(struct dma_fence *fence) dma_fence_put(prev);

dma_fence_put(chain->fence);

dma_fence_free(fence);
WRITE_ONCE(chain->fence, NULL);

/*
* Don't garbage collect here to avoid recursion and potential stack
* overflow.
*/
chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (kref_read(&chain->base.refcount) ||
   READ_ONCE(chain->fence))
dma_fence_chain_enqueue(chain);
else
dma_fence_free(&chain->base);
}

const struct dma_fence_ops dma_fence_chain_ops = { @@ -218,6 +296,40 @@ const struct dma_fence_ops dma_fence_chain_ops = { }; EXPORT_SYMBOL(dma_fence_chain_ops);

+/**

dma_fence_chain_garbage_collect - cleanup chain nodes

Do some garbage collection and try to release chain nodes.

*/

+void dma_fence_chain_garbage_collect(void) +{
struct dma_fence_chain *chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (!dma_fence_get_rcu(&chain->base)) {
/* Unused, check if it's also clean */
if (likely(!READ_ONCE(chain->fence))) {
	dma_fence_free(&chain->base);
	return;
}
} else {
struct dma_fence *prev;
/* Used, do some chain walk */
prev = dma_fence_chain_get_prev(chain);
if (prev) {
	dma_fence_chain_try_replace(chain, prev);
	dma_fence_put(prev);
}
dma_fence_put(&chain->base);
}

dma_fence_chain_enqueue(chain);
+} +EXPORT_SYMBOL(dma_fence_chain_garbage_collect);

/**

dma_fence_chain_init - initialize a fence chain

@chain: the chain node to initialize

@@ -254,5 +366,7 @@ void dma_fence_chain_init(struct dma_fence_chain *chain,

dma_fence_init(&chain->base, &dma_fence_chain_ops, &chain->lock, context, seqno);

dma_fence_chain_enqueue(chain);

}

EXPORT_SYMBOL(dma_fence_chain_init); diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 5f45689a6d2e..b412b5396434 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -19,6 +19,7 @@

@base: fence base class

@lock: spinlock for fence handling

@prev: previous fence of the chain

@next: next chain node for garbage collection

@prev_seqno: original previous seqno before garbage collection

@fence: encapsulated fence

@cb: callback structure for signaling

@@ -27,6 +28,7 @@ struct dma_fence_chain { struct dma_fence base; struct dma_fence __rcu *prev;

struct dma_fence_chain __rcu *next; u64 prev_seqno; struct dma_fence *fence; union {

@@ -38,6 +40,8 @@ struct dma_fence_chain {

extern const struct dma_fence_ops dma_fence_chain_ops;

+void dma_fence_chain_garbage_collect(void);

/**

to_dma_fence_chain - cast a fence to a dma_fence_chain

@fence: fence to cast to a dma_fence_array

@@ -61,6 +65,7 @@ to_dma_fence_chain(struct dma_fence *fence) */ static inline struct dma_fence_chain *dma_fence_chain_alloc(void) {

dma_fence_chain_garbage_collect(); return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL);

};

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10:07 a.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

Am 11.06.21 um 10:58 schrieb Daniel Vetter:

...

On Thu, Jun 10, 2021 at 11:17:57AM +0200, Christian König wrote:

...
Add some rather sophisticated lockless garbage collection for dma_fence_chain objects.

For this keep all initialized dma_fence_chain nodes an a queue and trigger garbage collection before a new one is allocated.

Signed-off-by: Christian König christian.koenig@amd.com

Uh hand-rolled lockless list, I'm not a fan.

But the real question here is why? This is a global list, so it's going to look great on your desktop, but gpus are for servers now and those are NUMA. So just from that pov doing garbage-collection individually still feels like a much better idea.

Yeah, I was pondering on that quite a bit as well. That why I added the multi threaded alloc/free test.

...

So what's the problem your trying to solve here?

I was not sure if the chain is garbage collected enough when used for the dma_resv exclusive object.

But I think we could just drop this for now and just see how it goes.

Christian.

...

-Daniel

...
drivers/dma-buf/dma-fence-chain.c | 160 +++++++++++++++++++++++++----- include/linux/dma-fence-chain.h | 5 + 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 1b4cb3e5cec9..c2f0b69eabb7 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -9,8 +9,53 @@

#include <linux/dma-fence-chain.h>

+static struct dma_fence_chain __rcu *fifo_front; +static struct dma_fence_chain __rcu **fifo_back = &fifo_front;

static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);

+/**

dma_fence_chain_enqueue - enqeue a chain node for garbage collection

@chain: the chain node to enqueue

Add the chain node to the end of the gc fifo.

*/

+static void dma_fence_chain_enqueue(struct dma_fence_chain *chain) +{
struct dma_fence_chain __rcu **tmp;

RCU_INIT_POINTER(chain->next, NULL);

tmp = xchg((struct dma_fence_chain __force ***)&fifo_back,
   &chain->next);
/* This is intentionally unordered since we only need the fifo for gc */

rcu_assign_pointer(*tmp, chain);
+}

+/**

dma_fence_chain_dequeue - deqeueue a chain node for garbage collection

Remove the first chain node from the gc fifo while making sure that always

keep at least one node on the fifo for lockless fifo implementation.

Returns the dequeued chain node or NULL.

*/

+static struct dma_fence_chain *dma_fence_chain_dequeue(void) +{
struct dma_fence_chain *chain, *tmp;

rcu_read_lock();

chain = rcu_dereference(fifo_front);

/* Never dequeue the last chain node for lockless fifo */

if (unlikely(!chain || !rcu_access_pointer(chain->next))) {
rcu_read_unlock();
return NULL;
}

tmp = cmpxchg((struct dma_fence_chain __force **)&fifo_front,
      chain, rcu_access_pointer(chain->next));
rcu_read_unlock();

return tmp == chain ? chain : NULL;
+}

/**

dma_fence_chain_get_prev - use RCU to get a reference to the previous fence

@chain: chain node to get the previous node from

@@ -28,6 +73,43 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) return prev; }

+/**

dma_fence_chain_try_replace - try to replace the prev node

@chain: Chain node we try to replace prev.

@prev: the old prev node

Try to replace the previous chain node when it or its containing fence is

signaled. Returns true if we tried, false if we need to wait.

*/

+static bool dma_fence_chain_try_replace(struct dma_fence_chain *chain,
			struct dma_fence *prev)
+{
struct dma_fence *replacement, *tmp;

struct dma_fence_chain *prev_chain;

prev_chain = to_dma_fence_chain(prev);

if (prev_chain) {
if (!dma_fence_is_signaled(prev_chain->fence))
	return false;
replacement = dma_fence_chain_get_prev(prev_chain);
} else {
if (!dma_fence_is_signaled(prev))
	return false;
replacement = NULL;
}

tmp = cmpxchg((struct dma_fence __force **)&chain->prev, prev,
      replacement);
if (tmp == prev)
dma_fence_put(tmp);
else
dma_fence_put(replacement);
return true;
+}

/**

dma_fence_chain_walk - chain walking function

@fence: current chain node

@@ -38,8 +120,8 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) */ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) {

struct dma_fence_chain *chain, *prev_chain;

struct dma_fence *prev, *replacement, *tmp;

struct dma_fence_chain *chain;

struct dma_fence *prev;

chain = to_dma_fence_chain(fence); if (!chain) {

@@ -48,26 +130,8 @@ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) }

while ((prev = dma_fence_chain_get_prev(chain))) {
prev_chain = to_dma_fence_chain(prev);
if (prev_chain) {
	if (!dma_fence_is_signaled(prev_chain->fence))
		break;
	replacement = dma_fence_chain_get_prev(prev_chain);
} else {
	if (!dma_fence_is_signaled(prev))
		break;
	replacement = NULL;
}
tmp = cmpxchg((struct dma_fence __force **)&chain->prev,
	      prev, replacement);
if (tmp == prev)
	dma_fence_put(tmp);
else
	dma_fence_put(replacement);
if (!dma_fence_chain_try_replace(chain, prev))
	break;
dma_fence_put(prev); }
@@ -205,7 +269,21 @@ static void dma_fence_chain_release(struct dma_fence *fence) dma_fence_put(prev);

dma_fence_put(chain->fence);

dma_fence_free(fence);
WRITE_ONCE(chain->fence, NULL);

/*
* Don't garbage collect here to avoid recursion and potential stack
* overflow.
*/
chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (kref_read(&chain->base.refcount) ||
   READ_ONCE(chain->fence))
dma_fence_chain_enqueue(chain);
else
dma_fence_free(&chain->base);
}

const struct dma_fence_ops dma_fence_chain_ops = {
@@ -218,6 +296,40 @@ const struct dma_fence_ops dma_fence_chain_ops = { }; EXPORT_SYMBOL(dma_fence_chain_ops);

+/**

dma_fence_chain_garbage_collect - cleanup chain nodes

Do some garbage collection and try to release chain nodes.

*/

+void dma_fence_chain_garbage_collect(void) +{
struct dma_fence_chain *chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (!dma_fence_get_rcu(&chain->base)) {
/* Unused, check if it's also clean */
if (likely(!READ_ONCE(chain->fence))) {
	dma_fence_free(&chain->base);
	return;
}
} else {
struct dma_fence *prev;
/* Used, do some chain walk */
prev = dma_fence_chain_get_prev(chain);
if (prev) {
	dma_fence_chain_try_replace(chain, prev);
	dma_fence_put(prev);
}
dma_fence_put(&chain->base);
}

dma_fence_chain_enqueue(chain);
+} +EXPORT_SYMBOL(dma_fence_chain_garbage_collect);

/**

dma_fence_chain_init - initialize a fence chain

@chain: the chain node to initialize

@@ -254,5 +366,7 @@ void dma_fence_chain_init(struct dma_fence_chain *chain,

dma_fence_init(&chain->base, &dma_fence_chain_ops, &chain->lock, context, seqno);

dma_fence_chain_enqueue(chain); }

EXPORT_SYMBOL(dma_fence_chain_init);

diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 5f45689a6d2e..b412b5396434 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -19,6 +19,7 @@

@base: fence base class

@lock: spinlock for fence handling

@prev: previous fence of the chain

@next: next chain node for garbage collection

@prev_seqno: original previous seqno before garbage collection

@fence: encapsulated fence

@cb: callback structure for signaling

@@ -27,6 +28,7 @@ struct dma_fence_chain { struct dma_fence base; struct dma_fence __rcu *prev;

struct dma_fence_chain __rcu *next; u64 prev_seqno; struct dma_fence *fence; union {

@@ -38,6 +40,8 @@ struct dma_fence_chain {

extern const struct dma_fence_ops dma_fence_chain_ops;

+void dma_fence_chain_garbage_collect(void);

/**

to_dma_fence_chain - cast a fence to a dma_fence_chain

@fence: fence to cast to a dma_fence_array

@@ -61,6 +65,7 @@ to_dma_fence_chain(struct dma_fence *fence) */ static inline struct dma_fence_chain *dma_fence_chain_alloc(void) {

dma_fence_chain_garbage_collect(); return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); };

-- 2.25.1

Daniel Vetter

3:11 p.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

On Fri, Jun 11, 2021 at 12:07:00PM +0200, Christian König wrote:

...

Am 11.06.21 um 10:58 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:17:57AM +0200, Christian König wrote:

...
Add some rather sophisticated lockless garbage collection for dma_fence_chain objects.

For this keep all initialized dma_fence_chain nodes an a queue and trigger garbage collection before a new one is allocated.

Signed-off-by: Christian König christian.koenig@amd.com

Uh hand-rolled lockless list, I'm not a fan.

But the real question here is why? This is a global list, so it's going to look great on your desktop, but gpus are for servers now and those are NUMA. So just from that pov doing garbage-collection individually still feels like a much better idea.

Yeah, I was pondering on that quite a bit as well. That why I added the multi threaded alloc/free test.

...
So what's the problem your trying to solve here?

I was not sure if the chain is garbage collected enough when used for the dma_resv exclusive object.

But I think we could just drop this for now and just see how it goes.

Yeah I think fixing a perf/tuning problem when we're hitting it is much better than trying to guess what it might be and writing something really clever to anticipate that. We generally get it wrong.

Also with the explaination of what it tests added the testcase makes sense. Just some documentation or comment would have helped. -Daniel

...

Christian.

...
-Daniel

...
drivers/dma-buf/dma-fence-chain.c | 160 +++++++++++++++++++++++++----- include/linux/dma-fence-chain.h | 5 + 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 1b4cb3e5cec9..c2f0b69eabb7 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -9,8 +9,53 @@ #include <linux/dma-fence-chain.h> +static struct dma_fence_chain __rcu *fifo_front; +static struct dma_fence_chain __rcu **fifo_back = &fifo_front;

static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);

+/**

dma_fence_chain_enqueue - enqeue a chain node for garbage collection

@chain: the chain node to enqueue

Add the chain node to the end of the gc fifo.

*/

+static void dma_fence_chain_enqueue(struct dma_fence_chain *chain) +{
struct dma_fence_chain __rcu **tmp;

RCU_INIT_POINTER(chain->next, NULL);

tmp = xchg((struct dma_fence_chain __force ***)&fifo_back,
   &chain->next);
/* This is intentionally unordered since we only need the fifo for gc */

rcu_assign_pointer(*tmp, chain);
+}

+/**

dma_fence_chain_dequeue - deqeueue a chain node for garbage collection

Remove the first chain node from the gc fifo while making sure that always

keep at least one node on the fifo for lockless fifo implementation.

Returns the dequeued chain node or NULL.

*/

+static struct dma_fence_chain *dma_fence_chain_dequeue(void) +{
struct dma_fence_chain *chain, *tmp;

rcu_read_lock();

chain = rcu_dereference(fifo_front);

/* Never dequeue the last chain node for lockless fifo */

if (unlikely(!chain || !rcu_access_pointer(chain->next))) {
rcu_read_unlock();
return NULL;
}

tmp = cmpxchg((struct dma_fence_chain __force **)&fifo_front,
      chain, rcu_access_pointer(chain->next));
rcu_read_unlock();

return tmp == chain ? chain : NULL;
+}

/**

dma_fence_chain_get_prev - use RCU to get a reference to the previous fence

@chain: chain node to get the previous node from

@@ -28,6 +73,43 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) return prev; } +/**

dma_fence_chain_try_replace - try to replace the prev node

@chain: Chain node we try to replace prev.

@prev: the old prev node

Try to replace the previous chain node when it or its containing fence is

signaled. Returns true if we tried, false if we need to wait.

*/

+static bool dma_fence_chain_try_replace(struct dma_fence_chain *chain,
			struct dma_fence *prev)
+{
struct dma_fence *replacement, *tmp;

struct dma_fence_chain *prev_chain;

prev_chain = to_dma_fence_chain(prev);

if (prev_chain) {
if (!dma_fence_is_signaled(prev_chain->fence))
	return false;
replacement = dma_fence_chain_get_prev(prev_chain);
} else {
if (!dma_fence_is_signaled(prev))
	return false;
replacement = NULL;
}

tmp = cmpxchg((struct dma_fence __force **)&chain->prev, prev,
      replacement);
if (tmp == prev)
dma_fence_put(tmp);
else
dma_fence_put(replacement);
return true;
+}

/**

dma_fence_chain_walk - chain walking function

@fence: current chain node

@@ -38,8 +120,8 @@ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) */ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) {

struct dma_fence_chain *chain, *prev_chain;

struct dma_fence *prev, *replacement, *tmp;

struct dma_fence_chain *chain;

struct dma_fence *prev; chain = to_dma_fence_chain(fence); if (!chain) {

@@ -48,26 +130,8 @@ struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence) } while ((prev = dma_fence_chain_get_prev(chain))) {
prev_chain = to_dma_fence_chain(prev);
if (prev_chain) {
	if (!dma_fence_is_signaled(prev_chain->fence))
		break;
	replacement = dma_fence_chain_get_prev(prev_chain);
} else {
	if (!dma_fence_is_signaled(prev))
		break;
	replacement = NULL;
}
tmp = cmpxchg((struct dma_fence __force **)&chain->prev,
	      prev, replacement);
if (tmp == prev)
	dma_fence_put(tmp);
else
	dma_fence_put(replacement);
if (!dma_fence_chain_try_replace(chain, prev))
	break;
dma_fence_put(prev); }
@@ -205,7 +269,21 @@ static void dma_fence_chain_release(struct dma_fence *fence) dma_fence_put(prev); dma_fence_put(chain->fence);

dma_fence_free(fence);
WRITE_ONCE(chain->fence, NULL);

/*
* Don't garbage collect here to avoid recursion and potential stack
* overflow.
*/
chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (kref_read(&chain->base.refcount) ||
   READ_ONCE(chain->fence))
dma_fence_chain_enqueue(chain);
else
dma_fence_free(&chain->base);
} const struct dma_fence_ops dma_fence_chain_ops = {
@@ -218,6 +296,40 @@ const struct dma_fence_ops dma_fence_chain_ops = { }; EXPORT_SYMBOL(dma_fence_chain_ops); +/**

dma_fence_chain_garbage_collect - cleanup chain nodes

Do some garbage collection and try to release chain nodes.

*/

+void dma_fence_chain_garbage_collect(void) +{
struct dma_fence_chain *chain = dma_fence_chain_dequeue();

if (!chain)
return;
if (!dma_fence_get_rcu(&chain->base)) {
/* Unused, check if it's also clean */
if (likely(!READ_ONCE(chain->fence))) {
	dma_fence_free(&chain->base);
	return;
}
} else {
struct dma_fence *prev;
/* Used, do some chain walk */
prev = dma_fence_chain_get_prev(chain);
if (prev) {
	dma_fence_chain_try_replace(chain, prev);
	dma_fence_put(prev);
}
dma_fence_put(&chain->base);
}

dma_fence_chain_enqueue(chain);
+} +EXPORT_SYMBOL(dma_fence_chain_garbage_collect);

/**

dma_fence_chain_init - initialize a fence chain

@chain: the chain node to initialize

@@ -254,5 +366,7 @@ void dma_fence_chain_init(struct dma_fence_chain *chain, dma_fence_init(&chain->base, &dma_fence_chain_ops, &chain->lock, context, seqno);

dma_fence_chain_enqueue(chain); }

EXPORT_SYMBOL(dma_fence_chain_init);

diff --git a/include/linux/dma-fence-chain.h b/include/linux/dma-fence-chain.h index 5f45689a6d2e..b412b5396434 100644 --- a/include/linux/dma-fence-chain.h +++ b/include/linux/dma-fence-chain.h @@ -19,6 +19,7 @@

@base: fence base class

@lock: spinlock for fence handling

@prev: previous fence of the chain

@next: next chain node for garbage collection

@prev_seqno: original previous seqno before garbage collection

@fence: encapsulated fence

@cb: callback structure for signaling

@@ -27,6 +28,7 @@ struct dma_fence_chain { struct dma_fence base; struct dma_fence __rcu *prev;

struct dma_fence_chain __rcu *next; u64 prev_seqno; struct dma_fence *fence; union {

@@ -38,6 +40,8 @@ struct dma_fence_chain { extern const struct dma_fence_ops dma_fence_chain_ops; +void dma_fence_chain_garbage_collect(void);

/**

to_dma_fence_chain - cast a fence to a dma_fence_chain

@fence: fence to cast to a dma_fence_array

@@ -61,6 +65,7 @@ to_dma_fence_chain(struct dma_fence *fence) */ static inline struct dma_fence_chain *dma_fence_chain_alloc(void) {

dma_fence_chain_garbage_collect(); return kmalloc(sizeof(struct dma_fence_chain), GFP_KERNEL); };

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

kernel test robot

16 Jun 16 Jun

5:29 p.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

Hi "Christian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next] [also build test WARNING on drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master v5.13-rc6 next-20210616] [cannot apply to drm-tip/drm-tip drm/drm-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Christian-K-nig/dma-buf-some-dma_fe... base: git://anongit.freedesktop.org/drm-intel for-linux-next config: alpha-randconfig-s031-20210615 (attached as .config) compiler: alpha-linux-gcc (GCC) 9.3.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.3-341-g8af24329-dirty # https://github.com/0day-ci/linux/commit/058b99a2929f84c7f2cb1516dcc8674bf3f2... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Christian-K-nig/dma-buf-some-dma_fence_chain-improvements/20210616-182806 git checkout 058b99a2929f84c7f2cb1516dcc8674bf3f2a661 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=alpha

If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com

sparse warnings: (new ones prefixed by >>)

...

...
drivers/dma-buf/dma-fence-chain.c:28:15: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct dma_fence_chain **_x_ @@ got struct dma_fence_chain [noderef] __rcu ** @@

drivers/dma-buf/dma-fence-chain.c:28:15: sparse: expected struct dma_fence_chain **_x_ drivers/dma-buf/dma-fence-chain.c:28:15: sparse: got struct dma_fence_chain [noderef] __rcu **

...

...
drivers/dma-buf/dma-fence-chain.c:28:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct dma_fence_chain [noderef] __rcu **tmp @@ got struct dma_fence_chain **[assigned] __ret @@

drivers/dma-buf/dma-fence-chain.c:28:13: sparse: expected struct dma_fence_chain [noderef] __rcu **tmp drivers/dma-buf/dma-fence-chain.c:28:13: sparse: got struct dma_fence_chain **[assigned] __ret

vim +28 drivers/dma-buf/dma-fence-chain.c

16 17 /** 18 * dma_fence_chain_enqueue - enqeue a chain node for garbage collection 19 * @chain: the chain node to enqueue 20 * 21 * Add the chain node to the end of the gc fifo. 22 */ 23 static void dma_fence_chain_enqueue(struct dma_fence_chain *chain) 24 { 25 struct dma_fence_chain __rcu **tmp; 26 27 RCU_INIT_POINTER(chain->next, NULL);

...

28 tmp = xchg((struct dma_fence_chain __force ***)&fifo_back,

29 &chain->next); 30 31 /* This is intentionally unordered since we only need the fifo for gc */ 32 rcu_assign_pointer(*tmp, chain); 33 } 34

--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

kernel test robot

6:50 p.m.

New subject: [PATCH 4/7] dma-buf: add dma_fence_chain_garbage_collect

Hi "Christian,

I love your patch! Perhaps something to improve:

url: https://github.com/0day-ci/linux/commits/Christian-K-nig/dma-buf-some-dma_fe... base: git://anongit.freedesktop.org/drm-intel for-linux-next config: i386-randconfig-s001-20210615 (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-341-g8af24329-dirty # https://github.com/0day-ci/linux/commit/058b99a2929f84c7f2cb1516dcc8674bf3f2... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Christian-K-nig/dma-buf-some-dma_fence_chain-improvements/20210616-182806 git checkout 058b99a2929f84c7f2cb1516dcc8674bf3f2a661 # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=i386

If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com

sparse warnings: (new ones prefixed by >>)

...

...
drivers/dma-buf/dma-fence-chain.c:28:15: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct dma_fence_chain **__ret @@ got struct dma_fence_chain [noderef] __rcu ** @@

drivers/dma-buf/dma-fence-chain.c:28:15: sparse: expected struct dma_fence_chain **__ret drivers/dma-buf/dma-fence-chain.c:28:15: sparse: got struct dma_fence_chain [noderef] __rcu ** drivers/dma-buf/dma-fence-chain.c:28:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct dma_fence_chain [noderef] __rcu **tmp @@ got struct dma_fence_chain **[assigned] __ret @@ drivers/dma-buf/dma-fence-chain.c:28:13: sparse: expected struct dma_fence_chain [noderef] __rcu **tmp drivers/dma-buf/dma-fence-chain.c:28:13: sparse: got struct dma_fence_chain **[assigned] __ret

vim +28 drivers/dma-buf/dma-fence-chain.c

...

28 tmp = xchg((struct dma_fence_chain __force ***)&fifo_back,

29 &chain->next); 30 31 /* This is intentionally unordered since we only need the fifo for gc */ 32 rcu_assign_pointer(*tmp, chain); 33 } 34

--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

Christian König

10 Jun 10 Jun

9:17 a.m.

New subject: [PATCH 5/7] drm/syncobj: drop the manual garbage collection

Not needed any more since dma_fence_chain objects take care of this now.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/drm_syncobj.c | 3 --- 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c index 1c5b9ef6da37..e6d144775a87 100644 --- a/drivers/gpu/drm/drm_syncobj.c +++ b/drivers/gpu/drm/drm_syncobj.c @@ -304,9 +304,6 @@ void drm_syncobj_add_point(struct drm_syncobj *syncobj, list_for_each_entry_safe(cur, tmp, &syncobj->cb_list, node) syncobj_wait_syncobj_func(syncobj, cur); spin_unlock(&syncobj->lock); - - /* Walk the chain once to trigger garbage collection */ - dma_fence_chain_for_each(fence, prev); dma_fence_put(prev); } EXPORT_SYMBOL(drm_syncobj_add_point);

-- 2.25.1

Christian König

9:17 a.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@ * Christian König christian.koenig@amd.com */

+#include <linux/dma-fence-chain.h> + #include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h" @@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); }

+/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev, + enum amdgpu_sync_mode mode, + void *owner, struct dma_fence *f) +{ + void *fence_owner = amdgpu_sync_get_owner(f); + + /* Always sync to moves, no matter what */ + if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) + return true; + + /* We only want to trigger KFD eviction fences on + * evict or move jobs. Skip KFD fences otherwise. + */ + if (fence_owner == AMDGPU_FENCE_OWNER_KFD && + owner != AMDGPU_FENCE_OWNER_UNDEFINED) + return false; + + /* Never sync to VM updates either. */ + if (fence_owner == AMDGPU_FENCE_OWNER_VM && + owner != AMDGPU_FENCE_OWNER_UNDEFINED) + return false; + + /* Ignore fences depending on the sync mode */ + switch (mode) { + case AMDGPU_SYNC_ALWAYS: + return true; + + case AMDGPU_SYNC_NE_OWNER: + if (amdgpu_sync_same_dev(adev, f) && + fence_owner == owner) + return false; + break; + + case AMDGPU_SYNC_EQ_OWNER: + if (amdgpu_sync_same_dev(adev, f) && + fence_owner != owner) + return false; + break; + + case AMDGPU_SYNC_EXPLICIT: + return false; + } + + WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD, + "Adding eviction fence to sync obj"); + return true; +} + /** * amdgpu_sync_resv - sync to a reservation object * @@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,

/* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv); - r = amdgpu_sync_fence(sync, f); + dma_fence_chain_for_each(f, f) { + struct dma_fence_chain *chain = to_dma_fence_chain(f); + + if (amdgpu_sync_test_fence(adev, mode, owner, chain ? + chain->fence : f)) { + r = amdgpu_sync_fence(sync, f); + dma_fence_put(f); + if (r) + return r; + break; + } + }

flist = dma_resv_shared_list(resv); - if (!flist || r) - return r; + if (!flist) + return 0;

for (i = 0; i < flist->shared_count; ++i) { - void *fence_owner; - f = rcu_dereference_protected(flist->shared[i], dma_resv_held(resv));

- fence_owner = amdgpu_sync_get_owner(f); - - /* Always sync to moves, no matter what */ - if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) { + if (amdgpu_sync_test_fence(adev, mode, owner, f)) { r = amdgpu_sync_fence(sync, f); if (r) - break; - } - - /* We only want to trigger KFD eviction fences on - * evict or move jobs. Skip KFD fences otherwise. - */ - if (fence_owner == AMDGPU_FENCE_OWNER_KFD && - owner != AMDGPU_FENCE_OWNER_UNDEFINED) - continue; - - /* Never sync to VM updates either. */ - if (fence_owner == AMDGPU_FENCE_OWNER_VM && - owner != AMDGPU_FENCE_OWNER_UNDEFINED) - continue; - - /* Ignore fences depending on the sync mode */ - switch (mode) { - case AMDGPU_SYNC_ALWAYS: - break; - - case AMDGPU_SYNC_NE_OWNER: - if (amdgpu_sync_same_dev(adev, f) && - fence_owner == owner) - continue; - break; - - case AMDGPU_SYNC_EQ_OWNER: - if (amdgpu_sync_same_dev(adev, f) && - fence_owner != owner) - continue; - break; - - case AMDGPU_SYNC_EXPLICIT: - continue; + return r; } - - WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD, - "Adding eviction fence to sync obj"); - r = amdgpu_sync_fence(sync, f); - if (r) - break; } - return r; + return 0; }

/**

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

9:07 a.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

On Thu, Jun 10, 2021 at 11:17:59AM +0200, Christian König wrote:

...

Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@

Christian König christian.koenig@amd.com

*/

+#include <linux/dma-fence-chain.h>

#include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h" @@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); }

+/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
		   enum amdgpu_sync_mode mode,
		   void *owner, struct dma_fence *f)
+{
void *fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED)
return true;
/* We only want to trigger KFD eviction fences on
* evict or move jobs. Skip KFD fences otherwise.
*/
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Never sync to VM updates either. */

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Ignore fences depending on the sync mode */

switch (mode) {

case AMDGPU_SYNC_ALWAYS:
return true;
case AMDGPU_SYNC_NE_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner == owner)
	return false;
break;
case AMDGPU_SYNC_EQ_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner != owner)
	return false;
break;
case AMDGPU_SYNC_EXPLICIT:
return false;
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
    "Adding eviction fence to sync obj");
return true;
+}

/**

amdgpu_sync_resv - sync to a reservation object

@@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,

/* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv);

r = amdgpu_sync_fence(sync, f);

dma_fence_chain_for_each(f, f) {

Jason has some helper for deep-walking fence chains/arrays here I think. Might want to look into that, so that we have some consistency in how we pile up multiple exclusive fences.

Anyway pretty much one of the versions I had in mind too, except I didn't type it up.

Acked-by: Daniel Vetter daniel.vetter@ffwll.ch

...

struct dma_fence_chain *chain = to_dma_fence_chain(f);

if (amdgpu_sync_test_fence(adev, mode, owner, chain ?

```
			   chain->fence : f)) {
```
```
	r = amdgpu_sync_fence(sync, f);
```
```
	dma_fence_put(f);
```
```
	if (r)
```
```
		return r;
```
```
	break;
```
```
}
```
}

flist = dma_resv_shared_list(resv);

if (!flist || r)
```
return r;
```

if (!flist)
```
return 0;
```
for (i = 0; i < flist->shared_count; ++i) {

```
void *fence_owner;
```
f = rcu_dereference_protected(flist->shared[i], dma_resv_held(resv));

fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {

if (amdgpu_sync_test_fence(adev, mode, owner, f)) {
r = amdgpu_sync_fence(sync, f);
if (r)

```
		break;
```
```
}
```

/* We only want to trigger KFD eviction fences on

 * evict or move jobs. Skip KFD fences otherwise.

```
 */
```

if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```
```
/* Never sync to VM updates either. */
```

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```

/* Ignore fences depending on the sync mode */

```
switch (mode) {
```
```
case AMDGPU_SYNC_ALWAYS:
```
```
	break;
```
```
case AMDGPU_SYNC_NE_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner == owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EQ_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner != owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EXPLICIT:
```
```
	continue;
```

```
		return r;
```
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,

     "Adding eviction fence to sync obj");

```
r = amdgpu_sync_fence(sync, f);
```
```
if (r)
```
```
	break;
```
}
return r;

return 0;

}

/**

2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10:09 a.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

Am 11.06.21 um 11:07 schrieb Daniel Vetter:

...

On Thu, Jun 10, 2021 at 11:17:59AM +0200, Christian König wrote:

...
Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@

Christian König christian.koenig@amd.com

*/

+#include <linux/dma-fence-chain.h>

#include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h"

@@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); }

+/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
		   enum amdgpu_sync_mode mode,
		   void *owner, struct dma_fence *f)
+{
void *fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED)
return true;
/* We only want to trigger KFD eviction fences on
* evict or move jobs. Skip KFD fences otherwise.
*/
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Never sync to VM updates either. */

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Ignore fences depending on the sync mode */

switch (mode) {

case AMDGPU_SYNC_ALWAYS:
return true;
case AMDGPU_SYNC_NE_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner == owner)
	return false;
break;
case AMDGPU_SYNC_EQ_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner != owner)
	return false;
break;
case AMDGPU_SYNC_EXPLICIT:
return false;
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
    "Adding eviction fence to sync obj");
return true;
+}

/**

amdgpu_sync_resv - sync to a reservation object

@@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync,

/* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv);

r = amdgpu_sync_fence(sync, f);

dma_fence_chain_for_each(f, f) {
Jason has some helper for deep-walking fence chains/arrays here I think. Might want to look into that, so that we have some consistency in how we pile up multiple exclusive fences.

Well those helpers are not from Jason, but from me :)

But no, for now the deep inspection is not really helpful here since grabbing a reference to a certain chain node is what that makes the handling easier and faster here.

Thinking more about it that should also make it possible for the garbage collection to kick in properly.

...

Anyway pretty much one of the versions I had in mind too, except I didn't type it up.

Acked-by: Daniel Vetter daniel.vetter@ffwll.ch

Thanks, Christian.

...

struct dma_fence_chain *chain = to_dma_fence_chain(f);

if (amdgpu_sync_test_fence(adev, mode, owner, chain ?

```
			   chain->fence : f)) {
```
```
	r = amdgpu_sync_fence(sync, f);
```
```
	dma_fence_put(f);
```
```
	if (r)
```
```
		return r;
```
```
	break;
```
```
}
```
}

flist = dma_resv_shared_list(resv);

if (!flist || r)
```
return r;
```

if (!flist)
```
return 0;
```
for (i = 0; i < flist->shared_count; ++i) {

```
void *fence_owner;
```
f = rcu_dereference_protected(flist->shared[i], dma_resv_held(resv));

fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {

if (amdgpu_sync_test_fence(adev, mode, owner, f)) {
r = amdgpu_sync_fence(sync, f);
if (r)

```
		break;
```
```
}
```

/* We only want to trigger KFD eviction fences on

 * evict or move jobs. Skip KFD fences otherwise.

```
 */
```

if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```
```
/* Never sync to VM updates either. */
```

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```

/* Ignore fences depending on the sync mode */

```
switch (mode) {
```
```
case AMDGPU_SYNC_ALWAYS:
```
```
	break;
```
```
case AMDGPU_SYNC_NE_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner == owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EQ_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner != owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EXPLICIT:
```
```
	continue;
```

```
		return r;
```
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,

     "Adding eviction fence to sync obj");

```
r = amdgpu_sync_fence(sync, f);
```
```
if (r)
```
```
	break;
```
}
return r;

return 0; }

/**

-- 2.25.1

Daniel Vetter

3:18 p.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

On Fri, Jun 11, 2021 at 12:09:19PM +0200, Christian König wrote:

...

Am 11.06.21 um 11:07 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:17:59AM +0200, Christian König wrote:

...
Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@

Christian König christian.koenig@amd.com

*/ +#include <linux/dma-fence-chain.h>

#include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h"

@@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); } +/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
		   enum amdgpu_sync_mode mode,
		   void *owner, struct dma_fence *f)
+{
void *fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED)
return true;
/* We only want to trigger KFD eviction fences on
* evict or move jobs. Skip KFD fences otherwise.
*/
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Never sync to VM updates either. */

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Ignore fences depending on the sync mode */

switch (mode) {

case AMDGPU_SYNC_ALWAYS:
return true;
case AMDGPU_SYNC_NE_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner == owner)
	return false;
break;
case AMDGPU_SYNC_EQ_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner != owner)
	return false;
break;
case AMDGPU_SYNC_EXPLICIT:
return false;
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
    "Adding eviction fence to sync obj");
return true;
+}

/**

amdgpu_sync_resv - sync to a reservation object

@@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, /* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv);

r = amdgpu_sync_fence(sync, f);

dma_fence_chain_for_each(f, f) {
Jason has some helper for deep-walking fence chains/arrays here I think. Might want to look into that, so that we have some consistency in how we pile up multiple exclusive fences.
Well those helpers are not from Jason, but from me :)

But no, for now the deep inspection is not really helpful here since grabbing a reference to a certain chain node is what that makes the handling easier and faster here.

Thinking more about it that should also make it possible for the garbage collection to kick in properly.

Hm this is tricky to reason about, but yeah with this here it's a true chain, and you just need to connect them. But then if a buffer is on multiple engines, collapsing things down occasionally might be useful.

But maybe we need to do that in the bigger rework where exclusive fences are also just in the dma_fence_list with a "this is an exclusive one btw" tag.

I think for the vk import case doing the deep scan makes more sense, it's a once-per-frame thing, and there's a much bigger chance that you have a pile of fences from different engines on it already.

I think a comment explaining why we think deep scan isn't a good idea here would be good, just so we can appreciate our foolishness when it all goes wrong :-) -Daniel

...

Anyway pretty much one of the versions I had in mind too, except I didn't type it up.

Acked-by: Daniel Vetter daniel.vetter@ffwll.ch

Thanks, Christian.

...

struct dma_fence_chain *chain = to_dma_fence_chain(f);

if (amdgpu_sync_test_fence(adev, mode, owner, chain ?

```
			   chain->fence : f)) {
```
```
	r = amdgpu_sync_fence(sync, f);
```
```
	dma_fence_put(f);
```
```
	if (r)
```
```
		return r;
```
```
	break;
```
```
}
```
} flist = dma_resv_shared_list(resv);

if (!flist || r)
```
return r;
```

if (!flist)
```
return 0;
```
for (i = 0; i < flist->shared_count; ++i) {

```
void *fence_owner;
```
f = rcu_dereference_protected(flist->shared[i], dma_resv_held(resv));

fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {

if (amdgpu_sync_test_fence(adev, mode, owner, f)) {
r = amdgpu_sync_fence(sync, f);
if (r)

```
		break;
```
```
}
```

/* We only want to trigger KFD eviction fences on

 * evict or move jobs. Skip KFD fences otherwise.

```
 */
```

if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```
```
/* Never sync to VM updates either. */
```

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```

/* Ignore fences depending on the sync mode */

```
switch (mode) {
```
```
case AMDGPU_SYNC_ALWAYS:
```
```
	break;
```
```
case AMDGPU_SYNC_NE_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner == owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EQ_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner != owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EXPLICIT:
```
```
	continue;
```

```
		return r;
```
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,

     "Adding eviction fence to sync obj");

```
r = amdgpu_sync_fence(sync, f);
```
```
if (r)
```
```
	break;
```
}
return r;

return 0; } /**

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

14 Jun 14 Jun

7:25 a.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

Am 11.06.21 um 17:18 schrieb Daniel Vetter:

...

On Fri, Jun 11, 2021 at 12:09:19PM +0200, Christian König wrote:

...
Am 11.06.21 um 11:07 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:17:59AM +0200, Christian König wrote:

...
Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@ * Christian König christian.koenig@amd.com */ +#include <linux/dma-fence-chain.h>

#include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h"

@@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); } +/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
		   enum amdgpu_sync_mode mode,
		   void *owner, struct dma_fence *f)
+{
void *fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED)
return true;
/* We only want to trigger KFD eviction fences on
* evict or move jobs. Skip KFD fences otherwise.
*/
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Never sync to VM updates either. */

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Ignore fences depending on the sync mode */

switch (mode) {

case AMDGPU_SYNC_ALWAYS:
return true;
case AMDGPU_SYNC_NE_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner == owner)
	return false;
break;
case AMDGPU_SYNC_EQ_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner != owner)
	return false;
break;
case AMDGPU_SYNC_EXPLICIT:
return false;
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
    "Adding eviction fence to sync obj");
return true;
+}

/**

amdgpu_sync_resv - sync to a reservation object

@@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, /* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv);

r = amdgpu_sync_fence(sync, f);

dma_fence_chain_for_each(f, f) {
Jason has some helper for deep-walking fence chains/arrays here I think. Might want to look into that, so that we have some consistency in how we pile up multiple exclusive fences.
Well those helpers are not from Jason, but from me :)

But no, for now the deep inspection is not really helpful here since grabbing a reference to a certain chain node is what that makes the handling easier and faster here.

Thinking more about it that should also make it possible for the garbage collection to kick in properly.
Hm this is tricky to reason about, but yeah with this here it's a true chain, and you just need to connect them. But then if a buffer is on multiple engines, collapsing things down occasionally might be useful.

But maybe we need to do that in the bigger rework where exclusive fences are also just in the dma_fence_list with a "this is an exclusive one btw" tag.

I think for the vk import case doing the deep scan makes more sense, it's a once-per-frame thing, and there's a much bigger chance that you have a pile of fences from different engines on it already.

The problem with Jasons IOCTL is that you *must* do a deep dive and flatten out the fences.

Otherwise somebody could use it to create a deep fence structure with dma_fence_arrays containing dma_fence_arrays, containing dma_fence_arrays etc...

When you then release that structure you overwrite kernel stack because the dma_fence_array does a dma_fence_put() on it's entries :)

The dma_fence_chain container is intentionally made in a way to prevent that.

...

I think a comment explaining why we think deep scan isn't a good idea here would be good, just so we can appreciate our foolishness when it all goes wrong :-)

Ok, good point.

Thanks, Christian.

...

-Daniel

...

Anyway pretty much one of the versions I had in mind too, except I didn't type it up.

Acked-by: Daniel Vetter daniel.vetter@ffwll.ch

Thanks, Christian.

...

struct dma_fence_chain *chain = to_dma_fence_chain(f);

if (amdgpu_sync_test_fence(adev, mode, owner, chain ?

```
			   chain->fence : f)) {
```
```
	r = amdgpu_sync_fence(sync, f);
```
```
	dma_fence_put(f);
```
```
	if (r)
```
```
		return r;
```
```
	break;
```
```
}
```
} flist = dma_resv_shared_list(resv);

if (!flist || r)
```
return r;
```

if (!flist)
```
return 0;
```
for (i = 0; i < flist->shared_count; ++i) {

```
void *fence_owner;
```

f = rcu_dereference_protected(flist->shared[i],
			      dma_resv_held(resv));

fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {

if (amdgpu_sync_test_fence(adev, mode, owner, f)) {
	r = amdgpu_sync_fence(sync, f);
	if (r)

```
		break;
```
```
}
```

/* We only want to trigger KFD eviction fences on

 * evict or move jobs. Skip KFD fences otherwise.

```
 */
```

if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```
```
/* Never sync to VM updates either. */
```

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```

/* Ignore fences depending on the sync mode */

```
switch (mode) {
```
```
case AMDGPU_SYNC_ALWAYS:
```
```
	break;
```
```
case AMDGPU_SYNC_NE_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner == owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EQ_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner != owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EXPLICIT:
```
```
	continue;
```

```
		return r;
}
```

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,

     "Adding eviction fence to sync obj");

```
r = amdgpu_sync_fence(sync, f);
```
```
if (r)
```
```
	break;
```
}
return r;

return 0; } /**

-- 2.25.1

Daniel Vetter

17 Jun 17 Jun

4:51 p.m.

New subject: [PATCH 6/7] drm/amdgpu: unwrap fence chains in the explicit sync fence

On Mon, Jun 14, 2021 at 09:25:44AM +0200, Christian König wrote:

...

Am 11.06.21 um 17:18 schrieb Daniel Vetter:

...
On Fri, Jun 11, 2021 at 12:09:19PM +0200, Christian König wrote:

...
Am 11.06.21 um 11:07 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:17:59AM +0200, Christian König wrote:

...
Unwrap a the explicit fence if it is a dma_fence_chain and sync to the first fence not matching the owner rules.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 118 +++++++++++++---------- 1 file changed, 68 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index 1b2ceccaf5b0..862eb3c1c4c5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -28,6 +28,8 @@ * Christian König christian.koenig@amd.com */ +#include <linux/dma-fence-chain.h>

#include "amdgpu.h" #include "amdgpu_trace.h" #include "amdgpu_amdkfd.h"

@@ -186,6 +188,55 @@ int amdgpu_sync_vm_fence(struct amdgpu_sync *sync, struct dma_fence *fence) return amdgpu_sync_fence(sync, fence); } +/* Determine based on the owner and mode if we should sync to a fence or not */ +static bool amdgpu_sync_test_fence(struct amdgpu_device *adev,
		   enum amdgpu_sync_mode mode,
		   void *owner, struct dma_fence *f)
+{
void *fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED)
return true;
/* We only want to trigger KFD eviction fences on
* evict or move jobs. Skip KFD fences otherwise.
*/
if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Never sync to VM updates either. */

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&
   owner != AMDGPU_FENCE_OWNER_UNDEFINED)
return false;
/* Ignore fences depending on the sync mode */

switch (mode) {

case AMDGPU_SYNC_ALWAYS:
return true;
case AMDGPU_SYNC_NE_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner == owner)
	return false;
break;
case AMDGPU_SYNC_EQ_OWNER:
if (amdgpu_sync_same_dev(adev, f) &&
    fence_owner != owner)
	return false;
break;
case AMDGPU_SYNC_EXPLICIT:
return false;
}

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,
    "Adding eviction fence to sync obj");
return true;
+}

/**

amdgpu_sync_resv - sync to a reservation object

@@ -211,67 +262,34 @@ int amdgpu_sync_resv(struct amdgpu_device *adev, struct amdgpu_sync *sync, /* always sync to the exclusive fence */ f = dma_resv_excl_fence(resv);

r = amdgpu_sync_fence(sync, f);

dma_fence_chain_for_each(f, f) {
Jason has some helper for deep-walking fence chains/arrays here I think. Might want to look into that, so that we have some consistency in how we pile up multiple exclusive fences.
Well those helpers are not from Jason, but from me :)

But no, for now the deep inspection is not really helpful here since grabbing a reference to a certain chain node is what that makes the handling easier and faster here.

Thinking more about it that should also make it possible for the garbage collection to kick in properly.
Hm this is tricky to reason about, but yeah with this here it's a true chain, and you just need to connect them. But then if a buffer is on multiple engines, collapsing things down occasionally might be useful.

But maybe we need to do that in the bigger rework where exclusive fences are also just in the dma_fence_list with a "this is an exclusive one btw" tag.

I think for the vk import case doing the deep scan makes more sense, it's a once-per-frame thing, and there's a much bigger chance that you have a pile of fences from different engines on it already.
The problem with Jasons IOCTL is that you *must* do a deep dive and flatten out the fences.

Otherwise somebody could use it to create a deep fence structure with dma_fence_arrays containing dma_fence_arrays, containing dma_fence_arrays etc...

When you then release that structure you overwrite kernel stack because the dma_fence_array does a dma_fence_put() on it's entries :)

The dma_fence_chain container is intentionally made in a way to prevent that.

I'm maybe too dense, but why can't we use a chain with Jason's work too? At least from high-level semantics it feels like we're doing exactly the same thing here as the import ioctl does. Only thing different is that the import ioctl doesn't do an actual CS or anything like that. -Daniel

...

I think a comment explaining why we think deep scan isn't a good idea here would be good, just so we can appreciate our foolishness when it all goes wrong :-)

Ok, good point.

Thanks, Christian.

...

-Daniel

...

Anyway pretty much one of the versions I had in mind too, except I didn't type it up.

Acked-by: Daniel Vetter daniel.vetter@ffwll.ch

Thanks, Christian.

...

struct dma_fence_chain *chain = to_dma_fence_chain(f);

if (amdgpu_sync_test_fence(adev, mode, owner, chain ?

```
			   chain->fence : f)) {
```
```
	r = amdgpu_sync_fence(sync, f);
```
```
	dma_fence_put(f);
```
```
	if (r)
```
```
		return r;
```
```
	break;
```
```
}
```
} flist = dma_resv_shared_list(resv);

if (!flist || r)
```
return r;
```

if (!flist)
```
return 0;
```
for (i = 0; i < flist->shared_count; ++i) {

```
void *fence_owner;
```

f = rcu_dereference_protected(flist->shared[i],
			      dma_resv_held(resv));

fence_owner = amdgpu_sync_get_owner(f);

/* Always sync to moves, no matter what */

if (fence_owner == AMDGPU_FENCE_OWNER_UNDEFINED) {

if (amdgpu_sync_test_fence(adev, mode, owner, f)) {
	r = amdgpu_sync_fence(sync, f);
	if (r)

```
		break;
```
```
}
```

/* We only want to trigger KFD eviction fences on

 * evict or move jobs. Skip KFD fences otherwise.

```
 */
```

if (fence_owner == AMDGPU_FENCE_OWNER_KFD &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```
```
/* Never sync to VM updates either. */
```

if (fence_owner == AMDGPU_FENCE_OWNER_VM &&

    owner != AMDGPU_FENCE_OWNER_UNDEFINED)

```
	continue;
```

/* Ignore fences depending on the sync mode */

```
switch (mode) {
```
```
case AMDGPU_SYNC_ALWAYS:
```
```
	break;
```
```
case AMDGPU_SYNC_NE_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner == owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EQ_OWNER:
```
```
	if (amdgpu_sync_same_dev(adev, f) &&
```
```
	    fence_owner != owner)
```
```
		continue;
```
```
	break;
```
```
case AMDGPU_SYNC_EXPLICIT:
```
```
	continue;
```

```
		return r;
}
```

WARN(debug_evictions && fence_owner == AMDGPU_FENCE_OWNER_KFD,

     "Adding eviction fence to sync obj");

```
r = amdgpu_sync_fence(sync, f);
```
```
if (r)
```
```
	break;
```
}
return r;

return 0; } /**

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10 Jun 10 Jun

9:18 a.m.

New subject: [PATCH 7/7] drm/amdgpu: rework dma_resv handling

Drop the workaround and instead implement a better solution.

Basically we are now chaining all submissions using a dma_fence_chain container and adding them as exclusive fence to the dma_resv object.

This way other drivers can still sync to the single exclusive fence while amdgpu only sync to fences from different processes.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 54 +++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 65 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 - 6 files changed, 47 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h index a130e766cbdb..c905a4cfc173 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h @@ -34,6 +34,7 @@ struct amdgpu_fpriv; struct amdgpu_bo_list_entry { struct ttm_validate_buffer tv; struct amdgpu_bo_va *bo_va; + struct dma_fence_chain *chain; uint32_t priority; struct page **user_pages; bool user_invalidated; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 325e82621467..f6f3029f958d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -587,6 +587,20 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, goto out; }

+ amdgpu_bo_list_for_each_entry(e, p->bo_list) { + struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo); + + e->bo_va = amdgpu_vm_bo_find(vm, bo); + + if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) { + e->chain = dma_fence_chain_alloc(); + if (!e->chain) { + r = -ENOMEM; + goto error_validate; + } + } + } + amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0; @@ -614,15 +628,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, gws = p->bo_list->gws_obj; oa = p->bo_list->oa_obj;

- amdgpu_bo_list_for_each_entry(e, p->bo_list) { - struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo); - - /* Make sure we use the exclusive slot for shared BOs */ - if (bo->prime_shared_count) - e->tv.num_shared = 0; - e->bo_va = amdgpu_vm_bo_find(vm, bo); - } - if (gds) { p->job->gds_base = amdgpu_bo_gpu_offset(gds) >> PAGE_SHIFT; p->job->gds_size = amdgpu_bo_size(gds) >> PAGE_SHIFT; @@ -644,8 +649,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, }

error_validate: - if (r) + if (r) { + amdgpu_bo_list_for_each_entry(e, p->bo_list) { + dma_fence_chain_free(e->chain); + e->chain = NULL; + } ttm_eu_backoff_reservation(&p->ticket, &p->validated); + } out: return r; } @@ -685,9 +695,17 @@ static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser, int error, { unsigned i;

- if (error && backoff) + if (error && backoff) { + struct amdgpu_bo_list_entry *e; + + amdgpu_bo_list_for_each_entry(e, parser->bo_list) { + dma_fence_chain_free(e->chain); + e->chain = NULL; + } + ttm_eu_backoff_reservation(&parser->ticket, &parser->validated); + }

for (i = 0; i < parser->num_post_deps; i++) { drm_syncobj_put(parser->post_deps[i].syncobj); @@ -1260,6 +1278,20 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);

+ amdgpu_bo_list_for_each_entry(e, p->bo_list) { + struct dma_resv *resv = e->tv.bo->base.resv; + struct dma_fence_chain *chain = e->chain; + + if (!chain) + continue; + + dma_fence_chain_init(chain, dma_resv_excl_fence(resv), + dma_fence_get(p->fence), 1); + + rcu_assign_pointer(resv->fence_excl, &chain->base); + e->chain = NULL; + } + ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index c3053b83b80c..23219fc3b28c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -42,48 +42,6 @@ #include <linux/pci-p2pdma.h> #include <linux/pm_runtime.h>

-static int -__dma_resv_make_exclusive(struct dma_resv *obj) -{ - struct dma_fence **fences; - unsigned int count; - int r; - - if (!dma_resv_shared_list(obj)) /* no shared fences to convert */ - return 0; - - r = dma_resv_get_fences(obj, NULL, &count, &fences); - if (r) - return r; - - if (count == 0) { - /* Now that was unexpected. */ - } else if (count == 1) { - dma_resv_add_excl_fence(obj, fences[0]); - dma_fence_put(fences[0]); - kfree(fences); - } else { - struct dma_fence_array *array; - - array = dma_fence_array_create(count, fences, - dma_fence_context_alloc(1), 0, - false); - if (!array) - goto err_fences_put; - - dma_resv_add_excl_fence(obj, &array->base); - dma_fence_put(&array->base); - } - - return 0; - -err_fences_put: - while (count--) - dma_fence_put(fences[count]); - kfree(fences); - return -ENOMEM; -} - /** * amdgpu_dma_buf_attach - &dma_buf_ops.attach implementation * @@ -110,24 +68,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf *dmabuf, if (r < 0) goto out;

- r = amdgpu_bo_reserve(bo, false); - if (unlikely(r != 0)) - goto out; - - /* - * We only create shared fences for internal use, but importers - * of the dmabuf rely on exclusive fences for implicitly - * tracking write hazards. As any of the current fences may - * correspond to a write, we need to convert all existing - * fences on the reservation object into a single exclusive - * fence. - */ - r = __dma_resv_make_exclusive(bo->tbo.base.resv); - if (r) - goto out; - - bo->prime_shared_count++; - amdgpu_bo_unreserve(bo); return 0;

out: @@ -150,9 +90,6 @@ static void amdgpu_dma_buf_detach(struct dma_buf *dmabuf, struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);

- if (attach->dev->driver != adev->dev->driver && bo->prime_shared_count) - bo->prime_shared_count--; - pm_runtime_mark_last_busy(adev_to_drm(adev)->dev); pm_runtime_put_autosuspend(adev_to_drm(adev)->dev); } @@ -406,8 +343,6 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) bo = gem_to_amdgpu_bo(gobj); bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT; - if (dma_buf->ops != &amdgpu_dmabuf_ops) - bo->prime_shared_count = 1;

dma_resv_unlock(resv); return gobj; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 1c3e3b608332..5d82120fc160 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -829,7 +829,8 @@ int amdgpu_gem_op_ioctl(struct drm_device *dev, void *data, break; } case AMDGPU_GEM_OP_SET_PLACEMENT: - if (robj->prime_shared_count && (args->value & AMDGPU_GEM_DOMAIN_VRAM)) { + if (robj->tbo.base.import_attach && + args->value & AMDGPU_GEM_DOMAIN_VRAM) { r = -EINVAL; amdgpu_bo_unreserve(robj); break; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 96447e1d4c9c..30951b593809 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -871,7 +871,7 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return -EINVAL;

/* A shared bo cannot be migrated to VRAM */ - if (bo->prime_shared_count || bo->tbo.base.import_attach) { + if (bo->tbo.base.import_attach) { if (domain & AMDGPU_GEM_DOMAIN_GTT) domain = AMDGPU_GEM_DOMAIN_GTT; else diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index b35962702278..55d7e90eaa72 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -98,7 +98,6 @@ struct amdgpu_bo { struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap; u64 flags; - unsigned prime_shared_count; /* per VM structure for page tables and with virtual addresses */ struct amdgpu_vm_bo_base *vm_bo; /* Constant after initialization */

-- 2.25.1

Daniel Vetter

11 Jun 11 Jun

9:17 a.m.

New subject: [PATCH 7/7] drm/amdgpu: rework dma_resv handling

On Thu, Jun 10, 2021 at 11:18:00AM +0200, Christian König wrote:

...

Drop the workaround and instead implement a better solution.

Basically we are now chaining all submissions using a dma_fence_chain container and adding them as exclusive fence to the dma_resv object.

This way other drivers can still sync to the single exclusive fence while amdgpu only sync to fences from different processes.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 54 +++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 65 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 - 6 files changed, 47 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h index a130e766cbdb..c905a4cfc173 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h @@ -34,6 +34,7 @@ struct amdgpu_fpriv; struct amdgpu_bo_list_entry { struct ttm_validate_buffer tv; struct amdgpu_bo_va *bo_va;

struct dma_fence_chain *chain; uint32_t priority; struct page **user_pages; bool user_invalidated;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 325e82621467..f6f3029f958d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -587,6 +587,20 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, goto out; }
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
e->bo_va = amdgpu_vm_bo_find(vm, bo);
if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
	e->chain = dma_fence_chain_alloc();
	if (!e->chain) {
		r = -ENOMEM;
		goto error_validate;
	}
}
}

amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0;
@@ -614,15 +628,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, gws = p->bo_list->gws_obj; oa = p->bo_list->oa_obj;
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
	e->tv.num_shared = 0;
e->bo_va = amdgpu_vm_bo_find(vm, bo);
}

if (gds) { p->job->gds_base = amdgpu_bo_gpu_offset(gds) >> PAGE_SHIFT; p->job->gds_size = amdgpu_bo_size(gds) >> PAGE_SHIFT;
@@ -644,8 +649,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, }

error_validate:

if (r)
if (r) {
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&p->ticket, &p->validated);
}
out: return r; } @@ -685,9 +695,17 @@ static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser, int error, { unsigned i;

if (error && backoff)
if (error && backoff) {
struct amdgpu_bo_list_entry *e;
amdgpu_bo_list_for_each_entry(e, parser->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&parser->ticket, &parser->validated);

}

for (i = 0; i < parser->num_post_deps; i++) { drm_syncobj_put(parser->post_deps[i].syncobj);
@@ -1260,6 +1278,20 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct dma_resv *resv = e->tv.bo->base.resv;
struct dma_fence_chain *chain = e->chain;
if (!chain)
	continue;
dma_fence_chain_init(chain, dma_resv_excl_fence(resv),
		     dma_fence_get(p->fence), 1);
rcu_assign_pointer(resv->fence_excl, &chain->base);

So for safety since this is now driver interface I was thinking off a helper which does this entire dance and _also_ adds the new fence to the shared slots. This way we don't let drivers open-code this rather tricky operation.

You probably also want to then combine this with ttm_eu_fence_buffer_objects below I think so you're not walking this list 2x.

Iow I'd put the fence_chain parameter into struct ttm_validate_buffer and then let ttm_eu_fence_buffer_objects() do this, by calling a new dma_resv_add_shared_excl_fence. Ideally the same helper that also Jason's sync_file import will use. We want to hide the inner workings of dma_resv as much as possible ofc.

The other thing I'm wondering is whether this needs to be wrapped in a single seqlock or whether we don't have ordering issues if we split up the update here? Setting the exclusive fence before we also added it to the shared slot can at least violate the "shared fences signal after exclusive if both are present"

Finally I guess need to sprinkle the manual garbage collector somehwere here too.

But aside from the interface polish and correctness against races I think this looks solid in the big picture.

Cheers, Daniel

...

e->chain = NULL;
}

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index c3053b83b80c..23219fc3b28c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -42,48 +42,6 @@ #include <linux/pci-p2pdma.h> #include <linux/pm_runtime.h>

-static int -__dma_resv_make_exclusive(struct dma_resv *obj) -{
struct dma_fence **fences;

unsigned int count;

int r;

if (!dma_resv_shared_list(obj)) /* no shared fences to convert */
return 0;
r = dma_resv_get_fences(obj, NULL, &count, &fences);

if (r)
return r;
if (count == 0) {
/* Now that was unexpected. */
} else if (count == 1) {
dma_resv_add_excl_fence(obj, fences[0]);
dma_fence_put(fences[0]);
kfree(fences);
} else {
struct dma_fence_array *array;
array = dma_fence_array_create(count, fences,
			       dma_fence_context_alloc(1), 0,
			       false);
if (!array)
	goto err_fences_put;
dma_resv_add_excl_fence(obj, &array->base);
dma_fence_put(&array->base);
}

return 0;
-err_fences_put:
while (count--)
dma_fence_put(fences[count]);
kfree(fences);

return -ENOMEM;
-}

/**

amdgpu_dma_buf_attach - &dma_buf_ops.attach implementation

@@ -110,24 +68,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf *dmabuf, if (r < 0) goto out;
r = amdgpu_bo_reserve(bo, false);

if (unlikely(r != 0))
goto out;
/*
* We only create shared fences for internal use, but importers
* of the dmabuf rely on exclusive fences for implicitly
* tracking write hazards. As any of the current fences may
* correspond to a write, we need to convert all existing
* fences on the reservation object into a single exclusive
* fence.
*/
r = __dma_resv_make_exclusive(bo->tbo.base.resv);

if (r)
goto out;
bo->prime_shared_count++;

amdgpu_bo_unreserve(bo); return 0;
out: @@ -150,9 +90,6 @@ static void amdgpu_dma_buf_detach(struct dma_buf *dmabuf, struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
if (attach->dev->driver != adev->dev->driver && bo->prime_shared_count)
bo->prime_shared_count--;
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev); pm_runtime_put_autosuspend(adev_to_drm(adev)->dev);
} @@ -406,8 +343,6 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) bo = gem_to_amdgpu_bo(gobj); bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT;
if (dma_buf->ops != &amdgpu_dmabuf_ops)
bo->prime_shared_count = 1;
dma_resv_unlock(resv); return gobj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 1c3e3b608332..5d82120fc160 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -829,7 +829,8 @@ int amdgpu_gem_op_ioctl(struct drm_device *dev, void *data, break; } case AMDGPU_GEM_OP_SET_PLACEMENT:
if (robj->prime_shared_count && (args->value & AMDGPU_GEM_DOMAIN_VRAM)) {
if (robj->tbo.base.import_attach &&
    args->value & AMDGPU_GEM_DOMAIN_VRAM) {
r = -EINVAL;
amdgpu_bo_unreserve(robj);
break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 96447e1d4c9c..30951b593809 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -871,7 +871,7 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return -EINVAL;

/* A shared bo cannot be migrated to VRAM */

if (bo->prime_shared_count || bo->tbo.base.import_attach) {

if (bo->tbo.base.import_attach) { if (domain & AMDGPU_GEM_DOMAIN_GTT) domain = AMDGPU_GEM_DOMAIN_GTT; else

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index b35962702278..55d7e90eaa72 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -98,7 +98,6 @@ struct amdgpu_bo { struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap; u64 flags;

unsigned prime_shared_count; /* per VM structure for page tables and with virtual addresses */ struct amdgpu_vm_bo_base *vm_bo; /* Constant after initialization */

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

10:12 a.m.

New subject: [PATCH 7/7] drm/amdgpu: rework dma_resv handling

Am 11.06.21 um 11:17 schrieb Daniel Vetter:

...

On Thu, Jun 10, 2021 at 11:18:00AM +0200, Christian König wrote:

...
Drop the workaround and instead implement a better solution.

Basically we are now chaining all submissions using a dma_fence_chain container and adding them as exclusive fence to the dma_resv object.

This way other drivers can still sync to the single exclusive fence while amdgpu only sync to fences from different processes.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 54 +++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 65 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 - 6 files changed, 47 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h index a130e766cbdb..c905a4cfc173 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h @@ -34,6 +34,7 @@ struct amdgpu_fpriv; struct amdgpu_bo_list_entry { struct ttm_validate_buffer tv; struct amdgpu_bo_va *bo_va;

struct dma_fence_chain *chain; uint32_t priority; struct page **user_pages; bool user_invalidated;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 325e82621467..f6f3029f958d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -587,6 +587,20 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, goto out; }
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
e->bo_va = amdgpu_vm_bo_find(vm, bo);
if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
	e->chain = dma_fence_chain_alloc();
	if (!e->chain) {
		r = -ENOMEM;
		goto error_validate;
	}
}
}

amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0;
@@ -614,15 +628,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, gws = p->bo_list->gws_obj; oa = p->bo_list->oa_obj;
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
	e->tv.num_shared = 0;
e->bo_va = amdgpu_vm_bo_find(vm, bo);
}

if (gds) { p->job->gds_base = amdgpu_bo_gpu_offset(gds) >> PAGE_SHIFT; p->job->gds_size = amdgpu_bo_size(gds) >> PAGE_SHIFT;
@@ -644,8 +649,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, }

error_validate:

if (r)
if (r) {
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&p->ticket, &p->validated);
} out: return r; }
@@ -685,9 +695,17 @@ static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser, int error, { unsigned i;

if (error && backoff)
if (error && backoff) {
struct amdgpu_bo_list_entry *e;
amdgpu_bo_list_for_each_entry(e, parser->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&parser->ticket, &parser->validated);

}

for (i = 0; i < parser->num_post_deps; i++) { drm_syncobj_put(parser->post_deps[i].syncobj);
@@ -1260,6 +1278,20 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,

amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct dma_resv *resv = e->tv.bo->base.resv;
struct dma_fence_chain *chain = e->chain;
if (!chain)
	continue;
dma_fence_chain_init(chain, dma_resv_excl_fence(resv),
		     dma_fence_get(p->fence), 1);
rcu_assign_pointer(resv->fence_excl, &chain->base);
So for safety since this is now driver interface I was thinking off a helper which does this entire dance and _also_ adds the new fence to the shared slots. This way we don't let drivers open-code this rather tricky operation.

Well I only see this as a workaround for amdgpu and don't want to leak it into other drivers.

If somebody else wants to adopt it we should probably consider fixing the dma_resv object instead.

...

You probably also want to then combine this with ttm_eu_fence_buffer_objects below I think so you're not walking this list 2x.

Iow I'd put the fence_chain parameter into struct ttm_validate_buffer and then let ttm_eu_fence_buffer_objects() do this, by calling a new dma_resv_add_shared_excl_fence. Ideally the same helper that also Jason's sync_file import will use. We want to hide the inner workings of dma_resv as much as possible ofc.

The other thing I'm wondering is whether this needs to be wrapped in a single seqlock or whether we don't have ordering issues if we split up the update here? Setting the exclusive fence before we also added it to the shared slot can at least violate the "shared fences signal after exclusive if both are present"

Uff, good point.

...

Finally I guess need to sprinkle the manual garbage collector somehwere here too.

That is done automatically when somebody iterates the chain node.

Christian.

...

But aside from the interface polish and correctness against races I think this looks solid in the big picture.

Cheers, Daniel

...
e->chain = NULL;
}

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index c3053b83b80c..23219fc3b28c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -42,48 +42,6 @@ #include <linux/pci-p2pdma.h> #include <linux/pm_runtime.h>

-static int -__dma_resv_make_exclusive(struct dma_resv *obj) -{
struct dma_fence **fences;

unsigned int count;

int r;

if (!dma_resv_shared_list(obj)) /* no shared fences to convert */
return 0;
r = dma_resv_get_fences(obj, NULL, &count, &fences);

if (r)
return r;
if (count == 0) {
/* Now that was unexpected. */
} else if (count == 1) {
dma_resv_add_excl_fence(obj, fences[0]);
dma_fence_put(fences[0]);
kfree(fences);
} else {
struct dma_fence_array *array;
array = dma_fence_array_create(count, fences,
			       dma_fence_context_alloc(1), 0,
			       false);
if (!array)
	goto err_fences_put;
dma_resv_add_excl_fence(obj, &array->base);
dma_fence_put(&array->base);
}

return 0;
-err_fences_put:
while (count--)
dma_fence_put(fences[count]);
kfree(fences);

return -ENOMEM;
-}

/**

amdgpu_dma_buf_attach - &dma_buf_ops.attach implementation

@@ -110,24 +68,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf *dmabuf, if (r < 0) goto out;
r = amdgpu_bo_reserve(bo, false);

if (unlikely(r != 0))
goto out;
/*
* We only create shared fences for internal use, but importers
* of the dmabuf rely on exclusive fences for implicitly
* tracking write hazards. As any of the current fences may
* correspond to a write, we need to convert all existing
* fences on the reservation object into a single exclusive
* fence.
*/
r = __dma_resv_make_exclusive(bo->tbo.base.resv);

if (r)
goto out;
bo->prime_shared_count++;

amdgpu_bo_unreserve(bo); return 0;

out:
@@ -150,9 +90,6 @@ static void amdgpu_dma_buf_detach(struct dma_buf *dmabuf, struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
if (attach->dev->driver != adev->dev->driver && bo->prime_shared_count)
bo->prime_shared_count--;
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev); pm_runtime_put_autosuspend(adev_to_drm(adev)->dev); }
@@ -406,8 +343,6 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) bo = gem_to_amdgpu_bo(gobj); bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT;
if (dma_buf->ops != &amdgpu_dmabuf_ops)
bo->prime_shared_count = 1;
dma_resv_unlock(resv); return gobj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 1c3e3b608332..5d82120fc160 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -829,7 +829,8 @@ int amdgpu_gem_op_ioctl(struct drm_device *dev, void *data, break; } case AMDGPU_GEM_OP_SET_PLACEMENT:
if (robj->prime_shared_count && (args->value & AMDGPU_GEM_DOMAIN_VRAM)) {
if (robj->tbo.base.import_attach &&
    args->value & AMDGPU_GEM_DOMAIN_VRAM) {
r = -EINVAL;
amdgpu_bo_unreserve(robj);
break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 96447e1d4c9c..30951b593809 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -871,7 +871,7 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return -EINVAL;

/* A shared bo cannot be migrated to VRAM */

if (bo->prime_shared_count || bo->tbo.base.import_attach) {

if (bo->tbo.base.import_attach) { if (domain & AMDGPU_GEM_DOMAIN_GTT) domain = AMDGPU_GEM_DOMAIN_GTT; else

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index b35962702278..55d7e90eaa72 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -98,7 +98,6 @@ struct amdgpu_bo { struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap; u64 flags;

unsigned prime_shared_count; /* per VM structure for page tables and with virtual addresses */ struct amdgpu_vm_bo_base *vm_bo; /* Constant after initialization */

-- 2.25.1

Daniel Vetter

3:21 p.m.

New subject: [PATCH 7/7] drm/amdgpu: rework dma_resv handling

On Fri, Jun 11, 2021 at 12:12:45PM +0200, Christian König wrote:

...

Am 11.06.21 um 11:17 schrieb Daniel Vetter:

...
On Thu, Jun 10, 2021 at 11:18:00AM +0200, Christian König wrote:

...
Drop the workaround and instead implement a better solution.

Basically we are now chaining all submissions using a dma_fence_chain container and adding them as exclusive fence to the dma_resv object.

This way other drivers can still sync to the single exclusive fence while amdgpu only sync to fences from different processes.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 54 +++++++++++++---- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 65 --------------------- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 - 6 files changed, 47 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h index a130e766cbdb..c905a4cfc173 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h @@ -34,6 +34,7 @@ struct amdgpu_fpriv; struct amdgpu_bo_list_entry { struct ttm_validate_buffer tv; struct amdgpu_bo_va *bo_va;

struct dma_fence_chain *chain; uint32_t priority; struct page **user_pages; bool user_invalidated;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 325e82621467..f6f3029f958d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -587,6 +587,20 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, goto out; }
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
e->bo_va = amdgpu_vm_bo_find(vm, bo);
if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
	e->chain = dma_fence_chain_alloc();
	if (!e->chain) {
		r = -ENOMEM;
		goto error_validate;
	}
}
}

amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold, &p->bytes_moved_vis_threshold); p->bytes_moved = 0;
@@ -614,15 +628,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, gws = p->bo_list->gws_obj; oa = p->bo_list->oa_obj;
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
/* Make sure we use the exclusive slot for shared BOs */
if (bo->prime_shared_count)
	e->tv.num_shared = 0;
e->bo_va = amdgpu_vm_bo_find(vm, bo);
}

if (gds) { p->job->gds_base = amdgpu_bo_gpu_offset(gds) >> PAGE_SHIFT; p->job->gds_size = amdgpu_bo_size(gds) >> PAGE_SHIFT;
@@ -644,8 +649,13 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p, } error_validate:

if (r)
if (r) {
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&p->ticket, &p->validated);
} out: return r; }
@@ -685,9 +695,17 @@ static void amdgpu_cs_parser_fini(struct amdgpu_cs_parser *parser, int error, { unsigned i;

if (error && backoff)
if (error && backoff) {
struct amdgpu_bo_list_entry *e;
amdgpu_bo_list_for_each_entry(e, parser->bo_list) {
	dma_fence_chain_free(e->chain);
	e->chain = NULL;
}
ttm_eu_backoff_reservation(&parser->ticket, &parser->validated);

} for (i = 0; i < parser->num_post_deps; i++) { drm_syncobj_put(parser->post_deps[i].syncobj);
@@ -1260,6 +1278,20 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p, amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct dma_resv *resv = e->tv.bo->base.resv;
struct dma_fence_chain *chain = e->chain;
if (!chain)
	continue;
dma_fence_chain_init(chain, dma_resv_excl_fence(resv),
		     dma_fence_get(p->fence), 1);
rcu_assign_pointer(resv->fence_excl, &chain->base);
So for safety since this is now driver interface I was thinking off a helper which does this entire dance and _also_ adds the new fence to the shared slots. This way we don't let drivers open-code this rather tricky operation.
Well I only see this as a workaround for amdgpu and don't want to leak it into other drivers.

If somebody else wants to adopt it we should probably consider fixing the dma_resv object instead.

Well semantically this is exactly what what the sync_file import from Jason does. And it's kinda also what a lot of the drivers with opt-out implicit sync should be doing.

So I really think we do need this as a dma_resv primitive in form of a function drivers can use. Once we have that rolled out and tested some, then we can fix up the internals of dma_resv to make this case more efficient.

...

...
You probably also want to then combine this with ttm_eu_fence_buffer_objects below I think so you're not walking this list 2x.

Iow I'd put the fence_chain parameter into struct ttm_validate_buffer and then let ttm_eu_fence_buffer_objects() do this, by calling a new dma_resv_add_shared_excl_fence. Ideally the same helper that also Jason's sync_file import will use. We want to hide the inner workings of dma_resv as much as possible ofc.

The other thing I'm wondering is whether this needs to be wrapped in a single seqlock or whether we don't have ordering issues if we split up the update here? Setting the exclusive fence before we also added it to the shared slot can at least violate the "shared fences signal after exclusive if both are present"

Uff, good point.

Plus it's actually fairly tricky thing :-) -Daniel

...

...
Finally I guess need to sprinkle the manual garbage collector somehwere here too.

That is done automatically when somebody iterates the chain node.

Christian.

...
But aside from the interface polish and correctness against races I think this looks solid in the big picture.

Cheers, Daniel

...
e->chain = NULL;
}

ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence); mutex_unlock(&p->adev->notifier_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c index c3053b83b80c..23219fc3b28c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c @@ -42,48 +42,6 @@ #include <linux/pci-p2pdma.h> #include <linux/pm_runtime.h> -static int -__dma_resv_make_exclusive(struct dma_resv *obj) -{
struct dma_fence **fences;

unsigned int count;

int r;

if (!dma_resv_shared_list(obj)) /* no shared fences to convert */
return 0;
r = dma_resv_get_fences(obj, NULL, &count, &fences);

if (r)
return r;
if (count == 0) {
/* Now that was unexpected. */
} else if (count == 1) {
dma_resv_add_excl_fence(obj, fences[0]);
dma_fence_put(fences[0]);
kfree(fences);
} else {
struct dma_fence_array *array;
array = dma_fence_array_create(count, fences,
			       dma_fence_context_alloc(1), 0,
			       false);
if (!array)
	goto err_fences_put;
dma_resv_add_excl_fence(obj, &array->base);
dma_fence_put(&array->base);
}

return 0;
-err_fences_put:
while (count--)
dma_fence_put(fences[count]);
kfree(fences);

return -ENOMEM;
-}

/**

amdgpu_dma_buf_attach - &dma_buf_ops.attach implementation

@@ -110,24 +68,6 @@ static int amdgpu_dma_buf_attach(struct dma_buf *dmabuf, if (r < 0) goto out;
r = amdgpu_bo_reserve(bo, false);

if (unlikely(r != 0))
goto out;
/*
* We only create shared fences for internal use, but importers
* of the dmabuf rely on exclusive fences for implicitly
* tracking write hazards. As any of the current fences may
* correspond to a write, we need to convert all existing
* fences on the reservation object into a single exclusive
* fence.
*/
r = __dma_resv_make_exclusive(bo->tbo.base.resv);

if (r)
goto out;
bo->prime_shared_count++;

amdgpu_bo_unreserve(bo); return 0; out:
@@ -150,9 +90,6 @@ static void amdgpu_dma_buf_detach(struct dma_buf *dmabuf, struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj); struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
if (attach->dev->driver != adev->dev->driver && bo->prime_shared_count)
bo->prime_shared_count--;
pm_runtime_mark_last_busy(adev_to_drm(adev)->dev); pm_runtime_put_autosuspend(adev_to_drm(adev)->dev); }
@@ -406,8 +343,6 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct dma_buf *dma_buf) bo = gem_to_amdgpu_bo(gobj); bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT; bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT;
if (dma_buf->ops != &amdgpu_dmabuf_ops)
bo->prime_shared_count = 1;
dma_resv_unlock(resv); return gobj;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c index 1c3e3b608332..5d82120fc160 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -829,7 +829,8 @@ int amdgpu_gem_op_ioctl(struct drm_device *dev, void *data, break; } case AMDGPU_GEM_OP_SET_PLACEMENT:
if (robj->prime_shared_count && (args->value & AMDGPU_GEM_DOMAIN_VRAM)) {
if (robj->tbo.base.import_attach &&
    args->value & AMDGPU_GEM_DOMAIN_VRAM) {
r = -EINVAL;
amdgpu_bo_unreserve(robj);
break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 96447e1d4c9c..30951b593809 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -871,7 +871,7 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain, return -EINVAL; /* A shared bo cannot be migrated to VRAM */

if (bo->prime_shared_count || bo->tbo.base.import_attach) {

if (bo->tbo.base.import_attach) { if (domain & AMDGPU_GEM_DOMAIN_GTT) domain = AMDGPU_GEM_DOMAIN_GTT; else

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index b35962702278..55d7e90eaa72 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -98,7 +98,6 @@ struct amdgpu_bo { struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap; u64 flags;

unsigned prime_shared_count; /* per VM structure for page tables and with virtual addresses */ struct amdgpu_vm_bo_base *vm_bo; /* Constant after initialization */

-- 2.25.1

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Michel Dänzer

10 Jun 10 Jun

4:34 p.m.

On 2021-06-10 11:17 a.m., Christian König wrote:

...

Since we can't find a consensus on hot to move forward with the dma_resv object I concentrated on changing the approach for amdgpu first.

This new approach changes how the driver stores the command submission fence in the dma_resv object in DMA-buf exported BOs.

For exported BOs we now store the CS fence in a dma_fence_chain container and assign that one to the exclusive fences slot.

During synchronization this dma_fence_chain container is unpacked again and the containing fences handled individually.

This has a little bit more overhead than the old approach, but it allows for waiting for the exclusive slot for writes again.

Nice!

This seems to work as expected with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880: Some buffers now don't poll readable at first, until the GPU is done processing them.

Unfortunately, as expected, without a high priority context for the compositor which can preempt client drawing, this isn't enough to prevent slow clients from slowing down the compositor as well. But it should already help for fullscreen apps where the compositor can directly scan out the client buffers at least.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Christian König

4:44 p.m.

Am 10.06.21 um 18:34 schrieb Michel Dänzer:

...

On 2021-06-10 11:17 a.m., Christian König wrote:

...
Since we can't find a consensus on hot to move forward with the dma_resv object I concentrated on changing the approach for amdgpu first.

This new approach changes how the driver stores the command submission fence in the dma_resv object in DMA-buf exported BOs.

For exported BOs we now store the CS fence in a dma_fence_chain container and assign that one to the exclusive fences slot.

During synchronization this dma_fence_chain container is unpacked again and the containing fences handled individually.

This has a little bit more overhead than the old approach, but it allows for waiting for the exclusive slot for writes again.

Nice!

This seems to work as expected with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880: Some buffers now don't poll readable at first, until the GPU is done processing them.

Well I'm still pretty sure that any polling on the CPU should be avoided, but yes it is nice to have that working now in general.

...

Unfortunately, as expected, without a high priority context for the compositor which can preempt client drawing, this isn't enough to prevent slow clients from slowing down the compositor as well. But it should already help for fullscreen apps where the compositor can directly scan out the client buffers at least.

I have seen patches for this flying by internally, but not sure about the status.

Christian.

1404

Age (days ago)

1411

Last active (days ago)

dri-devel@lists.freedesktop.org

28 comments

4 participants

tags (0)

participants (4)

Christian König
Daniel Vetter
kernel test robot
Michel Dänzer