Re: [PATCH] gpu: drm: remove redundant dma_fence_put() when drm_sched_job_add_dependency() fails

28 Apr 2022

      On 2022-04-28 04:56, Hangyu Hua wrote:
...
On 2022/4/27 22:43, Andrey Grodzovsky wrote:
...
On 2022-04-26 22:31, Hangyu Hua wrote:
...
On 2022/4/26 22:55, Andrey Grodzovsky wrote:
...
On 2022-04-25 22:54, Hangyu Hua wrote:
...
On 2022/4/25 23:42, Andrey Grodzovsky wrote:
...
On 2022-04-25 04:36, Hangyu Hua wrote:
> When drm_sched_job_add_dependency() fails, dma_fence_put() will 
> be called
> internally. Calling it again after 
> drm_sched_job_add_dependency() finishes
> may result in a dangling pointer.
>
> Fix this by removing redundant dma_fence_put().
>
> Signed-off-by: Hangyu Hua hbh25y@gmail.com
> ---
>   drivers/gpu/drm/lima/lima_gem.c        | 1 -
>   drivers/gpu/drm/scheduler/sched_main.c | 1 -
>   2 files changed, 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_gem.c 
> b/drivers/gpu/drm/lima/lima_gem.c
> index 55bb1ec3c4f7..99c8e7f6bb1c 100644
> --- a/drivers/gpu/drm/lima/lima_gem.c
> +++ b/drivers/gpu/drm/lima/lima_gem.c
> @@ -291,7 +291,6 @@ static int lima_gem_add_deps(struct drm_file 
> *file, struct lima_submit *submit)
>           err = 
> drm_sched_job_add_dependency(&submit->task->base, fence);
>           if (err) {
> -            dma_fence_put(fence);
>               return err;
Makes sense here
>           }
>       }
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index b81fceb0b8a2..ebab9eca37a8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -708,7 +708,6 @@ int 
> drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
>           dma_fence_get(fence);
>           ret = drm_sched_job_add_dependency(job, fence);
>           if (ret) {
> -            dma_fence_put(fence);
Not sure about this one since if you look at the relevant commits -
'drm/scheduler: fix drm_sched_job_add_implicit_dependencies' and
'drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder'
You will see that the dma_fence_put here balances the extra 
dma_fence_get
above
Andrey
I don't think so. I checked the call chain and found no additional 
dma_fence_get(). But dma_fence_get() needs to be called before 
drm_sched_job_add_dependency() to keep the counter balanced.
I don't say there is an additional get, I just say that 
drm_sched_job_add_dependency doesn't grab an extra reference to the 
fences it stores so this needs to be done outside and for that
drm_sched_job_add_implicit_dependencies->dma_fence_get is called 
and, if this addition fails you just call dma_fence_put to keep the 
counter balanced.
drm_sched_job_add_implicit_dependencies() will call 
drm_sched_job_add_dependency(). And drm_sched_job_add_dependency() 
already call dma_fence_put() when it fails. Calling dma_fence_put() 
twice doesn't make sense.
dma_fence_get() is in [2]. But dma_fence_put() will be called in [1] 
and [3] when xa_alloc() fails.
The way I see it, [2] and [3] are mat matching *get* and *put* 
respectively. [1] *put* is against the original 
dma_fence_init->kref_init of the fence which always set the refcount 
at 1.
Also in support of this see commit 'drm/scheduler: fix 
drm_sched_job_add_implicit_dependencies harder' - it says there 
"drm_sched_job_add_dependency() could drop the last ref"  - this last 
ref is the original refcount set by dma_fence_init->kref
Andrey
You can see that drm_sched_job_add_dependency() has three return paths 
they are [4], [5] and [1]. [4] and [5] will return 0. [1] will return 
error.
There will be three weird problems if you're right:

[5] path will triger a refcount leak beacause ret is 0 in *if*[6].

Terminology confusion issue - [5] is a 'put' so it cannot cause a leak 
by definition, extra unbalanced 'get' will cause a leak because memory 
is never released, extra put will just probably cause a warning in 
kref_put or maybe double free.
...
Otherwise [2] and [5] are matching *get* and *put* in here.
Exactly, they are matching - so until this point all good and no 'leak' 
then, no ?
...

[4] path need a additional dma_fence_get() to adds the fence as a

job dependency. fence is from obj->resv. Taking msm as an example 
obj->resv is from etnaviv_ioctl_gem_submit()->submit_lookup_objects(). 
It is not possible that an object has *refcount == 1* but is 
referenced in two places. So dma_fence_get() called in [2] is for [4]. 
By the way, [3] don't execute in this case.
Still don't see the problem - [2] is the additional dma_fence_get() you 
need here (just as you say above).
...

This one is a doubt. You can see in "[PATCH] drm/scheduler: fix

drm_sched_job_add_implicit_dependencies harder". 
drm_sched_job_add_dependency() could drop the last ref, so we need to do
the dma_fence_get() first. But the last ref still will drop in [3] if 
drm_sched_job_add_dependency() go path [1]. And there is only a 
*return* between [1] and [3]. Is this necessary? I think Rob Clark 
wants to avoid the last ref being dropped in 
drm_sched_job_add_implicit_dependencies() because fence is still used 
by obj->resv.
In the scenario above - if we go thorough path [1] refcount before [1] 
starts is 2 - one from original kref_init and one from [2] and so it's 
balanced against 2 puts (one from [1] and one from [3]) so I still don't 
see a problem.
I suggest that you give a specific scenario  from fence ref-count 
perspective that your patch fixes. I might be wrong but unless you give 
a specific case where the 'put' in [3] is redundant I just can't see it.
Andrey
...
int drm_sched_job_add_dependency(struct drm_sched_job *job,
                                 struct dma_fence *fence)
{
        ...
        xa_for_each(&job->dependencies, index, entry) {
                if (entry->context != fence->context)
                        continue;
if (dma_fence_is_later(fence, entry)) {
                        dma_fence_put(entry);
                        xa_store(&job->dependencies, index, fence, 
GFP_KERNEL);    <---- [4]
                } else {
                        dma_fence_put(fence);    <---- [5]
                }
                return 0;
        }
ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, 
GFP_KERNEL);
        if (ret != 0)
                dma_fence_put(fence);   <---- [1]
return ret;
}
int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
                                            struct drm_gem_object *obj,
                                            bool write)
{
        struct dma_resv_iter cursor;
        struct dma_fence *fence;
        int ret;
dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
                /* Make sure to grab an additional ref on the added 
fence */
                dma_fence_get(fence);   <---- [2]
                ret = drm_sched_job_add_dependency(job, fence);
                if (ret) {      <---- [6]
                        dma_fence_put(fence);   <---- [3]
return ret;
                }
        }
        return 0;
}
Thanks,
hangyu
...
...
int drm_sched_job_add_dependency(struct drm_sched_job *job,
                 struct dma_fence *fence)
{
    ...
    ret = xa_alloc(&job->dependencies, &id, fence, xa_limit_32b, 
GFP_KERNEL);
    if (ret != 0)
        dma_fence_put(fence);    <--- [1]
return ret;
}
EXPORT_SYMBOL(drm_sched_job_add_dependency);
int drm_sched_job_add_implicit_dependencies(struct drm_sched_job *job,
                        struct drm_gem_object *obj,
                        bool write)
{
    struct dma_resv_iter cursor;
    struct dma_fence *fence;
    int ret;
dma_resv_for_each_fence(&cursor, obj->resv, write, fence) {
        /* Make sure to grab an additional ref on the added fence */
        dma_fence_get(fence);    <--- [2]
        ret = drm_sched_job_add_dependency(job, fence);
        if (ret) {
            dma_fence_put(fence);    <--- [3]
            return ret;
        }
    }
    return 0;
}
...
...
On the other hand, dma_fence_get() and dma_fence_put() are 
meaningless here if threre is an extra dma_fence_get() beacause 
counter will not decrease to 0 during drm_sched_job_add_dependency().
I check the call chain as follows:
msm_ioctl_gem_submit()
-> submit_fence_sync()
-> drm_sched_job_add_implicit_dependencies()
Can you maybe trace or print one such example of problematic 
refcount that you are trying to fix ? I still don't see where is 
the problem.
Andrey
I also wish I could. System logs can make this easy. But i don't 
have a corresponding GPU physical device. 
drm_sched_job_add_implicit_dependencies is only used in a few devices.
Thanks.
...
...
Thanks,
Hangyu
...
>               return ret;
>           }
>       }

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH] gpu: drm: remove redundant dma_fence_put() when drm_sched_job_add_dependency() fails