[PATCH] drm/scheduler: Remove entity->rq NULL check

List overview All Threads
Download

newer

older

[PATCH 1/4] drm/scheduler: trivial...

[PATCH v2] drm/tegra: hdmi: probe...

Christian König

3 Aug 2018 3 Aug '18

11:08 a.m.

That is superflous now.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, if (first) { /* Add the entity to the run queue */ spin_lock(&entity->rq_lock); - if (!entity->rq) { - DRM_ERROR("Trying to push to a killed entity\n"); - spin_unlock(&entity->rq_lock); - return; - } drm_sched_rq_add_entity(entity->rq, entity); spin_unlock(&entity->rq_lock); drm_sched_wakeup(entity->rq->sched);

-- 2.14.1

Show replies by date

Christian König

6 Aug 6 Aug

2:13 p.m.

Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:

...

That is superflous now.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, if (first) { /* Add the entity to the run queue */ spin_lock(&entity->rq_lock);
if (!entity->rq) {
	DRM_ERROR("Trying to push to a killed entity\n");
	spin_unlock(&entity->rq_lock);
	return;
}
drm_sched_rq_add_entity(entity->rq, entity); spin_unlock(&entity->rq_lock); drm_sched_wakeup(entity->rq->sched);

Nayan Deshmukh

2:18 p.m.

I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh < nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:

...

Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:

...
That is superflous now.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c

b/drivers/gpu/drm/scheduler/gpu_scheduler.c

...
index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job

*sched_job,

...
  if (first) {
          /* Add the entity to the run queue */
          spin_lock(&entity->rq_lock);
        if (!entity->rq) {
                DRM_ERROR("Trying to push to a killed entity\n");
                spin_unlock(&entity->rq_lock);
                return;
        }
        drm_sched_rq_add_entity(entity->rq, entity);
        spin_unlock(&entity->rq_lock);
        drm_sched_wakeup(entity->rq->sched);

Andrey Grodzovsky

9 Aug 9 Aug

7:49 p.m.

Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...

I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:

Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>               drm_sched_rq_add_entity(entity->rq, entity);
>               spin_unlock(&entity->rq_lock);
>               drm_sched_wakeup(entity->rq->sched);

Christian König

10 Aug 10 Aug

1:27 p.m.

Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...

Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>               drm_sched_rq_add_entity(entity->rq, entity);
>               spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);

Andrey Grodzovsky

1:44 p.m.

I can take care of this.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...

Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>               drm_sched_rq_add_entity(entity->rq, entity);
>               spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);

Andrey Grodzovsky

13 Aug 13 Aug

4:43 p.m.

Attached.

If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it

with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...

Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>               drm_sched_rq_add_entity(entity->rq, entity);
>               spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);

Christian König

14 Aug 14 Aug

7:05 a.m.

I would rather like to avoid taking the lock in the hot path.

How about this:

/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL) drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }

Christian.

Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

...

Attached.

If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it

with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...
Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>  drm_sched_rq_add_entity(entity->rq, entity);
>               spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Andrey Grodzovsky

3:17 p.m.

I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -

What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there

are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job

will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.

Another issue bellow -

Andrey

On 08/14/2018 03:05 AM, Christian König wrote:

...

I would rather like to avoid taking the lock in the hot path.

How about this:

/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)

This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.

Andrey

...

drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }

Christian.

Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

...
Attached.

If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it

with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...
Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>               spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>  drm_sched_rq_add_entity(entity->rq, entity);
>  spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

3:26 p.m.

Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:

...

I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -

What are you talking about? You surely now take looks in drm_sched_entity_push_job():

...

+ spin_lock(&entity->rq_lock); + entity->last_user = current->group_leader; + if (list_empty(&entity->list))

...

What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there

are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job

will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.

Thought about this as well, but in this case I would say: Shit happens!

The dying process did some command submission and because of this the entity was killed as well when the process died and that is legitimate.

...

Another issue bellow -

Andrey

On 08/14/2018 03:05 AM, Christian König wrote:

...
I would rather like to avoid taking the lock in the hot path.

How about this:

/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)

This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.

Calling drm_sched_rq_add_entity() is harmless, it is protected against double insertion.

But thinking more about it your idea of adding a killed or finished flag becomes more and more appealing to have a consistent handling here.

Christian.

...

Andrey

...
drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }

Christian.

Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

...
Attached.

If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it

with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...
Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:

...
I forgot about this since we started discussing possible scenarios of processes and threads.

In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>

Nayan

On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?

Christian.

Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koenig@amd.com
<mailto:christian.koenig@amd.com>>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>   1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
>       if (first) {
>               /* Add the entity to the run queue */
>  spin_lock(&entity->rq_lock);
> -             if (!entity->rq) {
> -                     DRM_ERROR("Trying to push to a killed
entity\n");
> -  spin_unlock(&entity->rq_lock);
> -                     return;
> -             }
>  drm_sched_rq_add_entity(entity->rq, entity);
>  spin_unlock(&entity->rq_lock);
>  drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Andrey Grodzovsky

4:32 p.m.

On 08/14/2018 11:26 AM, Christian König wrote:

...

Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:

...
I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -

What are you talking about? You surely now take looks in drm_sched_entity_push_job():

...
+ spin_lock(&entity->rq_lock); + entity->last_user = current->group_leader; + if (list_empty(&entity->list))

Oh, so your code in drm_sched_entity_flush still relies on my code in drm_sched_entity_push_job, OK.

...

...
What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there

are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job

will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.

Thought about this as well, but in this case I would say: Shit happens!

The dying process did some command submission and because of this the entity was killed as well when the process died and that is legitimate.

...
Another issue bellow -

Andrey

On 08/14/2018 03:05 AM, Christian König wrote:

...
I would rather like to avoid taking the lock in the hot path.

How about this:

/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)

This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.

Calling drm_sched_rq_add_entity() is harmless, it is protected against double insertion.

Missed that one, right...

...

But thinking more about it your idea of adding a killed or finished flag becomes more and more appealing to have a consistent handling here.

Christian.

So to be clear - you would like something like

Removing entity->last_user and adding a 'stopped' flag to drm_sched_entity to be set in drm_sched_entity_flush and in

drm_sched_entity_push_job check for 'if (entity->stopped)' and when true just return some error back to user instead of pushing the job ?

Andrey

...

...
Andrey

...
drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }

Christian.

Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

...
Attached.

If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked

concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it

with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.

Andrey

On 08/10/2018 09:27 AM, Christian König wrote:

...
Crap, yeah indeed that needs to be protected by some lock.

Going to prepare a patch for that, Christian.

Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

...
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com

But I still have questions about entity->last_user (didn't notice this before) -

Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)

now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also

executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal

from it's scheduler rq.

Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.

Andrey

On 08/06/2018 10:18 AM, Nayan Deshmukh wrote: > I forgot about this since we started discussing possible > scenarios of processes and threads. > > In any case, this check is redundant. Acked-by: Nayan Deshmukh > <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com> > > Nayan > > On Mon, Aug 6, 2018 at 7:43 PM Christian König > <ckoenig.leichtzumerken@gmail.com > mailto:ckoenig.leichtzumerken@gmail.com> wrote: > > Ping. Any objections to that? > > Christian. > > Am 03.08.2018 um 13:08 schrieb Christian König: > > That is superflous now. > > > > Signed-off-by: Christian König <christian.koenig@amd.com > mailto:christian.koenig@amd.com> > > --- > > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > > 1 file changed, 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c > b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > index 85908c7f913e..65078dd3c82c 100644 > > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct > drm_sched_job *sched_job, > > if (first) { > > /* Add the entity to the run queue */ > > spin_lock(&entity->rq_lock); > > - if (!entity->rq) { > > - DRM_ERROR("Trying to push to a > killed entity\n"); > > - spin_unlock(&entity->rq_lock); > > - return; > > - } > > drm_sched_rq_add_entity(entity->rq, entity); > > spin_unlock(&entity->rq_lock); > > drm_sched_wakeup(entity->rq->sched); >

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

2465

Age (days ago)

2476

Last active (days ago)

dri-devel@lists.freedesktop.org

10 comments

4 participants

tags (0)

participants (4)

Andrey Grodzovsky
Christian König
Christian König
Nayan Deshmukh