That is superflous now.
Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, if (first) { /* Add the entity to the run queue */ spin_lock(&entity->rq_lock); - if (!entity->rq) { - DRM_ERROR("Trying to push to a killed entity\n"); - spin_unlock(&entity->rq_lock); - return; - } drm_sched_rq_add_entity(entity->rq, entity); spin_unlock(&entity->rq_lock); drm_sched_wakeup(entity->rq->sched);
Ping. Any objections to that?
Christian.
Am 03.08.2018 um 13:08 schrieb Christian König:
That is superflous now.
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, if (first) { /* Add the entity to the run queue */ spin_lock(&entity->rq_lock);
if (!entity->rq) {
DRM_ERROR("Trying to push to a killed entity\n");
spin_unlock(&entity->rq_lock);
return;
drm_sched_rq_add_entity(entity->rq, entity); spin_unlock(&entity->rq_lock); drm_sched_wakeup(entity->rq->sched);}
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh < nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that?
Christian.
Am 03.08.2018 um 13:08 schrieb Christian König:
That is superflous now.
Signed-off-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- 1 file changed, 5 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 85908c7f913e..65078dd3c82c 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
*sched_job,
if (first) { /* Add the entity to the run queue */ spin_lock(&entity->rq_lock);
if (!entity->rq) {
DRM_ERROR("Trying to push to a killed entity\n");
spin_unlock(&entity->rq_lock);
return;
} drm_sched_rq_add_entity(entity->rq, entity); spin_unlock(&entity->rq_lock); drm_sched_wakeup(entity->rq->sched);
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
I can take care of this.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
Attached.
If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it
with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
I would rather like to avoid taking the lock in the hot path.
How about this:
/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL) drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }
Christian.
Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
Attached.
If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it
with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -
What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there
are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job
will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.
Another issue bellow -
Andrey
On 08/14/2018 03:05 AM, Christian König wrote:
I would rather like to avoid taking the lock in the hot path.
How about this:
/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)
This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.
Andrey
drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }
Christian.
Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
Attached.
If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it
with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -
What are you talking about? You surely now take looks in drm_sched_entity_push_job():
+ spin_lock(&entity->rq_lock); + entity->last_user = current->group_leader; + if (list_empty(&entity->list))
What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there
are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job
will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.
Thought about this as well, but in this case I would say: Shit happens!
The dying process did some command submission and because of this the entity was killed as well when the process died and that is legitimate.
Another issue bellow -
Andrey
On 08/14/2018 03:05 AM, Christian König wrote:
I would rather like to avoid taking the lock in the hot path.
How about this:
/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)
This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.
Calling drm_sched_rq_add_entity() is harmless, it is protected against double insertion.
But thinking more about it your idea of adding a killed or finished flag becomes more and more appealing to have a consistent handling here.
Christian.
Andrey
drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }
Christian.
Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
Attached.
If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it
with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Ping. Any objections to that? Christian. Am 03.08.2018 um 13:08 schrieb Christian König: > That is superflous now. > > Signed-off-by: Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 85908c7f913e..65078dd3c82c 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, > if (first) { > /* Add the entity to the run queue */ > spin_lock(&entity->rq_lock); > - if (!entity->rq) { > - DRM_ERROR("Trying to push to a killed entity\n"); > - spin_unlock(&entity->rq_lock); > - return; > - } > drm_sched_rq_add_entity(entity->rq, entity); > spin_unlock(&entity->rq_lock); > drm_sched_wakeup(entity->rq->sched);
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 08/14/2018 11:26 AM, Christian König wrote:
Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
I assume that this is the only code change and no locks are taken in drm_sched_entity_push_job -
What are you talking about? You surely now take looks in drm_sched_entity_push_job():
+ spin_lock(&entity->rq_lock); + entity->last_user = current->group_leader; + if (list_empty(&entity->list))
Oh, so your code in drm_sched_entity_flush still relies on my code in drm_sched_entity_push_job, OK.
What happens if process A runs drm_sched_entity_push_job after this code was executed from the (dying) process B and there
are still jobs in the queue (the wait_event terminated prematurely), the entity already removed from rq , but bool 'first' in drm_sched_entity_push_job
will return false and so the entity will not be reinserted back into rq entity list and no wake up trigger will happen for process A pushing a new job.
Thought about this as well, but in this case I would say: Shit happens!
The dying process did some command submission and because of this the entity was killed as well when the process died and that is legitimate.
Another issue bellow -
Andrey
On 08/14/2018 03:05 AM, Christian König wrote:
I would rather like to avoid taking the lock in the hot path.
How about this:
/* For killed process disable any more IBs enqueue right now */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) { grab_lock(); drm_sched_rq_remove_entity(entity->rq, entity); if (READ_ONCE(&entity->last_user) != NULL)
This condition is true because just exactly now process A did drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, current->group_leader); and so the line bellow executed and entity reinserted into rq. Let's say also that the entity job queue is empty now. For process A bool 'first' will be true and hence also drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) will take place causing double insertion of the entity queue into rq list.
Calling drm_sched_rq_add_entity() is harmless, it is protected against double insertion.
Missed that one, right...
But thinking more about it your idea of adding a killed or finished flag becomes more and more appealing to have a consistent handling here.
Christian.
So to be clear - you would like something like
Removing entity->last_user and adding a 'stopped' flag to drm_sched_entity to be set in drm_sched_entity_flush and in
drm_sched_entity_push_job check for 'if (entity->stopped)' and when true just return some error back to user instead of pushing the job ?
Andrey
Andrey
drm_sched_rq_add_entity(entity->rq, entity); drop_lock(); }
Christian.
Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
Attached.
If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it
with normal boot and ruining multiple glxgears concurrently - which doesn't really test this code path since i think each of them works on it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that, Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky andrey.grodzovsky@amd.com
But I still have questions about entity->last_user (didn't notice this before) -
Looks to me there is a race condition with it's current usage, let's say process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote: > I forgot about this since we started discussing possible > scenarios of processes and threads. > > In any case, this check is redundant. Acked-by: Nayan Deshmukh > <nayan26deshmukh@gmail.com mailto:nayan26deshmukh@gmail.com> > > Nayan > > On Mon, Aug 6, 2018 at 7:43 PM Christian König > <ckoenig.leichtzumerken@gmail.com > mailto:ckoenig.leichtzumerken@gmail.com> wrote: > > Ping. Any objections to that? > > Christian. > > Am 03.08.2018 um 13:08 schrieb Christian König: > > That is superflous now. > > > > Signed-off-by: Christian König <christian.koenig@amd.com > mailto:christian.koenig@amd.com> > > --- > > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 ----- > > 1 file changed, 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c > b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > index 85908c7f913e..65078dd3c82c 100644 > > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct > drm_sched_job *sched_job, > > if (first) { > > /* Add the entity to the run queue */ > > spin_lock(&entity->rq_lock); > > - if (!entity->rq) { > > - DRM_ERROR("Trying to push to a > killed entity\n"); > > - spin_unlock(&entity->rq_lock); > > - return; > > - } > > drm_sched_rq_add_entity(entity->rq, entity); > > spin_unlock(&entity->rq_lock); > > drm_sched_wakeup(entity->rq->sched); >
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
dri-devel@lists.freedesktop.org