Hi all,
based on my current theory on how a deadlock could happen in the buffer allocation code, these two patches should fix the deadlock without having a use-after-free.
I'm still working on a way to clean up the ttm_bo_init sequence overall, but I'm separating these two out for a hopefully quick fix, since in my book use-after-frees are worse than deadlocks.
Samuel, I'd very much appreciate if you could check that the deadlocks in Hitman remain fixed with these two patches.
Thanks, Nicolai
From: Nicolai Hähnle nicolai.haehnle@amd.com
Fixes a potential race condition in amdgpu that looks as follows:
Task 1: attempt ttm_bo_init, but ttm_bo_validate fails Task 1: add BO to global list anyway Task 2: grabs hold of the BO, waits on its reservation lock Task 1: releases its reference of the BO; never gives up the reservation lock
The patch "drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()" attempts to fix that by releasing the reservation lock in amdgpu code; unfortunately, it introduces a use-after-free when this race _doesn't_ happen.
This patch should fix the race properly by never adding the BO to the global list in the first place.
Cc: Samuel Pitoiset samuel.pitoiset@gmail.com Cc: zhoucm1 david1.zhou@amd.com Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- drivers/gpu/drm/ttm/ttm_bo.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 239a957..76bee42 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1215,18 +1215,20 @@ int ttm_bo_init(struct ttm_bo_device *bdev, if (likely(!ret)) ret = ttm_bo_validate(bo, placement, interruptible, false);
- if (!resv) { + if (!resv) ttm_bo_unreserve(bo);
- } else if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) { + if (unlikely(ret)) { + ttm_bo_unref(&bo); + return ret; + } + + if (resv && !(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) { spin_lock(&bo->glob->lru_lock); ttm_bo_add_to_lru(bo); spin_unlock(&bo->glob->lru_lock); }
- if (unlikely(ret)) - ttm_bo_unref(&bo); - return ret; } EXPORT_SYMBOL(ttm_bo_init);
Am 14.02.2017 um 10:18 schrieb Nicolai Hähnle:
From: Nicolai Hähnle nicolai.haehnle@amd.com
Fixes a potential race condition in amdgpu that looks as follows:
Task 1: attempt ttm_bo_init, but ttm_bo_validate fails Task 1: add BO to global list anyway Task 2: grabs hold of the BO, waits on its reservation lock Task 1: releases its reference of the BO; never gives up the reservation lock
The patch "drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()" attempts to fix that by releasing the reservation lock in amdgpu code; unfortunately, it introduces a use-after-free when this race _doesn't_ happen.
This patch should fix the race properly by never adding the BO to the global list in the first place.
Cc: Samuel Pitoiset samuel.pitoiset@gmail.com Cc: zhoucm1 david1.zhou@amd.com Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
NAK, that is actually not correct either.
The previous implementation doesn't have an use after free, but actually a memory leak.
I completely agree that adding the BO to the LRU in case of an error is incorrect, but ttm_bo_add_to_lru() will add some references to the BO.
So the following ttm_bo_unref() will just drop the initial reference.
Let's clean up this mess by moving the freeing of the BO in case of an error to the caller.
Regards, Christian.
drivers/gpu/drm/ttm/ttm_bo.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 239a957..76bee42 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1215,18 +1215,20 @@ int ttm_bo_init(struct ttm_bo_device *bdev, if (likely(!ret)) ret = ttm_bo_validate(bo, placement, interruptible, false);
- if (!resv) {
- if (!resv) ttm_bo_unreserve(bo);
- } else if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
- if (unlikely(ret)) {
ttm_bo_unref(&bo);
return ret;
- }
- if (resv && !(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) { spin_lock(&bo->glob->lru_lock); ttm_bo_add_to_lru(bo); spin_unlock(&bo->glob->lru_lock); }
- if (unlikely(ret))
ttm_bo_unref(&bo);
- return ret; } EXPORT_SYMBOL(ttm_bo_init);
On 14.02.2017 11:38, Christian König wrote:
Am 14.02.2017 um 10:18 schrieb Nicolai Hähnle:
From: Nicolai Hähnle nicolai.haehnle@amd.com
Fixes a potential race condition in amdgpu that looks as follows:
Task 1: attempt ttm_bo_init, but ttm_bo_validate fails Task 1: add BO to global list anyway Task 2: grabs hold of the BO, waits on its reservation lock Task 1: releases its reference of the BO; never gives up the reservation lock
The patch "drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()" attempts to fix that by releasing the reservation lock in amdgpu code; unfortunately, it introduces a use-after-free when this race _doesn't_ happen.
This patch should fix the race properly by never adding the BO to the global list in the first place.
Cc: Samuel Pitoiset samuel.pitoiset@gmail.com Cc: zhoucm1 david1.zhou@amd.com Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
NAK, that is actually not correct either.
The previous implementation doesn't have an use after free, but actually a memory leak.
I completely agree that adding the BO to the LRU in case of an error is incorrect, but ttm_bo_add_to_lru() will add some references to the BO.
Yes, but those are bo->list_krefs, not bo->krefs. ttm_bo_unref will put the bo->kref, which should lead to ttm_bo_release. This will then add the buffer to the ddestroy list and eventually put all the list_krefs. (Or possibly, if the ttm_bo_wait in ttm_bo_cleanup_refs_or_queue fails, it will explicitly delete the buffer from the LRU, which should amount to the same, I think).
Or perhaps I'm still missing something, I'm not too familiar with the whole BO reference counting.
So the following ttm_bo_unref() will just drop the initial reference.
Let's clean up this mess by moving the freeing of the BO in case of an error to the caller.
Yes, see my other patches.
Thanks, Nicolai
Regards, Christian.
drivers/gpu/drm/ttm/ttm_bo.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 239a957..76bee42 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1215,18 +1215,20 @@ int ttm_bo_init(struct ttm_bo_device *bdev, if (likely(!ret)) ret = ttm_bo_validate(bo, placement, interruptible, false);
- if (!resv) {
- if (!resv) ttm_bo_unreserve(bo);
- } else if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
- if (unlikely(ret)) {
ttm_bo_unref(&bo);
return ret;
- }
- if (resv && !(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) { spin_lock(&bo->glob->lru_lock); ttm_bo_add_to_lru(bo); spin_unlock(&bo->glob->lru_lock); }
- if (unlikely(ret))
ttm_bo_unref(&bo);
} EXPORT_SYMBOL(ttm_bo_init);return ret;
Am 14.02.2017 um 12:37 schrieb Nicolai Hähnle:
On 14.02.2017 11:38, Christian König wrote:
Am 14.02.2017 um 10:18 schrieb Nicolai Hähnle:
From: Nicolai Hähnle nicolai.haehnle@amd.com
Fixes a potential race condition in amdgpu that looks as follows:
Task 1: attempt ttm_bo_init, but ttm_bo_validate fails Task 1: add BO to global list anyway Task 2: grabs hold of the BO, waits on its reservation lock Task 1: releases its reference of the BO; never gives up the reservation lock
The patch "drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()" attempts to fix that by releasing the reservation lock in amdgpu code; unfortunately, it introduces a use-after-free when this race _doesn't_ happen.
This patch should fix the race properly by never adding the BO to the global list in the first place.
Cc: Samuel Pitoiset samuel.pitoiset@gmail.com Cc: zhoucm1 david1.zhou@amd.com Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
NAK, that is actually not correct either.
The previous implementation doesn't have an use after free, but actually a memory leak.
I completely agree that adding the BO to the LRU in case of an error is incorrect, but ttm_bo_add_to_lru() will add some references to the BO.
Yes, but those are bo->list_krefs, not bo->krefs. ttm_bo_unref will put the bo->kref, which should lead to ttm_bo_release. This will then add the buffer to the ddestroy list and eventually put all the list_krefs. (Or possibly, if the ttm_bo_wait in ttm_bo_cleanup_refs_or_queue fails, it will explicitly delete the buffer from the LRU, which should amount to the same, I think).
Or perhaps I'm still missing something, I'm not too familiar with the whole BO reference counting.
Ah, yes of course, never mind.
So the following ttm_bo_unref() will just drop the initial reference.
Let's clean up this mess by moving the freeing of the BO in case of an error to the caller.
Yes, see my other patches.
In this case the set is Reviewed-by: Christian König christian.koenig@amd.com.
Regards, Christian.
Thanks, Nicolai
Regards, Christian.
drivers/gpu/drm/ttm/ttm_bo.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 239a957..76bee42 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -1215,18 +1215,20 @@ int ttm_bo_init(struct ttm_bo_device *bdev, if (likely(!ret)) ret = ttm_bo_validate(bo, placement, interruptible, false);
- if (!resv) {
- if (!resv) ttm_bo_unreserve(bo);
- } else if (!(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {
- if (unlikely(ret)) {
ttm_bo_unref(&bo);
return ret;
- }
- if (resv && !(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) { spin_lock(&bo->glob->lru_lock); ttm_bo_add_to_lru(bo); spin_unlock(&bo->glob->lru_lock); }
- if (unlikely(ret))
ttm_bo_unref(&bo);
} EXPORT_SYMBOL(ttm_bo_init);return ret;
From: Nicolai Hähnle nicolai.haehnle@amd.com
This reverts commit 38fc4856ad98f230bc91da0421dec69e4aee40f8, which introduces a use-after-free.
The underlying bug should be properly fixed with "drm/ttm: never add BO that failed to validate to the LRU list".
Cc: Samuel Pitoiset samuel.pitoiset@gmail.com Cc: zhoucm1 david1.zhou@amd.com Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 556236a..d1ef1d0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -403,11 +403,8 @@ int amdgpu_bo_create_restricted(struct amdgpu_device *adev, &bo->placement, page_align, !kernel, NULL, acc_size, sg, resv ? resv : &bo->tbo.ttm_resv, &amdgpu_ttm_bo_destroy); - if (unlikely(r != 0)) { - if (!resv) - ww_mutex_unlock(&bo->tbo.resv->lock); + if (unlikely(r != 0)) return r; - }
bo->tbo.priority = ilog2(bo->tbo.num_pages); if (kernel)
On 02/14/2017 10:18 AM, Nicolai Hähnle wrote:
Hi all,
based on my current theory on how a deadlock could happen in the buffer allocation code, these two patches should fix the deadlock without having a use-after-free.
I'm still working on a way to clean up the ttm_bo_init sequence overall, but I'm separating these two out for a hopefully quick fix, since in my book use-after-frees are worse than deadlocks.
Samuel, I'd very much appreciate if you could check that the deadlocks in Hitman remain fixed with these two patches.
Sure, I will check.
Thanks, Nicolai
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On 02/14/2017 10:18 AM, Nicolai Hähnle wrote:
Hi all,
based on my current theory on how a deadlock could happen in the buffer allocation code, these two patches should fix the deadlock without having a use-after-free.
I'm still working on a way to clean up the ttm_bo_init sequence overall, but I'm separating these two out for a hopefully quick fix, since in my book use-after-frees are worse than deadlocks.
Samuel, I'd very much appreciate if you could check that the deadlocks in Hitman remain fixed with these two patches.
Looks good. Thanks!
Thanks, Nicolai
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
dri-devel@lists.freedesktop.org