On 4/21/20 11:43 AM, Gerd Hoffmann wrote:
On Sat, Apr 18, 2020 at 02:39:17PM +0800, Caicai wrote:
When a qxl resource is released, the list that needs to be released is fetched from the linked list ring and cleared. When you empty the list, instead of trying to determine whether the ttm buffer object for each qxl in the list is locked, you release the qxl object and remove the element from the list until the list is empty. It was found that the linked list was cleared first, and that the lock on the corresponding ttm Bo for the QXL had not been released, so that the new qxl could not be locked when it used the TTM.
So the dma_resv_reserve_shared() call in qxl_release_validate_bo() is unbalanced? Because the dma_resv_unlock() call in qxl_release_fence_buffer_objects() never happens due to qxl_release_free_list() clearing the list beforehand? Is that correct?
we observe similar issue: RHEL7 guests crashes in qxl_draw_opaque_fb() qxl_release_fence_buffer_objects()
crashdump investigation shows that qlx_object was freed and reused, so its original content was re-written.
At the same time qxl_device have empty release_idr, ant there are no allocated qxl_bo_list entries. i.e. qxl_release_free was really called.
The only way I see for this to happen is that the guest is preempted between qxl_push_{cursor,command}_ring_release() and qxl_release_fence_buffer_objects() calls. The host can complete the qxl command then, signal the guest, and the IRQ handler calls qxl_release_free_list() before qxl_release_fence_buffer_objects() runs.
We think the same: qxl_release was freed by garbage collector before original thread had called qxl_release_fence_buffer_objects().
Looking through the code I think it should be safe to simply swap the qxl_release_fence_buffer_objects() + qxl_push_{cursor,command}_ring_release() calls to close that race window. Can you try that and see if it fixes the bug for you?
I'm going to prepare and test such patch but I have one question here: qxl_push_*_ring_release can be called with interruptible=true and fail How to correctly handle this case? Is the hunk below correct from your POV?
--- a/drivers/gpu/drm/qxl/qxl_ioctl.c +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c @@ -261,12 +261,8 @@ static int qxl_process_single_command(struct qxl_device *qdev, apply_surf_reloc(qdev, &reloc_info[i]); }
+ qxl_release_fence_buffer_objects(release); ret = qxl_push_command_ring_release(qdev, release, cmd->type, true); - if (ret) - qxl_release_backoff_reserve_list(release); <<<< ???? - else - qxl_release_fence_buffer_objects(release); - out_free_bos: out_free_release:
Thank you, Vasily Averin
Hi,
The only way I see for this to happen is that the guest is preempted between qxl_push_{cursor,command}_ring_release() and qxl_release_fence_buffer_objects() calls. The host can complete the qxl command then, signal the guest, and the IRQ handler calls qxl_release_free_list() before qxl_release_fence_buffer_objects() runs.
We think the same: qxl_release was freed by garbage collector before original thread had called qxl_release_fence_buffer_objects().
Ok, nice, I think we can consider the issue being analyzed then ;)
Looking through the code I think it should be safe to simply swap the qxl_release_fence_buffer_objects() + qxl_push_{cursor,command}_ring_release() calls to close that race window. Can you try that and see if it fixes the bug for you?
I'm going to prepare and test such patch but I have one question here: qxl_push_*_ring_release can be called with interruptible=true and fail How to correctly handle this case? Is the hunk below correct from your POV?
Oh, right, the error code path will be quite different, checking ...
--- a/drivers/gpu/drm/qxl/qxl_ioctl.c +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c @@ -261,12 +261,8 @@ static int qxl_process_single_command(struct qxl_device *qdev, apply_surf_reloc(qdev, &reloc_info[i]); }
qxl_release_fence_buffer_objects(release); ret = qxl_push_command_ring_release(qdev, release, cmd->type, true);
if (ret)
qxl_release_backoff_reserve_list(release); <<<< ????
else
qxl_release_fence_buffer_objects(release);
out_free_bos: out_free_release:
if (ret) qxl_release_free(qdev, release);
[ code context added ]
qxl_release_free() checks whenever a release is fenced and signals the fence in case it is so it doesn't wait for the signal forever. So, yes, I think qxl_release_free() should cleanup the release properly in any case and the patch chunk should be correct.
take care, Gerd
qxl_release should not be accesses after qxl_push_*_ring_release() calls: userspace driver can process submitted command quickly, move qxl_release into release_ring, generate interrupt and trigger garbage collector.
It can lead to crashes in qxl driver or trigger memory corruption in some kmalloc-192 slab object
Gerd Hoffmann proposes to swap the qxl_release_fence_buffer_objects() + qxl_push_{cursor,command}_ring_release() calls to close that race window.
cc: stable@vger.kernel.org Fixes: f64122c1f6ad ("drm: add new QXL driver. (v1.4)") Signed-off-by: Vasily Averin vvs@virtuozzo.com --- drivers/gpu/drm/qxl/qxl_cmd.c | 5 ++--- drivers/gpu/drm/qxl/qxl_display.c | 6 +++--- drivers/gpu/drm/qxl/qxl_draw.c | 2 +- drivers/gpu/drm/qxl/qxl_ioctl.c | 5 +---- 4 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/qxl/qxl_cmd.c b/drivers/gpu/drm/qxl/qxl_cmd.c index fa8762d15d0f..05863b253d68 100644 --- a/drivers/gpu/drm/qxl/qxl_cmd.c +++ b/drivers/gpu/drm/qxl/qxl_cmd.c @@ -500,8 +500,8 @@ int qxl_hw_surface_alloc(struct qxl_device *qdev, /* no need to add a release to the fence for this surface bo, since it is only released when we ask to destroy the surface and it would never signal otherwise */ - qxl_push_command_ring_release(qdev, release, QXL_CMD_SURFACE, false); qxl_release_fence_buffer_objects(release); + qxl_push_command_ring_release(qdev, release, QXL_CMD_SURFACE, false);
surf->hw_surf_alloc = true; spin_lock(&qdev->surf_id_idr_lock); @@ -543,9 +543,8 @@ int qxl_hw_surface_dealloc(struct qxl_device *qdev, cmd->surface_id = id; qxl_release_unmap(qdev, release, &cmd->release_info);
- qxl_push_command_ring_release(qdev, release, QXL_CMD_SURFACE, false); - qxl_release_fence_buffer_objects(release); + qxl_push_command_ring_release(qdev, release, QXL_CMD_SURFACE, false);
return 0; } diff --git a/drivers/gpu/drm/qxl/qxl_display.c b/drivers/gpu/drm/qxl/qxl_display.c index 09583a08e141..91f398d51cfa 100644 --- a/drivers/gpu/drm/qxl/qxl_display.c +++ b/drivers/gpu/drm/qxl/qxl_display.c @@ -510,8 +510,8 @@ static int qxl_primary_apply_cursor(struct drm_plane *plane) cmd->u.set.visible = 1; qxl_release_unmap(qdev, release, &cmd->release_info);
- qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false); qxl_release_fence_buffer_objects(release); + qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false);
return ret;
@@ -652,8 +652,8 @@ static void qxl_cursor_atomic_update(struct drm_plane *plane, cmd->u.position.y = plane->state->crtc_y + fb->hot_y;
qxl_release_unmap(qdev, release, &cmd->release_info); - qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false); qxl_release_fence_buffer_objects(release); + qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false);
if (old_cursor_bo != NULL) qxl_bo_unpin(old_cursor_bo); @@ -700,8 +700,8 @@ static void qxl_cursor_atomic_disable(struct drm_plane *plane, cmd->type = QXL_CURSOR_HIDE; qxl_release_unmap(qdev, release, &cmd->release_info);
- qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false); qxl_release_fence_buffer_objects(release); + qxl_push_cursor_ring_release(qdev, release, QXL_CMD_CURSOR, false); }
static void qxl_update_dumb_head(struct qxl_device *qdev, diff --git a/drivers/gpu/drm/qxl/qxl_draw.c b/drivers/gpu/drm/qxl/qxl_draw.c index f8776d60d08e..3599db096973 100644 --- a/drivers/gpu/drm/qxl/qxl_draw.c +++ b/drivers/gpu/drm/qxl/qxl_draw.c @@ -243,8 +243,8 @@ void qxl_draw_dirty_fb(struct qxl_device *qdev, } qxl_bo_kunmap(clips_bo);
- qxl_push_command_ring_release(qdev, release, QXL_CMD_DRAW, false); qxl_release_fence_buffer_objects(release); + qxl_push_command_ring_release(qdev, release, QXL_CMD_DRAW, false);
out_release_backoff: if (ret) diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/drivers/gpu/drm/qxl/qxl_ioctl.c index 8117a45b3610..72f3f1bbb40c 100644 --- a/drivers/gpu/drm/qxl/qxl_ioctl.c +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c @@ -261,11 +261,8 @@ static int qxl_process_single_command(struct qxl_device *qdev, apply_surf_reloc(qdev, &reloc_info[i]); }
+ qxl_release_fence_buffer_objects(release); ret = qxl_push_command_ring_release(qdev, release, cmd->type, true); - if (ret) - qxl_release_backoff_reserve_list(release); - else - qxl_release_fence_buffer_objects(release);
out_free_bos: out_free_release:
On Wed, Apr 29, 2020 at 12:01:24PM +0300, Vasily Averin wrote:
qxl_release should not be accesses after qxl_push_*_ring_release() calls: userspace driver can process submitted command quickly, move qxl_release into release_ring, generate interrupt and trigger garbage collector.
It can lead to crashes in qxl driver or trigger memory corruption in some kmalloc-192 slab object
Gerd Hoffmann proposes to swap the qxl_release_fence_buffer_objects() + qxl_push_{cursor,command}_ring_release() calls to close that race window.
cc: stable@vger.kernel.org Fixes: f64122c1f6ad ("drm: add new QXL driver. (v1.4)") Signed-off-by: Vasily Averin vvs@virtuozzo.com
Pushed to drm-misc-fixes.
thanks, Gerd
dri-devel@lists.freedesktop.org