Yeah, that should probably be the right one.
Christian.
Am 12.01.22 um 03:19 schrieb Chen, Guchun:
[Public]
Hi Christian,
My BAD, I checked that discussion history of this just now. So If I read it correctly, the double check at a different place to skip evict is: " drm/ttm: Double check mem_type of BO while eviction"? It is in 5.16 kernel.
Regards, Guchun
-----Original Message----- From: Christian König ckoenig.leichtzumerken@gmail.com Sent: Tuesday, January 11, 2022 7:27 PM To: Chen, Guchun Guchun.Chen@amd.com; Pan, Xinhui Xinhui.Pan@amd.com; Koenig, Christian Christian.Koenig@amd.com; amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Subject: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
IIRC we have completely dropped this patch in favor of a check at a different place.
Regards, Christian.
Am 11.01.22 um 09:47 schrieb Chen, Guchun:
[Public]
Hi Christian,
Looks this patch still missed in 5.16 kernel. Is it intentional? https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit. kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2 Ftree%2Fdrivers%2Fgpu%2Fdrm%2Fttm%2Fttm_bo.c%3Fh%3Dv5.16&data=04%7 C01%7CGuchun.Chen%40amd.com%7Cf3b7f4971dc8405b0c2908d9d4f55547%7C3dd89 61fe4884e608e11a82d994e183d%7C0%7C0%7C637774972434004088%7CUnknown%7CT WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI 6Mn0%3D%7C3000&sdata=vbuBPHO40J2HGt7abzfzC0nC1DQa62qal5S6TXBRj4w%3 D&reserved=0
Regards, Guchun
-----Original Message----- From: amd-gfx amd-gfx-bounces@lists.freedesktop.org On Behalf Of Pan, Xinhui Sent: Tuesday, November 9, 2021 9:16 PM To: Koenig, Christian Christian.Koenig@amd.com; amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Subject: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
[AMD Official Use Only]
[AMD Official Use Only]
Actually this patch does not totally fix the mismatch of lru list with mem_type as mem_type is changed in ->move() and lru list is changed after that.
During this small period, another eviction could still happed and evict this mismatched BO from sMam(say, its lru list is on vram domain) to sMem. ________________________________________ 发件人: Pan, Xinhui Xinhui.Pan@amd.com 发送时间: 2021年11月9日 21:05 收件人: Koenig, Christian; amd-gfx@lists.freedesktop.org 抄送: dri-devel@lists.freedesktop.org 主题: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too.
I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly. maybe something below is needed. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c83ef42ca702..aa63ae7ddf1e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict, } if (old_mem->mem_type == TTM_PL_SYSTEM && (new_mem->mem_type == TTM_PL_TT ||
new_mem->mem_type == AMDGPU_PL_PREEMPT)) {
new_mem->mem_type == AMDGPU_PL_PREEMPT ||
new_mem->mem_type == TTM_PL_SYSTEM)) { ttm_bo_move_null(bo, new_mem); goto out; }
otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address. 206 /* Map only what can't be accessed directly */ 207 if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) { 208 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) + 209 mm_cur->start; 210 return 0; 211 }
line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens.
发件人: Koenig, Christian Christian.Koenig@amd.com 发送时间: 2021年11月9日 20:35 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org 抄送: dri-devel@lists.freedesktop.org 主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list
Mhm, I'm not sure what the rational behind that is.
Not moving the BO would make things less efficient, but should never cause a crash.
Maybe we should add a CC: stable tag and push it to -fixes instead?
Christian.
Am 09.11.21 um 13:28 schrieb Pan, Xinhui:
[AMD Official Use Only]
I hit vulkan cts test hang with navi23.
dmesg says gmc page fault with address 0x0, 0x1000, 0x2000.... And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird. ________________________________________ 发件人: Koenig, Christian Christian.Koenig@amd.com 发送时间: 2021年11月9日 20:20 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org 抄送: dri-devel@lists.freedesktop.org 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list
Am 09.11.21 um 12:19 schrieb xinhui pan:
After we move BO to a new memory region, we should put it to the new memory manager's lru list regardless we unlock the resv or not.
Signed-off-by: xinhui pan xinhui.pan@amd.com
Interesting find, did you trigger that somehow or did you just stumbled over it by reading the code?
Patch is Reviewed-by: Christian König christian.koenig@amd.com, I will pick that up for drm-misc-next.
Thanks, Christian.
drivers/gpu/drm/ttm/ttm_bo.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index f1367107925b..e307004f0b28 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev, ret = ttm_bo_evict(bo, ctx); if (locked) ttm_bo_unreserve(bo);
else
ttm_bo_move_to_lru_tail_unlocked(bo); ttm_bo_put(bo); return ret;