If the initial value of `num_entires` (calculated at line 1654) is not an integral multiple of `AMDGPU_GPU_PAGES_IN_CPU_PAGE`, in line 1681 a value greater than the initial value will be assigned to it. That causes `start > last + 1` after line 1708. Then in the next iteration an underflow happens at line 1654. It causes message
*ERROR* Couldn't update BO_VA (-12)
printed in kernel log, and GPU hanging.
Fortify the criteria of the loop to fix this issue.
BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1549 Fixes: a39f2a8d7066 ("drm/amdgpu: nuke amdgpu_vm_bo_split_mapping v2") Reported-by: Xi Ruoyao xry111@mengyan1223.wang Reported-by: Dan Horák dan@danny.cz Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao xry111@mengyan1223.wang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ad91c0c3c423..cee0cc9c8085 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1707,7 +1707,7 @@ static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, } start = tmp;
- } while (unlikely(start != last + 1)); + } while (unlikely(start < last + 1));
r = vm->update_funcs->commit(¶ms, fence);
base-commit: a5e13c6df0e41702d2b2c77c8ad41677ebb065b3
On 2021-03-29 20:04 +0200, Christian König wrote:
Simply run "OpenGL area" from gtk3-demo (which just renders a triangle with GL) under Xorg, on MIPS64. See the BugLink.
On 2021-03-29 20:10 +0200, Christian König wrote:
You need to identify the root cause of this, most likely start or last are not a multiple of AMDGPU_GPU_PAGES_IN_CPU_PAGE.
I printk'ed the value of start & last, they are all a multiple of 4 (AMDGPU_GPU_PAGES_IN_CPU_PAGE).
However... `num_entries = last - start + 1` so it became some irrational thing... Either this line is wrong, or someone called amdgpu_vm_bo_update_mapping with [start, last) instead of [start, last], which is unexpected.
On 2021-03-30 02:21 +0800, Xi Ruoyao wrote:
I added BUG_ON(num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE != 0), get:
Hi Christian,
I don't think there is any constraint implemented to ensure `num_entries % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
/* validate the parameters */ if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK || size == 0 || size & AMDGPU_GPU_PAGE_MASK) return -EINVAL;
/* snip */
saddr /= AMDGPU_GPU_PAGE_SIZE; eaddr /= AMDGPU_GPU_PAGE_SIZE;
/* snip */
mapping->start = saddr; mapping->last = eaddr;
If we really want to ensure (mapping->last - mapping->start + 1) % AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace "AMDGPU_GPU_PAGE_MASK" in "validate the parameters" with "PAGE_MASK".
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.
On 2021-03-30 02:30 +0800, Xi Ruoyao wrote:
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
Yeah, good point.
I tried it and it broke userspace: Xorg startup fails with EINVAL with this change.
Well in theory it is possible that we always fill the GPUVM on a 4k basis while the native page size of the CPU is larger. Let me double check the code.
BTW: What code base are you based on? The code your post here is quite outdated.
Christian.
On 2021-03-29 21:36 +0200, Christian König wrote:
Linus' tree.
I'll go to sleep now (it's 03:39 here :( ), when I wake up I can try to fetch drm-next or something.
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of "AMDGPU_GPU_PAGE_MASK" :(.
On my platform, there are two issues both causing the VM corruption. One is fixed by agd5f/linux@fe001e7. Another is in Mesa from userspace: it uses `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver expects it to use `dev_info->virtual_address_alignment`.
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue and make virtual_address_alignment = 4K. Otherwise, we should fortify the parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the userspace will just get an EINVAL, instead of a slient VM corruption. And someone should tell Mesa developers to fix the code in this case. -- Xi Ruoyao xry111@mengyan1223.wang School of Aerospace Science and Technology, Xidian University
Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
What is fe001e7? A commit id? If yes then that is to short and I can't find it.
Mhm, looking at the kernel code I would rather say Mesa is correct and the kernel driver is broken.
The gart_page_size is limited by the CPU page size, but the virtual_address_alignment isn't.
I rather see this as a kernel bug. Can you test if this code fragment fixes your issue:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index 64beb3399604..e1260b517e1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) } dev_info->virtual_address_alignment = max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); dev_info->pte_fragment_size = (1 << adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; + dev_info->gart_page_size = dev_info->virtual_address_alignment; dev_info->cu_active_number = adev->gfx.cu_info.number; dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; dev_info->ce_ram_size = adev->gfx.ce_ram_size;
Thanks, Christian.
On Tue, 30 Mar 2021 14:55:01 +0200 Christian König christian.koenig@amd.com wrote:
it's a gitlab shortcut for https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be...
Dan
On 2021-03-30 14:55 +0200, Christian König wrote:
It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae1... before (with a common sub-expression, though :).
On 2021-03-30 21:02 +0800, Xi Ruoyao wrote:
Some comment: on an old version of Fedora ported by Loongson, Xorg just hangs without this commit. But on the system I built from source, I didn't see any issue before Linux 5.11. So I misbelieved that it was something already fixed.
Dan: you can try it on your PPC 64 with non-4K page as well.
On Tue, 30 Mar 2021 21:09:12 +0800 Xi Ruoyao xry111@mengyan1223.wang wrote:
yup, looks good here as well, ppc64le (Power9) system with 64KB pages
Dan
On 2021-03-30 15:24 +0200, Christian König wrote:
I think maybe its author considers it a "workaround" like me, as there is already a "virtual_address_alignment" which seems correct.
Can you please send it to the mailing list with me on CC?
I've sent it, together with my patch for using ~PAGE_MASK in parameter validation.
dri-devel@lists.freedesktop.org