AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch [https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. [https://patchwork.kernel.org/project/dri-devel/list/?series=489811]
In this RFC, patches 1 and 2 are for context to show how we are looking up the SPM memory and registering it with devmap.
Patches 3-5 are the changes we are trying to upstream or rework to make them acceptable upstream.
Alex Sierra (5): drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helper to check zone device generic type mm: add generic type support for device zone page migration mm: changes to unref pages with Generic type
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 +++++++++++---- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 1 - include/linux/mm.h | 8 ++++++++ kernel/resource.c | 2 +- mm/memremap.c | 5 ++++- mm/migrate.c | 13 ++++++++----- 6 files changed, 32 insertions(+), 12 deletions(-)
From: Alex Sierra alex.sierra@amd.com
When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_GENERIC to create the device page map region.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 12 +++++++++--- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 1 - kernel/resource.c | 2 +- 3 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index c8ca3252cbc2..f5939449a99f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -895,6 +895,7 @@ int svm_migrate_init(struct amdgpu_device *adev) struct resource *res; unsigned long size; void *r; + bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;
/* Page migration works on Vega10 or newer */ if (kfddev->device_info->asic_family < CHIP_VEGA10) @@ -907,17 +908,22 @@ int svm_migrate_init(struct amdgpu_device *adev) * should remove reserved size */ size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (xgmi_connected_to_cpu) + res = lookup_resource(&iomem_resource, adev->gmc.aper_base); + else + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (IS_ERR(res)) return -ENOMEM;
- pgmap->type = MEMORY_DEVICE_PRIVATE; pgmap->nr_range = 1; pgmap->range.start = res->start; pgmap->range.end = res->end; + pgmap->type = xgmi_connected_to_cpu ? + MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE; pgmap->ops = &svm_migrate_pgmap_ops; pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev); - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + pgmap->flags = 0; r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n"); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 21f693767a0d..3881a93192ed 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -38,7 +38,6 @@ #define SVM_RANGE_VRAM_DOMAIN (1UL << 0) #define SVM_ADEV_PGMAP_OWNER(adev)\ ((adev)->hive ? (void *)(adev)->hive : (void *)(adev)) - struct svm_range_bo { struct amdgpu_bo *bo; struct kref kref; diff --git a/kernel/resource.c b/kernel/resource.c index 627e61b0c124..da137553b83e 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -783,7 +783,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
return res; } - +EXPORT_SYMBOL(lookup_resource); /* * Insert a resource into the resource tree. If successful, return NULL, * otherwise return the conflicting resource (compare to __request_resource())
From: Alex Sierra alex.sierra@amd.com
Generic device type memory on VRAM to RAM migration, has similar access as System RAM from the CPU. This flag sets the source from the sender. Which in Generic type case, should be set as SYSTEM.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index f5939449a99f..7b41006c1164 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -653,8 +653,9 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + migrate.flags = adev->gmc.xgmi.connected_to_cpu ? + MIGRATE_VMA_SELECT_SYSTEM : MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t); size *= npages;
From: Alex Sierra alex.sierra@amd.com
Helper to check if zone device page is generic type.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- include/linux/mm.h | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index c9900aedc195..1af7b9b76948 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1158,6 +1158,13 @@ static inline bool is_device_private_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PRIVATE; }
+static inline bool is_device_generic_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_GENERIC; +} + static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
From: Alex Sierra alex.sierra@amd.com
This support is only for generic type anonymous memory. Generic type with zone device pages require to take an extra reference, as it's done with device private type. Also, support added to migrate pages meta-data for generic device type.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- mm/migrate.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 20ca887ea769..33e573a992e5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -380,7 +380,8 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) * Device private pages have an extra refcount as they are * ZONE_DEVICE pages. */ - expected_count += is_device_private_page(page); + expected_count += + (is_device_private_page(page) || is_device_generic_page(page)); if (mapping) expected_count += thp_nr_pages(page) + page_has_private(page);
@@ -2607,7 +2608,7 @@ static bool migrate_vma_check_page(struct page *page) * FIXME proper solution is to rework migration_entry_wait() so * it does not need to take a reference on page. */ - return is_device_private_page(page); + return is_device_private_page(page) | is_device_generic_page(page); }
/* For file back page */ @@ -3069,10 +3070,12 @@ void migrate_vma_pages(struct migrate_vma *migrate) mapping = page_mapping(page);
if (is_zone_device_page(newpage)) { - if (is_device_private_page(newpage)) { + if (is_device_private_page(newpage) || + is_device_generic_page(newpage)) { /* - * For now only support private anonymous when - * migrating to un-addressable device memory. + * For now only support private and devdax/generic + * anonymous when migrating to un-addressable + * device memory. */ if (mapping) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
From: Alex Sierra alex.sierra@amd.com
pages in device mapping refcounts are 1-based, instead of 0-based. If refcount 1, means it can be freed. This logic is not set for Generic memory type. Therefore, its release is threated as a normal page, instead of the callback device driver release it.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- include/linux/mm.h | 1 + mm/memremap.c | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 1af7b9b76948..83bd2f3e111b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1130,6 +1130,7 @@ static inline bool page_is_devmap_managed(struct page *page) switch (page->pgmap->type) { case MEMORY_DEVICE_PRIVATE: case MEMORY_DEVICE_FS_DAX: + case MEMORY_DEVICE_GENERIC: return true; default: break; diff --git a/mm/memremap.c b/mm/memremap.c index 16b2fb482da1..d2563fbcf987 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -44,6 +44,7 @@ EXPORT_SYMBOL(devmap_managed_key); static void devmap_managed_enable_put(struct dev_pagemap *pgmap) { if (pgmap->type == MEMORY_DEVICE_PRIVATE || + pgmap->type == MEMORY_DEVICE_GENERIC || pgmap->type == MEMORY_DEVICE_FS_DAX) static_branch_dec(&devmap_managed_key); } @@ -51,6 +52,7 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap) static void devmap_managed_enable_get(struct dev_pagemap *pgmap) { if (pgmap->type == MEMORY_DEVICE_PRIVATE || + pgmap->type == MEMORY_DEVICE_GENERIC || pgmap->type == MEMORY_DEVICE_FS_DAX) static_branch_inc(&devmap_managed_key); } @@ -480,7 +482,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); void free_devmap_managed_page(struct page *page) { /* notify page idle for dax */ - if (!is_device_private_page(page)) { + if (!(is_device_private_page(page) || + is_device_generic_page(page))) { wake_up_var(&page->_refcount); return; }
On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
So you have completely coherent and indistinguishable GPU and CPU memory and the need of migration is basicaly alot like NUMA policy choice - get better access locality?
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
I'm not an expert in migrate, but it doesn't look outrageous.
Have you thought about allowing MEMORY_DEVICE_GENERIC to work with hmm_range_fault() so you can have nice uniform RDMA?
People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody finished the work
Jason
Am 2021-05-28 um 9:08 a.m. schrieb Jason Gunthorpe:
On Thu, May 27, 2021 at 07:08:04PM -0400, Felix Kuehling wrote:
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
So you have completely coherent and indistinguishable GPU and CPU memory and the need of migration is basicaly alot like NUMA policy choice - get better access locality?
Yes. For a typical GPU compute application it means the GPU gets the best bandwidth/latency, and the CPU can coherently access the results without page faults and migrations. That's especially valuable for applications with persistent compute kernels that want to exploit concurrency between CPU and GPU.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
I'm not an expert in migrate, but it doesn't look outrageous.
Have you thought about allowing MEMORY_DEVICE_GENERIC to work with hmm_range_fault() so you can have nice uniform RDMA?
Yes. That's our plan for RDMA to unified memory on this system. My understanding was, that DEVICE_GENERIC pages should already work with hmm_range_fault. But maybe I'm missing something.
People have wanted to do that with MEMORY_DEVICE_PRIVATE but nobody finished the work
Yeah, for DEVICE_PRIVATE it seems more tricky because the peer device is not the owner of the pages and would need help from the actual owner to get proper DMA addresses.
Regards, Felix
Jason
dri-devel@lists.freedesktop.org