v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch https://lore.kernel.org/dri-devel/20210527205606.2660-6-Felix.Kuehling@amd.c... On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. https://patchwork.kernel.org/project/dri-devel/list/?series=489811
v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU.
Also addresses the following feedback made in v1: - Isolate in one patch kernel/resource.c modification, based on Christoph's feedback. - Add helpers check for generic and private type to avoid duplicated long lines.
v3: - Include cover letter from v1 - Rename dax_layout_is_idle_page func to dax_page_unused in patch ext4/xfs: add page refcount helper
Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches Patches 4-5 are for context to show how we are looking up the SPM memory and registering it with devmap. Patches 3,6-8 are the changes we are trying to upstream or rework to make them acceptable upstream.
Alex Sierra (6): kernel: resource: lookup_resource as exported symbol drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helpers to check zone device generic type mm: add generic type support to migrate_vma helpers mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 10 ++++ include/linux/memremap.h | 7 +-- include/linux/mm.h | 52 +++--------------- kernel/resource.c | 2 +- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 69 +++++++----------------- mm/migrate.c | 13 ++--- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++-------------- 16 files changed, 83 insertions(+), 164 deletions(-)
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
v2: [AS]: rename dax_layout_is_idle_page func to dax_page_unused
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page)); WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- if (page_ref_count(page) > 1) + if (!dax_layout_is_idle_page(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c173c8405856..9ee00186412f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0;
- error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(ei)); + error = dax_wait_page(ei, page, ext4_wait_dax_page); } while (error == 0);
return error; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..39565fe5f817 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -782,9 +782,7 @@ xfs_break_dax_layouts( return 0;
*retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return dax_wait_page(inode, page, xfs_wait_dax_page); }
int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8b5da1d60dbc 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_page_unused(struct page *page) +{ + return page_ref_count(page) == 1; +} + +#define dax_wait_page(_inode, _page, _wait_cb) \ + ___wait_var_event(&(_page)->_refcount, \ + dax_page_unused(_page), \ + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode)) + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else
On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
v2: [AS]: rename dax_layout_is_idle_page func to dax_page_unused
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0;WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
if (page_ref_count(page) > 1)
} return NULL;if (!dax_layout_is_idle_page(page)) return page;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c173c8405856..9ee00186412f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0;
error = ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE, 0, 0,
ext4_wait_dax_page(ei));
error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);
return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..39565fe5f817 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -782,9 +782,7 @@ xfs_break_dax_layouts( return 0;
*retry = true;
- return ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
0, 0, xfs_wait_dax_page(inode));
- return dax_wait_page(inode, page, xfs_wait_dax_page);
Mechanically, this looks like a straightforward replacement, so: Acked-by: Darrick J. Wong djwong@kernel.org
--D
}
int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8b5da1d60dbc 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_page_unused(struct page *page) +{
- return page_ref_count(page) == 1;
+}
+#define dax_wait_page(_inode, _page, _wait_cb) \
- ___wait_var_event(&(_page)->_refcount, \
dax_page_unused(_page), \
TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
#ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r);
#else
2.17.1
On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
v2: [AS]: rename dax_layout_is_idle_page func to dax_page_unused
Did you even compile test this?
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
Because you still use dax_layout_is_idle_page() here, not dax_page_unused()...
WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
if (page_ref_count(page) > 1)
if (!dax_layout_is_idle_page(page))
Here too.
Cheers,
Dave.
On 6/17/2021 6:52 PM, Dave Chinner wrote:
On Thu, Jun 17, 2021 at 10:16:58AM -0500, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
v2: [AS]: rename dax_layout_is_idle_page func to dax_page_unused
Did you even compile test this?
Thanks for catching this Dave, my mistake, I was building without CONFIG_FS_DAX, as I was using DEVICE_GENERIC only. I'll send the corrected patch in replay.
Regards,
Alex Sierra
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
Because you still use dax_layout_is_idle_page() here, not dax_page_unused()...
WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
if (page_ref_count(page) > 1)
if (!dax_layout_is_idle_page(page))
Here too.
Cheers,
Dave.
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
v3: [AS]: rename dax_layout_is_idle_page func to dax_page_unused
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..4820bb511d68 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_page_unused(page)); WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- if (page_ref_count(page) > 1) + if (!dax_page_unused(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c173c8405856..9ee00186412f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0;
- error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(ei)); + error = dax_wait_page(ei, page, ext4_wait_dax_page); } while (error == 0);
return error; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..39565fe5f817 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -782,9 +782,7 @@ xfs_break_dax_layouts( return 0;
*retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return dax_wait_page(inode, page, xfs_wait_dax_page); }
int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8b5da1d60dbc 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_page_unused(struct page *page) +{ + return page_ref_count(page) == 1; +} + +#define dax_wait_page(_inode, _page, _wait_cb) \ + ___wait_var_event(&(_page)->_refcount, \ + dax_page_unused(_page), \ + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode)) + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE.
v2: AS: merged this patch in linux 5.11 version
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 ----------------- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--------------- 12 files changed, 45 insertions(+), 147 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 84e5a2dc8be5..acee67710620 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 92987daa5e17..8bc7120e1216 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; }
- get_page(page); + init_page_count(page); lock_page(page); return page; } diff --git a/fs/dax.c b/fs/dax.c index 321f4ddc6643..7b4c6b35b098 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
/** * dax_layout_busy_page_range - find first pinned page in @mapping - * @mapping: address space to scan for a page with ref count > 1 + * @mapping: address space to scan for a page with ref count > 0 * @start: Starting offset. Page containing 'start' is included. * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, * pages from 'start' till the end of file are included. * * DAX requires ZONE_DEVICE mapped pages. These pages are never * 'onlined' to the page allocator so they are considered idle when - * page->count == 1. A filesystem uses this interface to determine if + * page->count == 0. A filesystem uses this interface to determine if * any page in the mapping is busy, i.e. for DMA, or other * get_user_pages() usages. * diff --git a/include/linux/dax.h b/include/linux/dax.h index 8b5da1d60dbc..05fc982ce153 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)
static inline bool dax_page_unused(struct page *page) { - return page_ref_count(page) == 1; + return page_ref_count(page) == 0; }
#define dax_wait_page(_inode, _page, _wait_cb) \ diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 79c49e7f5c30..327f32427d21 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -66,9 +66,10 @@ enum memory_type {
struct dev_pagemap_ops { /* - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never - * reach 0 refcount unless there is a refcount bug. This allows the - * device driver to implement its own memory management.) + * Called once the page refcount reaches 0. The reference count + * should be reset to one with init_page_count(page) before reusing + * the page. This allows the device driver to implement its own + * memory management. */ void (*page_free)(struct page *page);
diff --git a/include/linux/mm.h b/include/linux/mm.h index c9900aedc195..d8d79bb94be8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1117,39 +1117,6 @@ static inline bool is_zone_device_page(const struct page *page) } #endif
-#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -static inline bool page_is_devmap_managed(struct page *page) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; -} - -void put_devmap_managed_page(struct page *page); - -#else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) -{ - return false; -} - -static inline void put_devmap_managed_page(struct page *page) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static inline bool is_device_private_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && @@ -1196,17 +1163,6 @@ static inline void put_page(struct page *page) { page = compound_head(page);
- /* - * For devmap managed pages we need to catch refcount transition from - * 2 to 1, when refcount reach one it means the page is free and we - * need to inform the device driver through callback. See - * include/linux/memremap.h and HMM for details. - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - return; - } - if (put_page_testzero(page)) __put_page(page); } diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 80a78877bd93..6998f10350ea 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) }
dpage->zone_device_data = rpage; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage;
diff --git a/mm/internal.h b/mm/internal.h index 25d2b2439f19..d3e58746f2d2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -623,4 +623,12 @@ struct migration_target_control { gfp_t gfp_mask; };
+#ifdef CONFIG_DEV_PAGEMAP_OPS +void free_zone_device_page(struct page *page); +#else +static inline void free_zone_device_page(struct page *page) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memremap.c b/mm/memremap.c index 16b2fb482da1..614b3d600e95 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -12,6 +12,7 @@ #include <linux/types.h> #include <linux/wait_bit.h> #include <linux/xarray.h> +#include "internal.h"
static DEFINE_XARRAY(pgmap_array);
@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif
-#ifdef CONFIG_DEV_PAGEMAP_OPS -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); -EXPORT_SYMBOL(devmap_managed_key); - -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE || - pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_dec(&devmap_managed_key); -} - -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE || - pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_inc(&devmap_managed_key); -} -#else -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ -} -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -87,16 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) return (range->start + range_len(range)) >> PAGE_SHIFT; }
-static unsigned long pfn_next(unsigned long pfn) -{ - if (pfn % 1024 == 0) - cond_resched(); - return pfn + 1; -} - -#define for_each_device_pfn(pfn, map, i) \ - for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn)) - static void dev_pagemap_kill(struct dev_pagemap *pgmap) { if (pgmap->ops && pgmap->ops->kill) @@ -152,20 +117,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
void memunmap_pages(struct dev_pagemap *pgmap) { - unsigned long pfn; int i;
dev_pagemap_kill(pgmap); for (i = 0; i < pgmap->nr_range; i++) - for_each_device_pfn(pfn, pgmap, i) - put_page(pfn_to_page(pfn)); + percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) - + pfn_first(pgmap, i)); dev_pagemap_cleanup(pgmap);
for (i = 0; i < pgmap->nr_range; i++) pageunmap_range(pgmap, i);
WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); - devmap_managed_enable_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages);
@@ -361,8 +324,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } }
- devmap_managed_enable_get(pgmap); - /* * Clear the pgmap nr_range as it will be incremented for each * successfully processed range. This communicates how many @@ -477,16 +438,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap);
#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page) +static void free_device_private_page(struct page *page) { - /* notify page idle for dax */ - if (!is_device_private_page(page)) { - wake_up_var(&page->_refcount); - return; - }
__ClearPageWaiters(page); - mem_cgroup_uncharge(page);
/* @@ -513,4 +468,19 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void free_zone_device_page(struct page *page) +{ + switch (page->pgmap->type) { + case MEMORY_DEVICE_FS_DAX: + /* notify page idle */ + wake_up_var(&page->_refcount); + return; + case MEMORY_DEVICE_PRIVATE: + free_device_private_page(page); + return; + default: + return; + } +} #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/migrate.c b/mm/migrate.c index 20ca887ea769..8c2430d3e77b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -376,11 +376,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) { int expected_count = 1;
- /* - * Device private pages have an extra refcount as they are - * ZONE_DEVICE pages. - */ - expected_count += is_device_private_page(page); if (mapping) expected_count += thp_nr_pages(page) + page_has_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 519a60d5b6f7..4612c457d0b0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6210,6 +6210,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
__init_single_page(page, pfn, zone_idx, nid);
+ /* ZONE_DEVICE pages start with a zero reference count. */ + set_page_count(page, 0); + /* * Mark page reserved as it will need to wait for onlining * phase for it to be fully associated with a zone. diff --git a/mm/swap.c b/mm/swap.c index 2cca7141470c..0a12af814065 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { if (is_zone_device_page(page)) { - put_dev_pagemap(page->pgmap); - /* * The page belongs to the device that created pgmap. Do * not return it to page allocator. */ + free_zone_device_page(page); return; }
@@ -878,29 +877,18 @@ void release_pages(struct page **pages, int nr) if (is_huge_zero_page(page)) continue;
+ if (!put_page_testzero(page)) + continue; + if (is_zone_device_page(page)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* - * ZONE_DEVICE pages that return 'false' from - * page_is_devmap_managed() do not require special - * processing, and instead, expect a call to - * put_page_testzero(). - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - continue; - } - if (put_page_testzero(page)) - put_dev_pagemap(page->pgmap); + free_zone_device_page(page); continue; }
- if (!put_page_testzero(page)) - continue; - if (PageCompound(page)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); @@ -1142,26 +1130,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif
On 6/17/21 8:16 AM, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE.
v2: AS: merged this patch in linux 5.11 version
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 ----------------- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--------------- 12 files changed, 45 insertions(+), 147 deletions(-)
I think it is great that you are picking this up and trying to revive it.
However, I have a number of concerns about how it affects existing ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC struct pages and then: dev_dax_fault() dev_dax_huge_fault() __dev_dax_pte_fault() vmf_insert_mixed() which just inserts the PFN into the CPU page tables without increasing the page refcount so it is zero (whereas it was one before). But using get_page() will trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of free verses allocated for these struct pages. I suppose init_page_count() could be called on all the struct pages in dev_dax_probe() to fix that though.
I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear allocate and free states for backing storage but there are the complications with the page cache references, etc. to consider. The >1 to 1 reference count seems to be used to tell when a page is idle (no I/O, reclaim scanners) rather than free (not allocated to any file) but I'm not 100% sure about that since I don't really understand all the issues around why a file system needs to have a DAX mount option besides knowing that the storage block size has to be a multiple of the page size.
Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
On 6/17/21 8:16 AM, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE.
v2: AS: merged this patch in linux 5.11 version
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 ----------------- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--------------- 12 files changed, 45 insertions(+), 147 deletions(-)
I think it is great that you are picking this up and trying to revive it.
However, I have a number of concerns about how it affects existing ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC struct pages and then: dev_dax_fault() dev_dax_huge_fault() __dev_dax_pte_fault() vmf_insert_mixed() which just inserts the PFN into the CPU page tables without increasing the page refcount so it is zero (whereas it was one before). But using get_page() will trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of free verses allocated for these struct pages. I suppose init_page_count() could be called on all the struct pages in dev_dax_probe() to fix that though.
Hi Ralph,
For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So I'm not sure what the reference counting is good for in this case.
Alex is going to add free_zone_device_page support for DEVICE_GENERIC pages (patch 8 of this series). However, even then, it only does anything if there is an actual call to put_page. Where would that call come from in the dev_dax driver case?
I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear allocate and free states for backing storage but there are the complications with the page cache references, etc. to consider. The >1 to 1 reference count seems to be used to tell when a page is idle (no I/O, reclaim scanners) rather than free (not allocated to any file) but I'm not 100% sure about that since I don't really understand all the issues around why a file system needs to have a DAX mount option besides knowing that the storage block size has to be a multiple of the page size.
The only thing that happens in free_zone_device_page for FS_DAX pages is wake_up_var(&page->_refcount). I guess, whoever is waiting for this wake-up will need to be prepared to see a refcount 0 instead of 1 now. I see these callers that would need to be updated:
./fs/ext4/inode.c: error = ___wait_var_event(&page->_refcount, ./fs/ext4/inode.c- atomic_read(&page->_refcount) == 1, ./fs/ext4/inode.c- TASK_INTERRUPTIBLE, 0, 0, ./fs/ext4/inode.c- ext4_wait_dax_page(ei)); -- ./fs/fuse/dax.c: return ___wait_var_event(&page->_refcount, ./fs/fuse/dax.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/fuse/dax.c- 0, 0, fuse_wait_dax_page(inode)); -- ./fs/xfs/xfs_file.c: return ___wait_var_event(&page->_refcount, ./fs/xfs/xfs_file.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/xfs/xfs_file.c- 0, 0, xfs_wait_dax_page(inode));
Regarding your page-cache comment, doesn't DAX mean that those file pages are not in the page cache?
Regards, Felix
On 6/28/21 9:46 AM, Felix Kuehling wrote:
Am 2021-06-17 um 3:16 p.m. schrieb Ralph Campbell:
On 6/17/21 8:16 AM, Alex Sierra wrote:
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE.
v2: AS: merged this patch in linux 5.11 version
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 ----------------- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--------------- 12 files changed, 45 insertions(+), 147 deletions(-)
I think it is great that you are picking this up and trying to revive it.
However, I have a number of concerns about how it affects existing ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX users and I don't see this addressing them. For example, dev_dax_probe() allocates MEMORY_DEVICE_GENERIC struct pages and then: dev_dax_fault() dev_dax_huge_fault() __dev_dax_pte_fault() vmf_insert_mixed() which just inserts the PFN into the CPU page tables without increasing the page refcount so it is zero (whereas it was one before). But using get_page() will trigger VM_BUG_ON_PAGE() if it is enabled. There isn't any current notion of free verses allocated for these struct pages. I suppose init_page_count() could be called on all the struct pages in dev_dax_probe() to fix that though.
Hi Ralph,
For DEVICE_GENERIC pages free_zone_device_page doesn't do anything. So I'm not sure what the reference counting is good for in this case.
Alex is going to add free_zone_device_page support for DEVICE_GENERIC pages (patch 8 of this series). However, even then, it only does anything if there is an actual call to put_page. Where would that call come from in the dev_dax driver case?
Correct, the drivers/dax/device.c driver allocates MEMORY_DEVICE_GENERIC struct pages and doesn't seem to allocate/free the page nor increment/decrement the reference count but it does call vmf_insert_mixed() if the /dev/file is mmap()'ed into a user process' address space. If devm_memremap_pages() returns the array of ZONE_DEVICE struct pages initialized with a reference count of zero, then the CPU page tables will have a PTE/PFN that points to a struct page with a zero reference count. This goes against the normal expectation in the rest of the mm code that assumes a page mapped by a CPU has a non-zero reference count. So yes, nothing "bad" happens because put_page() isn't called but the reference count will differ from other drivers that call vmf_insert_mixed() or vm_insert_page() where the page was allocated with alloc_pages() or similar.
I'm even less clear about how to fix MEMORY_DEVICE_FS_DAX. File systems have clear allocate and free states for backing storage but there are the complications with the page cache references, etc. to consider. The >1 to 1 reference count seems to be used to tell when a page is idle (no I/O, reclaim scanners) rather than free (not allocated to any file) but I'm not 100% sure about that since I don't really understand all the issues around why a file system needs to have a DAX mount option besides knowing that the storage block size has to be a multiple of the page size.
The only thing that happens in free_zone_device_page for FS_DAX pages is wake_up_var(&page->_refcount). I guess, whoever is waiting for this wake-up will need to be prepared to see a refcount 0 instead of 1 now. I see these callers that would need to be updated:
./fs/ext4/inode.c: error = ___wait_var_event(&page->_refcount, ./fs/ext4/inode.c- atomic_read(&page->_refcount) == 1, ./fs/ext4/inode.c- TASK_INTERRUPTIBLE, 0, 0, ./fs/ext4/inode.c- ext4_wait_dax_page(ei)); -- ./fs/fuse/dax.c: return ___wait_var_event(&page->_refcount, ./fs/fuse/dax.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/fuse/dax.c- 0, 0, fuse_wait_dax_page(inode)); -- ./fs/xfs/xfs_file.c: return ___wait_var_event(&page->_refcount, ./fs/xfs/xfs_file.c- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, ./fs/xfs/xfs_file.c- 0, 0, xfs_wait_dax_page(inode));
Regarding your page-cache comment, doesn't DAX mean that those file pages are not in the page cache?
Regards, Felix
I don't really understand the FS_DAX code. I can see the __wait_var_event() is being used when truncating or punching holes in files but I'm not quite sure if it is using the >1 to 1 reference count to know when a page has no "extra" references or if it means the page is actually "free" and no longer assigned to a file. I really think some FS_DAX expert needs to weigh in on these reference count changes.
The AMD architecture for the Frontier supercomputer will have device memory which can be coherently accessed by the CPU. The system BIOS advertises this memory as SPM (special purpose memory) in the UEFI system address map.
The AMDGPU driver needs to be able to lookup this resource in order to claim it as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- kernel/resource.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/resource.c b/kernel/resource.c index 627e61b0c124..269489bb7097 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -783,7 +783,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
return res; } - +EXPORT_SYMBOL_GPL(lookup_resource); /* * Insert a resource into the resource tree. If successful, return NULL, * otherwise return the conflicting resource (compare to __request_resource())
When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_GENERIC to create the device page map region.
Signed-off-by: Alex Sierra alex.sierra@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index c8ca3252cbc2..f5939449a99f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -895,6 +895,7 @@ int svm_migrate_init(struct amdgpu_device *adev) struct resource *res; unsigned long size; void *r; + bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;
/* Page migration works on Vega10 or newer */ if (kfddev->device_info->asic_family < CHIP_VEGA10) @@ -907,17 +908,22 @@ int svm_migrate_init(struct amdgpu_device *adev) * should remove reserved size */ size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (xgmi_connected_to_cpu) + res = lookup_resource(&iomem_resource, adev->gmc.aper_base); + else + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (IS_ERR(res)) return -ENOMEM;
- pgmap->type = MEMORY_DEVICE_PRIVATE; pgmap->nr_range = 1; pgmap->range.start = res->start; pgmap->range.end = res->end; + pgmap->type = xgmi_connected_to_cpu ? + MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE; pgmap->ops = &svm_migrate_pgmap_ops; pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev); - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + pgmap->flags = 0; r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n");
Generic device type memory on VRAM to RAM migration, has similar access as System RAM from the CPU. This flag sets the source from the sender. Which in Generic type case, should be set as SYSTEM.
Signed-off-by: Alex Sierra alex.sierra@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index f5939449a99f..7b41006c1164 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -653,8 +653,9 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + migrate.flags = adev->gmc.xgmi.connected_to_cpu ? + MIGRATE_VMA_SELECT_SYSTEM : MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t); size *= npages;
Two helpers added. One checks if zone device page is generic type. The other if page is either private or generic type.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- include/linux/mm.h | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index d8d79bb94be8..f5b247a63044 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1125,6 +1125,14 @@ static inline bool is_device_private_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PRIVATE; }
+static inline bool is_device_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + is_zone_device_page(page) && + (page->pgmap->type == MEMORY_DEVICE_PRIVATE || + page->pgmap->type == MEMORY_DEVICE_GENERIC); +} + static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
Device generic type case added for migrate_vma_pages and migrate_vma_check_page helpers. Both, generic and private device types have the same conditions to decide to migrate pages from/to device memory.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- mm/migrate.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 8c2430d3e77b..3b6aaba96fe6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2602,7 +2602,7 @@ static bool migrate_vma_check_page(struct page *page) * FIXME proper solution is to rework migration_entry_wait() so * it does not need to take a reference on page. */ - return is_device_private_page(page); + return is_device_page(page); }
/* For file back page */ @@ -3064,10 +3064,10 @@ void migrate_vma_pages(struct migrate_vma *migrate) mapping = page_mapping(page);
if (is_zone_device_page(newpage)) { - if (is_device_private_page(newpage)) { + if (is_device_page(newpage)) { /* - * For now only support private anonymous when - * migrating to un-addressable device memory. + * For now only support private and generic + * anonymous when migrating to device memory. */ if (mapping) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback. Device generic type memory case is now able to free its pages properly.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- mm/memremap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/memremap.c b/mm/memremap.c index 614b3d600e95..6c884e2542a9 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -438,7 +438,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap);
#ifdef CONFIG_DEV_PAGEMAP_OPS -static void free_device_private_page(struct page *page) +static void free_device_page(struct page *page) {
__ClearPageWaiters(page); @@ -477,7 +477,8 @@ void free_zone_device_page(struct page *page) wake_up_var(&page->_refcount); return; case MEMORY_DEVICE_PRIVATE: - free_device_private_page(page); + case MEMORY_DEVICE_GENERIC: + free_device_page(page); return; default: return;
On 6/17/2021 10:16 AM, Alex Sierra wrote:
v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch https://lore.kernel.org/dri-devel/20210527205606.2660-6-Felix.Kuehling@amd.c...
Corrected link:
https://cgit.freedesktop.org/drm/drm/log/?h=drm-next
Regards, Alex Sierra
On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. https://patchwork.kernel.org/project/dri-devel/list/?series=489811
Corrected link:
https://lore.kernel.org/dri-devel/20210527205606.2660-6-Felix.Kuehling@amd.c...
Regards, Alex Sierra
v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU.
Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.
v3:
- Include cover letter from v1
- Rename dax_layout_is_idle_page func to dax_page_unused in patch
ext4/xfs: add page refcount helper
Patches 1-2 Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches Patches 4-5 are for context to show how we are looking up the SPM memory and registering it with devmap. Patches 3,6-8 are the changes we are trying to upstream or rework to make them acceptable upstream.
Alex Sierra (6): kernel: resource: lookup_resource as exported symbol drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helpers to check zone device generic type mm: add generic type support to migrate_vma helpers mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 10 ++++ include/linux/memremap.h | 7 +-- include/linux/mm.h | 52 +++--------------- kernel/resource.c | 2 +- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 69 +++++++----------------- mm/migrate.c | 13 ++--- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++-------------- 16 files changed, 83 insertions(+), 164 deletions(-)
On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
Is there a way we can test the codepaths touched by this patchset? It doesn't have to be via a complete qemu simulation of the GPU device memory, but some way of creating MEMORY_DEVICE_GENERIC subject to migrate_vma_* helpers so we can test for regressions moving forward.
Thanks,
- Ted
On 2021-06-20 10:14 a.m., Theodore Ts'o wrote:
On Thu, Jun 17, 2021 at 10:16:57AM -0500, Alex Sierra wrote:
v1: AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
Is there a way we can test the codepaths touched by this patchset? It doesn't have to be via a complete qemu simulation of the GPU device memory, but some way of creating MEMORY_DEVICE_GENERIC subject to migrate_vma_* helpers so we can test for regressions moving forward.
Hi Theodore,
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
For the migration helper changes we could modify or parametrize lib/hmm_test.c to create DEVICE_GENERIC pages instead of DEVICE_PRIVATE. Then run tools/testing/selftests/vm/hmm-tests.c.
Regards, Felix
Thanks,
- Ted
On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
Sorry for the thread necromancy, but now that the merge window is past....
Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4.
Cheers,
- Ted
Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
Sorry for the thread necromancy, but now that the merge window is past....
No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes.
Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4.
That would be great!
Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab:
ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX
I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series.
If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX.
Regards, Felix
Cheers,
- Ted
On 7/16/2021 5:14 PM, Felix Kuehling wrote:
Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
Sorry for the thread necromancy, but now that the merge window is past....
No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes.
V4 patch series have been sent for review. https://marc.info/?l=dri-devel&m=162654971618911&w=2
Regards, Alex Sierra
Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4.
That would be great!
Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab:
ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX
I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series.
In order to run hmm test with generic device type enabled, set the following:
kernel config: EFI_FAKE_MEMMAP RUNTIME_TESTING_MENU TEST_HMM=m
Kernel parameters to fake SP memory. The addresses set here are based on my system's usable memory enumerated by BIOS-e820 at boot. The test requires two SP devices of at least 256MB. efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
To run the hmm_test in generic device type pass the SP addresses to the sh script. sudo ./test_hmm.sh smoke 0x100000000 0x140000000
If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX.
Regards, Felix
Cheers,
- Ted
On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote:
On 7/16/2021 5:14 PM, Felix Kuehling wrote:
Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
Sorry for the thread necromancy, but now that the merge window is past....
No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes.
V4 patch series have been sent for review. https://marc.info/?l=dri-devel&m=162654971618911&w=2
Regards, Alex Sierra
Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4.
That would be great!
Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab:
ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX
Hi Theodore, I wonder if you had chance to set the kernel configs from Felix to enable DAX in xfstests.
I've been trying to test FS DAX on my side using virtio-fs + QEMU. Unfortunately, Im having some problems setting up the environment with the guest kernel. Could you share your VM (QEMU or GCE) setup to run it with xfstests?
Regards, Alex S.
I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series.
In order to run hmm test with generic device type enabled, set the following:
kernel config: EFI_FAKE_MEMMAP RUNTIME_TESTING_MENU TEST_HMM=m
Kernel parameters to fake SP memory. The addresses set here are based on my system's usable memory enumerated by BIOS-e820 at boot. The test requires two SP devices of at least 256MB. efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
To run the hmm_test in generic device type pass the SP addresses to the sh script. sudo ./test_hmm.sh smoke 0x100000000 0x140000000
If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX.
Regards, Felix
Cheers,
- Ted
Am 2021-07-23 um 6:46 p.m. schrieb Sierra Guiza, Alejandro (Alex):
On 7/17/2021 2:54 PM, Sierra Guiza, Alejandro (Alex) wrote:
On 7/16/2021 5:14 PM, Felix Kuehling wrote:
Am 2021-07-16 um 11:07 a.m. schrieb Theodore Y. Ts'o:
On Wed, Jun 23, 2021 at 05:49:55PM -0400, Felix Kuehling wrote:
I can think of two ways to test the changes for MEMORY_DEVICE_GENERIC in this patch series in a way that is reproducible without special hardware and firmware:
For the reference counting changes we could use the dax driver with hmem and use efi_fake_mem on the kernel command line to create some DEVICE_GENERIC pages. I'm open to suggestions for good user mode tests to exercise dax functionality on this type of memory.
Sorry for the thread necromancy, but now that the merge window is past....
No worries. Alejandro should have a new version of this series in a few days, with updates to hmm_test and some fixes.
V4 patch series have been sent for review. https://marc.info/?l=dri-devel&m=162654971618911&w=2
Regards, Alex Sierra
Today I test ext4's dax support, without having any $$$ DAX hardware, by using the kernel command line "memmap=4G!9G:memmap=9G!14G" which reserves memory so that creates two pmem device and then I run xfstests with DAX enabled using qemu or using a Google Compute Engine VM, using TEST_DEV=/dev/pmem0 and SCRATCH_DEV=/dev/pmem1.
If you can give me a recipe for what kernel configs I should enable, and what magic kernel command line arguments to use, then I'd be able to test your patch set with ext4.
That would be great!
Regarding kernel config options, it should be the same as what you're using for DAX testing today. We're not changing or adding any Kconfig options. But let me take a stab:
ZONE_DEVICE HMM_MIRROR MMU_NOTIFIER DEVICE_PRIVATE (maybe not needed for your test) FS_DAX
Hi Theodore, I wonder if you had chance to set the kernel configs from Felix to enable DAX in xfstests.
I've been trying to test FS DAX on my side using virtio-fs + QEMU. Unfortunately, Im having some problems setting up the environment with the guest kernel. Could you share your VM (QEMU or GCE) setup to run it with xfstests?
Regards, Alex S.
Hi Theodore,
Sorry to keep bugging you. I'm wondering if you've had a chance to test FS DAX with Alex's last patch series? ([PATCH v4 00/13] Support DEVICE_GENERIC memory in migrate_vma_*) I think other than minor cosmetic fixes, this should be ready to merge, if it passes your tests.
Thanks, Felix
I'm not sure what you're looking for in terms of kernel command line, other than the memmap options you already found. There are some more options to run hmm_test with fake SPM (DEVICE_GENERIC) memory, but we're already running that ourselves. That will also be in the next revision of this patch series.
In order to run hmm test with generic device type enabled, set the following:
kernel config: EFI_FAKE_MEMMAP RUNTIME_TESTING_MENU TEST_HMM=m
Kernel parameters to fake SP memory. The addresses set here are based on my system's usable memory enumerated by BIOS-e820 at boot. The test requires two SP devices of at least 256MB. efi_fake_mem=1G@0x100000000:0x40000,1G@0x140000000:0x40000
To run the hmm_test in generic device type pass the SP addresses to the sh script. sudo ./test_hmm.sh smoke 0x100000000 0x140000000
If you can run your xfstests with DAX on top of this patch series, that would be very helpful. That's to make sure the ZONE_DEVICE page refcount changes don't break DAX.
Regards, Felix
Cheers,
- Ted
dri-devel@lists.freedesktop.org